# GENOMIC APPROACHES FOR IMPROVEMENT OF UNDERSTUDIED GRASSES

EDITED BY: Keenan Amundsen, Gautam Sarath and Teresa Donze-Reiner PUBLISHED IN: Frontiers in Plant Science

#### *Frontiers Copyright Statement*

*© Copyright 2007-2017 Frontiers Media SA. All rights reserved. All content included on this site, such as text, graphics, logos, button icons, images, video/audio clips, downloads, data compilations and software, is the property of or is licensed to Frontiers Media SA ("Frontiers") or its licensees and/or subcontractors. The copyright in the text of individual articles is the property of their respective authors, subject to a license granted to Frontiers.*

*The compilation of articles constituting this e-book, wherever published, as well as the compilation of all other content on this site, is the exclusive property of Frontiers. For the conditions for downloading and copying of e-books from Frontiers' website, please see the Terms for Website Use. If purchasing Frontiers e-books from other websites or sources, the conditions of the website concerned apply.*

*Images and graphics not forming part of user-contributed materials may not be downloaded or copied without permission.*

*Individual articles may be downloaded and reproduced in accordance with the principles of the CC-BY licence subject to any copyright or other notices. They may not be re-sold as an e-book.*

*As author or other contributor you grant a CC-BY licence to others to reproduce your articles, including any graphics and third-party materials supplied by you, in accordance with the Conditions for Website Use and subject to any copyright notices which you include in connection with your articles and materials.*

> *All copyright, and all rights therein, are protected by national and international copyright laws.*

*The above represents a summary only. For the full conditions see the Conditions for Authors and the Conditions for Website Use.*

ISSN 1664-8714 ISBN 978-2-88945-242-2 DOI 10.3389/978-2-88945-242-2

### About Frontiers

Frontiers is more than just an open-access publisher of scholarly articles: it is a pioneering approach to the world of academia, radically improving the way scholarly research is managed. The grand vision of Frontiers is a world where all people have an equal opportunity to seek, share and generate knowledge. Frontiers provides immediate and permanent online open access to all its publications, but this alone is not enough to realize our grand goals.

### Frontiers Journal Series

The Frontiers Journal Series is a multi-tier and interdisciplinary set of open-access, online journals, promising a paradigm shift from the current review, selection and dissemination processes in academic publishing. All Frontiers journals are driven by researchers for researchers; therefore, they constitute a service to the scholarly community. At the same time, the Frontiers Journal Series operates on a revolutionary invention, the tiered publishing system, initially addressing specific communities of scholars, and gradually climbing up to broader public understanding, thus serving the interests of the lay society, too.

### Dedication to Quality

Each Frontiers article is a landmark of the highest quality, thanks to genuinely collaborative interactions between authors and review editors, who include some of the world's best academicians. Research must be certified by peers before entering a stream of knowledge that may eventually reach the public - and shape society; therefore, Frontiers only applies the most rigorous and unbiased reviews.

Frontiers revolutionizes research publishing by freely delivering the most outstanding research, evaluated with no bias from both the academic and social point of view. By applying the most advanced information technologies, Frontiers is catapulting scholarly publishing into a new generation.

### What are Frontiers Research Topics?

Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are collections of at least ten articles, all centered on a particular subject. With their unique mix of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author by contacting the Frontiers Editorial Office: researchtopics@frontiersin.org

# **GENOMIC APPROACHES FOR IMPROVEMENT OF UNDERSTUDIED GRASSES**

Topic Editors:

**Keenan Amundsen,** University of Nebraska-Lincoln, United States **Gautam Sarath,** United States Department of Agriculture—Agricultural Research Service, United States

**Teresa Donze-Reiner,** West Chester University, United States

Mowed buffalograss in Mead, Nebraska, USA. Photo by Keenan Amundsen

Grasses are diverse, spanning native prairies to high-yielding grain cropping systems. They are valued for their beauty and useful for soil stabilization, pollution mitigation, biofuel production, nutritional value, and forage quality; grasses encompass the most important grain crops in the world. There are thousands of distinct grass species and many have promiscuous hybridization patterns, blurring species boundaries. Resources for advancing the science and knowledgebase of individual grass species or their unique characteristics varies, often proportional to their perceived value to society. For many grasses, limited genetic information hinders research progress. Presented in this research topic is a brief snapshot of creative efforts to apply modern genomics research methodologies to the study of several minor grass species.

**Citation:** Amundsen, K., Sarath, G., Donze-Reiner, T., eds. (2017). Genomic Approaches for Improvement of Understudied Grasses. Lausanne: Frontiers Media. doi: 10.3389/978-2-88945-242-2

# Table of Contents

### **Introduction**

*05 Editorial: Genomic Approaches for Improvement of Understudied Grasses* Keenan Amundsen, Gautam Sarath and Teresa Donze-Reiner

### **Stress tolerance**


Fatma-Ezzahra Yousfi, Emna Makhloufi, William Marande, Abdel W. Ghorbel, Mondher Bouzayen and Hélène Bergès


Dawid Perlikowski, Mariusz Czyz˙niejewski, Łukasz Marczak, Adam Augustyniak and Arkadiusz Kosmala

### **Molecular marker development and applications**

*63 Transcriptome Profiling of Buffalograss Challenged with the Leaf Spot Pathogen*  **Curvularia inaequalis**

Bimal S. Amaradasa and Keenan Amundsen

*75 Validating DNA Polymorphisms Using KASP Assay in Prairie Cordgrass (***Spartina pectinata** *Link) Populations in the U.S.*

Hannah Graves, A. L. Rayburn, Jose L. Gonzalez-Hernandez, Gyoungju Nah, Do-Soon Kim and D. K. Lee


### **Enhanced biomass production**


# Editorial: Genomic Approaches for Improvement of Understudied Grasses

#### Keenan Amundsen<sup>1</sup> \*, Gautam Sarath<sup>2</sup> and Teresa Donze-Reiner <sup>3</sup>

*<sup>1</sup> Department of Agronomy and Horticulture, University of Nebraska-Lincoln, Lincoln, NE, United States, <sup>2</sup> Grain, Forage and Bioenergy Research Unit, United States Department of Agriculture—Agricultural Research Service, Lincoln, NE, United States, <sup>3</sup> Department of Biology, West Chester University, West Chester, PA, United States*

Keywords: biomass yield, genotypic diversity, grasses, stress tolerance, RNA-seq, proteomics, genomics, differential gene expression

**Editorial on the Research Topic**

#### **Genomic Approaches for Improvement of Understudied Grasses**

Grasses are diverse, spanning native prairies to high-yielding grain cropping systems. They are valued for their beauty and useful for soil stabilization, pollution mitigation, biofuel production, nutritional value, and forage quality; grasses encompass the most important grain crops in the world. There are thousands of distinct grass species and many have promiscuous hybridization patterns, blurring species boundaries. Resources for advancing the science and knowledgebase of individual grass species or their unique characteristics varies, often proportional to their perceived value to society. For many grasses, limited genetic information hinders research progress. Presented in this research topic is a brief snapshot of creative efforts to apply modern genomics research methodologies to the study of several minor grass species.

Native or naturalized grass species offer unique adaptation advantages and often have better heat, drought, and salinity tolerance than recently introduced species. Genotypes with unique combinations of traits frequently arise and many offer unique and robust sources of stress resistance or desirable production characteristics, but identification of those plants is challenging. Al-Dakheel and Hussain present a novel method for field-evaluating and identifying salinity tolerance in buffelgrass (Cencrhus ciliaris L.). Using a field-pot system over three years, 12 salt tolerant, high dry biomass yielding buffelgrass accessions were identified. Salinity tolerance is important in modern agriculture and particularly for many alternative grasses as they are often managed on marginal lands. Yousfi et al. specifically assayed the role of WRKY genes in salinity tolerance in durum wheat (Triticum turgidum L.). Five WRKY-containing bacterial artificial chromosomes (BACs) were identified, sequenced, and annotated. Differential response in WRKY genes in salt sensitive vs. salt tolerant germplasm was observed suggesting their role in salt tolerance. Another interesting finding, common to many grasses was the observation that 74.6% of the sequenced BACs contained transposable elements, often found in high copy numbers in grasses and further complicating genetic studies. Yue et al. used a global transcriptome approach to identify transcripts differentially expressed between a waxy, drought sensitive cultivar of broomcorn millet (Panicum miliaceum L.) and one that was non-waxy and salt and drought tolerant. Yue et al. reported the first assembly of broomcorn millet and identified 292 differentially expressed transcripts between the studied cultivars. These transcripts may be important in the morphological differences associated with waxiness, and drought and salinity tolerance. In addition to salinity tolerance, drought tolerance is essential for grasses grown on marginal lands. Hybrids can form between Italian ryegrass (Lolium multiflorum Lam.) and tall fescue (Festuca arundinacea Schreb.) and are exploited to combine

#### Edited by:

*Sergio Lanteri, University of Turin, Italy*

#### Reviewed by:

*Sergio Lanteri, University of Turin, Italy Rieseberg Loren, University of British Columbia, Canada*

> \*Correspondence: *Keenan Amundsen kamundsen2@unl.edu*

#### Specialty section:

*This article was submitted to Crop Science and Horticulture, a section of the journal Frontiers in Plant Science*

> Received: *25 April 2017* Accepted: *23 May 2017* Published: *09 June 2017*

#### Citation:

*Amundsen K, Sarath G and Donze-Reiner T (2017) Editorial: Genomic Approaches for Improvement of Understudied Grasses. Front. Plant Sci. 8:976. doi: 10.3389/fpls.2017.00976* desirable traits from each species. Perlikowski et al. examined two hybrid forms differing in their photosynthetic capacity during drought stress and their ability for membrane regeneration following removal of the stress, offering insights into the role of metabolic alterations on drought tolerance and membrane recovery.

Amaradasa and Amundsen studied the interaction between a fungal pathogen, Curvularia inaequalis, and resistant and susceptible American buffalograss [Buchloë dactyloides (Nutt.) Engelm. syn. Bouteloua dactyloides (Nutt.) Columbus]. Their analysis led to the development of RNA-based markers that have potential to screen and identify sources of host resistance in the absence of an in vivo assay. These markers have the added advantage of being gene-based, avoiding some of the challenges associated with genetic studies in complex polyploid genomes. Genetic marker development and testing is an important early step in working with grass species lacking genomic resources. As a part of the Yue et al. study, they identified 35,216 simple sequence repeat sequences in broomcorn millet that could be developed into molecular markers. Graves et al. used single nucleotide polymorphic markers (SNPs) to develop KASP assays. The KASP assays were able to discriminate hybridization and self-fertilization events in populations of prairie cordgrass (Spartina pectinata Link), which is challenging in a species with diverse and complex inheritance patterns. The ability to exploit important traits from breeding populations is critically important in order to maximize their value. Grinberg et al. used perennial ryegrass (Lolium perenne L.) breeding populations as a model to compare different predictive models in a genomic prediction framework with a goal to ultimately improve several biomass, yield, and nutritional value traits. Miao et al. examined the role of histones, another important conserved protein family essential for genome stability in switchgrass (Panicum virgatum L.) by exploiting genome specific markers and transforming tobacco to confirm their functional role. Miao et al. confirmed that the histone genes being investigated could trigger cell death and their nuclear localization was critical for their function.

Biomass is critically important for grasses destined for use by the biofuel industry. Lignin, cellulose, and pectin are cell wall constituents that influence how grasses can be used for biofuel production. Rai et al. conducted a genome-wide analysis to identify genes associated with cell wall composition of sorghum [Sorghum bicolor (L.) Moench]. By physically mapping those genes to the sorghum genome, researchers can use that information to alter cell wall composition through traditional plant breeding methods. Muthamilarasan et al. also examined gene expression profiles in cell wall associated genes in foxtail millet (Setaira italica L.) and identified genes differentially expressed in response to abiotic stress and exogenous hormone applications. Muthamilarasan et al. further describe the importance of synteny among grasses and conservation among cell wall genes. The study of Paudel et al. also highlights conservation among the grasses by studying expressed proteins during senescence in switchgrass and prairie cordgrass. Early senescence reduces biomass and therefore is an important process to understand in perennial grasses. By comparing early senescence genotypes with late senescence genotypes in both species, proteins intimately involved in senescence were found and could be exploited in future studies and breeding programs to develop germplasm with delayed senescence.

Here we highlight 11 understudied grass species, relative to the major cereal grasses, which is only a small fraction of the thousands of known species. As land resources become scarce and demand for highly productive arable land increases, identification of understudied grasses and their desirable traits can fast-forward their suitability for use in marginal lands. The papers presented in this research topic demonstrates novel approaches or new applications for proven methods to improve our understanding of perennial grasses and their function and illustrates important steps toward use of these understudied species in modern agriculture.

### AUTHOR CONTRIBUTIONS

All authors listed, have made substantial, direct and intellectual contribution to the work, and approved it for publication.

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2017 Amundsen, Sarath and Donze-Reiner. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Genotypic Variation for Salinity Tolerance in *Cenchrus ciliaris* L.

### Abdullah J. Al-Dakheel and M. Iftikhar Hussain\*

*Crop Diversification and Genetic Improvement Section, International Center for Biosaline Agriculture, Dubai, United Arab Emirates*

Scarcity of irrigation water and increasing soil salinization has threatened the sustainability of forage production in arid and semi-arid region around the globe. Introduction of salt-tolerant perennial species is a promising alternative to overcome forage deficit to meet future livestock needs in salt-affected areas. This study presents the results of a salinity tolerance screening trial which was carried out in plastic pots buried in the open field for 160 buffelgrass (*Cenchrus ciliaris* L.) accessions for three consecutive years (2003–2005). The plastic pots were filled with sand, organic, and peat moss mix and were irrigated with four different quality water (EC 0, 10, 15, and 20 dS m−<sup>1</sup> ). The results indicate that the average annual dry weights (DW) were in the range from 122.5 to 148.9 g/pot in control; 96.4–133.8 g/pot at 10 dS m−<sup>1</sup> ; 65.6–80.4 g/pot at 15 dS m−<sup>1</sup> , and 55.4–65.6 g/pot at 20 dS m−<sup>1</sup> . The highest DW (148.9 g/pot) was found with accession 49 and the lowest with accession 23. Principle component analysis shows that PC-1 contributed 81.8% of the total variability, while PC-2 depicted 11.7% of the total variation among *C. ciliaris* accessions for DW. Hierarchical cluster analysis revealed that a number of accessions collected from diverse regions could be grouped into a single cluster. Accessions 3, 133, 159, 30, 23, 142, 141, 95, 49, 129, 124, and 127 were stable, salt tolerant, and produced good dry biomass yield. These accessions demonstrate sufficient salinity tolerance potential for promotion in marginal lands to enhance farm productivity and reduce rural poverty.

### *Edited by:*

*Teresa Donze, University of Nebraska–Lincoln, USA*

#### *Reviewed by:*

*Caiguo Zhang, University of Colorado Denver, USA Zeran Li, Washington University, USA*

#### *\*Correspondence:*

*M. Iftikhar Hussain m.iftikhar@biosaline.org.ae; mih786@gmail.com*

#### *Specialty section:*

*This article was submitted to Crop Science and Horticulture, a section of the journal Frontiers in Plant Science*

> *Received: 02 April 2016 Accepted: 11 July 2016 Published: 28 July 2016*

### *Citation:*

*Al-Dakheel AJ and Hussain MI (2016) Genotypic Variation for Salinity Tolerance in Cenchrus ciliaris L. Front. Plant Sci. 7:1090. doi: 10.3389/fpls.2016.01090* Keywords: buffelgrass, biomass yield, multivariate analysis, salt tolerance, genotypic diversity

## INTRODUCTION

Salt-affected soils and millions of hectares of marginal lands have limited the scope for crop production (Wang et al., 2012). According to the Food and Agriculture Organization (FAO), 34 million hectares (11% of the total irrigated area of the world) are affected by different levels of salinization (Food Agriculture Organization of the United Nations, 2012). Moreover, the world loses 0.25–0.5 M ha of agricultural land annually because of salt buildup, which mainly results from irrigation, especially in arid and semiarid areas (Peng et al., 2008; Qadir et al., 2014). Soil salinity reduces the productivity of most crops, although to a varying extent depending on species (Roy et al., 2014; Hussain et al., 2016). Besides improving water management practices to reduce salt accumulation in the root zone, there is a need to improve salinity tolerance of strategically important crops. The use of low quality saline water for plant production is an option to conserve limited freshwater resources, particularly for the water-scarce regions of Arabian and African peninsula.

The cultivation of forages, biomass crops, and perennial grasses as feedstock for energy, biomaterials, and livestock rearing have been promoted as an opportunity to improve sustainability of forage supply, energy security, and contributing to the rural development (Ahmad, 2010). Therefore, demand for sustainable biomass production for livestock and energy use has raised the interest in perennial crops like Cenchrus ciliaris L. Buffelgrass (C. ciliaris L.) is a perennial (C4) forage grass (family poaceae), sometimes produces rhizomes and is native to the Arabian Peninsula. The C. ciliaris is dominant in natural grazing zones of Ethiopia (Angassa and Baars, 2000), Australia (Buldgen and Francois, 1998), and North Africa (Mseddi et al., 2004). Buffelgrass has proved useful for pasture and soil retention in a wide range of environments due to its drought tolerance, high biomass, deep roots, rapid response to summer rains, and resistance to overgrazing. With extensive belowground systems, cultivation of perennial grasses present high efficiencies in the use of nutrient and water resources and control of soil erosion, carbon sequestration with the restoration of soil properties (fertility, structure, organic matter). Compared with annual systems, herbaceous perennial crops have the advantages of erodibility, and crop management options, such as pesticides and fertilizers inputs (Zhang et al., 2011; Fernando et al., 2012). The salt tolerance of different C. ciliaris genotypes need to be evaluated to test their suitability for marginal environments to offer a more practical solution for effective utilization of salt affected soils. Among buffelgrass, accession from North America, "Texas 4464" has been reported as drought tolerant (Ayerza, 1981), and "Biloela" as salt-tolerant (Graham and Humphreys, 1970). Therefore, strategies for mitigating salinity problems in crop production include both development of management options (Shannon, 1997) and genetic improvement of current cultivars (Krishnamurthy et al., 2007).

Germplasm of a specific crop collected from the diverse sources offers greater genetic diversity and may furnish useful traits to widen the genetic base of crop species. The collection, screening and description of the existing variability among the forage crops are the first step in the performance evaluation and selection process (Ponsens et al., 2010). Knowledge about germ- plasm diversity and salinity tolerance evaluation will be an excellent tool to screen and select high yielding accessions for further evaluation under field conditions. Screening large numbers of genotypes for salinity tolerance in the field is notoriously difficult because of the variability of salinity within fields (Daniells et al, 2001). Moreover, it would be difficult to determine the critical parameters under field conditions since any environmental change could result in dramatic change in the plant's response to salinity (Shannon, 1997). Although C. ciliaris response to salinity stress has been a topic of many researchers (Arshad et al., 2000; Hacker and Waite, 2001; Jorge et al., 2008; Ksiksi and El-Shaigy, 2012); to best of our knowledge, no study has evaluated and characterized the C. ciliaris genotypes in terms of agromorphological attributes and dry matter yield responses so far. This study evaluates the morphological and biomass yield responses of C. ciliaris genotypes to water salinity in pot culture trial.

In a study of salt stress on buffelgrass and its effects on productivity decline; Lanza Castelli et al. (2010) has found that C. ciliaris accession, Texas 4464, is susceptible to salt stress at 300 mM NaCl concentrations at the seedling stage, while Americana showed tolerance against salinity. The fresh weight, root length, and plant height of these accessions were least affected by salinity. However, screening and selection of large numbers of buffelgrass genotypes for salinity tolerance is lacking. Therefore, the current research was undertaken with the aim to identify superior genotypes for forage production under hot arid conditions of UAE using naturally available low quality saline water. The results of this research are expected to provide useful information that can be used to group accessions for relative comparison and evaluation of biomass yield in areas where problems of soil and water salinity are increasing.

### MATERIALS AND METHODS

### Site Description

The field trials were conducted for three growing seasons during 2003–2005 on the experimental farm of the International center for Biosaline Agriculture (ICBA), located on the eastern side of Dubai between 25◦ 5 ′ N and 55◦ 23′ E with an elevation of 30 m above mean sea level. The soil of the experimental field is Carbonatic, Hyperthermic Typic Torripsamment having a negligible level of inherent soil salinity (0.2 dS m−<sup>1</sup> ). The study area is characterized by very hot, dry days in summer (April–October), when temperatures can reach 45◦C, while the winter days (December–February) are mainly cooler and dry with low average night-time temperatures (10◦C). Because of the aridity, and relatively cloudless skies, there are great extremes of temperature variation, but there are also wide variations between the seasons. Air temperature follows a regular seasonal trend, with a minimum in January and a maximum in July. According to the ICBA weather station data the annual average temperature of the study area was 10◦C during the month of January while average maximum temperature during summer, July, was 45◦C. The average annual rainfall is <80 mm and falls in short, torrential bursts during the summer months.

### Plant Material, Experimental Design, and Management Practices

In total, 160 accessions of buffelgrass (C. ciliaris L.) were used in this experiment as shown in **Table 1**. The seeds of C. ciliaris L. were received from United States Department of Agriculture (USDA); originated from Asia, Africa, Australia, India, USA and some local landraces and commercial cultivars were also included in the study. Seeds of C. ciliaris genotypes (three per pot) were sown in poly vinyl chloride pots (30 × 30 cm) on 2nd March 2003. The pots were buried in the open field (**Figure 1**), near the net house at experimental research station (ICBA, Dubai, UAE) for three consecutive years (2003–2005). After uniformity of emergence, seedlings were thinned to one plant per pot to maintain good stand establishment. Experimental soil achieved field capacity and permanent wilting point at 0.1 bar and 15 bars,


FIGURE 1 | Screening *Cenchrus ciliaris for salt tolerance in pots buried in open field at Research Station, ICBA, Dubai, UAE.* (A) Germination, (B) growth and tillering, (C) heading stage, and (D) harvesting.

respectively. All pots were irrigated to upper limit of field capacity before planting.

The soil was collected from an open area of ICBA that had not been previously irrigated with saline water. Textural analysis showed that the resulting soil had 98% sand, 0.5% silt, and 1% clay, which would be classified as sandy soils. The plastic pots were filled with soil mix (20 kg/pot) containing 50% sand, 25% organic fertilizer, and 25% peat moss. The fertilizer NPK was mixed periodically at the start of the trial and after each harvest. The weeds were eradicated by hand weeding.

### Irrigation and Saline Water Treatment Application

The C. ciliaris accessions were arranged in the main plot while saline water treatment (0, 10, 15, and 20 dS m−<sup>1</sup> ) was randomized in RCBD split plot design. Initially, the pots were irrigated with fresh water up till 1 month to facilitate the germination and seedling establishment. One pot was assigned to each accession per replication and four replications per treatment were maintained throughout the trial. The saline water treatments were applied 1 month after sowing and each salinity treatment was made by diluting ground water (EC: 25 dS m−<sup>1</sup> ) with fresh water (1.5 dS m−<sup>1</sup> ) in a separate tank to achieve the target salinities and then delivered to the pots via drip irrigation system. Irrigations were applied on daily basis at rates equivalent to ET<sup>0</sup> plus 20% for leaching requirements. The four salinity levels were maintained constantly throughout the growth period during all the years.

The field experiment was equipped with a drip system (pressure compensating (PC), micro flapper with 4 L hr−<sup>1</sup> flow rate) and 0.5 m distance between rows and 0.25 m between drippers. Each pot has one dripper; the pot area was 0.12 m<sup>2</sup> which means that for each 4 liters the depth would be 33 mm hr−<sup>1</sup> . Irrigation monitoring, scheduling and salinity management were achieved using Decagon <sup>R</sup> sensors. During the summer season (9 months from March–November) the pots were irrigated 30 min per day which means an irrigation depth of 16 mm per pot with the leaching fraction, but during mild winter season (3 months from December– February) the irrigation duration was 20 min which means an irrigation depth of 11 mm per day with the leaching fraction. The drippers were tested biweekly to check the distribution uniformity and coefficient using the method of the low quarter average divided by the average of the whole readings, the data (not shown) that the distribution uniformity was not <85% with a distribution coefficient of 80%.

### Harvesting and Biomass Yield Measurements

The plants were harvested at heading stage and weighed with an analytical balance. **Table 2** demonstrates the harvesting dates

TABLE 2 | Sequence of harvesting schedule, stage, cut date, and number of cuts in perennial crop.


*Cenchrus ciliaris for salinity tolerance comparison trial for year 2003–2005.*

and number of harvests (cuts) obtained during each year. For each harvest, the total fresh weight of the collected biomass samples was weighed in g/pot. The samples were dried in a forced air oven at 60◦C for 72 h and dry matter yield (DW) was determined.

### Data and Statistical Analysis

Statistical analyses were performed on fresh biomass (FW) and dry weight (DW) in two steps:


0, B = 10, C = 15, and D = 20 dS m−<sup>1</sup> salinity level.

### RESULTS

B = 10, C = 15, and D = 20 dS m−<sup>1</sup>

### Impact of Salinity on Average Biomass Yield of *C. ciliaris*

salinity level.

The salinity treatments significantly influenced the growth and biomass yield of various C. ciliaris accessions. The average biomass yield (fresh and dry weight) were calculated by taking the average of 3 years' data for each salinity level (**Figures 2**, **3**). Fresh weight (FW) ranged between 50 and 450 g/pot in control, 50–350 g/pot at low salinity (10 dS m−<sup>1</sup> ), 50–300 g/pot at medium high salinity (15 dS m−<sup>1</sup> ) and 50–200 g/pot at high salinity (20 dS m−<sup>1</sup> ) (**Figures 2A–D**). The curve was bell shaped in control pots and right skewed distribution when the values clustered more toward lower production groups at high salinity (**Figure 2D**). Accession 127 proved to be stable and higher yielder that produced the maximum FW (400.8 g/pot) in control and 348.9 g/pot fresh biomass at 10 dS m−<sup>1</sup> . However, at medium and higher salinity (15, 20 dS m−<sup>1</sup> ), its biomass yield decreased significantly and produced 180.1 and 122.3 g/pot, respectively. Accession 49 proved to be stable genotype that produced the 2nd highest biomass yield (351.1 g/pot) in the control pots while 292.4 g/pot at 10 dS m−<sup>1</sup> , 219.3 g/pot at 15 dS m−<sup>1</sup> , and 181.7 g/pot at 20 dS m−<sup>1</sup> , respectively. At medium and higher salinity (15 and 20 dS m−<sup>1</sup> ), accession 30 produced 233.8 g and 154.2 g/pot FW and demonstrated as medium salt tolerant genotypes.

The dry weight range was in the range of 20–160 g/pot in control, 20–140 g/pot at 10 dS m−<sup>1</sup> , 100–20 g/pot at 15 dS m−<sup>1</sup> , and 20–80 g/pot at 20 dS m−<sup>1</sup> respectively (**Figures 3A–D**). The curve distribution was bell shaped in control pots and right skewed at high salinity (20 dS m−<sup>1</sup> ) (**Figure 3D**). The range of variation reduced with an increase in salinity levels and ranged between 20 and 160 g/pot when salinity varies from 0 to 20 dS m−<sup>1</sup> . The reduction in dry biomass was 87.5, 85.7, 80, and 75% at 10, 15, and 20 dS m−<sup>1</sup> , respectively (**Figures 3A–D**). The average mean DW was 148.9 g/pot recorded for accession 49 in control while accession 127 produced the highest dry biomass (141.4 g/pot), in control pots, respectively (data not shown). The accession 124 produced the maximum DW (80.3 g/pot) at 20 dS m−<sup>1</sup> salinity and proved to be stable and higher yield genotype. At highest salinity (20 dS m−<sup>1</sup> ), accession 49 produced the maximum dry biomass (65.6 g/pot) (data not shown).

### Effect of Salinity on Total Annual Yield of *C. ciliaris*

Fresh biomass (total annual yield) ranged from 301.6 to 400.8 g/pot in control, 256.2–348.9 g/pot at 10 dSm−<sup>1</sup> , 192.5– 233.8 g/pot at 15 dSm−<sup>1</sup> , and 164.8–185.0 g/pot at 20 dS m−<sup>1</sup> ) among the top 10 accessions of C. ciliaris (**Figure 4**). The results illustrated that FW in 112, 37, 30, 23, and 46 accessions were more affected by the salinity at 10, 15, and 20 dS m−<sup>1</sup> , respectively. The accessions 49, 159, 129, 30, 38, were the least affected by increasing the salinity level from 0 to 20 dS m−<sup>1</sup> . However, accession 129 produced good fresh biomass yield in control, 10 dS m−<sup>1</sup> , and 15 dS m−<sup>1</sup> salinity while it was not a stable genotype at higher salinity (20 dS m−<sup>1</sup> ) that produced less FW as compared

to other top 10 genotypes (**Figure 4**). The salinity tolerance values in DW from all buffelgrass accessions show a high degree of variability under the various salinity levels. Average annual DW yield ranged from 122.5 to 148.9 g/pot in control; 96.4– 133.7 g/pot at 10 dS m−<sup>1</sup> ; from 65.6 to 80.3 g/pot at 15 dS m−<sup>1</sup> and 55.4–65.6 g/pot at higher salinity (20 dS m−<sup>1</sup> ). The highest DW (148.9 g/pot) was produced by accession 49 in control and the lowest one (55.4 g/pot) was recorded in accession 23 following treatment at the highest salinity level (20 dS m−<sup>1</sup> ) (**Figure 5**).

### Evaluation of Performance of *C. ciliaris* Accessions through Multivariate Analysis Principle Component Analysis

The results of principal components analysis show that out of four components, only 1 had extracted Eigen values over 1. This is based on Chatfield and Collin (1980) assumption which stated that components with an Eigen value of <1 should be eliminated. The extracted component was subsequently rotated according to Varimax rotation in order to make interpretation easier and fundamental significance of extracted component to the irrigation water salinity. Principle component PC 1 was extracted having Eigen value >1 and contributed 83.5% of the total variability among the C. ciliaris genotypes assessed for fresh biomass (**Figure 6**). However, the PC 2 depicted 10.7% of the total variation for the same parameter. Furthermore, PC1 contributed 81.8% of the total variability among the 160 genotypes and PC2 only depicted 11.7% of the total variation for dry weight (DW). Salinity applies a gradual selection pressure on the entries that was higher for dry biomass than for fresh biomass. PCA of C. ciliaris genotypes revealed diverse grouping pattern. The separation on the basis of PC1 and PC2 revealed that the genotypes were scattered in all the quarters, which show the high level of genotypic variation among the accessions (**Figure 6**). The first principal component correlated with five of the original variables (DWS2, FW2, FWS3, DWS1, and FWS4) and more strongly with the DWS2. In fact, we could state that based on this correlation, principal component is primarily a measure of the DWS2 (data not shown). In the first principal components, accessions 49, 22, 16, 42, 8, 149, showed more genotypic variation based on FW and are located close to axis line. In the second principal components, accessions 49, 133, 23, 3, 30, 6, showed more genotypic variation based on dry weight (DW) trait and are also located close to axis line.

### Hierarchical Cluster Analysis

Germplasm evaluation for quantitative traits may help to work out the relative importance of various traits within each cluster. Hierarchical cluster analysis based on agronomical traits (fresh and dry biomass) divided the 160 accessions into 6 main clusters. Maximum number of accessions (42) were present in group I, followed group II (27), group III (26), group IV (19), group V (17), group VI (17), and group VII (12) (**Figure 7**). Hierarchical cluster analysis showed that some of accessions originated from various geographical areas were grouped into the same cluster, while many others fell into different clusters. Group VII consisted of highly salt tolerant genotypes (3, 133, 159, 30, 23, 142, 141, 95, 49, 129, 124,

and 127) that acquired maximum biomass yield at all target salinities and also among the top 10 accessions. Grouping pattern indicated that the clusters were heterogeneous with regard to the geographical origin and some entries from different geographic regions were pooled in the same cluster (**Figure 7**).

### Performance of Commercial Cultivars of C. ciliaris

The observations of present study were largely consistent with the available descriptions of commercial cultivars. The genotype 159 (Biloela) was salt tolerant, productive and produced higher biomass than other genotypes at different water salinities (0, 10, 15, 20 dS m−<sup>1</sup> ). Other commercial genotypes 158 (Gayndah) and 160 did not show the good performance in terms of fresh and dry biomass and were not stable at medium and high salinities (data not shown).

### DISCUSSION

Yield assessment conducted in field trials provides the most reliable information on salt tolerance; however, spatial and temporal variability in soil salinity (Richards, 1983) make it difficult to obtain reproducible information. The advantages of using pots buried in open field are that the saline conditions can be controlled and constantly monitored. Salt affects the growth of plants by limiting the absorption of water and essential nutrients through roots. Salt stress has an immediate effect on cell growth and enlargement, and high concentrations of salts can be extremely toxic (Munns and Tester, 2008; Hussain et al., 2016). In the present study, there was wide variation among C. ciliaris genotypes grown under different salinity levels regarding their biomass yield. The fresh and dry weight of all genotypes was decreased with increase in salt stress. However, salt-sensitive genotypes showed more reduction in their biomass as compared to tolerant genotypes. Lanza Castelli et al. (2010) demonstrated a significant reduction (50–70%) in height and shoot fresh weight at 30 dS m−<sup>1</sup> in Texas and Biloela accessions. In the present study, the accessions 49, 127, 129, 30, 38, were the least affected by increasing salinity. The accession 129 produced good biomass at low salinity; however, it was not a stable genotype at higher salinity. Therefore, it is quite suitable for low and medium saline areas where water salinity falls in the range 5– 15 dS m−<sup>1</sup> , respectively. Our results demonstrate that accessions from Africa were more stable, salt tolerant and attained high biomass yield as compared to accessions from other regions. Previously, it was reported that, C. ciliaris is a dominant grass along the sandy beaches of Eritrea on soils with salt content of 0.006–0.251% (Hemming, 1961). Skerman and Riveros (1990), demonstrated that it grows well in saline soils; although it is less salt tolerant than Chloris gayana, Cynodon spp., and Panicum antidotale.

The DW from all C. ciliaris accessions shows a high variability under the various salinity levels. The highest DW was produced by accession 49 and the lowest by accession 23. Similar studies were conducted on other grass species such as Bermuda grass (Rodríguez and Miller, 2000), Catharanthus roseus (Jaleel et al., 2008) and C. ciliaris (Lanza Castelli et al., 2010) and showed that the plant growth, tiller, number of leaves and dry

weight decreased under salt stress. Ions of sodium (Na+) and chlorine (Cl−) (>40 mmol/L) can be toxic to plants because they can create imbalance in plant nutrition due to decreased nutrient uptake and translocation to new shoots (Munns et al., 2000; Tester and Davenport, 2003; Munns and Tester, 2008). Nawazish et al. (2006) found that shoot dry weight was severely affected in the C. ciliaris ecotype from Faisalabad, where it decreased from 29.83 to 8.02 g/plant. High salinity modifies plant metabolisms, which results in altered plant morphology; cultivar type, duration, and intensity of stress determine the extent of morphological modification. These results indicate that dry biomass yield can be used as a screening and selection criteria for evaluating the salt tolerance behavior among a large collection of plant accessions. Based on their relative performance (DW) there is a chance of selecting the genotypes within each salinity level. Several accessions from the top 10 selection with high biomass yield potential are of particular interest as forage resource in saltaffected agro-ecosystems for livestock and are highly suitable for arid and semi-arid hot dry regions.

From present results, we found that, PCA grouped accessions together with more morphological similarities; the clusters did not necessarily include all the accessions from the same or nearby sites. Diversity of populations within a geographical origin and similarity of populations beyond geographical limits have also been reported in C. ciliaris genotypes (Jorge et al., 2008). Among the good performed accessions, genotypes 127, 129, 49, 23, 27, 16, and 42 attained higher DW. It was also found that variables (genotypes) close to an axis correlate more with that principal

component; one may consider that axis is a combination of its neighboring variables. Divergence studies of morphological and agronomical traits using principal component and cluster analysis have also been reported by other researchers (Warwick et al., 2006; Dawood et al., 2009) which were in support of present investigation that both cluster and PCA disclosed complex relationship among the accessions in a more understandable way. The variability among the accessions from diverse origin could be related primarily to their morphological differences and secondly to agronomic use. Accessions evaluated in this work exhibited a reasonable level of diversity for some of the studied traits (FW, DW) of economic significance providing a resource for future crop improvement. For the improvement of C. ciliaris, it would be desirable to use diverse accessions with more variability for use in future breeding programs. For use in natural resource management and for soil stabilization, C. ciliaris accessions which are shorter, more prostrate and more rhizomatous should be selected, because they would provide better ground cover (IBPGR, 1984).

A cluster analysis was used in this study to facilitate the evaluation of salt tolerance among different genotypes. The major advantage of the utilization of a multivariate analysis in the evaluation for salt tolerance is the allowance of a simultaneous analysis on multiple parameters and increase in the accuracy in the rankings of genotypes. Another advantage is the convenience to rank genotypes when plants are evaluated at different salt levels, e.g., moderate and high salt levels. Cluster analysis proved their validity to establish genotypic diversity within accessions. These findings are supported by Iqbal et al. (2010) who concluded that quantitative characters revealed more reliability than biochemical markers that are specific for particular conditions. Germplasm evaluation for quantitative traits may help to work out the relative importance of various traits within each cluster. Accessions from different geographical regions were pooled in the same cluster. Our results coincide with the classification of 68 populations of C. ciliaris where most of the populations were grouped mainly due to their agronomical attributes and not related to their geographic origin (Jorge et al., 2008). Group VII consists of highly salt tolerant genotypes that acquired maximum biomass yield at all target salinities and were also among the top 10 accessions. Grouping was not associated with the geographic distribution; instead accessions were mainly grouped based on their morphological traits (FW, DW). Thus, it cannot be generalized that all the accessions having same origin would always have low diversity among them. For example, in Australia, where C. ciliaris is grown extensively, rainfall can occur during the cooler months and most cultivars of C. ciliarisrespond poorly to these rains owing to the cold temperatures. Accessions originating from cool, dry environments could be promising for these areas through improved performance in spring, owing to a better response to rains during the cooler seasons (Hacker and Waite, 2001). Divergence studies of morphological and agronomical attributes using principal component and cluster analysis have been made by different researchers (Dawood et al., 2009).

A good starting point in selecting suitable accessions for dry areas is to test accessions with good agro-morphological attributes originated from areas with very low rainfall. Several accessions of potential interest were collected from arid areas of South Africa, Zimbabwe, Tanzania and hot arid environments of UAE and Kenya. Salinity tolerance is generally an important parameter when choosing accessions to evaluate, especially within a country, because of the strong correlation between genotype and environment (G × E). A total of 100 accessions, originating from South Africa are found in four of the five clusters, showing that there is not a strong link between environment and morphology, confirming the findings in a study on dry matter and phenology of C. ciliaris in Tunisia (Mseddi et al., 2004). Some other studies also classified American and Gayndah in the same group (Pengelly et al., 1992).

The present study was largely consistent with the available descriptions of commercial cultivars. One main reason for discrepancies among experimental findings could be a strong influence of the environment. Within-cultivar high variation in C. ciliaris is a well-established fact (Jorge et al., 2008) that may be one of the reasons why commercial cultivars are not consistent in their performance. They further concluded that accessions belonging to identical agro-morphological groups were found from a wide range of environments in sub-Saharan Africa. Inherent genetic variability present in perennial grasses collected from different habitats of Cholistan desert were evaluated and high amount of genotypic variations were recorded among the accessions (Arshad et al., 2000). A genotypic characterization of the collection could provide further evidence to confirm the classification of this study.

### CONCLUSIONS

We found significant variation for salt tolerance among a large collection of buffelgrass genotypes representing global diversity, USDA collection, commercial varieties and local landraces. We identified several accessions that were stable and productive at high salinity and other accessions whom were stable but low yielder. Accession 3, 133, 159, 30, 23, 142, 141, 95, 49, 129, 124, and 127 were stable, salt tolerant and produced good dry biomass yield at target salinities during all years. These

### REFERENCES


accessions can be exploited to widen the genetic base of existing C. ciliaris accessions against salt tolerance. These accessions hold an appropriate salinity tolerance potential and can be grown to enhance farm productivity in arid and semi-arid areas of the globe. Furthermore; introduction of salt-tolerant perennial species is a promising alternatives to overcome salinity problems in the arid regions. The conservation of fresh water resources and its use for high value purposes, while using low quality saline water can provide both ecological and economic benefits, essential for sustainable agriculture in dry lands.

### AUTHOR CONTRIBUTIONS

MH conducted the experiment, collected data, and drafted the manuscript. AA designed the experiment, followed upon data collection, provided support through the PCA statistical analysis and edited the manuscript.

### ACKNOWLEDGMENTS

The authors gratefully acknowledge the International Fund for Agricultural Development (IFAD), Arab Fund for Economic and Social Development (AFESD), and the Islamic Development Bank (IDB) for their financial support through several regional projects. We are highly grateful to Dr. Susan Robertson (ICBA, Dubai, UAE) for help in English language and grammar correction. We are thankful to Dr. Asad Qureshi (ICBA, Dubai, UAE) for help in critical revision of the article. We are extremely thankful to anonymous reviewers for the critical and valuable comments that have greatly improved the article.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Al-Dakheel and Hussain. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Comparative Analysis of *WRKY* Genes Potentially Involved in Salt Stress Responses in *Triticum turgidum L.* ssp. *durum*

Fatma-Ezzahra Yousfi1, 2, 3, Emna Makhloufi1, 2, 3, 4, William Marande<sup>2</sup> , Abdel W. Ghorbel <sup>1</sup> , Mondher Bouzayen3, 4 and Hélène Bergès <sup>2</sup> \*

<sup>1</sup> Laboratory of Plant Molecular Physiology, Center of Biotechnology of Borj Cedria, Borj Cedria Science and Technology Park, Hammam-lif, Tunisia, <sup>2</sup> Centre National de Ressources Genomiques Vegetales, French Plant Genomic Center, INRA–CNRGV, Castanet-Tolosan, France, <sup>3</sup> INRA, UMR990 Genomique et Biotechnologie des Fruits, Castanet-Tolosan, France, <sup>4</sup> INPT, Laboratoire de Genomique et Biotechnologie des Fruits, University of Toulouse, Castanet-Tolosan, France

#### *Edited by:*

Teresa Donze, West Chester University, USA

#### *Reviewed by:*

Hao Wang, University of Georgia, USA Dong-Ha Oh, Louisiana State University, USA Zoe Joly-Lopez, New York University, USA

> *\*Correspondence:* Hélène Bergès helene.berges@inra.fr

#### *Specialty section:*

This article was submitted to Crop Science and Horticulture, a section of the journal Frontiers in Plant Science

*Received:* 29 June 2016 *Accepted:* 20 December 2016 *Published:* 31 January 2017

#### *Citation:*

Yousfi F-E, Makhloufi E, Marande W, Ghorbel AW, Bouzayen M and Bergès H (2017) Comparative Analysis of WRKY Genes Potentially Involved in Salt Stress Responses in Triticum turgidum L. ssp. durum. Front. Plant Sci. 7:2034. doi: 10.3389/fpls.2016.02034 WRKY transcription factors are involved in multiple aspects of plant growth, development and responses to biotic stresses. Although they have been found to play roles in regulating plant responses to environmental stresses, these roles still need to be explored, especially those pertaining to crops. Durum wheat is the second most widely produced cereal in the world. Complex, large and unsequenced genomes, in addition to a lack of genomic resources, hinder the molecular characterization of tolerance mechanisms. This paper describes the isolation and characterization of five TdWRKY genes from durum wheat (Triticum turgidum L. ssp. durum). A PCR-based screening of a T. turgidum BAC genomic library using primers within the conserved region of WRKY genes resulted in the isolation of five BAC clones. Following sequencing fully the five BACs, fine annotation through Triannot pipeline revealed 74.6% of the entire sequences as transposable elements and a 3.2% gene content with genes organized as islands within oceans of TEs. Each BAC clone harbored a TdWRKY gene. The study showed a very extensive conservation of genomic structure between TdWRKYs and their orthologs from Brachypodium, barley, and T. aestivum. The structural features of TdWRKY proteins suggested that they are novel members of the WRKY family in durum wheat. TdWRKY1/2/4, TdWRKY3, and TdWRKY5 belong to the group Ia, IIa, and IIc, respectively. Enrichment of cis-regulatory elements related to stress responses in the promoters of some TdWRKY genes indicated their potential roles in mediating plant responses to a wide variety of environmental stresses. TdWRKY genes displayed different expression patterns in response to salt stress that distinguishes two durum wheat genotypes with contrasting salt stress tolerance phenotypes. TdWRKY genes tended to react earlier with a down-regulation in sensitive genotype leaves and with an up-regulation in tolerant genotype leaves. The TdWRKY transcripts levels in roots increased in tolerant genotype compared to sensitive genotype. The present results indicate that these genes might play some functional role in the salt tolerance in durum wheat.

Keywords: BAC sequencing, *cis*-regulatory elements, WRKY, transposable elements, salt stress, *Triticum durum*, wheat

### INTRODUCTION

Durum wheat (Triticum turgidum ssp. durum) is a monocot of the Poaceae family, of the Triticeae tribe and the Triticum genus. Tetraploid Triticum durum (2n = 4× = 28, AABB), originated when two diploid wild grasses were crossed. The A genome originated from T. urartu (Dvorak, 1988). The origin of the B genome is still under discussion, but so far, Ae. Speltoides has been put forward as the closest to the donor of this genome (Fernandez-Calvin and Orellana, 1994). Durum wheat has one of the largest and most complex genomes; its size is estimated at 13,000 Mb.

The unavailability of T. durum genome sequences complicates and hinders the identification of the genetic factors and mechanisms behind responses to abiotic stresses such as drought and salinity. These are just some of the major constraints affecting cereal crops. In terms of commercial production and human food, this species is the second largest of its kind after bread wheat (Triticum aestivum L.). Durum wheat represents 8% of total wheat production, but 80% grow under Mediterranean climates (Monneveux et al., 2000). To date, durum wheat is the only tetraploid wheat species of commercial importance that is widely cultivated in these regions, where drought, heat and salinity limit yield considerably. Durum wheat is also more salt sensitive than bread wheat (Gorham et al., 1990) and saline soil has a negative effect on production (Maas and Grieve, 1990). Consequently, special efforts must be made to increase its tolerance.

Plants adapt to adverse environmental conditions through the induction of stress-responsive and stress-tolerant genes, a process that occurs mainly through transcription factors (TFs). TFs are known to mediate stress signal transduction pathways regulating downstream target gene expression and lead to stress tolerance (Shinozaki and Dennis, 2003; Chen and Zhu, 2004; Yamaguchi-Shinozaki and Shinozaki, 2005; Budak et al., 2013).

The WRKY transcription factor belongs to a very large family of transcription factors potentially involved in drought/salt stress response (Budak et al., 2013). This family originated in early eukaryotes and greatly expanded in plants (Zhang and Wang, 2005). It counts over 70 members in the Arabidopsis genus (Arabidopsis thaliana; Eulgem et al., 2000; Dong et al., 2003), 55 members among cucumber plants (Cucumis sativus; Wei et al., 2012), 119 members among maize plants (Zea Mays; Ling et al., 2011), 94 members among barley plants (Hordeum vulgare; Liu et al., 2014), 182 encoding genes in soybean (Glycine max; Bencke-Malato et al., 2014), 62 members among diploid woodland strawberry plants (Fragaria vesca; Wei et al., 2016) and 100 members among rice grasses (Oryza sativa; Xie et al., 2005).

This transcription factors family is characterized by a 60 amino acids domain containing the WRKY amino acid sequence at its amino-terminal end and a putative zinc finger motif at its carboxy-terminal end. It binds specifically to the (T)(T)TGAC(C/T) sequence motif, known as the W box, which requires both the invariable WRKY amino-acid signature and the cysteine and histidine residues of the WRKY domain, which tetrahedrally coordinate a zinc atom (Rushton et al., 2010).

The existence of either one or two highly conserved WRKY domains is the most vital structural characteristic of the TdWRKY gene (Bi et al., 2016). Furthermore, the global structures of WRKY proteins are highly divergent and can be classified into different groups, which might reflect their distinct roles. WRKY proteins are classified into 3 main groups (I, II, III) based on the number of WRKY domains and the structure of the zinc finger-like-motif.

Group I proteins contain two WRKY domains followed by a C2H2 zinc finger motif. The other WRKY proteins from group II and III contain one WRKY domain followed by a C2H2 or C2HC accordingly. Group II can be divided into five subgroups (IIa, IIb, IIc, IId, and IIe) based on additional amino acid motifs (Yamasaki et al., 2005).

WRKY genes are known to participate in various developmental and physiological metabolisms, including disease resistance (Bhattarai et al., 2010), senescence (Besseau et al., 2012), growth and developmental processes (Guillaumie et al., 2010), as well as biotic and abiotic stress responses (Mingyu et al., 2012). Recently, transgenic Arabidopsis plants overexpressing TaWRKY2 (EU665425) or TaWRKY19 (EU665430) have shown improved tolerance to salt, drought and/or freezing stresses when compared with the wild-type plants (Niu et al., 2012). Marè et al. (2004) described HvWRKY38 (AY541586), a barley gene coding for a WRKY protein, whose expression is involved in cold and drought stress response.

In durum wheat, very few WRKY EST sequences have been identified. Only partial cDNAs were found (Budak et al., 2013; Cifarelli et al., 2013). Literature does not include any report describing the characterization of a TdWRKY gene from tetraploid wheat species.

Available plant genome sequences from monocot plants are key resources that enable a better understanding of their gene content, structure and function. They are also indispensable for understanding transposable elements, intergenic space organization and composition. Genomic comparisons between the genome A diploid wheat donor (Triticum urartu), hexaploid wheat (T. aestivum, The International Wheat Genome Sequencing Consortium (IWGSC), 2014), barley (H. vulgare, The International Barley Sequencing Consortium, 2014) and Brachypodium (Brachypodium distachyon, The International Brachypodium Initiative, 2010), have not only confirmed the broad synteny between Poaceae gene content but have also helped to deduce their function within a phylogenetic context (Dubcovsky et al., 2001).

In this paper, we obtained genomic sequences of the region harboring WRKY genes from T. turgidum ssp. durum from Langdom#65 BAC clone library screening. First, we presented the WRKY gene characterization, including phylogenetic analysis and orthologous gene comparison. Then, we analyzed the genomic environment of WRKY genes, including intergenic space composition, and cis-acting elements that helped to associate a putative function with TdWRKYs. We finally performed TdWRKYs differential gene expression. TdWRKY1, 3,

**Abbreviations:** ABRE, ABA-responsive element; BAC, bacterial artificial chromosome; CRT, C-repeat element; DRE, dehydration-responsive element; ERF, ethylene response factor; EST, expressed sequence tag; LTRE, low-temperatureresponsive element; NLS, nuclear localization signal; ORF, open reading frame; TE, transposable element; TF, transcription factor; UTR, untranslated region.

4, and 5 were induced by high-salt treatment in two durum wheat varieties, Grecale (GR) and Om Rabiaa (OR), shown to be salttolerant and -sensitive, respectively, suggesting that TdWRKYs may be involved in salt-stress responses. These data provide new leads toward improving durum wheat tolerance to abiotic stresses.

### MATERIALS AND METHODS

### Plant Material

Three genotypes of tetraploid T. turgidum L. ssp. durum (2n = 4× = 28) were used in this study; Langdon LDN#65 for PCR bacterial artificial chromosome (BAC) screening, and OR and GR for functional analyses. The latter two were a local Tunisian variety and an Italian variety introduced in Tunisia, respectively.

### BAC Library Screening

The BAC library constructed from the tetraploid T. turgidum L. ssp. durum (2n = 4× = 28) Langdon LDN#65 genotype was used in this study for PCR Bacterial Artificial Chromosome (BAC) screenings. The BAC library is represented by a total of 516,096 clones organized in 1,344,384-well plates. The average insert size of the BAC clones was estimated at around 131 kb resulting in 5.1 genome coverage (Cenci et al., 2003). The LDN#65 BAC library and related tools are available at the French Plant Genomic Resource Center upon request (http://cnrgv.toulouse.inra.fr/en/library/genomic\_ resource/Ttu-B-LDN65). The library was organized into a twodimensional pool and BAC library screening was performed as described by Cenci et al. (2003, 2004). The pooling strategy used required 56 PCRs for the superpools (9216 BAC clones each) and 50 additional PCRs for each positive superpool. The screenings have been done as described in Makhloufi et al. (2014).

### PCR Primer Design and PCR Amplification for BAC Screening

Two primer design strategies have been used. O. sativa, T. aestivum and H. vulgare sequences harboring conserved WRKY domains were used as queries (NCBI: http://www. ncbi.nlm.nih.gov/; GrainGenes: http://wheat.pw.usda.gov/ GG2/index.shtml; Gramene: http://www.gramene.org/; TIGR: http://www.jcvi.org/; HarvEST: http://harvest.ucr. edu/; Phytozome:http://www.phytozome.net/; PTDB: http:// plntfdb.bio.uni-potsdam.de/v3.0; and plant GDB: http://www. plantgdb.org/), in a BLAST search to find homologous T. turgidum ESTs (Budak et al., 2013; Cifarelli et al., 2013). Multiple alignments of the DNA sequences were performed by ClustalW software (Larkin et al., 2007). In order to avoid an amplification of the exon–intron junction, prediction of the exon boundaries within Triticeae expressed sequence tags (ESTs) were performed based on rice and Arabidopsis genomic sequences.

The second has been deduced due to the very strong homology between different species of Triticeae. We tried to identify nonspecific T. turgidum primers from genes whose function was studied. For this, some WRKY genes that had been specifically studied in the context of abiotic stress, were chosen for primer design: HvWRKY38 (AY541586) (group IIa) (Marè et al., 2004) and TaWRKY2 (EU665425) (group I); TaWRKY19 (EU665430) (group I) (Niu et al., 2012).

PCR primers were then designed to cover exons of the entire selected sequence using the Perl primer tool v.1.1.2.1 (Marshall, 2004) (Table S1). Primers were tested on genomic DNA of LDN#65 before BAC library screening. Total DNA was extracted from wheat Langdon 65 variety using the Plant DNAzol <sup>R</sup> reagent. PCR conditions used were as follows: initial denaturation at 95◦C for 5 min, followed by 45 cycles of 20 s at 95◦C, 16 s at 60◦C, and 20 s at 72◦C, performing a melting curve with an increment of 0.5◦C per cycle. PCR products for the selected BACs were separated by electrophoresis (2% Agarose).

### BAC Sequencing, Assembly, and Annotation

BAC DNA was extracted using a NucleoSpin <sup>R</sup> 96 Flash kit (Macherey-Nagel) and the insert size was estimated with NotI digestion (Fast Digest NotI; fermentas). Positive BAC clones were outsourced for 454 Life Sciences pyrosequencing technology using the GS Junior Roche system (Kit 454 Titanium; Roche). Sequence data assembly was performed with Newbler 2.8 software sold by 454 Life Sciences/Roche for 454 data (Veras et al., 2013). The assembly was performed on data previously processed by the software Pyrocleaner after clearing reads from contamination by the host E. coli. Sequenced BAC DNA was analyzed using TriAnnot Pipeline v.3.8 improved for wheat species (http://wheat-urgi.versailles. inra.fr/Tools/TriAnnot-Pipeline) enabling annotation, masking of transposable elements, and gene structure organization (Text S1) (Leroy et al., 2012).

## Alignment, Phylogenetic Tree, and Sequence Analysis of *TdWRKY* Genes

The TdWRKY protein sequences were submitted to the CDD (Conserved Domains Database) from NCBI, and to Motif Scan detection in MyHits (http://myhits.isb-sib.ch/) with Prosite databases selected. Homologous proteins from whole genome sequenced monocot plants (Wheat, Barley, Brachypodium, Maize, Sorghum and Rice) and Arabidopsis were selected for sequence alignments using DNAMAN software (http://www. lynnon.com/).

A neighbor joining phylogenetic tree was derived from a MUSCLE alignment 3.8 (Edgar, 2004) of TdWRKY proteins, their homologous WRKY proteins from cereals and their closest members from A. thaliana. The tree was then produced by MEGA 6 software (Tamura et al., 2013) using the Neighbor-Joining method with 1000 bootstrap replicates.

A 1.5 kb DNA fragment on the 5′ -regulatory region upstream of the transcription start of the TdWRKY gene was subjected to in silico analysis using the PLACE signal scan and NSITE-PL (Recognition of PLANT Regulatory motifs with statistics) from the Softberry tool (http://www.softberry.com/) to search for putative cis-regulatory elements in the promoter region potentially involved in controlling TdWRKY gene expression and also in the promoter region of their orthologs in T. aestivum and H. vulgare to see the average frequencies of cis-acting elements related to abiotic stress in these promoters.

### Salt Stress

Sterile seeds of two independent genotypes from T. turgidum subsp. durum, OR and GR, were first stored for 48 h at 4◦C for initialization of germination. Seedlings were sown in recipient MagentaTM vessels containing 50 ml of 50% MS-based medium (Murashige and Skoog, 1962) and were left for 10 d in an in vitro growth chamber maintained at a controlled photoperiod of 14 h during the day at 25◦C with 80% humidity and an intense luminosity of 250 µmol m−<sup>2</sup> s −1 , and for 10 h during the night at 20◦C. They were then subjected to abiotic and hormonal stress treatments. For salinity treatment, seedlings were transferred into 50% MS medium containing 200 mM NaCl for 6 or 24 h. Leaves and roots were then harvested separately, dropped immediately into liquid nitrogen, and stored at −80◦C for RNA extraction.

### Gene Expression Analysis

Total RNA from at least 30 salt-treated and untreated leaves and roots from OR and GR genotypes were extracted using a Pure Link Plant RNA Reagent kit (Invitrogen). Total RNA was DNase treated (Promega), and first-strand cDNA was reverse transcribed from 2 µg of total RNA using an M-MLV Reverse Transcriptase kit (Promega) according to the manufacturer's instructions. First-strand cDNA generated from total RNA including salt-treated and untreated samples from either the OR or GR genotype was subjected to real time quantitative expression analysis. The latter was performed in a fluometric thermal cycler (DNA Engine Opticon 2; MJ Research, Walthan, MA, USA) using SYBR Green fluorescent dye following the manufacturer's instructions. Results were shown using SDS2.2 software on an Applied Biosystem 7900 HT Fast Real-Time PCR System. Comparisons of repeated samples were assessed using CT values among the three replications. Linear data were normalized to the mean CT of 26S rRNA as an internal reference gene and the relative expression ratio was calculated using the formula 2−11CT. Log2-transformed signal ratio was carried out. The gene specific primers used for PCR are listed in Table S2.

### RESULTS

### Identification, Sequencing and General Features of the Sequence Composition of *T. durum* BAC Clones

516,000 clones from the Durum wheat bacterial artificial chromosome (BAC) library, were screened for individual clones harboring WRKY genes. Six T. durum BAC clones were selected and fully sequenced by 454 technologies.

BAC clone annotation revealed 12 non-TE genic features that were classified into two categories: 9 protein coding genes, one hypothetical and 2 gene fragments (Table S4). The gene assignments were all supported by at least one full-length cDNA, an EST and/or homolog in another monocot plant such as T. aestivum, O. sativa, H. vulgare, Z. mays, and B. distachyon. One gene per insert was predicted for TD14H23 (GenBank accession no. KY091673), TD473J01(GenBank accession no. KY091677), TD493B21 (GenBank accession no. KY091678) and TD315C07 (GenBank accession no. KY091676). The remaining two BAC clones TD16L16 (GenBank accession no. KY091674) and TD789O23 (GenBank accession no. KY091679), contain four genes (**Table 1**). Two genes (of known or unknown function), three genes with putative function, four genes with domaincontaining proteins, one hypothetical gene and one truncated (fragmented gene), were identified (**Table 1**, Table S4).

An average GC content of 47.5% was found for all BACs. In contrast with the constant GC content, gene and TE composition was highly variable between the different BAC sequences. The proportion of TEs ranged from 64 to 91%, while gene content ranged from 1 to 4 genes per BAC. The coding fraction of the 960,769 kb total samples represents 3.2% of the sequences, while the TE content is 74.6%, distributed as follows: 63.6% for class I, 7.9% for class II and 3.1% for unclassified TEs (Table S3). While class I retrotransposons constitute the highest TE proportion of the 6 sequenced regions, BAC clone TD473J01 shows the highest proportion of CACTA class II (17.0%). Class I TE DNA sequences were distributed as follows: 27.2% Gypsy- (150 TEs) and 15.9% Copia- (71 TEs) like "long terminal repeats (LTR)"-retrotransposons. New class I transposable elements were identified for the first time in this study. They account for 17.9% of length and were identified as de novo LTR-retrotransposons (Table S5). Class II TEs (DNA transposons) represent 7.9% of the cumulative sequence length. The CACTA TEs represent the majority (32.4%) of class II DNA sequences. Novel class II TE families were identified, sharing weak homologies with known CACTA, Mutator and Mariner. The 3.1% novel unclassified elements, share a stretch of weak homology with other Triticeae unclassified transposable elements (Table S3, Text S2).

### *TdWRKY* and Monocot Gene Structure

A comparison of durum wheat, T. aestivum, H. vulgare, and Brachypodium distachyon WRKY gene sequences helped to establish their structure.

Gene length varies from 1.2 to 4.2 Kb. It is conserved amongst the orthologous monocots (TdWRKY1 and TdWRKY5 length was highly conserved among orthologous members). The number of introns varies from 2 to 4. The intron size is also variable; it ranges from 102 to 925 bp (**Figure 1**, **Table 1**). All exon-intron junction sites obey the GT/AG rule as identified in other eukaryotic genes. To date, the relative organization of the exons and introns is the same for the other WRKY genes characterized in cereal, i.e., the number of exons and introns remains the same and individual introns occur at relatively the same sites for barley, Brachypodium and wheat genes.

The size of coding sequences varies from 882 to 1716 bp (**Figure 1**). The size of exonic regions between the orthologous genes was similar although the overall region structure was slightly different between TdWRKYs and their orthologs (exon 2 from TdWRKY1 and TdWRKY2 were almost the same size, but with 1 to 3 more intron phases in corresponding exons. The same was observed with exon 3 from TdWRKY5). Small stretches of exonic region sequences or more exons (BdiWRKY has 4 more exons, in the 5′ moiety, than TdWRKY4) do not contradict the


general pattern of overall high WRKY gene conservation between the TdWRKYs and their homologous genes (Text S3).

Comparison of the TdWRKYs cDNA with sequences of other species showed an identity ranging from 99% for T. aestivum; 86% for Brachypodium and 92% for barley (**Table 1**, Text S3).

Interestingly, we also identified a cluster of non syntenic collinear genes, probably originating from genomic rearrangements of gene blocks such as gene 1 and gene 4 from TD789O23 (the two gene fragments from **Table 1**). They were annotated as being similar to the WRKY domain protein. Genes 1 and 4 (order on the BAC clone), share 93 and 94% of similarity, respectively, with TdWRKY5. An alignment of TdWRKY5 from TD315C07 and the two fragments (gene 1 and gene 4) from TD789O23, revealed that gene 4 might be a truncated second allelic form of TdWRKY5 (**Figure 1**). In fact, the gene 4 genomic sequence is 989 bp shorter (from position 1 to 990, Figure S2) than TdWRKY5. It results in the formation of a new coding protein. The predicted protein from gene 4 belongs to group II as well as TdWRKY5. The WRKY domain on exon2, the PR intron special feature, and the zinc finger-like motif on exon3, are perfectly conserved. The alignment of TdWRKY5, gene 4 and gene 1, showed that gene 1 and gene 4 happened to be fragments of TdWRKY5 (Figure S2). This disruption was the result of a deletion and an insertion. These are two of the most important genomic rearrangement events. They play significant roles in genome evolution. This reshuffling created two new, non-syntenic genes with orthologous species (**Table 1**). This might, or might not, be a loss-of-function mutation of the second TdWRKY5 allelic form. Unfortunately, we were only able to find approximately 450 bp from the 898 bp of newly inserted fragment (gene1) since it is located at the extreme 5′ start of TD789O23 DNA insert. There are many ways in which exon shuffling may occur. Shuffling involves transposable elements such as LINE-1 retroelements and Helitron transposons, as well as CACTA elements and LTR retroelements, a crossover during sexual recombination or alternative splicing. These hypothesized mechanisms should be thoroughly explored.

### TdWRKYs Classification and Phylogeny

WRKY proteins are classified into 3 groups. We used phylogeny to assign a group to our WRKY sequences (**Figure 2**). WRKY1, 2, and 4 belong to group I. They all have 2 WRKY domains with a C-X (4,5)-C-X (22,23) H-X-H zinc-finger-like type motif on each domain. TdWRKY2 predicted protein contains two WRKY domains. The N-terminal domain has an altered WRKY motif WRKYGKK. An alignment of TdWRKY2, with its orthologous proteins from Aegilops, Brachypodium and T.urartu on the UniProtKB database (http://www.uniprot.org/align/), indicates that the protein is biased at its C-terminus WRKY conserved domain. 29 aa are missing from the zinc finger motif (Figure S1).

This might be the cause of alteration in Wbox recognition. The role of conserved residues in this domain has been studied by Maeo et al. (2001) using mutation experiments. Any mutation occurring either in the WRKYGQK or the zinc finger motif of WRKY domains, affecting cysteine or histidine, cancels DNA binding activity (Knoth et al., 2007). The WRKY domain from the C-terminal region is responsible for binding to DNA, whereas the role of the N-terminal is to promote protein-protein interactions (Maeo et al., 2001).

TdWRKY3 and 5 were assigned to groups IIa and IIc respectively, due to the presence of only one WRKY domain with a specific zinc finger motif, C-X (4,5)-C-X (22,25)-H-X-H.

We investigated the Prosite and Pfam databases using the TdWRKYs sequences as queries to identify conserved domains, motifs and active phosphorylation sites (**Figure 3**, Text S4). The analysis predicted 11 specific sequence features for all TdWRKYs identified, including the WRKY domain, a Plant Zinc Clust domain, and a Gly\_Rich region and a His\_Rich region (**Figure 3**). Seven active sites were detected, including 32 CK2\_Phospho sites, 22 PKC\_Phospho sites, 12 Asn\_Glycosylation sites, 29 Myristyle sites, 2 TYR\_pospho sites, one amidition\_site and 3 CAMP\_phospho\_sites. These features are related to subcellular localization, signal transduction, transcriptional regulation and protein interaction, and build a basis for TdWRKY function. TdWRKY1, 4, and 5 might act as activators. TdWRKY2 and 3 contain the active repressor motif (LXLXLX) (Xie et al., 2005) (**Figure 3**).

For additional phylogeny support, the detection of the position and the phase of the intron in each region encoding a WRKY domain is essential. Four of the eight WRKY domains, found in TdWRKY genes, contain an intron in a conserved position (**Figure 1**). This phase 2 intron is localized at the 11th codon downstream of the WRKY motif, interrupting the codon encoding Arg (on 100% of genes). In Group Ia (TdWRKY1, 2, and 4) (the C-terminal WRKY domain), Group IIc (TdWRKY5), IId, IIe, and III genes, this intron comes after the codons for the invariant amino acid sequence PR and separates the WRKY sequence from the zinc finger motif. In Group IIa (TdWRKY3) and IIb genes the intron occurs at a nucleotide position that corresponds to five amino acids after the C-X5-C and separates this from the rest of the zinc finger structure (Tripathi et al., 2012). On TdWRKY3, the WRKY domain was separated from the zinc finger motif by the C-X5-C core but there was not an intron at this position (**Figure 1**).

### Hypothetical Function Deduced By Protein Domain and Orthologous Gene Function Sequence

The phylogenetic analyses revealed distinct clusters of TdWRKYs. Within the WRKY clade, distinct clusters corresponding to WRKY groups and subgroups comprising of further sub-clusters emerged. TdWRKYs with WRKYs from wheat, barley, T. urartu and Aegilops formed one sub-cluster whereas rice, maize and sorghum were grouped as a separate sub-cluster, and Arabidopsis were clustered together. Within sub-clusters, durum wheat, and T. aestivum, WRKYs showed less divergence (as indicated by a shorter branch length) than Brachypodium and barley WRKYs. Similarly, maize and sorghum WRKYs showed less divergence compared to rice. Monocot WRKYs showed less divergence than the closest Arabidopsis WRKYs (**Figure 2**).

TdWRKY1 was clustered to TaWRKY19, involved in salt stress response. TdWRKY2 was clustered to BdWRKY2 and AtWRKY34, involved in cold stress response (Zou et al., 2010). TdWRKY4 belongs to the TaWRKY53, TaWRKY2 and AtWRKY33 cluster, involved in salt stress (Jiang and Deyholos, 2009). TdWRKY3 is clustered to TaWRKY80 and HvWRKY38, involved in dehydration response and to AtWRKY18, 60, and 40 (Liu et al., 2012). They were described as regulating defense response. TdWRKY5 and its orthologous TaWRKY71, BdWRKY71, and AtWRKY8, involved in salt stress response, were grouped together (**Figure 2**).

Multiple sequence alignment of the conserved sequence from group Ia TdWRKYs (the two WRKY domains were aligned separately) and group IIa and IIc TdWRKY members with their closest Arabidopsis and monocot WRKY proteins (**Figure 4**) revealed the very conserved structure of the WRKY motifs and the amino acid residues potentially interacting with zinc ligands. TdWRKYs consist of a four-stranded ß-sheet (1, 2, 3, and 4), with a zinc binding pocket formed by the conserved Cys/His residues located at one end of the ß-sheet, and the WRKYGQK cores, corresponding to the most N-terminal ß-strand (strand ß-1), kinked in the middle of the sequence by the Glyresidue (Yamasaki et al., 2005).

### Putative cis-Acting Elements Identified in the *TdWRKY* Promoter Region

BAC sequences generated more data about gene environment. It affords greater reliable and precise information for further functional studies. A 1.5 kb 5′ regulatory region upstream of the transcription start of the TdWRKY genes was subjected to in silico analysis using a plant cis-acting regulatory DNA elements (PLACE) signal scan to search for putative cis-regulatory elements potentially involved in the control of TdWRKY gene expression. The data indicated the presence of a large number of conserved cis-regulatory elements that are putative targets for TFs reported to mediate responses to environmental stresses or to stress-related hormones (**Table 2**). To regulate gene expression, WRKY factors show a binding preference to a DNA sequence called W box: (C/T) TGAC (C/T) (Ciolkowski et al., 2008). This DNA core was over-represented on TdWRKY3, 4 and 5. 6 and 5 boxes were found on TdWRKY1 and 2, respectively.

It has also been shown that some WRKY factor types can bind to other types of cis-elements. ABRE-like motifs (ACGTG) and ABRE-related motifs (ACGTGKC and TACGTGTC) were also found in the promoter region of TdWRKY1 3, 4, and 5. The MYB-core element (TAACTG) and a number of MYB-related motifs (YAACKG, CNGTTR, and GGATG), as well as a MYC (CANNTG) motif and the MYC-related motifs (CATGTG and CACATG), were present on the entire promoters. MYC and MYB had the biggest number among others. DRE (TACCGACAT), CRT (RCCGAC), and low-temperature responsive elements (LTREs) (CCGAC), all containing the CCGAC motif that forms the core of the DRE sequence, were well represented in the promoter of TdWRKY1 and TdWRKY3. DRE-like elements such as CBFHv (RYCGAC) were also identified mostly on TdWRKY1 and TdWRKY3. Two GCC-box motifs (AGCCGCC), target sequences for ERF proteins, were found in the TdWRKY1 and TdWRKY4 promoters. RAV1-A motifs (CAACA), to which RAV1 proteins can bind through their AP2 and B3-like domains, were also present in a large number on TdWRKY promoters (**Table 2**). The average cis-elements frequencies on the WRKY promoter gene were deduced from TdWRKYs and from orthologous TaWRKYs and HvWRKYs scanned also for regulatory elements. The results showed that the promoters of TdWRKY genes are the richest in terms of putative regulatory elements compared to HvWRKYs and TaWRKYs. Moreover, the frequencies of GCC boxes, MYB and Wboxes in A. thaliana promoter regions are significantly lower than in hexaploid wheat, durum wheat and barley, which might be related to the fact that the average frequency was calculated within the 74 WRKY members (Dong et al., 2003). Promoter analysis in Populus revealed that variouscis-acting elements (LTRE, ABRE, ABA, and Wboxes) involved in abiotic stress and phytohormone responses were highly present in the promoter region of PtrWRKY genes (Jiang et al., 2014). The data presented by Makhloufi et al. (2014) on durum wheat TdERF indicated the presence of a large number of conserved cis-regulatory elements that are putative targets for

TFs reported to mediate responses to environmental stresses or to stress-related hormones.

### The Expression Patterns of *TdWRKY*S under Salt-Stress Conditions in Two Contrasting Genotypes

The expression pattern of TdWRKYs in response to short term salt stress, in both leaves and roots, was analyzed in the OR and GR genotypes of durum wheat shown to be sensitive and tolerant to high salt, respectively (Makhloufi et al., 2014). Specific primers (Table S2) were designed and used in a quantitative real-time PCR. Upon high-salt treatment (200 mM NaCl), the expression levels of TdWRKY1, 3, 4, and 5 were altered (up to 19-fold in TdWRKY4) in the leaves of sensitive genotype after 6 h under salt stress (**Figure 5**). Meanwhile, the transcripts level at the same time in tolerant genotype remain almost unchanged for TdWRKY3, 4, and 5 and increased almost 5 times than control (**Figure 5A**). Thereafter, the expression of the TdWRKY genes in leaf tissues displayed a dramatic increase at 24 h in both genotypes, even though the upregulation was substantially higher in GR (27-fold, 17-fold, 11-fold, and 15 fold for TdWRKY1, 3, 4, and 5, respectively) than in OR (1.5-fold altered, 13-fold, 3-fold, and 12-fold for TdWRKY1, 3, 4, and 5, respectively) (**Figure 5A**). In treated OR roots, transcripts levels of the four TdWRKYs remain unchanged at 6h in both sensitive and tolerant genotypes. Application of salt induced a decrease in transcript levels of TdWRKY3 (12-fold) and TdWRKY4 (14-fold) after 24 h in sensitive genotype. The expression levels of all TdWRKYs after 24 h of stress treatment, in tolerant genotype, remain constitutive (**Figure 5B**).

TdWRKY2 carries a deletion within the fourth and last exon, just after the position encoding the second Cys of the zinc finger motif (Figure S1). By qRT-PCR analysis, we verified disruption of the gene. 2 primer pairs positioned 3′ around the deletion site did not amplify any product (Table S2). As a disruption of the zinc finger motif has been shown to completely abolish the W-box–specific DNA binding activity of WRKY transcription factors (Maeo et al., 2001), it is very likely that TdWRKY2 is a true loss-of-function gene, as was suggested for TdWRKY2 protein.

### DISCUSSION

### TE Expansion Responsible for Durum Wheat Genome Organization

Despite the accumulation of complete plant genome sequences, the most comprehensive studies on the organization of gene space throughout the sequence were carried from individual BAC clones or broader regions composed of BAC straddling, also called contigs (Brenner et al., 1993; Vitte and Bennetzen, 2006; Liu et al., 2007). We obtained, from the representative 6 BACs, a cumulative sequence length of almost 1 Mb. The TE proliferation was pronounced (representing 74.6%). LTR (Long Terminal Repeats, class I) appears to be the most represented in durum wheat sequences (63.6%) and among all grasses (Devos, 2010). Class II DNA transposons are generally less invasive than the retrotransposons in plant genomes. Durum wheat class II representation is slightly lower than bread wheat (16%) with

7.9%. Within class II members, MITE elements represent only 0.5% on durum wheat sequences; the same percentage is observed in bread wheat (Choulet et al., 2010). The composition of durum wheat is closely comparable to bread wheat composition and consequently to maize genome one (Schnable et al., 2009).

Although they are representative of the abundant wheat TEs available in the TREP database (Wicker et al., 2002, 2007; http:// botserv2.uzh.ch/kelldata/trep-db/TEClassification.html), the class I and class II TEs observed in the genomic sequences of the wheat genomes may not cover all wheat TEs. It is expected that more wheat TEs will be identified, as more wheat genomic sequences become available and as more de novo TE annotation tools are developed (Choulet et al., 2010; Flutre et al., 2011). This is particularly supported by the identification in this study of different novel TE families, most of which are retrotransposons (17.9% of cumulative sequence).

### Durum Wheat Genes Clustered into Small Islands

Our data show that durum wheat genes are clustered mainly into several very small islands (from one to four genes) per BAC, separated by large blocks of repetitive elements. Overall durum wheat gene density was estimated at one gene every 80 Kb.

Gene islands reflect a proliferation of the genome. They are common features of large and repetitive genomes, such as wheat and maize genomes and are not found in small genomes such as rice, Arabidopsis thaliana (The Arabidopsis Genome Initiative, 2000), and Brachypodium (Huo et al., 2009). In fact, two BAC contigs (961 and 594 kb), in maize, were analyzed and gene blocks varied between one and four genes per block (Kronmiller and Wise, 2008). This suggests that within large plant genomes, gene islands may originate from a specific selection against the separation of genes by TE insertions that would be deleterious for gene expression or regulation and second a homogeneous expansion combined with preferential deletions in gene-rich regions (Choulet et al., 2010).

Genes from islands would share common functional characteristics (Hurst et al., 2004). Genes are maintained close to one another because this configuration would provide a selection advantage and functional significance (Batada et al., 2007; Janga et al., 2008). This is confirmed by the identification of co-expressed genes or relatives sharing the same functions that are conserved between the Arabidopsis, rice and poplar genomes (Liu and Han, 2009). TdWRKY2 and gene2 coding for a chloride channel protein (CLC) (**Table 1**) (from TD16L16), involved in vacuolar compartmentation during salt stress (Hechenberger


et al., 1996), might share the same function. This assumption is more defensible, because of their very close genomic distance; they are separated only by 500 bp. The same gene structure is conserved on B. distachyon orthologous region. Gene ontology and expression profiles along the chromosome in T. aestivum revealed that these islands are enriched in genes sharing the same function or expression profiles suggesting the existence of long-distance regulation mechanisms in wheat (Choulet et al., 2010).

### Correlation between Sequence Similarity and Functional Similarity

Protein sequences contain important information for protein function. We found that TdWRKYs have sequence features including domains/motifs: WRKY domains, which bind WRKY proteins to the W-box motif in the promoter of target genes for transcription regulation; Plant Zinc finger motifs which function in association with WRKY domains; NLS (nuclear Localization Signal) peptides, responsible for leading proteins to the cell nucleus. And active sites such as protein kinases, which play important roles in signal transduction pathways. Special features such as Myristyl sites could function during TdWRKY5 mediated gene responses to stresses, as myristoylation sites play a vital role in membrane targeting and signal transduction in plant responses to environmental stresses (Podell and Gribskov, 2004).

Homologous genes with similar sequences are likely to have equivalent functions and to play the same functional role in equivalent biological processes. It is thus very important to identify homologous genes, especially those which are supported by experimental data. TdWRKYs from different groups (Ia, IIa, and IIc) have a close relationship with orthologous genes from barley, Brachypodium, bread wheat, maize, sorghum, aegilops, T. urartu, rice and Arabidopsis. Genes from T. aestivum, Ae. tauschii, T. urartu, and H. vulgare shared the closest similarity, while rice, sorghum and maize were out-grouped from other monocot plants.

As summarized in **Table 2**, the promoter regions of TdWRKYs were highly rich in cis-acting elements, and most of them were related to stress-induced gene expression, suggesting the putative role of essentially TdWRKY1, 3, 4, and 5 genes in wheat responses to a variety of environmental stresses. A TBLASTN on NCBI nucleotide collection database of the TdWRKYs 1.5 Kb cis-elements showed that TdWRKY1-UTR region shares 92% similarity from base 1 to 419 bp with T. aestivum 3B genome scaffold HG670306.1 from 28534545576 to 285345992. This 419 bp contains 4 W Box, 4 MYB, 5MYC, 4RAV, and 1 ABRE elements. 1.5 Kb TdWRKY2-UTR (632-949) region also shares 84% similarity with HG670306.1 (74734822-74734532). Elements which might be common to the orthologous regions are 3 Wbox, 4MYB, and 4MYC. 100% of homology was found between the 1.5 Kb from TdWRKY5 putative cis-elements and its homolog HG670306 from coordinate 436276014 to coordinate 436275955. They share all the cis-elements on **Table 2**. Finally, 99% was shared between all UTR regions from KC174859.1 (TaWRKY53) and 1.5 Kb from 736 to 1501. This region contains 4 Wbox, 6 MYB, 1 MYC, 4 GCC cores, 2 RAV, 2 LTREs, and 3 ABREs.

 to from the study of Dong et al., 2003.

 any

 promoters were

Two hypotheses might be formulated. Firstly, genes that share the same acting elements on their promoters may function the same way and, secondly, homologous cis-acting elements of TdWRKY1, TdWRKY2, and TdWRKY4 are assigned to the 3B chromosome in bread wheat, suggesting that TdWRKY1, 2, and 4 might be also accommodated within chromosome 3B or 3A from T. turgidum.

Promoter region analyses, gene structure within orthologous members, and phylogeny studies all indicated that TdWRKYs are novel members of the WRKY family in durum wheat and, given their high sequence homology with orthologous monocots and Arabidopsis' known function, it can be postulated that they play similar roles in mediating responses to biotic and abiotic stresses.

### Differential Expression of *TdWRKY*S

In a context where no functional characterization has yet been carried out for WRKYs in durum wheat, our data showed that the TdWRKY genes were inducible by high-salt treatment. Moreover, TdWRKY expression in response to salt stress displayed distinctive patterns in two durum wheat genotypes with contrasting behavior regarding tolerance to abiotic stresses. In the tolerant GR variety, TdWRKYs were strongly induced by salt stress within a few hours (6 h), while it was downregulated in the sensitive OR variety. Notably, differences between tolerant and sensitive genotypes were detected, mainly in the expression levels in tolerant genotype leaves at 24 h stress treatment. Peng et al. (2014) showed that WRKY members' unigenes were mostly up-regulated under salt stress in cotton (Gossypium hirsutum L.). They noted that some WRKY genes were expressed in the salt-tolerant genotype Earlistaple 7, but were repressed, weakly induced, or not induced at all in salt-sensitive Nan Dan Ba Di Da Hua, within 24 h. HvvWRKY2 was induced by salt stress in TR1 (Tolerant variety) but not in TS1 (Sensitive variety) (Li et al., 2014). Similarly, expression of wheat TaWRKY2 and TaWRKY19 was induced by salt, and both TaWRKY2 and TaWRKY19 enhanced salt tolerance in transgenic Arabidopsis plants compared with wild type (Niu et al., 2012). The induction pattern showed that the highest gene expression occurs at 3–6 h after salt stress initiation, for TaWRKY2 and TaWRKY19 and at 6 and 24 h for HvvWRKY2, which is likely to be the same for TdWRKY members.

### CONCLUDING REMARKS

The durum wheat complexity and large genome size (13,000 Mb) have largely prevented the development of genomic resources. Meanwhile, efforts have focused on sequencing of target regions selected as covering one or more genes of interest, called locus. Several BAC clones covering the corresponding region in one or more wheat genomes were sequenced as well.

In this study, we targeted durum wheat BAC clones harboring TdWRKY genes potentially involved in response to salt stress. We validated six BAC clones relative to the genomic library TtuLDN65. The size of these clone inserts after sequencing varies between 120 and 190 Kb. The added value of such an approach is that we obtain the coding sequence with introns and promoter regions that are essential for expression or functional study. Furthermore, it enabled us to access the entire genomic

### REFERENCES


environment of the coding sequence that might provide new information about the structure, conservation, position, order and genomic dynamics. A structural study of the environment of a gene encoding a resistance or tolerance protein can bring a multitude of information that can enrich functional study.

In this study, we identified 5 WRKY genes potentially involved in the salt stress tolerance. This article reports the identification of 5 novel TdWRKY genes. Sequence comparison with orthologs from barley, wheat and Brachypodium provide valuable information for determining gene structure. TdWRKY1, TdWRKY3, TdWRKY4, and TdWRKY5 gene sequences were highly conserved as well as exon-intron boundaries, even with Arabidopsis. These important structural similarities, between orthologous WRKY genes, are indicative of a potential functional conservation.

### AUTHOR CONTRIBUTIONS

HB, MB, AG initiated the project. FY, EM, WM performed experiments. FY performed analysis and interpretation of data for the work. FY wrote the paper. HB, WM revised the paper critically for important intellectual content.

### ACKNOWLEDGMENTS

This research was realized at the French Plant Genomic Resource Center (CNRGV) (INRA, Toulouse) and Laboratoire de Genomique et Biotechnologie des Fruits (GBF) (INRA, Toulouse). The authors would like to thank Philippe Leroy from (INRA, Clermont-Ferrand) for the Triannot pipelines training.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fpls.2016. 02034/full#supplementary-material


and functions. Afr. J. Biotechnol. 11, 8051–8059. doi: 10.5897/ AJB11.549


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2017 Yousfi, Makhloufi, Marande, Ghorbel, Bouzayen and Bergès. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# De novo Assembly and Characterization of the Transcriptome of Broomcorn Millet (Panicum miliaceum L.) for Gene Discovery and Marker Development

#### Edited by:

Gautam Sarath, United States Department of Agriculture – Agricultural Research Service, USA

#### Reviewed by:

Teresa Donze, University of Nebraska Lincoln, USA Erin D. Scully, United States Department of Agriculture – Agricultural Research Service, USA

#### \*Correspondence:

Weining Song sweining2002@yahoo.com Xiaojun Nie small@nwsuaf.edu.cn †These authors have contributed equally to this work.

#### Specialty section:

This article was submitted to Crop Science and Horticulture, a section of the journal Frontiers in Plant Science

> Received: 31 March 2016 Accepted: 08 July 2016 Published: 21 July 2016

#### Citation:

Yue H, Wang L, Liu H, Yue W, Du X, Song W and Nie X (2016) De novo Assembly and Characterization of the Transcriptome of Broomcorn Millet (Panicum miliaceum L.) for Gene Discovery and Marker Development. Front. Plant Sci. 7:1083. doi: 10.3389/fpls.2016.01083 Hong Yue<sup>1</sup>† , Le Wang<sup>1</sup>† , Hui Liu<sup>1</sup> , Wenjie Yue<sup>1</sup> , Xianghong Du<sup>1</sup> , Weining Song1,2,3 \* and Xiaojun Nie1,2 \*

<sup>1</sup> College of Agronomy, Northwest A&F University, Yangling, China, <sup>2</sup> State Key Laboratory of Crop Stress Biology in Arid Areas, Northwest A&F University, Yangling, China, <sup>3</sup> Australia-China Joint Research Centre for Abiotic and Biotic Stress Management in Agriculture, Horticulture and Forestry, Northwest A&F University, Yangling, China

Broomcorn millet (Panicum miliaceum L.) is one of the world's oldest cultivated cereals, which is well-adapted to extreme environments such as drought, heat, and salinity with an efficient C4 carbon fixation. Discovery and identification of genes involved in these processes will provide valuable information to improve the crop for meeting the challenge of global climate change. However, the lack of genetic resources and genomic information make gene discovery and molecular mechanism studies very difficult. Here, we sequenced and assembled the transcriptome of broomcorn millet using Illumina sequencing technology. After sequencing, a total of 45,406,730 and 51,160,820 clean paired-end reads were obtained for two genotypes Yumi No. 2 and Yumi No. 3. These reads were mixed and then assembled into 113,643 unigenes, with the length ranging from 351 to 15,691 bp, of which 62,543 contings could be assigned to 315 gene ontology (GO) categories. Cluster of orthologous groups and Kyoto Encyclopedia of Genes and Genomes (KEGG) analyses assigned could map 15,514 unigenes into 202 KEGG pathways and 51,020 unigenes to 25 COG categories, respectively. Furthermore, 35,216 simple sequence repeats (SSRs) were identified in 27,055 unigene sequences, of which trinucleotides were the most abundant repeat unit, accounting for 66.72% of SSRs. In addition, 292 differentially expressed genes were identified between the two genotypes, which were significantly enriched in 88 GO terms and 12 KEGG pathways. Finally, the expression patterns of four selected transcripts were validated through quantitative reverse transcription polymerase chain reaction analysis. Our study for the first time sequenced and assembled the transcriptome of broomcorn millet, which not only provided a rich sequence resource for gene discovery and marker development in this important crop, but will also facilitate the further investigation of the molecular mechanism of its favored agronomic traits and beyond.

Keywords: abiotic stress, broomcorn millet, transcriptome, qRT-PCR, SSR

### INTRODUCTION

fpls-07-01083 July 21, 2016 Time: 11:4 # 2

Broomcorn millet (Panicum miliaceum L.) is one of the earliest domesticated cereals worldwide with historical, evolutionary and agricultural significance (Yang et al., 2012; Stephens et al., 2014). It has been revealed that Broomcorn millet was domesticated as early as 10,000 years ago in the semiarid regions of China (Lu et al., 2009), and it was an indispensable food staple in many semiarid regions of East Asia before the cultivation of rice and wheat (Hu et al., 2008). Thus, Broomcorn millet played a vital impact on human civilization (Crawford, 2006; Bellwood et al., 2007). Genetically, Broomcorn millet is a tetraploid species with a chromosome number of 36 (2N = 4X = 36; Hunt et al., 2010). As a short-day C4 crop, it has many favored agronomic traits, such as high productivity yield, short growing season (60–90 days), a lower water requirement and is well-adapted to extreme conditions (Hunt et al., 2014). Previous studies revealed that Broomcorn millet showed high drought and salinity tolerance (Dai et al., 2011). Thus, discovery and identification of the stress-responsive genes from broomcorn millet will not only provide useful information for better understanding the molecular mechanism of stress tolerance, but also provide indispensable gene resources for tolerance improvement in this species as well as other crops (Jiang and Deyholos, 2006).

With the development of next generation sequence (NGS) technology, RNA sequencing (RNA-seq) has gradually become a powerful and high-efficiency method to obtain a large number of transcripts and identify differentially expressed genes (DEGs) at the transcriptome level, which has been widely used in various plants (Clark et al., 2015; Sartelet et al., 2015; Xu et al., 2015; Zhan et al., 2015). Prior to this study, transcriptome analysis has not been performed in broomcorn millet, which has limited further gene identification and molecular mechanism studies (Miller et al., 2002; Salgado et al., 2014; Yates et al., 2014). Here, de novo assembly and analysis of the broomcorn millet transcriptome was performed using High-throughput Illumina paired-end RNA sequencing technology, with the purpose to enrich the genetic information and sequence resources for facilitating gene discovery and marker development studies. This is the first study to report the transcriptome characteristics of broomcorn millet, which will provide the useful information for molecular studies, and also shed light on the molecular mechanism of stress tolerance in Broomcorn millet.

### MATERIALS AND METHODS

### Plant Materials and RNA Extraction

Two Broomcorn millet genotypes, namely Yumi No. 2 and Yumi No. 3, which were kindly provided by Dr. Bai-Li Feng, College of Agriculture, Northwest A&F University, were used as materials in this study. Yumi No. 2 is a waxy, drought-sensitive cultivar, while Yumi No. 3 is a non-waxy, drought and salt tolerant cultivar. Seeds of both cultivars were germinated and grown in pots containing peat mixed sand at a 1:1 ratio. Plants were normally watered and grown under glasshouse conditions (22◦C, 16 h photoperiod/20◦C, 8 h dark period). For RNA-Seq library construction, tissue samples including leaves, stems, spikes, and roots were collected from five plant individuals grown at different time points. Leaves and roots were harvested at the seedling, jointing, booting, and filling stages while stems were harvested at the jointing, booting, filling, and mature stages. Spikes were harvested at the mature stages. Three independent tissue samples collection were performed as biological replicates. All the samples were immediately frozen in liquid nitrogen and stored at −80◦C for RNA isolation.

Total RNA from collected samples was separately isolated using TRIzol reagent (Invitrogen) according to the manufacturer's instructions. Then, the quality of RNA was checked by agarose gel electrophoresis and Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA) and the quantity was checked by NanoDropND-1000 Spectrophotometer (NanoDrop Technologies, Wilmington, DE, USA). Finally, the equal amounts of RNA isolated from Yumi No. 2 and Yumi No. 3 tissues were pooled together into a single RNA sample respectively, and then the pooled RNA sample was used for RNA sequencing.

### Library Construction and Illumina Sequencing

The RNA-seq library construction and sequencing were performed using the Illumina's standard pipeline (Illumina, San Diego, CA, USA). In brief, magnetic oligospheres were used to remove rRNA or tRNA and enrich mRNA. mRNA was further sheared into short fragments of 180 bp in size were recovered and column purified, then subjected to first strand cDNA synthesis using random hexamer-primed synthesis of first-strand cDNA, followed by second-strand cDNA synthesis using DNA polymerase I and RNase H (Invitrogen). The cDNA fragments were end-repaired and A-tailed, and then index adapters were ligated. After the purified cDNA libraries were amplified by polymerase chain reaction (PCR) for 15 cycles and PCR products were separated by Certified Low Range Ultra Agarose, the suitable fragments were selected for deep sequencing. Sequencing was performed on the IlluminaHiSeq2500 platform with the PE100 approach by Shanghai Majorbio Bio-pharm Technology Corporation. All clean Illumina sequencing data have been deposited in SRA database with the accession number of SAMN05255231.

### Sequence Data Analysis and De novo Assembly

The quality of the raw reads of two libraries was checked using SeqPrep and Sickle software. Adapter contamination, low quality bases (≤ 20), reads containing more than 10% ambiguous bases, and reads containing less than 20 bases were removed. The obtained clean reads were mixed and then de novo assembled using the Trinity program with the default parameters<sup>1</sup> . K was set to 25 for inchworm analysis and the transcript isoform with the highest relative abundance after butterfly analysis was selected as the representative transcript for each gene.

<sup>1</sup>http://trinityrnaseq.sourceforge.net

Clustering analysis with CD-HIT software, with 95% identity, was performed to reduce redundancy from transcripts derived from homeologous genes or different alleles of the same genes, which are common artifacts of pooling tissues from multiple individuals for transcrpitome sequencing.

### Functional Annotation

fpls-07-01083 July 21, 2016 Time: 11:4 # 3

To determine the predicted function, all unigenes were used as blastx queries against the following databases with an e-value < 1e-5, including NCBI Non-redundant Protein (NR) database (2016/1/12), Swiss-Prot protein database (UniProtKB/Swiss-Prot protein knowledgebase release 2015\_12 statistics) and Cluster of Orthologous Groups database (COG) as well as Kyoto Encyclopedia of Genes and Genomes pathway database (KEGG). To evaluated sequences similarity and predicted function according to homologous genes, three close relative species of broomcorn millet, including Panicum halli, Panicum virgatum and Setaria italica, were used to analyze. The genome annotaion sequences of P. halli<sup>2</sup> and P. virgatum<sup>3</sup> were obtained from JGI phytozome. The genome annotaion sequences of S. italica were obtained from the Foxtail millet Database<sup>4</sup> . The hits with the highest sequence similarity were retrieved for analysis. Based on NR annotation, 10 top-hit species were identified and gene ontology (GO) classifications were annotated by the Blast2GO program. KEGG produced annotation of metabolic pathways. Goatools<sup>5</sup> and KOBAS (Xie et al., 2011) was used to identify enriched GO and KEGG in the DEGs between Yumi No. 2 and Yumi No. 3. DEGs were identified using Fisher's exact test and P-values were corrected for multiple hypothesis testing.

### SSRs and SNPs Markers Screening

Potential simple sequence repeats (SSRs) were detected using MISA software<sup>6</sup> . In this study, repeats of one to six nucleotides in length were considered. The minimum reiterations units were 10 repeat units for mononucleotides, six for dinucleotides, and four for tri-, tetra-, penta-, and hexa-nucelotides. The maximal distance was 100 nucleotides interrupting two SSRs in a compound microsatellite. Assembled contigs were scanned for single nucleotide polymorphisms (SNPs) with SNP detection software SOAPsnp (Li et al., 2009) using the method as described by Chen et al. (2014), which has been used for SNP calling in Stevia rebaudiana with low quality mapping scores (≤ 20) and less than five nucleotides apart were discarded.

### Analysis and Annotation Differentially Expressed Genes (DEGs)

Bowtie was used to perform the read mapping. The number of reads per kilobase of exon region per million mapped reads (RPKM) was used to normalize the expression values of reads. Raw read counts were utilized with edgeR<sup>7</sup> to identify DEGs. In multiple hypothesis testing, false discovery rate (FDR) was used to select the threshold P-value (Benjamini et al., 2001). An FDR ≤ 0.05 and fold change (FC) ratio larger than 2 (| log<sup>2</sup> FC| ≥ 1) were chosen in our study to determine the DEGs. Scatter and volcano plots were drawn by geWorkbench platform<sup>8</sup> using the value of logRPKM and log2FC. All DEGs were subjected to GO and KEGG annotation and cluster analysis of DEG patterns was performed using hCluster software<sup>9</sup> following the method previously described by Liu G. et al. (2012).

### qRT-PCR Analysis

First, Broomcorn millet plants, including Yumi No. 2 and Yumi No. 3, were adequately watered and grown under glasshouse conditions (22◦C, 16 h photoperiod/20◦C, 8 h dark period). After grown for 1 month, the plants were transferred into 4C, 38◦C or 200 mM NaCl, which represented low temperature, heat and salt treatment, respectively (Li et al., 2014). Whole plants were separately collected at 0, 3, 6, 12, and 24 h under above treatment as well as under no stress treatment as control. The experiment was conducted three independent times with five biological replicate seedlings per treatment. Tissue from the five replicate samples was pooled for RNA isolation using the RNAiso reagent (TaKaRa, Japan) and the three experiments were treated as biological replicates for the RNA extraction, and then equivalent amounts of RNA from three biological replicates of each sample was pooled into a single RNA. cDNA was prepared by using the PrimeScriptTM RT reagent Kit (TaKaRa, Japan) following the manufacturer's instructions. Quantitative reverse transcription polymerase chain reaction (qRT-PCR) was analyzed using SYBR Green master mix (TaKaRa, Japan) and ABI 7300 real-time PCR system (Applied Biosystems, USA). The thermal cycling conditions were as follows: 3 min at 94◦C, followed by 40 cycles each consisting of 95◦C for 15 s, 60◦C for 30 s, 72◦C for 1 min. Actin was used as an internal control (Supplementary Table S1). Each reaction was performed in triplicate and the2−11Ct method was used to calculate the expression levels. Student's t-test was used to statistics analyze.

### RESULTS AND DISCUSSION

### Transcriptome Sequencing and De novo Assembly

To comprehensively generate a broomcorn millet transcriptome, the RNA isolated from leaves, stem, root, shoots, flower, and spike of two cultivars were equally pooled and sequenced separately using the Illumina Hiseq2000 platform. After sequencing and quality filtering, a total of 45,406,730 high-quality reads for Yumi No. 2 and 51,160,820 reads for Yumi No. 3 were obtained, accounting for approximately 4.3 and 4.9 Gb, respectively (Supplementary Table S2). Then, the generated reads were mixed and de novo assembled using Trinity software. A total of 113,643

<sup>2</sup>https://phytozome.jgi.doe.gov/pz/portal.html#!info?alias=Org\_Phallii

<sup>3</sup>https://phytozome.jgi.doe.gov/pz/portal.html#!info?alias=Org\_Pvirgatum

<sup>4</sup>http://foxtailmillet.genomics.org.cn/page/species/download.jsp

<sup>5</sup>https://github.com/tanghaibao/Goatools

<sup>6</sup>http://pgrc.ipk-gatersleben.de/misa/

<sup>7</sup>http://www.bioconductor.org/packages/release/bioc/html/edgeR.html

<sup>8</sup>https://cabig.nci.nih.gov/community/tools/geWorkbench

<sup>9</sup>https://pypi.python.org/pypi/hcluster/

contigs with a total assembly length of 164,535,293 bp were obtained. To investigate the quality and coverage of assembly, we then mapped the paired reads of the two cultivars back to these contigs. Results showed that 132,599,016 (93.80%) and 113,424,962 (93.29%) reads belonging to Yumi No. 2 and Yumi No. 3 respectively could be mapped back to the contigs, supporting the accuracy of the assembly. Further analysis found that the size distribution of these contigs ranged from 351 to 15,691 bp with an average size of 1,448 bp, of which 24,750 (21.78%) ranged from 100 to 600 bp, and 23,423 (20.61%) ranged from 600 to 1000 bp; 65,470 unigenes with lengths longer than 1 kb were identified (**Figure 1**). To the best of our knowledge, the sequences of broomcorn millet are deficient at present (Hunt et al., 2011). Therefore, the unigene dataset reported here significantly enriches the sequence resources and genetic information of broomcorn millet, which provides the foundation for further study of gene expression, gene function and gene regulation pathways in broomcorn millet.

### Annotation and Functional Characterization of the Broomcorn Millet Transcriptome

In order to assess and annotate the assembled unigenes, 113,643 assembled unigenes generated by Trinity software were subjected to blastx similarity searches against NCBI's NR database, Swiss-Prot, COG, KEGG of proteins with an e-value cut off of 10−<sup>5</sup> . As a result, 51,629 unigenes had matches to proteins in the NR database, and 40,407 unigenes were similar to proteins in the Swiss-Prot database. Totally, 60,352 unigenes were profitably annotated in Nr, Swiss-Prot, GO, COG, and KEGG. The size distribution of these open reading frames was shown in Supplementary Figure S1. Furthermore, we examined the homologs of broomcorn millet unigenes with other monocot species, including Sorghum bicolor, Oryza sativa, Brachypodium distachyon, Hordeum vulgare, Aegilops tauschii, Triticum urartu, P. virgatum, S. italica by protein similarity search in NCBI NR database. Results showed that 35,888 (43.68%) unigenes of broomcorn millet have homologous matches to S. bicolor transcripts, 28,537 (34.74%) have hits to Zea mays, 9,194 (11.19%) have hits to O. sativa, 2,995 (3.65%) have hits to B. distachyon, 1,247 (1.52%) have hits to H. vulgare, 1,190 (1.45%) have hits to A. tauschii, 742 (0.90%) have hits to Triticum urartu,73(0.14%) have hits to S. italica, and 50 (0.09%) have hits to P. virgatum as well as 921 (1.78%) have hits to other species, respectively (**Figure 2**). It is unexpected that only a few Broomcorn millet unigenes showed homologous significantly matches to S. italica and P. virgatum. Further analysis found this might be as the result of the best hits with better e-values and similarity distributions in S. bicolor, Zea mays, and O. sativa compared to P. halli, P. virgatum, and S. italica. To further evaluate accurately sequences similarity with P. halli, P. virgatum and S. italica, the predicted protein sequences of these three species were used to identify the homologs in broomcorn millet unigenes by blastx search with an e-value cutoff of 10−<sup>5</sup> . Results showed that 88,538 (77.91%) unigenes of broomcorn millet have homologs hits to P. halli genome annotations, 90,229 (79.40%) had hits to P. virgatum and 84,024 (73.94%) had hits to S. italica, respectively. At the same time, e-value and similarity distribution through against Nr databases, P. halli, P. virgatum, and S. italica genome annotations was shown in Supplementary Figure S2.

To preliminarily understand the function of the broomcorn millet unigenes, those with homologs to previously annotated sequences in the NR database were further annotated with GO terms using the Blast2GO tool (Conesa et al., 2005). A total of 62,543 unigene sequences were annotated to three major GO classes. The largest class was cellular component, accounting for 42.47% of the total annotated unigenes, followed by biological process (38.93%) and molecular function (18.60%).

Within the cellular component, 'cell,' 'cell part,' and 'organelle' were the most abundant among the 56 categories, which together accounted for 72.14% of the genes belonging to this class. In biological process, 'metabolic' and 'cellular processes' were the largest and second largest categories, accounting for 44.09% of the sequences assigned to this term. In 16 different molecular function categories, the two most abundant categories were 'catalytic activity' and 'binding,' which accounted for 87.44% for the genes of this class (**Figure 3**).

component and molecular function.



Furthermore, the assembled unigenes were aligned to the KOG database to classify their putative function. Result showed that a total of 33,671 unigenes could be matched to genes in the KOG database. These unigenes were classified into 25 different functional classes. Among transcripts with matches to the KOG database, 'the function prediction only class' (7,071, 21.00%) represented the largest group, followed by 'signal transduction mechanisms' (4,986; 14.81%), 'post-translational modification, protein turnover and chaperones' (3,835; 11.39%) 'transcription' (2,570; 7.63%), while only a few unigenes were assigned to 'extracellular structures' (70; 0.21%) and 'cell motility' (11; 0.03%; **Figure 4**).

Finally, KEGG analysis was used to assigned unigenes to metabolic pathways. 15,514 unigenes were mapped to 202 KEGG pathways. Among them, the most represented pathways were metabolic pathways (ko01100; 25.65%), followed by biosynthesis of secondary metabolites (ko01110; 10.71%), biosynthesis of amino acids (ko01230; 3.57%), pyrimidine metabolism (ko00240; 2.75%), purine metabolism (ko00230; 2.70%), peroxisome (ko04146; 2.55%), and spliceosome (ko03040; 2.50%) as well as plant–pathogen interaction (ko05169; 2.37%; Supplementary Figure S3 and Supplementary Table S3).

### Identification of SSRs and SNPs Loci in Broomcorn Millet

Simple sequence repeats are one of the most informative and versatile molecular markers, which are widely used in genetic diversity, genetic structure and genetic mapping studies (Varshney et al., 2005a,b; Arora et al., 2014; Ting et al., 2014). To provide useful information for marker development in broomcorn millet, we firstly investigated the EST-SSR loci using the assembled transcripts. A total of 35,216 SSR loci were identified in 27,055 sequences out of 113,643 unigenes. Among them, 2,536 sequences contained more than one SSR (**Table 1**). The repeat number of the SSRs ranged from 4 times to 24 times, and a repeat number of four was the most abundant repeats number accounting for 49.99% of SSRs, followed by repeat number of five accounting for 14.45% of SSRs, repeat number of 10 accounted for 13.38% of SSRs, repeat number six accounted for 8.79% of SSRs, while repeat number of more than 20 were rare, only accounted for 0.05% of SSRs. Within the different types of SSRs, trinucleotide was the most abundant repeat unit accounted for 66.72% of SSRs, which was consistent with cabbage, red clover, sweetpotato and cucumber (Guo et al., 2010; Wang et al., 2010; Izzah et al., 2014; Yates et al., 2014), followed by mononucleotide, which accounted for 18.26% of SSRs, dinucleotide, which accounted for 10.33% of SSRs, tetranucleotide, which accounted for 3.15% of SSRs, pentanucleotide, which accounted for 1.24% of SSRs and hexanucleotide accounted for 0.30% of SSRs (**Figure 5**). The most abundant repeat type was A/T, which accounted for 95.88% of the mononucleotide SSRs. The least repeat type was C/G, which accounted for 4.12% of the mononucleotide SSRs. Within dinucleotide repeats, the most abundant motif was AG/CT, which accounted for 62.56% of the dinucleotide SSRs, which was similar with the previous results in sweet potato, coffee, peanut, and Arachis (Poncet et al., 2006; Liang et al., 2009; Wang et al., 2010). In contrast, AC/GT were the most abundant dinucleotide repeats in soybean, maize, rice, wheat, and barley where AC/GT were the most frequent repeats. Within trinucleotide SSRs, CCG/CGG were the most abundant motifs accounting for 46.67% of the trinucleotide SSRs, which were the most abundant trinucleotide repeats in soybean (Xin et al., 2012), maize and barley(Kantety et al., 2002). Previous studies have reported that the CCG/CGG motif was very rare in dicotyledonous plants while abundant in monocots (Wang et al., 2011). The result of Broomcorn millet was consistent with this conclusion (**Table 1**). Finally, the length of SSRs loci were found to range from 10 to 25 bp, of which 12 bp were the most frequent accounting for 51.39% of the SSRs loci, followed by 15 bp accounting for 13.78% of the SSRs loci and 10 bp accounting for 11.91% of the SSRs loci (Supplementary Figure S4).

Single nucleotide polymorphisms have recently become a more popular marker for high density genetic mapping, association mapping and population genetic structure studies (Liu et al., 2011). SNPs occurring in coding regions (cSNP) may cause the loss or change of protein function, and thus, they could be used directly to assess the impact of mutation on important economic traits (Ellegren, 2008). To understand

FIGURE 7 | Scatter (A) and Volcano plot (B) of the differentially expressed genes (DEGs) between Yumi No. 2 and Yumi No. 3. (A) The x-axis represents the value of logeRPKM of Yumi No. 2, and the y-axis shows the value of logeRPKM of Yumi No. 3. The black dots represent no significant differences of genes, red and blue dots indicate significantly up-regulated or down-regulated expression of genes in Yumi No. 2 compared to Yumi No. 3 (FDR ≤ 0.001 and log2FC ratio ≥ 1), respectively. (B) The x-axis respresents the values of log2FC for genes of being differentially expressed between Yumi No. 2 and Yumi No. 3. The y-axis shows the values of log10FDR. The black dots represent no significant differences of genes, red and blue dots indicate significantly up-regulated or down-regulated expression of genes in Yumi No. 2 compared to Yumi No. 3 (FDR ≤ 0.001 and log2FC ratio ≥ 1).

the cSNP in Broomcorn millet, we identified cSNPs between these two cultivars. A total of 406,062 high-quality SNPs were identified in Yumi No. 2 and 409,850 high-quality SNPs were identified in Yumi No. 3. For Yumi No. 2, the putative SNPs included 270,068 transitions (A/G, C/T) and 135,994 transversions (G/T, C/G, A/T, A/C), while 273,593 transitions and 136,257 transversions were observed in Yumi No. 3 (Supplementary Figure S5). The average number was 2.46 and 2.49 SNPs per kb for Yumi No. 2 and Yumi No. 3, respectively. Further analysis found that in Yumi No. 2, 109,078 (26.87%) SNPs were distributed into coding sequences, 20,446 (5.03%) SNPs distributed into untranslated regions (UTRs) and 276,539 (68.10%) distributed in non-coding regions, respectively. Among the SNP of Yumi No. 2, the percentage distributed in the coding sequences (CDSs), UTRs and non-coding regions was 26.87, 5.03, and 68.10%, respectively. However, among the SNP of Yumi No. 3, the percentage distributed in the CDSs, UTRs and non-coding regions was 26.26, 4.93, and 68.81%, respectively (**Figure 6**).

### Detection of Differentially Expressed Genes (DEGs)

To detect DEGs, the expression level of these unigenes in the two cultivars was investigated. Pairwise comparison of RPKM and FC between Yumi No. 2 and Yumi No. 3 in the RNAseq data sets was first conducted. The scatter plots showed the expression differences of each gene, and volcano plots showed the expression differences of gene among these two cultivars. The results indicated that most genes were expressed at similar levels (black dots) between the cultivars and only a small portion of genes were significantly up-regulated (red dots) and downregulated (blue dots) expressed. A total of 292 DEGs were obtained, of which 128 genes (red dots) were up-regulated and 164 genes (blue dots) were down-regulated in Yumi No. 2 compared to Yumi No. 3, all of which had statistically significant differences in expression levels (P < 0.05; **Figure 7**). The list of

∗∗P-value < 0.01 (Student's t-test). (A) The expression level of Unigene34608 in Yumi No. 2. (B) The expression level of Unigene34608 in Yumi No. 3. (C) The expression level of Unigene41558 in Yumi No. 2. (D) The expression level of Unigene41558 in Yumi No. 3. (E) The expression level of Unigene33484 in Yumi No. 2. (F) The expression level of Unigene33484 in Yumi No. 3. (G) The expression level of Unigene35973 in Yumi No. 2. (H) The expression level of Unigene35973 in Yumi No. 3.

DEGs has shown in Supplementary Table S4 and Supplementary Figure S6.

To determine the potential function of these DEGs, GO, and KEGG analyses were performed. In total, 88 GO terms were significantly enriched in the DEGs at the stringent cut off level of P < 0.05. In biological process, the GO terms metabolic process (GO:0008152) and cellular process (GO:0009987) were enriched in the DEGs while in cellular component, the GO terms cytoplasmic part (GO:0044444) and intracellular organelle (GO:0043229) were enriched. Finally, heterocyclic compound binding (GO:1901363) and organic cyclic compound binding (GO:0097159) were enriched in the molecular function category (Supplementary Table S5). KEGG pathway enrichment analysis for DEGs included 12 enriched pathways (Supplementary Table S6). RNA degradation (ko03018), nucleotide excision repair (ko03420), ubiquinone and other terpenoid-quinone biosynthesis (ko00130) and phosphatidylinositol signaling system (ko04070) were the top four enriched pathways. The differential expressed genes clustered into different functional categories, which provided the important resource to discover and identify the important functional genes.

### Validation of the DEGs Using qRT-PCR

To validate the DEGs from the RNA-seq data, four DEGs which may be involved in abiotic stress response were selected to qRT-PCR analysis. Three DEGs, including unigene34608, unigene35973 and unigene41558, were down-regualted in Yumi No. 2 compared to Yumi No. 3. Unigene33484 was up-regualted in Yumi No. 2 compared to Yumi No. 3. Results showed that the expression patterns of three transcripts (unigene33484, unigene34608, and unigene35973) were consistent with that of RNA-seq (**Figure 8**). Although, the expression level of the remaining transcript unigene41558 obtained from qRT-PCR was significantly higher than that of RNA-seq data, they showed a similar down-regulated expression trend to RNA-seq (**Figure 8**). Consequently, the RNA-seq could provide the useful information for gene expression and the identified DEGs provided a valuable resource for gene discovery and functional analysis in Broomcorn millet.

To identify some stress-related genes from Broomcorn millet, the expression profiles of these unigenes which may be involved in abiotic-stress response were further detected under different stresses (**Figure 9**). Unigene34608 is predicted to encode heat shock factor-binding protein 1 (HSBP1). HSBP1 can affects HSF1 DNA binding activity and is a negative regulator in response to heat stress (Satyal et al., 1998). In Yumi No. 2, the results showed that the transcript levels of Unigene34608 had small expression level changes under cold stress, while its expression level was reduced by 0.26-fold under heat stress for 3 h and 0.16-fold after 6 h under salt stress compared to control plants. In Yumi No. 3, the expression level of Unigene34608 was temporarily elevated under cold and heat stress, especially expression level was increased more than 400-fold compared to control plants under cold treatment for 6 h. And it quickly declined to low levels under salt stress for 24 h. Unigene41558 putative encodes a CBL-interacting protein kinase 9 (CIPK9), which interacts with calcium sensor and plays important roles in low-K<sup>+</sup> stress (Liu L.L. et al., 2012; Hung et al., 2014). High expression levels of Unigene41558 were observed under several stress treatments. For example, in Yumi No. 2, expression levels of Unigene41558 increased 103.02-fold compared to untreated controls under cold treatment for 6 h, 52.02-fold under heat treatment for 6 h and 11.39-fold under salt treatment for 12 h. Similar trends were also observed in Yumi No. 3 where highest expression level of unigene41558 were observed under cold stress for 6 h, heat stress for 24 h, and salt stress for 3 h. Unigene33484 is homologous to an acidic Y2Kn dehydrin DHN1 (Allagulova et al., 2003). A previous study revealed over-expression of DHN1 gene positively affected plant growth under abiotic stress (Beck et al., 2007). The expression levels of Unigene33484 show slight increases in expression levels under cold and heat stress in Yumi No. 2 with values ranging from 1- to 4-fold higher than untreated controls, while under salt stress, expression levels initially declined to0.26-fold at6 h and gradually increased to 1.80-fold at 12 h and reached the highest expression levels for 24 h with expression levels 100-fold higher than untreated controls. In Yumi No. 3, the expression patterns of Unigene33484 under stress treatment were different from Yumi No. 2, which increased 153.22-, 42.27-, 31.38-fold under cold treatment for 6 h, heat stress for 6 h and salt stress for 24 h, respectively. It is indicated that Unigene33484 likely plays a role in osmoregulation in Broomcorn millet. Unigene35973 is predicted to encode a zinc-finger protein gene ISAP1, which involved in regulating cold, dehydration, and salt tolerance in transgenic tobacco (Mukhopadhyay et al., 2004). In this study, expression levels of Unigene35973 in Yumi No 2 were 7.72-, 6.34-, 3.75-fold higher under cold, heat, and salt stress compared to untreated controls, respectively. In Yumi No. 3, expression levels of this unigene were over 100-fold higher than untreated controls under both cold and salt stress. The expression profiles of these four unigenes suggested they may play an important role in response to abiotic stress in Broomcorn millet, which provided the foundation for further study of the molecular mechanism of stress tolerance of this important crop.

### CONCLUSION

This is the first large scale de novo assembly and analysis of the transcriptome in broomcorn millet. A total of 113,643 unigenes were obtained, of which 62,543 were functionally annotated. Furthermore, more than 35,000 SSRs and 406,000 SNP loci were identified, which provide an important resource for marker development in this species. This study provided the first insight into the transcriptome of Broomcorn millet, which not only provided an invaluable sequence resource and genomic information for molecular studies in this important crop, but also shed light on discovering the vital functional genes involving in the metabolism regulation network of involved in the adaptation to extreme climatic conditions as well as facilitating further studies on molecular mechanisms of stress tolerance in Broomcorn millet.

### AUTHOR CONTRIBUTIONS

fpls-07-01083 July 21, 2016 Time: 11:4 # 12

HY, WS, and XN conceived and designed the experiments. HY and LW performed the experiments. HY, WY, and HL analyzed the data. XD contributed reagents, materials, and analytical tools. HY, WS, and XN wrote the paper. All authors read and approved the final manuscript.

### FUNDING

This work was mainly funded by the National Natural Science Foundation of China (Grant No. 31401373) and partially supported by the Open Project Program of State Key Laboratory

### REFERENCES


of Crop Stress Biology in Arid Areas, China (CSBAA20 14002).

### ACKNOWLEDGMENT

We would like to thank Prof. Baili Feng for kindly providing two varieties of broomcorn millet.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fpls.2016.01083

of tetraploid broomcorn millet, P. miliaceum. J. Exp. Bot. 65, 3165–3175. doi: 10.1093/jxb/eru161


sweetpotato (Ipomoea batatas). BMC Plant Biol. 11:139. doi: 10.1186/1471- 2229-11-139


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The reviewer ES and handling Editor declared their shared affiliation, and the handling Editor states that the process nevertheless met the standards of a fair and objective review.

Copyright © 2016 Yue, Wang, Liu, Yue, Du, Song and Nie. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

fpls-07-01083 July 21, 2016 Time: 11:4 # 13

# Water Deficit Affects Primary Metabolism Differently in Two Lolium multiflorum/Festuca arundinacea Introgression Forms with a Distinct Capacity for Photosynthesis and Membrane Regeneration

Dawid Perlikowski<sup>1</sup> , Mariusz Czyzniejewski ˙ 1 , Łukasz Marczak<sup>2</sup> , Adam Augustyniak<sup>1</sup> and Arkadiusz Kosmala<sup>1</sup> \*

1 Institute of Plant Genetics, Polish Academy of Science, Poznan, Poland, ´ 2 Institute of Bioorganic Chemistry, Polish Academy of Sciences, Poznan, Poland ´

#### Edited by:

Teresa Donze, University of Nebraska–Lincoln, USA

#### Reviewed by:

Prateek Tripathi, Scripps Research Institute, USA Hao Peng, Washington State University, USA

> \*Correspondence: Arkadiusz Kosmala akos@igr.poznan.pl

#### Specialty section:

This article was submitted to Crop Science and Horticulture, a section of the journal Frontiers in Plant Science

> Received: 22 April 2016 Accepted: 06 July 2016 Published: 25 July 2016

#### Citation:

Perlikowski D, Czyzniejewski M, ˙ Marczak Ł, Augustyniak A and Kosmala A (2016) Water Deficit Affects Primary Metabolism Differently in Two Lolium multiflorum/Festuca arundinacea Introgression Forms with a Distinct Capacity for Photosynthesis and Membrane Regeneration. Front. Plant Sci. 7:1063. doi: 10.3389/fpls.2016.01063 Understanding how plants respond to drought at different levels of cell metabolism is an important aspect of research on the mechanisms involved in stress tolerance. Furthermore, a dissection of drought tolerance into its crucial components by the use of plant introgression forms facilitates to analyze this trait more deeply. The important components of plant drought tolerance are the capacity for photosynthesis under drought conditions, and the ability of cellular membrane regeneration after stress cessation. Two closely related introgression forms of Lolium multiflorum/Festuca arundinacea, differing in the level of photosynthetic capacity during stress, and in the ability to regenerate their cellular membranes after stress cessation, were used as forage grass models in a primary metabolome profiling and in an evaluation of chloroplast 1,6-bisphosphate aldolase accumulation level and activity, during 11 days of water deficit, followed by 10 days of rehydration. It was revealed here that the introgression form, characterized by the ability to regenerate membranes after rehydration, contained higher amounts of proline, melibiose, galactaric acid, myoinositol and myo-inositol-1-phosphate involved in osmoprotection and stress signaling under drought. Moreover, during the rehydration period, this form also maintained elevated accumulation levels of most the primary metabolites, analyzed here. The other introgression form, characterized by the higher capacity for photosynthesis, revealed a higher accumulation level and activity of chloroplast aldolase under drought conditions, and higher accumulation levels of most photosynthetic products during control and drought periods. The potential impact of the observed metabolic alterations on cellular membrane recovery after stress cessation, and on a photosynthetic capacity under drought conditions in grasses, are discussed.

Keywords: chloroplast aldolase, drought, forage grasses, membrane regeneration, photosynthetic activity, primary metabolites

**Abbreviations:** F-1,6-2P, Fructose-1,6-diphosphate; GC, gas chromatography; G-6-P, Glucose-6-phosphate; MS, mass spectrometry; MSTFA, N-Methyl-N-(trimethylsilyl) trifluoroacetamide; PCA, principal component analysis; pFBA, plastid Fructose-1,6-bisphosphate aldolase; RACE, rapid amplification of cDNA ends; ROS, reactive oxygen species; UTR, untranscribed region.

### INTRODUCTION

fpls-07-01063 July 22, 2016 Time: 14:48 # 2

The sedentary life style of plants exposes them to many unfavorable environmental conditions, limiting their growth and development. Abiotic stresses, such as drought (water deficit), salinity, flooding, and low temperature strongly affect plants during their life cycle. Among those factors water deficit is one of the most important, disturbing plant metabolism, inhibiting their growth and reducing productivity worldwide (Bray et al., 2000; Zhang et al., 2005). During their evolution plants developed many strategies for surviving water deficit, namely: drought avoidance, drought tolerance, drought escape, and recovery after drought cessation (Fang and Xiong, 2015). Numerous anatomical, physiological and molecular components of plant performance, including root parameters, leaf features, osmoprotection system, ROS scavenging system, membrane stability and photosynthetic capacity, influence a plant's response to drought (Fang and Xiong, 2015).

Plant productivity depends mainly on photosynthesis, which is one of the first and most sensitive physiological processes affected by drought (Lawlor, 2002; Munns, 2002). Several studies have been conducted in order to improve our knowledge about drought-induced inhibition of photosynthesis (Cornic, 2000; Lawlor, 2002; Chaves et al., 2003; Flexas et al., 2004). These studies have shown that the inhibitory effects of drought on photosynthesis could be associated with low CO<sup>2</sup> availability, due to limitations of its diffusion through the stomata (stomatal limitations) and/or due to non-stomatal limitations, including both diffusive (reduced mesophyll conductance) and metabolic (photochemical and enzymatic limitations) processes (Lawlor, 2002; Lawlor and Cornic, 2002; Flexas et al., 2004). Nevertheless, there is an ongoing debate on whether drought stress influences photosynthesis more by stomatal or non-stomatal alterations (Flexas and Medrano, 2002). Stomatal closure has been identified as an early response to decreasing soil water potential, or to a decline in leaf turgor due to fall in a relative water content, and as an efficient way to reduce water loss in drying field conditions. Simultaneously, a decreasing stomatal conductance limits carbon uptake into leaves, which affects the photosynthesis during mild to moderate drought (Cornic, 2000; Lawlor, 2002). However, during prolonged and more severe periods of drought metabolic limitations, including a reduction of crucial photosynthetic enzymes' activities, may also significantly reduce CO<sup>2</sup> assimilation (Flexas et al., 2004; Signarbieux and Feller, 2011). It was suggested that these limitations are mainly associated with enzymes of the Calvin cycle (Chaves et al., 2003). This relationship was demonstrated for the inhibition of ribulose-1,5-bisphosphatecarboxylase (Rubisco; Chaves et al., 2003), decreases in a total Rubisco activity and protein content (Flexas et al., 2006), and an inhibition of Rubisco activase (Lawlor, 2002). Furthermore, the rate of photosynthesis could be limited not only by a carboxylation of Rubisco but also by a ribulose-1,5-bisphosphate regeneration capacity, which could be reduced mainly by a decreased accumulation of plastid fructose-1,6-bisphosphatase (Kossmann et al., 1995), an inhibition of sedoheptulose-1,7-bisphosphatase (Harrison et al., 1998) and a reduced accumulation of chloroplast fructose-1,6-bisphosphate aldolase (pFBA; Haake et al., 1998, 1999; Uematsu et al., 2012). As it was demonstrated earlier, alterations in photosynthetic carbon metabolism in response to drought could also be strongly associated with accumulation levels of several classes of primary metabolites, mostly plant hormones, osmoprotectants and ROS scavenging particles which are crucial to develop drought tolerance (Chandler and Robertson, 1994). Osmoprotectants are important soluble metabolites and belong to amino acids and carbohydrates, sharing common characteristics, such as small molecular weight and non-toxic character, thus they can be accumulated in large quantities without being harmful for the cell functioning (Rontein et al., 2002). Under osmotic stress plants produce organic osmolytes from a group of soluble sugars, such as fructose or sucrose, organic alcohols, such as myo-inositol, complex sugars, such as trehalose or fructans or amino acids, such as proline and modified amino acids, such as glycine betaine (Kido et al., 2013). These compounds could also function as chaperone-like molecules stabilizing membranes and maintaining the activity and stability of the enzymes crucial for a proper functioning of cell metabolism (Xoconostle-Cazares et al., 2010).

Festuca arundinacea (tall fescue) is one of the most drought tolerant grass species in the Lolium-Festuca complex. Lolium multiflorum (Italian ryegrass) has high yielding capacity but significantly lower levels of tolerance to environmental stresses, such as drought, compared to F. arundinacea. L. multiflorum and F. arundinacea hybridization enables the assembly of complementary characters of both species within a single genotype (Kosmala et al., 2012; Perlikowski et al., 2014). The L. multiflorum/F. arundinacea introgression forms were shown in our earlier work to be excellent plant materials for dissecting drought tolerance of F. arundinacea into several crucial components (Perlikowski et al., 2014). Two introgression forms (4/10 and 7/6) were selected for further research performed at different levels to go deeper into the molecular mechanisms of drought tolerance existing in the group of Lolium-Festuca forage grasses. The introgression form 4/10 with better yield performance under simulated drought conditions in the field (14 weeks), and a faster re-growth after stress cessation, was also characterized by stronger membrane regeneration during recovery after 11 days of drought in simulated pot conditions. It was manifested by the electrolyte leakage parameter, describing the level of membrane stability. This parameter increased significantly on the 11th day of drought application in the two introgression forms, although after re-watering it returned to the values calculated for the conditions before drought initiation only in the form 4/10. On the other hand, the form 7/6 was characterized by a lower yield potential after 14 weeks of drought in the field, and a lower ability of re-growth after rehydration. However, this form was also shown to possess a greater level of photosynthesis capacity during 11 days of drought treatment in pot conditions, compared to the form 4/10, as manifested by CO<sup>2</sup> assimilation level [µmol(CO2) m−<sup>1</sup> s −1 ] (values marked below with the same letter did not differ statistically at P = 0.05, according to Tukey HSD test). This level was significantly higher in the form 7/6 on the 11th day of drought treatment (7.42 ± 0.16b), compared to the

genotype 4/10 (6.58 ± 0.17c; Perlikowski et al., 2014). Our earlier work (Perlikowski et al., 2014) demonstrated that more efficient photosynthesis during drought in the form 7/6 was, with a high probability, not associated with the photoactivity performance, since no differences between the analyzed introgression forms in the level of chlorophyll fluorescence parameters were observed under water deficit. Furthermore, it was also shown in our previous work that under drought conditions CO<sup>2</sup> assimilation rate was not limited by stomatal aperture; both introgression forms significantly reduced stomatal aperture under drought, compared to the control conditions but with a closely similar level of stomatal conductance the form 7/6 revealed a significantly higher level of CO<sup>2</sup> assimilation rate, compared to the form 4/10. We suggested that this greater capacity of photosynthesis could be due to a higher efficiency of the Calvin cycle in that introgression form. After pre-screening of protein profiles based on 2-D maps, it was found that the accumulation level of pFBA (EC 4.1.2.13) was higher in the form 7/6 (Perlikowski et al., 2014). These two closely related L. multiflorum/F. arundinacea introgression forms, 4/10 and 7/6, were applied in the research presented herein.

In this study, we hypothesize that (i) the introgression form 7/6, with more intensive CO<sup>2</sup> assimilation level during drought, will be characterized by higher pFBA accumulation and activity levels, which could be a crucial component of non-stomatal machinery involved in a regulation of photosynthetic efficiency in the Lolium-Festuca forage grasses; (ii) this phenomenon will also be accompanied by a higher accumulation level of primary photosynthetic metabolites in this form. Moreover, we hypothesize that (iii) a stronger regeneration capacity of the introgression form 4/10, including a membrane regeneration process after stress cessation, could be associated with higher accumulation levels of key metabolites, including osmoprotectants responsible for a protection of crucial proteins, and other important cell components. Thus, the research presented herein performed on the two introgression forms −4/10 and 7/6 involved: (i) Western blot experiments to confirm pFBA accumulation level during water deficit and recovery periods, accompanied by pFBA activity measurements and (ii) a primary metabolite profiling under stress conditions and after stress cessation, using GC - MS.

### MATERIALS AND METHODS

### Plant Materials

Plant materials used in the present research involved two L. multiflorum/F. arundinacea introgression forms (genotypes 4/10 and 7/6) obtained after four rounds of backcrossing of L. multiflorum (4x) × F. arundinacea (6x) hybrid to L. multiflorum (4x). These plants were selected earlier from a larger population in the field conditions with respect to their drought tolerance, as described by Perlikowski et al. (2014). After this selection, the two forms, each one in four biological replicates, were transferred to pots (1.75 dm<sup>3</sup> ), containing a sand:peat (1:3) mixture. The experiment of 11 days of water deficit, followed by 10 days of re-watering was performed in a growth chamber at a temperature of 22/17◦C (16 h day/8 h night, light of 400 µmol(quanta) m−<sup>2</sup> s −1 , HPS "Agro" lamps, Philips, Brussels, Belgium), 30% relative air humidity and watering completed. The level of soil water content decreased from 63% of field water capacity observed in control conditions down to approximately 3% on the 11th day of stress duration. After 10 days of re-watering this capacity increased to the value observed before drought treatment (Perlikowski et al., 2014). The physiological measurements summarized in the introduction, were performed during this experiment. The leaf tissue sample (100 mg) was collected before drought treatment (control), at three different time-points of drought (after 3, 6, and 11 days of drought), and after 10 days of subsequent rewatering, every time from each replicate, and frozen in liquid nitrogen.

### Identification of Aldolase cDNA Sequences

Full length cDNA sequences encoding chloroplast 1,6 bisphosphate aldolase (pFBA) were obtained by RACE reaction using commercial kit (5<sup>0</sup> /3<sup>0</sup> RACE Kit, 2nd Generation - ROCHE <sup>R</sup> ). Initial primers (forward primer – TTCGAGGAG ACCCTCTACCA; reverse primer – GGCTACAGTGCCCTCT CAAG) were designed on the basis of pFBA mRNA sequence of Brachypodium distachyon available in NCBI database [gi| 357157398|ref|XM\_003577737.1|]. Special primers for RACE reaction were designed on the basis of sequenced initial fragment of pFBA cDNA:

SP-F1 - GACTGTAGATGGCAAGAAGATTGTTGAC SP-F2 - CCAATTGTTGAGCCTGAGATCATG SP-R1 - CTACTAGCACTCTCTCCATAGGTAGATA SP-R2 - ATCAGTAGCTGTAGTTCTTGACGAACAT SP-R3 – TTCTCTGGAGGAGCTTGAGAGTGTA.

The PCR and RACE products were purified by QIAEXII Gel Extraction Kit (Qiagen), and ligated into the pGEM-T Easy vector (Promega). Vectors containing the ligated product were transformed into Escherichia coli strain XL1 Blue, and multiplied plasmids from clones selected with X-Gal and IPTG were extracted using QIAprep Spin Miniprep Kit (Qiagen). The obtained plasmids were sequenced (Molecular Biology Techniques Laboratory, Faculty of Biology, Adam Mickiewicz University, Poznan) using SP6 and T7 primers. The ´ obtained cDNA sequences were aligned with BioEdit software (ver 7.2.5).

### Analysis of Aldolase Accumulation Level

To estimate a pFBA protein accumulation level a Western blot analysis was performed, with the antibody directed against pFBA. The antibody was produced by Agrisera <sup>R</sup> company<sup>1</sup> using a rabbit host immunized with a highly specific pFBA 15 amino acid peptide (TFEVAQKVWAETFYY). The peptide was selected on the basis of comparison between

<sup>1</sup>www.agrisera.com

pFBA sequence of the two analyzed introgression forms, and available in database sequence of a cytosolic enzyme of B. distachyon (XM\_003564823.1) to avoid a cross-reaction of the anti-pFBA antibody with a cytosolic FBA. The detailed protocols for a protein extraction and Western blotting were as described by Pawłowicz et al. (2012). Briefly, 10 µg of chloroplast proteins from each time-point and introgression form in three biological replicates and standard samples were separated by 12% SDS-polyacrylamide gel electrophoresis and electroblotted onto nitrocellulose membranes (Bio-Rad). Immunodetection was performed with a rabbit polyclonal antibody (diluted 1:4000; Agrisera). The antigen–antibody complexes were detected using a chemiluminescent detection system with a secondary antirabbit IgG–horseradish peroxidase conjugate (diluted 1:20 000; Sigma) and a chemiluminescent substrate (Westar Supernova – Cyanogen) and the products intensities were estimated using ImageJ software.

### Analysis of Aldolase Activity

The pFBA activity was measured according to a modified Sibley-Lehninger method (Sibley and Lehninger, 1949; Willard and Gibbs, 1968). A protein extract from chloroplasts was prepared according to a modified method used by Kosmala et al. (2012). Briefly, 1g of frozen leaf material in three replicates for each sample, was ground in a liquid nitrogen, suspended in 4 ml of chloroplast isolation buffer (Sigma–Aldrich), shaken, filtered through a mesh 100 nylon (Sigma–Aldrich) and centrifuging 3 min at 200 g at 5◦C. The collected supernatant was subsequently centrifuged for 15 min at 900 g at 5◦C, and the washed chloroplast pellet was suspended in 2 ml 0.1 M phosphate buffer (0.1 M Na2HPO4) with 3% Triton X100 and shaken 5 min in 1000 rpm. The collected supernatant was used to determine the pFBA activity. To 2 ml tubes 50 µl of 0.06 M fructose-1,6-bisphosphate and 140 µl of incubation buffer (0.05 M 2,4,6 trimethylpyridine, 0.08 M hydrazine sulfate, 0.3 mM sodium iodoacetate) pH 7.4 were added and pre-incubated in water bath during 3 min at 30◦C. The volume of 100 µl of chloroplast extract was added and incubated at 30◦C for 2 h. After the incubation, tubes were chilled and 300 µl of 10% trichloroacetic acid was added to stop the reaction. For each sample one tube was treated as a reagent blank and was filled with 300 µl of 10% trichloroacetic acid before proceeding. After centrifugation at 10000 g, 100 µl of collected supernatant was pre-incubated with 100 µl of 0.75 M NaOH at room temperature for 10 min and then incubated at 30◦C water bath for 10 min with addition of 100 µl of 0.1% 2,4-dinitrophenylhydrasine. After that step, samples were mixed with 700 µl of 0.75 M NaOH and the absorbance measurements were performed with reference to a reagent blank using a spectrophotometer with 540 nm wavelength. A standard curve was prepared as follows: 2 ml tubes in two replicates were filled in order with 25, 50, 75, and 100 µl of 0.01 mM Dglyceraldehyde and filled up with water to a final volume of 100 µl. In the next step, 100 µl of 2,4-dinitrophenylhydrasine solution was added and samples were incubated at 30◦C water bath for 10 min. After incubation, 800 µl of NaOH was added and after 3 min of incubation the absorbance was measured with 540 nm with reference to a blank sample (100 µl of water plus reagents). The amount of produced trioses in the pFBA assay was read according to a standard curve and presented after calculation as µg of glyceraldehyde produced by 1 g of plant sample during 1 h.

### Metabolite Profiling

Analysis of primary metabolites accumulation was performed with slight modifications according to the protocol described earlier by Wojakowska et al. (2015). This protocol is presented briefly in the following sub-sections.

### Materials and Reagents

Solvents used for extraction and GC-MS analyses were MS grade methanol, methylene chloride, isopropanol, ribitol; derivatization reagents for GC-MS analyses were - MSTFA, O-methylhydroxylamine hydrochloride, pyridine and alkanes (C10–C36) used as retention index standards purchased from Sigma–Aldrich (Poznan, Poland). A deionized water was ´ purified by Milli-Q system Direct Q3 (Millipore, Bedford, MA, USA). A homogenization was performed with the MM400 (Retsch GmbH, Haan, Germany) homogenizer. Centrifugation was done with the EBA21 centrifuge (Hettich, Tuttlingen, Germany).

### Extraction of Metabolites

The amount of 100 mg of dried, powdered leaf material sample was transferred to 2 ml plastic tubes with two stainless steel balls, and 1.5 ml of 80% methanol in deionized water was added. Ribitol (25 µl of 1 mg/ml solution) was added to each sample as the internal standard. The samples were homogenized for 10 min at 1800 rpm, sonicated for 15 min and centrifuged for 15 min at 12000 rpm, followed by filtering through PTFE syringe filters 0.45 µm GHP ACRODISC 1 (Waters, Milford, CT, USA). The volume of 300 µl of each sample was transferred to a new tube and evaporated in a SpeedVac concentrator. A dried extract was then derivatized with 50 µl of methoxyamine hydrochloride in pyridine (20 mg/ml) at 37◦C for 90 min with agitation. The second step of derivatization was performed by adding 80 µl of MSTFA and an incubation at 37◦C for 30 min with agitation. Samples were subjected to GC/MS analysis directly after a derivatization. Each sample was prepared in four replicates. The compounds were considered "identified" when they met the identification criteria established by the GC software used (LECO ChromaTOF), namely: identity score higher than 700, Mass Threshold higher than 10 and a matched retention index.

### GC/MS Analysis

The analysis for separation was performed using the Agilent 7890A gas chromatograph (Agilent Technologies) connected to Pegasus 4D GCxGC-TOFMS mass spectrometer (Leco). A DB-5 bonded-phase fused-silica capillary column of 30 m length, 0.25 mm inner diameter and 0.25 µm film thickness (J&W Scientific Co., USA). The GC oven temperature program was as follows: 2 min at 70◦C, raised by 8◦C/min to 300◦C and held for 16 min at 300◦C. The total time of GC analysis was 46.75 min.

Helium was used as the carrier gas at a flow rate of 1ml/min. The volume of 1 µl of each derivatized sample was injected in a splitless mode. The initial PTV (Programmed Temperature Vaporization) injector temperature was 20◦C for 0.1 min and then raised by 600◦C/min to 350◦C. The septum purge flow rate was 3 ml/min and the purge was turned on after 60 s. The transfer line and ion source temperatures were set to 250◦C. In-source fragmentation was performed with 70 eV energy. Mass spectra were recorded in the mass range 35–650 m/z.

### Analysis of Mass Spectra

Data acquisition, automatic peak detection, mass spectrum deconvolution, retention index calculation and library search were done by Leco ChromaTOF-GC software (v4.51.6.0). To eliminate retention time shift and to determine the retention indexes (RI) for each compound, the alkane series mixture (C–10 to C–36) was injected into the GC/MS system. The metabolites were automatically identified by library search (Replib, Mainlib, Fiehn library) with a similarity index above 700 and retention index ±10. All known artifact peaks including alkanes, plasticizers, column bleed, MSTFA artifact and reagent peaks were not considered in the final results. To obtain accurate peak areas for the deconvoluted components, unique quantification masses for each component were specified and the samples were reprocessed. The obtained metabolite data was normalized relatively to the quant mass peak of internal standard (Ribitol – 217) in each sample before statistical analysis.

### Statistical Analysis

The normalized mass spectral intensity Log-transformed data (base 2) was subjected to statistical analysis. Two-way analysis of variance (ANOVA) with a genotype and treatment as classification factors, Fisher's least significant difference (LSD) and PCA was made using STATISTICA 10 software (StatSoft, Tulsa, OK, USA). The PCA was carried out by eigenvalue decomposition of data correlation matrix. The significant effects of genotype, time and genotype × time interaction were selected using the family wise error rate less than 1%. Fisher's LSD of samples at 1% was used. Heatmaps for a difference between means of time-points and the control were prepared.

### RESULTS

### The Accumulation Level and Activity of Chloroplast Fructose Bisphosphate Aldolase during Drought and Rehydration

The RACE analysis performed on total RNA extracted from the two introgression forms allowed the identification of two pFBA mRNA sequences in each form (**Supplementary Figure S1**). The overall length of the identified sequences with 3<sup>0</sup> and 5<sup>0</sup> UTR varied from 1423 to 1430 nucleotides, whereas the length of the coding region was the same for each identified sequence and covered 1164 nucleotides. This coding region showed 99% of similarity to B. distachyon sequence used earlier for primers designed to clone pFBA from the analyzed introgression forms (**Supplementary Figure S1**). The identified sequences were characterized by several single nucleotide polymorphisms within a coding region, and modifications within 3<sup>0</sup> and 5 <sup>0</sup> UTR (**Supplementary Figure S1**). The predicted protein sequence for the analyzed mRNAs covered 388 amino acids, including 37 amino acids of the chloroplastic transit sequence (**Supplementary Figure S2**). The predicted molecular mass of the protein was 42.05 and 38.48 kDa with and without a transit sequence, respectively. Its predicted isoelectric point was 5.51, and between the different predicted here protein sequences three amino acid modifications were found (**Supplementary Figure S2**).

The proteomic assays revealed a significantly higher accumulation level (**Figure 1A**) and total activity (**Figure 1B**) of pFBA in the 7/6 introgression form at all the time-points of drought treatment, compared to the 4/10 form. However, in both forms a significant decrease of pFBA activity after 6 and 11 days of drought was observed, compared to the control conditions and initial days of stress duration. On the other hand, after 10 days of rehydration an increase of enzyme activity was revealed but without significant differences between the two forms (**Figure 1B**).

### The Accumulation Level of Primary Metabolites during Drought and Rehydration

### Metabolite Accumulation Dynamics

A total of 937 different metabolite compounds were identified, and 66 were selected for further analysis. These 66 were present in all the biological replicates, and had a similarity index value above 700 and/or were manually identified using a comparison of their retention time with Golm metabolite VAR5 library data<sup>2</sup> , and after one-way ANOVA pre-selection had p-value lower than 0.01 (**Supplementary Table S1**). The analyzed metabolites were further divided into nine classes, including amino acids, amines, sugars, sugar acids, sugar alcohols, phosphoryl compounds, organic acids, alcohols, and fatty acids (**Figure 2**). The further statistical analysis revealed that 50 compounds presented significant genotype dependent differences, 63 time-point dependent differences and 60 compounds revealed a significant interaction between a genotype and a time-point (**Supplementary Table S1**). The introgression form 7/6 was characterized by a significantly higher accumulation level of primary metabolites, including 47 in the control conditions and 32 after 11 days of drought, compared to the 4/10 form. This latter plant expressed a higher accumulation level of six metabolites in the control conditions and 10 after 11 days of drought, compared to the form 7/6. Only after rehydration did the form 4/10 reveal 28 metabolites with a significantly higher accumulation level than the form 7/6. At this time-point the form 7/6 expressed only 14

<sup>2</sup>http://gmd.mpimp-golm.mpg.de

metabolites with higher accumulation levels than the form 4/10 (**Figure 3A**).

The analysis of changes in accumulation levels between the time-points in these two introgression forms revealed that during drought duration the patterns of metabolite dynamics were different in both plants. Between the control conditions and drought time-points more metabolites were significantly down-regulated in the 7/6 form, especially at the beginning of drought treatment between – the control and the 3rd day of stress period, with 36 metabolites decreasing their accumulation levels. The abundance of these metabolites started to be upregulated between the 3rd and 11th day of drought, with 35 metabolites increasing their abundance (**Figures 3B** and **4**). Different patterns of metabolite dynamics were observed in the 4/10 form. More metabolites were up-regulated than downregulated between all the time-points, but this phenomenon was most visible between the control and 11th day of drought period, with 34 metabolites increasing their abundance (**Figures 3C** and **4**). In the same period, only 21 metabolites increased their accumulation level in the form 7/6 (**Figures 3B** and **4**). Seventeen of these metabolites overlapped in both introgression forms (**Figure 4**). After rehydration, 44 metabolites presented a higher accumulation level, compared to the control conditions, in the form 4/10 (**Figure 3C**). Twenty-six of them belonged to the group of metabolites which significantly increased their abundance after 11 days of drought period and remained this elevated accumulation level or even increased it after rehydration. This trend was not observed in the 7/6 form, with only eight metabolites remaining the elevated abundance after rehydration (**Figure 4**). Moreover, 18 more metabolites, mainly carbohydrates, which were not accumulated under drought conditions, increased their abundance significantly after rehydration in the 4/10 form. This phenomenon was not observed in 7/6 form (**Figure 4**). These relations were also visible in the PCA, where the Principal Component 1 accounted for 47% of the variance, clearly separated the two forms with respect to the control values for the analyzed metabolites and indicated that during drought duration and rehydration the patterns of metabolite dynamics were different in both plants. For this separation, the highest contribution revealed not only the metabolites from the correlation groups number 4, 5, and 6 with most carbohydrates and both substrates and products of photosynthesis but also from the group number 3 with amino acids (**Figure 5** and **Supplementary Table S2**).

### Amino Acids and Amines

The accumulation of amino acids from the correlation group number 3 (**Figure 5**) revealed a significant impact on genotypes and time-points separation according to the PCA Component 2 values (**Supplementary Table S2**). Although, a clear separation of genotypes was not in fact observed, it was simultaneously noticed that after 6 and 11 days of drought treatment in both introgression forms the accumulation level of most amino acids significantly increased, compared to control values (**Figures 2**, **4** and **5**). A higher accumulation level of aspartic acid, isoleucine, lysine, methionine, tryptophan and tyrosine in response to drought application in both introgression forms, and asparagine only in the form 7/6, was revealed. On the other hand, alanine, arginine, proline and valine increased their accumulation level after 11 days of drought only in the form 4/10 (**Figures 2** and **4**). These results demonstrated that the proline accumulation level unexpectedly decreased at the beginning of drought (the 3rd and the 6th day of drought) in both analyzed grass forms. However, when drought treatment progressed, proline abundance returned to the values observed in the control conditions in the form 7/6, and was even higher in the form 4/10 (**Figure 2**). The accumulation pattern of dimethylglycinelike amine was different between the two forms. In the form 4/10, its increased accumulation level after 11 days of drought without further change after rehydration, was revealed. Contrary, in the 7/6 form, a progressive decrease in dimethylglycine -like amine accumulation during drought period, compared to the control conditions, and a slight increase after rehydration to the level observed in the 4/10 form, was noticed (**Figure 2**).

#### Carbohydrates

With respect to the accumulation profiles of the identified carbohydrates, including substrates and products of

L. multiflorum/F. arundinacea introgression forms. The bars represent a mean value (over replications) for Log<sup>2</sup> transformed mass spectra peak intensities. Error bars represent none weighted standard errors. The letters indicate groups of means that do not differ significantly at a significance level of 0.01 (Fisher's LSD-test).

metabolites with a significantly higher accumulation level between the analyzed forms at the particular time-points. (B) Numbers of metabolites with a significant increase or decrease in an accumulation level between time-points in the 7/6 form. (C) Numbers of metabolites with a significant increase or decrease in an accumulation level between time-points in the 4/10 form.

photosynthesis, it was revealed that under the control conditions the form 7/6 had a higher accumulation level of F-1,6-2P, G-6-P 2 and glucose, compared to the form 4/10. The accumulation levels of sucrose and fructose did not differ significantly between the two forms, and only G-6-P 1 showed a higher abundance in the 4/10 form. However, after 11 days of drought the accumulation of F-1,6-2P and G-6-P 2 decreased, and G-6-P 1 and mannose-6-phosphate increased in the 7/6 form to the levels observed in the 4/10 form (**Figure 2**). For the main photosynthesis products: glucose, fructose, and sucrose a significant increase of accumulation level after 11 days of drought in the analyzed introgression forms was observed. No significant differences between the plants in fructose and sucrose accumulation levels were revealed, and only glucose showed a significantly higher abundance in the 7/6 form (**Figure 2**). After rehydration, in the 4/10 form all these compounds demonstrated higher accumulation levels, compared to the 7/6 form.

Other identified here carbohydrates presenting significant differences in an accumulation level between the analyzed introgression forms were myo-inositol and myo-inositol-1 phosphate. Their abundance increased in response to drought in both introgression forms, but was significantly higher in the form 4/10 (**Figure 2**). The accumulation level of gentiobiose increased after 11 days of drought period in both forms being higher in the 7/6 form. Melibiose and glycerol revealed a gradual decrease of their abundance during drought period in the form 7/6 and more or less stable level in the 4/10 (**Figure 2**).

### Other Compounds

Sugar acids, such as galactaric acid, gluconic acid 2, glyceric acid, ribonic acid 2, ribonic acid like 1, saccharic acid and threonic acid revealed a significant increase of accumulation level after 11 days of drought in the 4/10 form, while in the form 7/6 rather a stable or even a decreased accumulation level in majority of sugar acids, was observed. Only galactaric acid showed an increased accumulation level during drought in this introgression form (**Figure 2**). The analyzed accumulation patterns for organic acids revealed differences between two forms. Five of six analyzed organic acids presented a higher accumulation level under drought treatment and rehydration in the form 4/10. In the form 7/6, caffeic acid, dehydroascorbic acid, and maleic acid showed a significant reduction in an accumulation level under drought. Only glutaric acid and shikimic acid accumulated significantly during drought treatment in this introgression form (**Figure 2**).

## DISCUSSION

### Primary Metabolite Accumulation Profiles with Respect to Chloroplast Fructose Bisphosphate Aldolase Activity and Photosynthetic Capacity

Under mild water deficit conditions stomata closure is the main physiological factor responsible for inhibiting photosynthesis through a reduction of CO<sup>2</sup> availability for the assimilation

FIGURE 4 | The accumulation levels of the analyzed metabolites (relative to control values, calculated for mean Log<sup>2</sup> transformed mass spectral peak intensities) at four time-points of the experiment: after 3, 6, and 11 days (D) of DR, and after 10 days of RH in the 4/10 and 7/6 L. multiflorum/F. arundinacea introgression forms. The values lower than the control are shown in shades of red, and the values higher than the control are shown in shades of blue. The asterisks indicate metabolites with a significant increase of an accumulation level after 11 days of drought: blue in the 4/10 form, magenta in the 7/6 form and green in both introgression forms. The crosses indicate metabolites with a significantly higher accumulation level after 10 days of re-watering: blue in the 4/10 form, magenta in the 7/6 form and green in both introgression forms.

process (Lawlor, 2002). However, an important recent topic for discussion has become the mechanisms of non-stomatal limitations of photosynthesis associated also with metabolic factors (Flexas and Medrano, 2002), such as the Calvin cycle enzymes, among which pFBA accumulation level plays a crucial role (Haake et al., 1998, 1999; Uematsu et al., 2012). The pFBA is a key enzyme of the regeneration phase of this cycle, and its activity may be important for the regulation of photosynthesis intensity (Haake et al., 1998; Raines, 2003). It initiates the third regeneration phase of the Calvin cycle by catalyzing the reversible conversion of glyceraldehydes-3-phosphate and dihydroxyacetone phosphate to fructose-1,6-bisphosphate (F-1,6-2P), and condensation of sedoheptulose-1,7-bisphosphate from erythrose-4-phosphate and dihydroxyacetone phosphate (Raines and Lloyd, 2007). It has been proved in several studies that a changed accumulation level of pFBA in plants could influence the efficiency of photosynthesis (Haake et al., 1998, 1999; Uematsu et al., 2012). However, this relationship has not been confirmed for the forage grasses. Here, we try to fill this gap in our knowledge. Photosynthesis is the main source of substrates for carbohydrates, amino acids, and other cellular compounds production. These metabolites serve not only as a cell energy reservoir but often also as signaling molecules important for growth, development, and response to unfavorable conditions (Dolferus, 2014). Thus, an inhibition of photosynthesis usually has its reflection in growth, biomass production, and metabolite accumulation disturbances (Chaves et al., 2010). Perennial grasses accumulate large quantities of soluble carbohydrates in their leaves, simultaneously having a low amount of starch (Pollock and Cairns, 1991; Cairns, 2002). The observed higher accumulation level of most primary metabolites under control conditions and drought treatment, including glucose in the form 7/6, was probably due to a higher efficiency of photosynthesis in this form. This validates one of our hypotheses formulated in the introduction section. On the other hand, a higher accumulation level of particular primary metabolites, including all the analyzed substrates and products of photosynthesis after re-hydration accompanied a higher potential of recovery in the 4/10 form (Perlikowski et al., 2014).

Our results indicated that water deficit had a significant impact on the pFBA activity in the two analyzed introgression forms. However, after further re-watering that activity was closer to the control values (before stress application) in the 4/10 form, indicating its higher potential of metabolism recovery, after stress cessation. It was previously observed that the impaired expression of pFBA in transgenic lines of Solanum tuberosum (potato) affected plant growth, and negatively influenced carbohydrates accumulation, including phosphoryl substrates for photosynthesis and fructose, glucose, sucrose, and starch in ambient conditions. A lower level of pFBA accumulation also influenced the activity of other Calvin cycle enzymes, such as plastid fructose-1,6-bisphosphatase (Haake et al., 1998, 1999). However, the impact of decreased pFBA activity on photosynthesis rate and metabolite accumulation was observed only when less than 50% of pFBA activity was removed, but even under those conditions the accumulation levels of most important hexose phosphates were not impaired (Haake et al.,

1998). In the present study, the impact of higher activity of pFBA on a higher photosynthesis efficiency, and a higher metabolite accumulation level in the 7/6 form, could be suggested, but in fact the difference between the analyzed plants for the activity of this enzyme was less than 25%. Also, the research performed on transgenic Nicotiana tabacum (tobacco) with overexpression of pFBA (Uematsu et al., 2012) clearly showed that an increased accumulation level of pFBA in plant tissues enhanced plant growth, biomass production, CO<sup>2</sup> assimilation rate and ribulose-1,5-bisphosphate accumulation. Both analyzed here introgression forms reduced significantly their stomatal conductance during drought period, compared to the control conditions, and this process was associated with a reduction of CO<sup>2</sup> assimilation

levels in these two plants. However, as proved also in our earlier work, with closely similar levels of stomatal conductance during drought, the form 7/6 revealed simultaneously a significantly higher level of CO<sup>2</sup> assimilation rate, compared to the form 4/10 (Perlikowski et al., 2014). In the current study, we demonstrated that this phenomenon was accompanied by higher accumulation and activity levels of pFBA in the form 7/6. Further research is required on the aldolase gene expression level using qRT-PCR as well as on the expression of genes coding other enzymes of the Calvin cycle. As far as we know, this report is the first one for the Lolium-Festuca forage grasses demonstrating crucial components of non-stomatal regulation mechanisms of CO<sup>2</sup> assimilation level and photosynthetic capacity under drought conditions.

fpls-07-01063 July 22, 2016 Time: 14:48 # 11

the present study. This model involves the results regarding chloroplast aldolase (pFBA) activity and specific metabolites' accumulation levels under drought and recovery conditions in the 7/6 and 4/10 L. multiflorum/F. arundinacea introgression forms. In velvet – characteristics for the 7/6 form and in blue – for the 4/10 form, are presented. The intensity of the arrows shows quantitative differences in the indicated physiological/metabolic process between the analyzed introgression forms. Abbreviations: D, day; EL, electrolyte leakage; H, higher level, compared to the other introgression form; L, lower level, compared to the other introgression form; (+), increased level, compared to the control conditions; (−), decreased level, compared to the control conditions.

### Primary Metabolite Accumulation Profiles in Response to Water Deficit Conditions

As per our hypothesis, water deficit revealed a significant impact on the primary metabolism of Lolium-Festuca grasses. The differences in the observed alterations in metabolite profiles were well visible between the analyzed introgression forms, mainly with respect to osmoprotectant and signaling molecules.

### Carbohydrates and Their Derivatives

Within a group of osmotic active compounds, soluble carbohydrates are especially important. Except their obvious central role at various levels of plant cell metabolism, these compounds are also associated with a wide range of response and signaling pathways altered by environmental stimuli (Hare et al., 1998; Fang and Xiong, 2015). Their small size, neutral character and biochemical compatibility allow them to maintain cellular water potential and interact with other cellular compound, stabilizing protein and membrane structures (Hoekstra et al., 2001; Xoconostle-Cazares et al., 2010). In the previous section, the accumulation profiles of the identified substrates and products of photosynthesis, were discussed. Here, in this paragraph, more potential functions of glucose, fructose, and sucrose in plant cell response to drought treatment, are considered. A higher accumulation level of glucose noticed in the introgression form 7/6 on the 11th day of drought period could also result in a higher ROS production in this form (Russell et al., 2002), and its higher exposure to potential oxidative damage, reducing the level of regeneration. However, this research aspect requires further work. An increased accumulation level of sucrose, fructose, and glucose was previously observed in forage grasses, such as F. arundinacea (Zwicke et al., 2015) and L. perenne (Foito et al., 2009), and in other plant species, such as Oryza sativa (rice; Ambavaram et al., 2014), Triticum aestivum (wheat; Izanloo et al., 2008), and S. tuberosum (Yang et al., 2015) under dehydration conditions, indicating that the mechanism of water deficit tolerance could be associated with a higher accumulation of soluble carbohydrates (Izanloo et al., 2008; Ambavaram et al., 2014). On the other hand, the research performed on L. perenne under drought treatment showed that sucrose, glucose, and fructose accumulation levels did not change significantly in leaves under stress treatment (Amiard et al., 2003). Glucose and sucrose are among the final products of carbon assimilation in plants, as well as precursors of the majority of organic compounds used in metabolic pathways (Ap Rees, 1995). Changing environmental conditions, including stress events such as water deficit often promote the accumulation of sucrose and products of its metabolism in the vegetative tissues. These compounds can serve as osmoprotectants (Kido et al., 2013), replacing dissipating water, especially under desiccation conditions (Hoekstra et al., 2001). The sucrose accumulation level is a result of sucrose biosynthesis or degradation of sugar polymers, such as starch. Both of these processes can be stimulated by drought (Zwicke et al., 2015). Overall, it was proved that sucrose could be associated with a dehydration avoidance mechanism, and can

replace the functions of other osmoprotectants, such as trehalose or raffinose (Foito et al., 2009), which were not identified here.

Other well-recognized osmoprotectants revealed here involved gentiobiose, melibiose, and glycerol (Bartels and Sunkar, 2005), which presented genotype dependent differences in accumulation levels under drought conditions. The important metabolites identified here are also myo-inositol and its related compounds, which can be used as substrates for the synthesis other osmoprotectants, such as raffinose family oligosaccharides (Karner et al., 2004), galactinol (ElSayed et al., 2014) or d-ononitol (Sheveleva et al., 1997), and a wide spectrum of lipid signaling compounds, such as phosphatidylinositol, phosphatidylinositol-phosphate, polyphosphoinositides, myoinositol phosphate or sphingolipid related molecules - crucial components of various metabolic pathways involved in a control of gene expression, hormonal regulation and response reactions to stress conditions (Liu et al., 2013; Zhai et al., 2015). Myo-inositol belongs to a sugar alcohol group of metabolites in which hydroxyl group can substitute the hydroxyl group of water during interaction with membrane lipids and proteins, maintaining their structure and properties (ElSayed et al., 2014). The increased accumulation of myo-inositol and myo-inositol-1-phosphate in the 4/10 form was positively correlated with a higher accumulation level of phosphatidylinositol in this form during drought treatment, as described earlier by Perlikowski et al. (2016). It was shown earlier that the overexpression of myo-inositol-1-phosphate synthase, which is a key enzyme restricting amount of produced myo-inositol, increased the level of tolerance to osmotic stresses in plants, such as rice, tobacco, and potato (Yang et al., 2008; Goswami et al., 2014; Zhai et al., 2015). Also, in the other experiments, in drought and salt treated tobacco myo-inositol was accumulated in higher amounts (Sheveleva et al., 1997). In this work, an increased abundance of myo-inositol did not perfectly reflect the accumulation level of its potential product – galactinol under drought conditions. On the other hand, the increase of myo-inositol and galactinol accumulation levels was not observed earlier in L. perenne under drought conditions (Amiard et al., 2003). The accumulated carbohydrates can above all serve as carbon storage for recovery period after stress cessation (Hare et al., 1998). A higher accumulation level of myo-inositol, myo-inositol-1-phosphate and melibiose after 11 days of drought in the form 4/10 could positively influence a faster recovery of this form during rehydration.

Shikimic acid is a key compound used in biosynthetic pathway of aromatic amino acid production, such as phenylalanine, tyrosine, and tryptophan (Kojima et al., 2015). Shikimic acid was previously observed to accumulate under drought in potato leaflets (Yang et al., 2015) and L. perenne leaves (Foito et al., 2009). An accumulation of known antioxidants, such as threonic acid and dehydroascorbic acid (Debolt et al., 2007) in the form 4/10 under drought and further recovery might be associated with more efficient regeneration mechanism in this form, associated with more efficient scavenging of ROS. In the form 7/6, those compounds were not accumulated highly during drought treatment, compared to the control conditions or even their accumulation levels decreased in drought.

### Amino Acids

Amino acids represent a highly significant group of metabolites, and it was demonstrated that their accumulation levels have a significant impact on the expression of plant drought tolerance (Barchet et al., 2014). It was previously observed that a particular amino acid content increased under drought conditions in plants. In creosotebush (Larrea divaricata) a significant increase of alanine, arginine, histidine, isoleucine, valine, glutamic acid, phenylalanine, and proline abundance was observed (Saunier et al., 1968). Also potato plants exposed to drought treatment accumulated higher amounts of glutamine/glutamic acid, serine, threonine, proline, phenylalanine, isoleucine, leucine, and valine (Yang et al., 2015). The previously observed higher accumulation level of asparagine in O. sativa cultivars under drought conditions was found to be negatively correlated with drought tolerance, and was characteristic for susceptible genotypes with a lower water use efficiency (Degenkolbe et al., 2013). The amino acid accumulation in plants exposed to dehydration conditions could be due as well to a protein hydrolysis (Saunier et al., 1968), and may be also associated with nitrogen storage for further metabolism remobilization during recovery period after stress cessation (Sicher and Barnaby, 2012).

One of the most common reactions of many plants upon experiencing dehydration conditions, is an accumulation of proline and glycine betaine (Bandurska and Józwiak, 2010 ´ ). Proline, with its neutral character, does not negatively affect the cell environment and could be accumulated in higher amounts, since it functions as an osmotic adjustment molecule and protects other cellular compounds, such as proteins and lipids against denaturation and peroxidation (Bandurska, 2001). Proline has a high water potential, and its hydrophobic site can interact with hydrophobic parts of other proteins, while a hydrophilic site with high affinity to water allows it to maintain a high water potential, solubility and a native structure of proteins under dehydration conditions (Hoekstra et al., 2001). It was previously reported that related genotypes of F. arundinacea (Abernethy and McManus, 1998), F. rubra and L. perenne (Bandurska and Józwiak, 2010 ´ ; Salehi et al., 2014) accumulated higher levels of proline and glycine betaine during a water deficit period, compared to well-watered plants. On the other hand, it was proposed earlier that proline had a lower contribution to the osmotic adjustment under drought stress in forage grasses, compared to soluble carbohydrates (Barker et al., 1993). Glycine betaine accumulated in plants under water deficit conditions, functioning as an osmoprotector and stabilizing cell compounds structure and activity (Sakamoto and Murata, 2002). It was previously reported that a progressive drought treatment induced an accumulation of glycine betaine in Hordeum vulgare (barley; Zuniga et al., 1989) and F. arundinacea (Abernethy and McManus, 1998). Here, glycine betaine was not identified, although one of its relatives, a precursor dimethylglycine-like was found among the analyzed metabolites. Probably, a higher accumulation level of proline and

dimethylglycine-like under the advanced drought conditions (the 11th day of drought) in the form 4/10 could positively affect a regeneration capacity of this form during recovery period, after rehydration.

### Primary Metabolite Accumulation Profiles and Plant Regeneration after Stress Cessation

A recovery after cessation of water deficit conditions refers to a plant's ability for re-growth, and to produce a fresh biomass during tissue regeneration following severe turgor loss and dehydration (Luo, 2010). In previous work, it was shown that the form 4/10, after 10 days of rehydration, was characterized by a significantly decreased level of electrolyte leakage, compared to its level after 11 days of drought, whereas in the 7/6, an elevated level of electrolyte leakage was also observed after a rehydration period. This indicated that the form 4/10 had more efficient regeneration mechanisms after stress cessation (Perlikowski et al., 2014). During recovery, soluble compounds stabilizing membrane surface are replaced by water in rehydrating membranes allowing them to slowly regain their function, and preventing any quick rupture of membranes caused by water flow (Hoekstra et al., 2001). It has been noticed that the accumulation of osmotically active compounds during a drought period was not fully reversible after the stress cessation, and this was associated with a stress memory. The rate of possible return of these compounds to the control values depended mainly on the strength of earlier stress treatment and the level of plant stress tolerance (Morgan, 1984). Thus, we assume that observed elevated levels, after rehydration, of most crucial compounds of photosynthetic pathway and osmoprotectants in the 4/10 form could be, at least partially, associated also with a stress memory, and with a stronger physiological performance of this form under recovery conditions, compared to the form 7/6. Sucrose decreased its abundance in both introgression forms after rehydration, and this could be associated with metabolic demands to restore normal cell activity and growth after stress cessation (Zwicke et al., 2015). Osmotic adjustment is considered as one of the most important mechanisms of plant tolerance to water deficit conditions, and also plays a crucial role in a plant recovery after stress cessation (Morgan, 1984). The impact of drought stress on plant development depends mainly on the stress intensity and duration, but also on genotype specific traits and earlier plant pre-hardening in stressful conditions. Although all the mechanisms driving a stress memory are still poorly understood, the evidence exists that the accumulation of some signaling molecules during drought conditions could be, at least partially, associated with this phenomenon (Bruce et al., 2007).

### CONCLUSIONS

The results obtained in this study clearly indicate that accumulation and activity levels of pFBA can influence the capacity of photosynthesis in the L. multiflorum/F. arundinacea introgression forms, due to the efficiency of the Calvin cycle. The phenomenon, described in our paper, is the first example of nonstomatal mechanisms involved in a regulation of photosynthetic rate during prolonged drought treatment in the Lolium-Festuca forage grasses. The higher activity and accumulation levels of pFBA in the 7/6 form was accompanied by a higher accumulation level of most photosynthesis products in this form, during control and drought periods. On the other hand, the form 4/10 demonstrated higher accumulation levels of stress tolerance marker metabolites, such as proline, melibiose, galactaric acid, myo-inositol and myo-inositol-1-phosphate under drought conditions. Their accumulation could be associated with more efficient capacity of membrane regeneration in this form, after stress cessation. Furthermore, during rehydration, the 4/10 introgression form also maintained elevated levels of most metabolites, analyzed herein. It cannot be excluded though that these alterations to metabolism could be involved in a stress memory mechanism in forage grasses, however, this research aspect requires further work. The most important conclusions of this study are presented also graphically in a model figure (**Figure 6**).

### AUTHOR CONTRIBUTIONS

DP and AK designed the experiments. DP, MC, ŁM, and AA conducted the experimental work. DP and AK drew main conclusions. DP carried out the statistical analysis. DP and AK prepared the first version of the manuscript, but all the authors contributed in further writing, and finally read and approved the manuscript.

### FUNDING

The plant selection and physiological analysis were performed within projects funded by the Polish Ministry of Agriculture and Rural Development (no. 84; 2011 and 2012).

### ACKNOWLEDGMENT

We thank Prof. Neil Jones from Aberystwyth University for a critical reading of this manuscript.

## SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fpls.2016.01063

FIGURE S1 | The alignment of chloroplast fructose bisphosphate aldolase (pFBA) cDNA sequences with 3<sup>0</sup> and 5<sup>0</sup> UTR of 4/10 and 7/6 Lolium multiflorum/Festuca arundinacea introgression forms with pFBA cDNA sequence of Brachypodium distachyon (B.d.; XM\_003577737.1). Gray highlighted text indicates 'start' and 'stop' positions of the coding sequence.

FIGURE S2 | The alignment of chloroplast fructose bisphosphate aldolase (pFBA) predicted protein sequences of 4/10 and 7/6 L. multiflorum/F. arundinacea introgression forms with pFBA protein sequence of Brachypodium distachyon (B.d.; XM\_003577737.1). Black highlighted text indicates a predicted chloroplast transition sequence.

TABLE S1 | The normalized mass spectral intensities for three biological replicates, data base annotations and ANOVA results of the analyzed metabolites. Calculations were performed for five time-points of experiment: before drought, after three, six and 11 days of drought, and after ten days of re-watering in Lolium multiflorum/Festuca arundinacea introgression forms, based on [Log2] transformed normalized mass spectra peak intensities.

### REFERENCES


#### TABLE S2 | Loadings of all the PCA components for the analyzed

metabolites. Calculations were performed for five time-points of experiment: before drought, after three, six and 11 days of drought, and after ten days of re-watering in Lolium multiflorum/Festuca arundinacea introgression forms, based on [Log2] transformed normalized mass spectra peak intensities.


to face abiotic stress. BMC Bioinformatics 14(Suppl. 1):S7. doi: 10.1186/1471- 2105-14-S1-S7


to drought stress in tall fescue (Festuca arundinacea Schreb.). Mol. Biotechnol. 56, 248–257. doi: 10.1007/s12033-013-9703-3


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Perlikowski, Czyzniejewski, Marczak, Augustyniak and Kosmala. ˙ This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Transcriptome Profiling of Buffalograss Challenged with the Leaf Spot Pathogen Curvularia inaequalis

#### Bimal S. Amaradasa<sup>1</sup> and Keenan Amundsen<sup>2</sup> \*

<sup>1</sup> Department of Plant Pathology, University of Nebraska–Lincoln, Lincoln, NE, USA, <sup>2</sup> Department of Agronomy and Horticulture, University of Nebraska–Lincoln, Lincoln, NE, USA

Buffalograss (Bouteloua dactyloides) is a low maintenance U. S. native turfgrass species with exceptional drought, heat, and cold tolerance. Leaf spot caused by Curvularia inaequalis negatively impacts buffalograss visual quality. Two leaf spot susceptible and two resistant buffalograss lines were challenged with C. inaequalis. Samples were collected from treated and untreated leaves when susceptible lines showed symptoms. Transcriptome sequencing was done and differentially expressed genes were identified. Approximately 27 million raw sequencing reads were produced per sample. More than 86% of the sequencing reads mapped to an existing buffalograss reference transcriptome. De novo assembly of unmapped reads was merged with the existing reference to produce a more complete transcriptome. There were 461 differentially expressed transcripts between the resistant and susceptible lines when challenged with the pathogen and 1552 in its absence. Previously characterized defense-related genes were identified among the differentially expressed transcripts. Twenty one resistant line transcripts were similar to genes regulating pattern triggered immunity and 20 transcripts were similar to genes regulating effector triggered immunity. There were also nine upregulated transcripts in resistance lines which showed potential to initiate systemic acquired resistance (SAR) and three transcripts encoding pathogenesis-related proteins which are downstream products of SAR. This is the first study characterizing changes in the buffalograss transcriptome when challenged with C. inaequalis.

Keywords: buffalograss (Bouteloua dactyloides), Curvularia inaequalis, leaf spot, defense-related genes, transcriptome, next-generation sequencing

### INTRODUCTION

Buffalograss (Bouteloua dactyloides) is a U. S. native, warm-season turfgrass species with exceptional drought, heat, and cold tolerance (Beetle, 1950; Reeder, 1971). Buffalograss requires less fertility, pesticides and water to maintain an acceptable quality level compared to traditional turfgrass species (Riordan et al., 1993). Replacing traditional turfgrass species with buffalograss may help to conserve water, especially in the semi-arid and arid regions of the USA (Riordan et al., 1993). Buffalograss is tolerant of many diseases, but leaf spot can cause decline or death of buffalograss turf. Leaf spot is caused by several species belonging to the Curvularia, Bipolaris,

#### Edited by:

Sergio Lanteri, University of Turin, Italy

#### Reviewed by:

Abu Hena Mostafa Kamal, University of Texas at Arlington, USA Julio Vega-Arreguin, National Autonomous University of Mexico, Mexico

> \*Correspondence: Keenan Amundsen kamundsen2@unl.edu

#### Specialty section:

This article was submitted to Crop Science and Horticulture, a section of the journal Frontiers in Plant Science

Received: 29 December 2015 Accepted: 09 May 2016 Published: 25 May 2016

#### Citation:

Amaradasa BS and Amundsen K (2016) Transcriptome Profiling of Buffalograss Challenged with the Leaf Spot Pathogen Curvularia inaequalis. Front. Plant Sci. 7:715. doi: 10.3389/fpls.2016.00715

and Cercospora genera (Smith et al., 1989; Smiley et al., 2005). In Nebraska, Curvularia inaequalis (Shear) Boedijn and Bipolaris spicifera (Bainier) Subram (teleomorph: Cochliobolus spicifer Nelson) are commonly isolated from buffalograss with leaf spot symptoms (Amaradasa and Amundsen, 2014b). On lawns, leaf spot initiates as dark brown leaf spots followed by leaf tip dieback and eventual blighting of entire tillers. As the disease progresses, patches of leaf decline and canopy thinning occur. Leaf spot symptoms of C. inaequalis and B. spicifera are identical and therefore it is not possible to distinguish the causal organism by disease symptoms alone. Disease development commonly occurs when temperatures are 30◦C and above. Disease severity increases when buffalograss is under stress by adverse weather conditions such as high temperatures, high humidity, drought, excess rain, and cloud cover. Since buffalograss is often considered a low-maintenance turfgrass, the use of fungicides is usually not preferred by homeowners and lawn care managers. Incorporating host resistance through plant breeding is one way to combat leaf spot disease. Conventional breeding for disease resistance is difficult and time consuming, and is based on inoculation, rating for incidence and severity of disease, and selection of resistant genotypes. Identification of genes that confer leaf spot resistance would enable molecularbased strategies to improve the efficiency of breeding for resistant cultivars.

Today, comparative genetic studies using next generation sequencing (NGS) technology are common for characterizing gene functions in plants and other organisms. NGS technology can be used to sequence both genomic DNA and total RNA (RNA-seq). The large number of short sequencing reads produced by this technology is highly cost effective and can be used to assemble a genome or transcriptome de novo or can be mapped to a reference to determine differentially expressed genes. Buffalograss has a basic chromosome number of 10 and exists as a ploidy series of diploids, tetraploids, pentaploids, and hexaploids (Johnson et al., 1998). This large repetitive genome makes whole genome sequencing and annotation difficult. Conversely, transcriptome profiling by RNA-seq has been used to decipher differentially expressed genes in grass systems (Wang et al., 2009; Gutierrez-Gonzalez et al., 2013; Wachholtz et al., 2013; Yang et al., 2013). The number of short-reads from RNAseq data gives an indication of the level of gene expression and therefore is highly suitable for gene expression studies (Wang et al., 2009).

To identify differentially expressed defense-related genes in buffalograss, we profiled transcriptomes of two leaf spot resistant and two susceptible lines after challenging with C. inaequalis. For this study, we chose to use Curvularia over Bipolaris since it is more virulent and produces disease symptoms faster. De novo assembly of RNA-seq data from a previous Prestige buffalograss NGS study resulted in a reference assembly of 91,519 contigs (Wachholtz et al., 2013); this previously published buffalograss transcriptome was used as a reference in our study. In the previous study, basal transcriptional expression differences were compared between the two buffalograss cultivars Prestige and 378. However, identification of defense-related genes in response to a pathogen was not part of the previous study. The main objective of our study was to identify buffalograss leaf spot resistance genes differentially expressed between resistant and susceptible buffalograss.

### MATERIALS AND METHODS

### Buffalograss Inoculation and Leaf Tissue Sampling

Two leaf spot resistant (95-55 and NE-BFG-7-3459-17) and two susceptible buffalograss lines (Prestige and NE-BFG-7-3453- 50) identified previously were used in our study (Amaradasa and Amundsen, 2014a). Stolons of leaf spot resistant lines and susceptible lines were planted in 7-cm-diameter plastic pots filled with Fafard <sup>R</sup> 3B Mix potting medium. Pots were kept in a greenhouse with a 16 h day and 8 h night photoperiod. The average daytime and nighttime temperature of the greenhouse was maintained at 30 and 22◦C, respectively. Plants were watered daily, fertilized biweekly with 20–20–20 to provide an approximate annual rate of 10 g N m−<sup>2</sup> , and clipped with scissors regularly to a height of 6 to 7 cm to promote prostrate growth and pot coverage. After 12 weeks of growth, plants were arranged in a randomized complete block design (RCBD) with three replications. Single-spore C. inaequalis strain 4L-SS01 was used to prepare a spore culture of 1 × 10<sup>6</sup> spores ml−<sup>1</sup> according to published methods (Brecht et al., 2007). Each pot was sprayed with 15 ml of the spore solution. Untreated controls were also included with three replicates and sprayed with water in place of the spore solution. After 10 days, when susceptible lines were exhibiting distinct disease symptoms, leaf tissue from both inoculated and uninoculated pots was harvested into separate freezer bags and immediately frozen in liquid nitrogen. Samples were kept at −80◦C for later use.

### Total RNA Sequencing and Analysis

Approximately 100 mg of leaf tissue of each sample (95- 55, NE-BFG-7-3459-17, Prestige, and NE-BFG-7-3453-50) was homogenized in liquid nitrogen using a mortar and pestle and RNA was extracted using an RNeasy Plant Mini Kit (Qiagen, Valencia, CA, USA) according to the manufacturer's instructions. RNA samples were qualitatively analyzed by agarose gel electrophoresis, and quantified using a NanoDrop 2000C spectrophotometer (Thermo Fisher Scientific Inc., Wilmington, DE, USA). Total RNA from 24 samples [4 buffalograss lines × 3 replicates × 2 treatments (inoculated/uninoculated)] was sent to the High-Throughput DNA Sequencing and Genotyping Core Facility located at the University of Nebraska Medical Center, Omaha, Nebraska for transcriptome sequencing. The cDNA libraries were prepared and then sequenced using a HiSeq 2000 sequencing platform (Illumina, San Diego, CA, USA) according to the manufacturer's RNA-seq protocol. The 24 samples were separately barcoded and run on three lanes of the HighSeq 2000 to obtain 100 bp single-end reads. Quality filtering of the reads was done by the Genotyping Core Facility. FastQC<sup>1</sup> was used to visualize the quality of the reads using

<sup>1</sup>http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

default parameters. Since FastQC showed several overrepresented reads consisting of Illumina adapter and primer sequences, Trimmomatic-0.30 (Bolger et al., 2014) was used to remove those contaminants. The reads were trimmed to a uniform length of 80 bp prior to downstream analysis. A fastq file containing the sequencing reads and quality data was used for down-stream analysis. The sequencing reads were mapped with Bowtie2-2.1.0 (Langmead and Salzberg, 2012) to the B. dactyloides cv. Prestige transcriptome (Wachholtz et al., 2013). Reads that did not map to the reference were retained and assembled using Trinity-r2013- 02-25 (Grabherr et al., 2011). The Trinity assembled contigs and the Prestige reference transcriptome were merged and cdhit-est version 4.5.4 (Weizhong and Godzik, 2006) was used to remove redundancy with a 100% identity threshold to create the buffalograss transcriptome. Single-end raw sequencing reads of each individual were mapped with Bowtie2 to the buffalograss transcriptome to allow for the estimation of transcript abundance per individual relative to the buffalograss transcriptome. A read count table was produced using SAMtools (Li et al., 2009) and Perl<sup>2</sup> . To account for the variability of total initial Illumina sequencing results among samples, mapped read counts were subjected to normalization and then analyzed for differential expression using the DESeq2 Bioconductor package (Love et al., 2014) in R program (version 3.0.2).

Read counts of the two inoculated resistant (R) lines were compared separately to each inoculated susceptible (S) line. We used a final adjusted P-value of < 0.01 to select transcripts that showed a difference in expression between inoculated resistant and susceptible lines. The differentially up-regulated transcripts of 95-55 (R) vs. NE-BFG-7-3453-50 (S) were compared with 95- 55 (R) vs. Prestige (S) and common transcripts were identified. Similarly, common up-regulated transcripts between inoculated resistant line NE-BFG-7-3459-17 (R) vs. inoculated susceptible lines were selected. Then the two sets of up-regulated genes (i.e., 95–55 vs. susceptible lines and NE-BFG-7-3459-17 vs. susceptible lines) were compared to each other and common transcripts were identified for annotation. By this filtering procedure, we identified transcripts that are common and differentially upregulated in both inoculated resistant lines compared to the susceptible lines (up-regulated in resistant inoculated; URI). In the same way, we compared each uninoculated resistant line with each uninoculated susceptible line and used the same filtering procedure to identify genes in common that have different levels of expression between both resistant and susceptible lines (basal up-regulated expressions; BUE). Then we identified downregulated transcripts in resistant inoculated and uninoculated cultivars compared to inoculated and uninoculated susceptible lines, respectively (down-regulated in resistant inoculated: DRI; basal down-regulated expressions: BDE).

The four sets of transcripts were pooled and annotated with Blast2GO using default settings (Conesa et al., 2005). The annotated transcripts were analyzed separately to identify genes responsible for induced resistance (transcripts of URI and DRI) and innate immunity (transcripts of BUE and BDE) in buffalograss. Blast2GO was used to prepare graphs of biological,

<sup>2</sup>http://www.perl.org

cellular and molecular processes at level two gene ontologies. Gene ontology IDs resulting from the analysis were mined for disease resistance related terms.

### Validation of BUE and URI Gene Expression

Ten differentially expressed transcripts were selected for validation by reverse transcription PCR (RT-PCR). These transcripts showed up-regulation either in inoculated or uninoculated resistant buffalograss lines compared to inoculated or uninoculated susceptible lines, respectively. Some transcripts were chosen because they did not have any read counts for either inoculated or uninoculated susceptible lines (Supplementary Table S1). The primers (**Table 1**) were synthesized for each transcript using Primer3web<sup>3</sup> version 4.0.0. The primers for ubiquitin conjugating enzyme (UCE) were used as a positive control (**Table 1**). RNA was extracted from new plants of inoculated and uninoculated 95–55, NE-BFG-7-3459-17, Prestige, and NE-BFG-7-3453-50 as described previously. cDNA was synthesized using an InvitrogenTM SuperScript <sup>R</sup> III First-Strand Synthesis System (Life Technologies, Grand Island, NY, USA) according to the manufacturer's instructions. Using a standard PCR protocol (each 25 µl reaction mixture contained 1x Taq DNA polymerase buffer, 0.2 µM forward and reverse primers, 0.2 mM each dNTPs, 1 U Taq polymerase) each primer pair was used to amplify 100 ng of cDNA template in a Mastercycler <sup>R</sup> Pro thermalcycler (Eppendorf, Hamburg, Germany) with the following conditions: Initial denaturation at 94◦C for 3 min, was followed by 35 cycles at 94◦C for 30 s, 55◦C for 1 min, and 72◦C for 1 min, and a final extension at 72◦C for 10 min. Thereafter the reaction was stopped by reducing the temperature to 4◦C and PCR products were stored at −20◦C. Aliquots (5 µl) of amplified products were separated by electrophoresis on a gel containing 1.7% (w/v) agarose and 1x TAE, at 100 V for 1 h. The presence and size of the DNA fragments were verified by staining the gel with ethidium bromide and observing under UV light.

### RESULTS

### Filtering and De Novo Assembly of Raw Reads

Sequencing of twenty four cDNA libraries constructed from C. inaequalis infected as well as non-infected buffalograss lines produced approximated 655.2 million 100 bp single-end reads (**Table 2**). On average, 30.5 million reads were produced per sample from inoculated leaf tissue and 24.0 million reads were produced per non-inoculated leaf tissue sample. Trimmomatic removed 9 to 17% of overrepresented sequences (**Table 2**). The existing B. dactyloides cv. Prestige reference transcriptome (Wachholtz et al., 2013) had 91,519 contigs. When raw reads of each of the buffalograss samples were mapped to the Prestige reference, 2 to 14% of the reads did not map (**Table 2**).

<sup>3</sup>http://bioinfo.ut.ee/primer3/

The unmapped raw reads were assembled by Trinity into 156,721 contigs. These contigs were merged with the existing Prestige reference resulting in the creation of the buffalograss transcriptome which consisted of 248,221 transcripts. The final buffalograss reference used for read mapping was prepared by removing all isoforms and keeping only the longest transcript for each gene. This final buffalograss reference had 196,168 transcripts with the longest transcript having 13,148 bp, average transcript length of 565 bp, and median transcript length of 340 bp. We submitted the sequences of final reference to NCBI BioProject repository<sup>4</sup> under the project ID PRJNA297834. The distribution of transcript lengths is depicted in **Figure 1**. When the sequencing reads from the 24 buffalograss samples were

<sup>4</sup>http://www.ncbi.nlm.nih.gov/bioproject/

#### TABLE 1 | Primers used for validating expression of selected buffalograss transcripts.


<sup>a</sup>Endogenous control ubiquitin conjugating enzyme gene.



<sup>a</sup>17, 50, 95 and P refers to NE BG 7-3459-17, NE BG 7-3453-50, 95-55 and Prestige, respectively. C and T indicate untreated and treated samples. <sup>b</sup>Wachholtz et al. (2013).

mapped to the final buffalograss reference, more than 98% of the reads mapped at least once (**Table 2**).

### Identification of Differentially Expressed Transcripts

The DESeq2 analysis of inoculated lines resulted in 355 upregulated transcripts (URI) and 106 down-regulated transcripts (DRI) in both resistant lines when compared to the susceptible lines (**Figures 2A** and **3A**). Similarly, the uninoculated resistant lines had 1,076 transcripts with higher expression (BUE) and 476 transcripts with lower expression (BDE) in common when compared with the uninoculated susceptible lines (**Figures 2B** and **3B**). There were 75 transcripts in common among the two up-regulated transcript sets (URI and BUE) and 14 common transcripts in the two down-regulated transcript sets (DRI and BDE). Eight URI transcripts had more than 10 mapped reads for each of the resistant samples and no mapped reads for the susceptible samples. Of these eight transcripts, five were in common with the BUE transcripts. Similarly, nine BUE transcripts had at least 10 mapped reads for each of the resistant samples and no mapped reads for the susceptible samples. Of these nine transcripts, two were in common with the URI transcripts. Two transcripts (Bodac2015c153835 and Bodac2015c154561) of DRI group had no mapped reads for any of the resistant samples while all of the inoculated susceptible samples had more than 10 mapped reads.

### Annotation of Differentially Expressed Transcripts

In total, there were 1,356 unique differentially expressed upregulated transcripts (BUE and URI). These 1,356 transcripts were subjected to Blast2GO and 678 had blast hits resulting in the annotation of 528 transcripts while the other 828 were not annotated by Blast2GO. However, some of the 150 (678 blast hits - 528 annotated sequences) sequences that had blast hits but were not annotated did have assigned protein descriptions that were associated with disease resistance. Among the 528 annotated sequences, 381 sequences had biological process associated gene ontology (GO) terms, 425 sequences had molecular function associated GO terms, and 341 sequences had cellular component associated GO terms. The level 2 GO terms for biological process, molecular function, and cellular component are summarized in **Figure 4**. The most prevalent biological process GOs were cellular process (289 sequences), metabolic process (283 sequences), single-organism process (153 sequences), and response to stimulus (130 sequences). The largest molecular function gene ontologies are catalytic activity (293 sequences) and binding (278 sequences). Cell (301 sequences), organelle (257 sequences), and membrane (166 sequences) are the three largest cellular component gene ontologies.

Additionally, there were a total of 568 unique down-regulated transcripts common to resistant lines (DRI and BDE sets) and they had 209 Blast2GO annotations. Annotated sequences had 152 sequences associated with biological process GO terms, 164 sequences with molecular function GO terms, and 146 sequences with cellular component GO terms. The level 2 GO terms for biological process, molecular function, and cellular component are summarized in Supplementary Figure S1. The most common level 2 GO terms were similar to the ones mentioned in the up-regulated transcripts (BUE and URI) with 49 sequences representing plant defense related GO term response to stimulus.

Sequences annotated with plant defense related GO terms were searched and selected ontologies are depicted in **Figure 5**. Level 4 and above defense related GO terms with the highest number of associated transcripts included defense response (43 transcripts), response to other organism (33 transcripts), response to external biotic stimulus (33 transcripts), and innate immune response (14 transcripts). These defense related terms are in the biological process GO and were represented by 57 unique transcripts and each transcript was associated with multiple GO terms. These multiple GO terms and the description of proteins encoded by the above 57 sequences are provided in Supplementary Table S2.

Transcripts were identified that shared similarity to known defense response genes. For example, five transcripts (Bodac2015c170619, Bodac2015c160020, Bodac2015c000447, Bodac2015c139533, and Bodac2015c153835) encode ABC transporter-like proteins and have potential to confer non-host resistance (Shimizu et al., 2010). These transcripts were expressed more in leaf spot resistant lines compared (URI, BUE, DRI, and

BDE) to the susceptible lines. The transcript Bodac2015c185871 is similar to a bacterial blight resistance gene. Transcripts Bodac2015c098349, Bodac2015c142490, Bodac2015c170497, Bodac2015c106585, and Bodac2015c139347 were up-regulated in resistant lines and similar to genes encoding Verticillium wilt disease resistance proteins. Transcript Bodac2015c141715 found in BUE group was homologous to the gene encoding the immediate-early fungal elicitor protein CMPG1. This had been reported to confer a hypersensitive response (HR) in many plants (González-Lamothe et al., 2006).

We identified a LysM domain receptor-like kinase (Bodac2015c159804) along with several (Bodac2015c146460, Bodac2015c164725, Bodac2015c161960, Bodac2015c142835, and Bodac2015c195842) mitogen-activated protein (MAP) kinase and MAP kinase family sequences. Similar genes were reported to cause pattern triggered immunity (PTI) in rice (Shimizu et al., 2010; Balmer et al., 2013). The transcript Bodac2015c130002 was identified among the BUE transcripts and is similar to Xa21 which mediates resistance against Xanthomonas bacteria in rice (Lee et al., 2009).

Nucleotide-binding site leucine rich repeat (NBS-LRR) like transcripts Bodac2015c176239, Bodac2015c100000, and Bodac2015c134520, which encode nb-arc domain proteins, had higher expression in resistant lines. Eighteen leucinerich repeat receptor-like protein kinase family transcripts were also expressed more in resistant lines. In addition, 16 RPM1 like disease resistance transcripts and one RPS2-like disease resistance transcript (Bodac2015c147648) were identified. The RPM1-like and RPS2-like disease resistance genes have been reported to confer effector-triggered immunity (ETI) in other plants (Mackey et al., 2002; Jones and Dangl, 2006; Balmer et al., 2013). One transcript (Bodac2015c098081) that was similar to the wheat stripe rust resistance gene Yr10 and the transcripts Bodac2015c145339 and Bodac2015c146056 were similar to the barley stem rust resistant gene Rpg1 had higher expression in buffalograss leaf spot resistant lines. The Yr10 and Rpg1 genes are also responsible for ETI (Brueggeman et al., 2002; Balmer et al., 2013; Zhang et al., 2013).

The transcripts Bodac2015c139945 and Bodac2015c165851 had higher expression (P < 0.006) in leaf spot resistant buffalograss lines and encode a heat shock transcription factorlike protein. Heat shock proteins can fold NBS-LRR proteins and make them active against pathogens (Balmer et al., 2013). Nine transcripts with response to salicylic acid GO term were also identified in leaf spot resistant buffalograss lines. All these were found in up-regulated transcript sets URI and BUE. We also found three transcripts (Bodac2015c146262, Bodac2015c159389, and Bodac2015c130171) encoding pathogenesis-related (PR) proteins with higher expression in the resistant lines.

Although this study was not designed to identify pathogen related genes, we searched for genes expressed by C. inaequalis since pathogen encoded RNA may have been included in our sequencing reads. The transcript Bodac2015c163958 showed homology to gene encoding NEP1 effector which is a phytotoxic protein identified in Botrytis cinerea (Oliver and Solomon, 2010).

### Gene Expression Validation by RT-PCR

Reverse transcription PCR could distinguish most of the differentially up-regulated genes based on host susceptibility (**Figure 6**). When the difference of gene expression from the transcriptional profiling study was not high, the intensity of PCR bands was similar across all samples (e.g., J and L genes of untreated resistant vs. untreated susceptible samples in **Figure 6** and Supplementary Table S1). The primers used to amplify the endogenous gene UCE resulted in a PCR product of the

expected size (61 bp) that was uniformly expressed across all samples.

### DISCUSSION

This study was designed to identify buffalograss defense related genes contributing to host resistance against leaf spot disease. We compared inoculated leaf spot resistant and susceptible buffalograss lines to identify genes with either higher or lower expression in the resistant lines. We also identified differentially expressed transcripts in uninoculated resistant lines when compared to uninoculated susceptible lines. Previous research has demonstrated higher basal expression of defense genes in buffalograss lines resistant to chinch bugs (Blissus occiduus Barber) compared to susceptible lines (Ramm et al., 2013). Higher basal gene expression may also play a role in buffalograss defense against leaf spot disease.

The ABC transporter like proteins have been reported to confer non-host resistance in Arabidopsis against the nonadapted pathogen Blumeria graminis f. sp. hordei (Stein et al., 2006). Genes conferring non-host resistance don't produce HR but are normally involved in rapid production of cell wall appositions (physical barriers) and antimicrobial metabolites at the site of pathogen entry (Jones and Dangl, 2006). Infection by C. inaequalis has also induced expression of the ABC transporter-like genes Bodac2015c170619, Bodac2015c160020, Bodac2015c000447, Bodac2015c139533, and Bodac2015c153835 in leaf spot resistance lines compared to the susceptible lines.

During the early stages of a pathogen attack the innate immunity of the host allows the plant to overcome the pathogen. In the first stage of this type of immunity, the microbe-associated molecular patterns (MAMP) such as chitin or flagellin are recognized by membrane-localized pattern-recognition receptors (PRR) in plants (Balmer et al., 2013) which results in PTI. In the model monocot rice, LysM PRRs CEBiP and CERK1 have been identified for sensing MAMP chitin (Shimizu et al., 2010). MAMP-signaling activates mitogen-activated protein kinase (MAPK) cascades, which regulate transcription factors (TFs) driving the expression of defense genes (Balmer et al., 2013). In these buffalograss samples we also found genes that encode both LysM domain receptor-like kinase and MAP kinase family proteins. These genes may be involved in PTI in buffalograss against C. inaequalis.

The Xa21 homolog found in buffalograss was originally reported in rice mediating resistance to Xanthomonas bacteria. This gene encodes extracellular and intracellular receptors which can sense the 194-amino acid bacterial protein Ax21 and is conserved across all known Xanthomonas strains (Lee et al., 2009). During an attack, XA21 induces downstream defense mechanisms which lead to the expression of pathogenesisrelated (PR) genes and the development of HR (Tena et al., 2011). Xa21 homologs have been found in other grasses such as Brachypodium, sorghum, and maize (Tan et al., 2012). Another example of PTI is the B-lectin receptor kinase Pi-d2 which confers resistance against Magnaporthe grisea in rice (Chen et al., 2006). We found 15 transcripts similar to lectin-domain containing receptor kinases in leaf spot resistant buffalograss.

To overcome PTI related defense signaling, pathogens release avirulence (Avr) proteins. In the host, a second line of plant immunity takes place mostly in the cytoplasm and is mediated by NBS-LRR proteins encoded by plant resistance (R) genes. These NBS-LRR proteins can recognize and neutralize Avr proteins/effectors which results in ETI (Elmore et al., 2011), and is usually manifest in a HR (Balmer et al., 2013). The NBS-LRR family represents one of the largest and widely

1356 differentially up-regulated transcripts, 528 were annotated by Blast2GO.


inoculated resistant buffalograss lines compared to inoculated susceptible lines. The BUE group refers to the transcripts expressed more in uninoculated resistant lines in comparison to uninoculated susceptible lines. BDE represents the down-regulated transcripts in uninoculated resistant lines compared to uninoculated susceptible lines.

conserved gene families in plants. More than one hundred NBS-LRR genes have been identified in the majority of sequenced plants (Balmer et al., 2013). NBS-LRR proteins usually consist of an N-terminal domain with Toll/Interleukin-1 Receptor (TIR-NBS-LRR, or TNL) or an N-terminal coiled-coil (CC-NBS-LRR, or CNL) motif (Meyers et al., 2003). In Arabidopsis, RPM1 is a peripheral plasma membrane NBS-LRR protein and in the absence of RPM1, AvrRpm1 effector protein of some Pseudomonas syringae strains can cause virulence (Mackey et al., 2002; Jones and Dangl, 2006). RPS2 has been reported to confer ETI to Arabidopsis against P. syringae (Jones and Dangl, 2006). We identified several NBS-LRR type genes along with genes that encode an RPM1 and RPS2-like disease resistance proteins.

The Yr10 and Rpg1 homologs found in leaf spot resistant buffalograss lines confer disease resistance in other graminaceous plants. Presently, 53 different Yr genes (Yr1–Yr53) have been

fpls-07-00715 May 23, 2016 Time: 17:33 # 10

identified to cause stripe rust (Puccinia striiformis f. sp. tritici) resistance in wheat (Zhang et al., 2013). The barley Rpg1 gene regulates resistance against the stem rust pathogen P. graminis f.sp. tritici (Brueggeman et al., 2002). Both of these gene products interact with pathogen effector proteins and confer resistance. The stripe rust resistance protein yr10 and disease resistance protein rpg1 were identified by the Blast2GO analysis. Similarly, many transcripts encoding NBS-LRR type defense related proteins were also identified. However, a search for transcripts that encode proteins similar to TNL and CNL using a hidden Markov model (HMM; Eddy, 1998; Meyers et al., 2003) may reveal more NBS-LRR genes and this analysis is underway.

We found two heat shock like genes (Bodac2015c139945 and Bodac2015c165851) in leaf spot resistant lines. The NBS-LRR proteins activated by heat shock proteins interact with pathogen effectors and regulate WRKY TFs to confer plant resistance (Balmer et al., 2013). WRKY TFs regulate many plant processes including response to biotic stresses (Zhang and Wang, 2005). Plants under pathogen attack can also show enhanced defense activity in tissues not yet attacked through systemic acquired resistance (SAR). When leaf pathogens show localized infection, defense signals are mobilized to distal plant tissues and induce systemic resistance against a broad range of pathogens including fungi, bacteria, viruses, nematodes, and even insects (Shah, 2009). Prior activation of defense genes in distal tissues renders them more resistance against future attacks. In dicots, salicylic acid has been found to play a major role in regulating SAR followed by the up-regulation of pathogenesis-related (PR) genes (Balmer et al., 2013). Compared with dicots, the knowledge of SAR in monocots is scarce (Balmer et al., 2013). It is interesting that we identified nine transcripts with "response to salicylic acid" GO term in leaf spot resistant buffalograss lines.

We validated the expression of 10 differentially expressed genes (Supplementary Table S1) by RT-PCR. The majority of the tested genes did produce PCR bands with different intensities and could distinguish resistant and susceptible samples confirming the accuracy of our analysis (**Figure 6**). Quantitative real-time RT-PCR may reveal expression differences of these potential molecular markers more accurately. Interestingly, we found 15 transcripts that have more than 10 read counts in treated and untreated resistant buffalograss and no read counts in the susceptible lines. They may be useful for identifying leaf spot resistant buffalograss lines by molecular methods.

We identified several differentially expressed transcripts that may regulate leaf spot resistance in buffalograss. In particular, the above mentioned 15 sequences that were uniquely expressed in the resistant lines are a new resource for identifying leaf

### REFERENCES


spot resistant buffalograss lines. We also found 21 transcripts in resistant lines that are similar to genes regulating PTI (e.g., sequences encoding LySM, MAPK, lectin receptor kinaselike proteins) and 20 sequences predicted to encode NBS-LRR proteins RMP1, RPS2, RPG1, and YR10 regulating ETI. There were also nine up-regulated transcripts in resistant lines that have potential to initiate SAR and three transcripts encoding PR proteins. This is the first study characterizing changes in the buffalograss transcriptome when challenged with leaf spot pathogen C. inaequalis. The NBS-LRR type defense related genes identified here are useful for screening large numbers of buffalograss germplasm for C. inaequalis resistance by molecular techniques thus eliminating laborious and time consuming traditional greenhouse and field testing. In addition, buffalograss is more susceptible to Curvularia and Bipolaris patch diseases when grown in humid regions. These new molecular resources would improve the efficiency for breeding for leaf spot resistant buffalograss cultivars and may help expand the buffalograss into areas prone of leaf spot outbreaks.

### AUTHOR CONTRIBUTIONS

All authors listed, have made substantial, direct and intellectual contribution to the work, and approved it for publication.

### FUNDING

We thank the United States Golf Association for funding this research.

### ACKNOWLEDGMENTS

We also thank the University of Nebraska DNA Sequencing Core that receives partial support from the NCRR (1S10RR027754- 01, 5P20RR016469, RR018788-08) and the National Institute for General Medical Science (NIGMS; 8P20GM103427, GM103471- 09). This publication's contents are the sole responsibility of the authors and do not necessarily represent the official views of the NIH or NIGMS.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fpls.2016.00715



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Amaradasa and Amundsen. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Validating DNA Polymorphisms Using KASP Assay in Prairie Cordgrass (Spartina pectinata Link) Populations in the U.S.

Hannah Graves <sup>1</sup> , A. L. Rayburn<sup>1</sup> , Jose L. Gonzalez-Hernandez <sup>2</sup> , Gyoungju Nah<sup>3</sup> , Do-Soon Kim<sup>3</sup> and D. K. Lee<sup>1</sup> \*

<sup>1</sup> Department of Crop Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA, <sup>2</sup> Plant Science Department, South Dakota State University, Brookings, SD, USA, <sup>3</sup> Department of Plant Science, Research Institute of Agriculture and Life Sciences, College of Agriculture and Life Sciences, Seoul National University, Seoul, Korea

#### Edited by:

Keenan Amundsen, University of Nebraska-Lincoln, USA

### Reviewed by:

Elisa Bellucci, Università Politecnica delle Marche, Italy Serge J. Edme, United States Department of Agriculture-Agricultural Research Service, USA

> \*Correspondence: D. K. Lee leedk@illinois.edu

#### Specialty section:

This article was submitted to Crop Science and Horticulture, a section of the journal Frontiers in Plant Science

Received: 27 August 2015 Accepted: 28 December 2015 Published: 22 January 2016

#### Citation:

Graves H, Rayburn AL, Gonzalez-Hernandez JL, Nah G, Kim D-S and Lee DK (2016) Validating DNA Polymorphisms Using KASP Assay in Prairie Cordgrass (Spartina pectinata Link) Populations in the U.S. Front. Plant Sci. 6:1271. doi: 10.3389/fpls.2015.01271 Single nucleotide polymorphisms (SNPs) are one of the most abundant DNA variants found in plant genomes and are highly efficient when comparing genome and transcriptome sequences. SNP marker analysis can be used to analyze genetic diversity, create genetic maps, and utilize marker-assisted selection breeding in many crop species. In order to utilize these technologies, one must first identify and validate putative SNPs. In this study, 121 putative SNPs, developed from a nuclear transcriptome of prairie cordgrass (Spartina pectinata Link), were analyzed using KASP technology in order to validate the SNPs. Fifty-nine SNPs were validated using a core collection of 38 natural populations and a phylogenetic tree was created with one main clade. Samples from the same population tended to cluster in the same location on the tree. Polymorphisms were identified within 52.6% of the populations, split evenly between the tetraploid and octoploid cytotypes. Twelve selected SNP markers were used to assess the fidelity of tetraploid crosses of prairie cordgrass and their resulting F2population. These markers were able to distinguish true crosses and selfs. This study provides insight into the genomic structure of prairie cordgrass, but further analysis must be done on other cytotypes to fully understand the structure of this species. This study validates putative SNPs and confirms the potential usefulness of SNP marker technology in future breeding programs of this species.

Keywords: Prairie cordgrass, SNP, marker, Spartina, polymorphism, transcriptome

### INTRODUCTION

Prairie cordgrass (Spartina pectinata Link) is a native grass species of the North American Prairie that has a geographic distribution, ranging from the southern U.S. (Texas, Arkansas, and New Mexico) to northern Canada, and from the east coast through the Midwest to the western coast of the U.S. (Hitchcock, 1950; Voight and Mohlenbrock, 1979; Barkworth et al., 2007; Gedye et al., 2010). This species is adapted to a wide range of environmental conditions and, in addition, responds well to abiotic stresses, such as moderate salinity, water logged soils, drought, and cold tolerance (Montemayor et al., 2008; Boe et al., 2009; Gonzalez-Hernandez et al., 2009; Kim et al., 2011; Zilverberg et al., 2014; Anderson et al., 2015). Because of its wide adaptability, this warm season, C4, perennial grass is highly valued for conservation practices, wetland revegetation, streambank stabilization, wildlife habitat, forage production, and recently bioenergy feedstock production (Hitchcock, 1950; Barkworth et al., 2007; Montemayor et al., 2008; Gonzalez-Hernandez et al., 2009; Kim et al., 2011; Boe et al., 2013; Zilverberg et al., 2014; Guo et al., 2015). This ability to adapt to such a wide diversity of conditions results in populations becoming adapted to specific environments, ultimately leading to genetically diverse populations. Adding to the potential genetic diversity of prairie cordgrass is polyploidy.

Prairie cordgrass is a polyploid species, composed of three cytotypes: tetraploid (2n = 4x = 40), hexaploid (2n = 6x = 60), and octoploid (2n = 8x = 80) (Church, 1940; Kim et al., 2010, 2012). Because of the reproductive and geographic isolation between the cytotypes, there is likely an increase in polymorphisms and potential genetic diversity, especially within the tetraploid and octoploids cytotypes (Soltis et al., 1992; Wendel and Doyle, 2005; Hirakawa et al., 2014). There is a large amount of phenotypic variation present in all cytotypes of prairie cordgrass (Boe and Lee, 2007; Kim et al., 2012; Guo et al., 2015), but there is a lack of knowledge about the genomic structure. A few studies have revealed diversity within highly polymorphic chloroplast DNA regions observed within and among tetra- and octoploid populations (Kim et al., 2013; Graves et al., 2015). In prairie cordgrass, EST-SSR markers (Gedye et al., 2010), SSR (Gedye et al., 2012), and AFLP markers (Moncada et al., 2007) have been developed. However, these technologies may not be as cost-effective, scalable, successful, or as flexible as using single nucleotide polymorphisms (SNPs) (Semagn et al., 2014).

SNPs provide a highly efficient way to conveniently compare genomic and transcriptome sequences. Because they are one of the most abundant DNA variants found in plant genomes, SNPs are more likely to be related to specific biological functions and phenotypes (Rafalski, 2002; Bundock et al., 2006; Salem et al., 2012). This technology has been applied in genetic diversity analysis, genetic map construction, association map analysis, and marker-assisted selection breeding in many different types of crop species (Byers et al., 2012; Saxena et al., 2012; Semagn et al., 2014; Sindhu et al., 2014; Wei et al., 2014). SNP marker technology is also utilized in high-throughput genotyping, increasing the speed of the selection process by eliminating growing plants to maturity for phenotypic selection (Paux et al., 2012). In order to use SNP markers for genetic improvement, there is a three-step process one must follow: (1) SNP discovery after aligning sequence reads generated by next-generation sequencing technologies for different genotypes of a given species; (2) validate SNPs to distinguish DNA polymorphisms of actual allelic variants from those of other biological phenomena such as gene duplication events; (3) SNP genotyping of germplasm collection or genetic/breeding populations (Saxena et al., 2012).

Step one of the process was accomplished in prairie cordgrass by using a transcriptome assembly derived from multiple genotypes and tissues (Gonzalez et al., personal communication). The second and third steps are yet to be completed for polyploid prairie cordgrass. Several parameters, such as sample size, number of SNPs to be used for analysis, cost effectiveness, and the SNP genotyping platform, must be considered in these analyses (Semagn et al., 2014). Many technologies exist for use in SNP genotyping analysis, but one technology performs well when it comes to adaptability, efficiency, and cost-effectiveness. Kompetitive allele-specific PCR (KASP), developed by LGC Genomics (Teddington, UK; www.lgcgenomics.com), is a PCRbased homogeneous fluorescent SNP genotyping system, which determines the alleles at a specific locus within genomic DNA (Semagn et al., 2014). The KASP technology has been utilized on other polyploid plant species, including switchgrass (LGC Genomics, 2014), cotton (Byers et al., 2012), wheat (Paux et al., 2012), potato (Uitdewilligen et al., 2013), and various triploid citrus species (Cuenca et al., 2013).

In this study, SNPs, identified in the nuclear transcriptome, were converted to the KASP marker system in order to validate that these SNPs are true allelic variants. In addition, KASP markers were used in quality control analysis when making crosses, prairie cordgrass being a putative self-compatible species. The main objectives of this study were (1) to validate SNP polymorphisms identified in the nuclear transcriptome of natural populations of prairie cordgrass in the U.S. and (2) to assess the fidelity of specific tetraploid crosses and selfs, and to elucidate inheritance patterns of SNP markers.

### MATERIALS AND METHODS

### Development and Validation of KASP Genotyping Assays

In a separate study by Gonzalez et al. (personal communication) at South Dakota State University, a transcriptome of prairie cordgrass was assembled using ∼1.2 billion Illumina pairedend reads from various vegetative tissues (roots, leaves, and rhizomes) under various conditions (salt stress, cold stress, and differing photoperiods) in order to obtain an abundance in diversity, with regards to the number and type of transcripts. The assembly was developed using CLC Genomics Workbench 7.0 (Arhaus, Denmark) and annotated against the sorghum genes models. About 146,549 contigs, or transcript assemblies, of 230 bp or more with an N50 of 973 bp were used to mine over 1 million SNPs, insertions, and deletions using the variant detection function in CLC Genomics Workbench. Putative SNPs were filtered based on coverage (minimum of 100 X), a window of 80–100 bp free from additional SNPs and an allele frequency of 20–80%. Initially, nine bi-allelic SNPs were selected for analysis, associated with enzymes within the lignin biosynthesis pathway. Additional SNPs were selected without regard to putative function of the transcript assembly. A total of 121 bi-allelic SNPs were identified for use in this study (**Table 1**). SNPs were sent for primer development to be used in KASP genotyping assays. Genotyping with KASP was performed as follows.

For all samples, each amplification reaction contained 50 ng template DNA, KASP V4.0 2x Master mix standard ROX (LCG Genomics, Beverly, MA, USA) and KASP-by-Design assay mix (LGC Genomics, Beverly, MA, USA). The PCR thermocycling

#### TABLE 1 | Summary of SNP sequences, including SNP ID, SNP sequences, and SNP alleles.


(Continued)

#### TABLE 1 | Continued


\*Failed primers.

Bold letters are actual SNPS (SNP alleles).

conditions for all primers, except pcg\_1186, was 15 min at 94◦C followed by 10 cycles of 94◦C for 20 s and 61◦C for 1 min (dropping −0.6◦C per cycle to achieve a 55◦C the annealing temperature) followed by 26 cycles of 94◦C for 20 s and 55◦C for 1 min. The PCR thermocycling conditions for primer pcg\_1186 was 15 min at 94◦C followed by 10 cycles of 94◦C for 20 s and 65◦C for 1 min (dropping −0.8◦C per cycle to achieve a 57◦C annealing temperature) followed by 26 cycles of 94◦C for 20 s and 57◦C for 1 min. After amplification, PCR plates were read with a Spectramax M5 FRET capable plate reader (Molecular Devices, Sunnyvale, CA, USA) using the recommended excitation and emission values. Data was then analyzed using Klustercaller software (LGC Genomics. Beverly, MA, USA) to identify SNP genotypes.

### Core Collection Analysis

In order to validate SNP polymorphisms of prairie cordgrass using KASP, seeds and rhizomes of natural populations were collected from across the continental U.S.A. (Kim et al., 2013) and grown at the Energy Biosciences Institute (EBI) Farm, Urbana, Illinois, USA. Individuals from 38 of these populations were selected as core collection based on geographic distribution; and two plants from each population were sampled, for a total of 76 plants (**Table 2**). Leaf tissue samples were stored at −80◦C until DNA extraction was performed. Total genomic DNA was extracted from frozen leaf tissue using the CTAB method (Mikkilineni, 1997) with slight modifications as described by Kim et al. (2013). Fifty-nine KASP genotyping assays out of 121 were selected and used to analyze the collection and five additional Spartina species samples, namely; S. alterniflora, S. patens (Flageo vt.), S. patens (Sharp vt.), S. patens, and S. bakeri. All of the KASP genotyping assay results were recorded as a twoletter code, or SNP code, i.e., AA, AG, GG. A DNA fingerprint was made using all the SNP genotypes creating a concatenated DNA-like sequence, which was then imported into MEGA 6 (Tamura et al., 2013) to make a phylogenetic tree. The maximum parsimony (MP) tree, inferred from 1000 replicates, was obtained using the Subtree-Pruning-Regrafting algorithm with a search level one in which the initial trees were obtained by the random addition of sequences (Felsenstein, 1985; Nei and Kumar, 2000). All positions with <95% site coverage were eliminated.

## F<sup>1</sup> Cross

In order to assess the utility of the KASP marker system in confirming specific tetraploid crosses of prairie cordgrass, a reciprocal cross involving two individuals (PC17-109 × PC20-102) of two populations differing in morphological characteristics of potential agronomic importance was developed. PC17-109 is a tetraploid population from Illinois with a phalanx rhizome type and low seed mass, whereas PC20-102 is a tetraploid population from Kansas with a guerilla rhizome type and high seed mass. In a greenhouse, the female inflorescence was covered ∼1 day prior to stigma emergence, while pollen was collected from the male parent. Pollen was directly applied to the stigmas with a brush, and rebagged until anthesis was completed. A total of 83 individuals, 70 F<sup>1</sup> individuals from PC17-109 (female) × PC20-102 (male) and 13 F<sup>1</sup> individuals from PC20-102 (female) × PC17-109 (male) were sampled. F<sup>1</sup> seeds were planted in greenhouse setting. Leaf tissue samples of each seedling were collected and stored at −80◦C until DNA extraction was performed. Total genomic DNA was extracted from frozen leaf tissue as described previously. For the F<sup>1</sup> individuals, 12 KASP genotyping assays were selected based on the parental SNP genotypes (**Table 3**). All of the assay results were recorded as two-letter SNP codes. To determine if the F<sup>1</sup> progeny followed segregation of a typical monohybrid cross in relation to SNP genotype, a χ 2 analysis was performed using P = 0.05, df = 2, and χ 2 critical value = 5.991. The observed, along with the expected genotype, was recorded for each KASP genotyping assay.

### F<sup>2</sup> Self

To assess the utility of the KASP marker system in identifying selfed individuals in the tetraploid background and gauge the segregation pattern, F<sup>2</sup> individuals were generated and genotyped. In a greenhouse, the prairie cordgrass inflorescence was covered ∼1 day prior to stigma emergence with bags constructed to view progression of inflorescence development of F<sup>1</sup> plants. When anthesis was reached, the bags were shaken to promote self-pollination. Bags remained until anthesis was complete. F<sup>2</sup> seeds were collected and planted in a greenhouse setting. A total of eight F<sup>1</sup> individuals were selfed (6 F<sup>1</sup> of TABLE 2 | Summary of plant materials used including, location, cytotype, and number of plants used per population.


PC17-109 × PC20-102 and 2 F<sup>1</sup> of PC20-102 × PC17-109) and 8–11 individuals were sampled from the planted seeds of each of the selfed plants (total of 76). Leaf tissue samples were stored at −80◦C until DNA extraction was performed. All 12 of the KASP genotyping assays selected to score the F<sup>1</sup> individuals were also tested on the F<sup>2</sup> individuals. All of the assay results were recorded as a SNP code as done in the F<sup>1</sup> analysis. All SNP codes that were not accurately identified were removed from analysis.

### RESULTS

### Development and Validation of KASP Assays

Twenty-six (21.5%) SNPs failed KASP marker development. From the remaining 95 (78.5%), 11 SNPs were found to be monomorphic when tested on the core collection DNA, resulting in 84 SNPs that were true allelic variants. Three of the eleven monomorphic markers were selected to discover if future plant samples would reveal the SNP polymorphisms previously identified in the transcriptome. From the 84 allelic variants, 56 of the most highly polymorphic SNPs were selected for further use in this study, resulting in 59 total KASP genotyping assays (**Table 4**).

### Core Collection

The resulting data set from the DNA fingerprint contained 118 characters. There was an average of 3.8 missing character data points (SNP codes) per population. The maximum parsimony tree identified one clade after correcting for the missing data (**Figure 1**). For 47.4% of the populations, plants sampled from the same populations were observed to form subclades; however, intrapopulational variation was observed.

Out of the 38 prairie cordgrass populations, 52.6% showed polymorphisms within populations. Of the 52.6% polymorphic populations, 50% were octoploid and 50% were tetraploid. Out of the 15 octoploid populations sampled, 66.7% of the populations showed polymorphisms between the two plants sampled and 43.5% of the 23 tetraploid populations showed polymorphisms. The average number of polymorphisms that occurred within each population was 16. In the octoploid populations, 16.4 was the average number of polymorphisms observed, and 15.5 polymorphisms were observed as the average for tetraploids.

### F<sup>1</sup> Analysis

Only 6 out of 59 possible KASP genotyping assays showed both parents as homozygous SNPs but for opposite alleles. Three representative assays were selected which showed one SNP heterozygous for one parent and one SNP homozygous for the other parent, and three representative assays were selected which showed both parents as heterozygous SNPs (**Table 3**). All SNP codes that could not be accurately identified or called, due to not appearing in one of the three genotypes, were removed from the χ 2 analysis. Four individuals did not consistently satisfy the expected heterozygous SNP genotype, with regards to KASP genotyping assays for which both parents were homozygous for opposite alleles (pcg\_00050, pcg\_00058, pcg\_00059, pcg\_000106, pcg\_1186, and pcg\_14142). These four individuals, after being analyzed across all 12 assays, were identified as being selfs, and were removed from the χ 2 analysis (**Table 3**). Using the resulting trimmed data, the χ 2 analysis indicated normal monohybrid 1:2:1 and 1:1 Mendelian inheritance patterns and could not be rejected for any of the primers (**Table 5**).

### F<sup>2</sup> Analysis

The F<sup>1</sup> parent genotype was identified in order to find SNPs that indicated the parent was homozygous (**Table 6**). For 3 F<sup>1</sup>



The first three primers indicate one parent as heterozygous and one parent as homozygous, the next six primers indicate both parents as homozygous for opposite alleles, and the last three primers indicate both parents as heterozygous. Also shown are SNP assay results for eight out of the 83 F<sup>1</sup> hybrids. Indicated are samples that can be identified as true crosses and selfs.

\*Primers that can distinguish true crosses from selfed samples.

†F<sup>1</sup> individuals that are identified as selfs of the PC20\_102 parent.

parents that were selfed, there were F<sup>2</sup> progeny that did not fall into the expected homozygous parental genotype (example in **Table 7**). Two F<sup>2</sup> progeny were identified consistently as unexpected offspring genotype of 13-F1008, 1 progeny of 14- F1014, and 4 progeny of 14-F1071. Individuals that consistently fell into the heterozygous (unexpected) genotype category across multiple homozygous primers were considered outcrosses and not true selfs of the F1(**Table 7**). Most of the F<sup>2</sup> progeny were identified as expected SNP genotypes when considering the parental genotype.

### DISCUSSION

In order to validate SNP polymorphisms in prairie cordgrass, 121 SNPs identified from the nuclear transcriptome were sent for KASP assay development. Among 121 SNPs, the assay success rate was 78.5% with 26 assays failing development. This is comparable with findings in the literature of success rates of 83% (Cockram et al., 2012), 88.4% (Saxena et al., 2012), and 80.9% (Semagn et al., 2014). The assays failed mainly due to paralogs within the prairie cordgrass genome. Because not all of the populations used to develop the transcriptome were in the core collection of DNA used in this study, some assays appeared as monomorphic. These selected SNPs may have been derived from the octoploid populations not present in the core collection. Three monomorphic SNPs were selected for further analysis, to see if the SNPs would be polymorphic in future studies. With the failed and monomorphic assays removed, 84 putative SNPs were validated as true allelic variants and 59 SNPs were selected for this study. The 59 highly polymorphic assays were selected based on the criteria that there were at least two of the three genotypes present in a large portion of the samples analyzed. These assays were tested on the 38 natural populations, creating a phylogenetic tree that resulted in one clade containing all of the prairie cordgrass populations. If subclades were observed, the two plants of a single population were represented in the subclade.

Just over half of the populations showed polymorphisms within, with an equal number of octoploid and tetraploid populations. The average number of polymorphisms that occurred within each population did not vary between octoploid and tetraploid populations. This is different from a chloroplast DNA study of prairie cordgrass, in which there was little, if any, polymorphisms observed in the tetraploid cytotype (Graves et al., 2015).

SNPs were successfully identified in nuclear transcriptomes of prairie cordgrass and validated as allelic variants that can be used in prairie cordgrass. SNP markers were used to detect significant polymorphisms in prairie cordgrass populations collected from distinct geographic regions in the U.S. These SNP polymorphisms appear to reflect genetic relationships in prairie cordgrass and, therefore, can be used to assess genetic diversity within and among populations in future studies.

The F<sup>1</sup> population, consisting of 83 plants, allows for the assessment of the fidelity of a specific tetraploid cross. Due to the lack of synchronization between the pollen and the ovaries, fewer seeds were obtained when PC20-12 was used as the female, compared with crosses involving PC17-109 as the female. Progeny that had SNP genotypes matching the female parent only were determined to be selfs. Of the F<sup>1</sup> progeny, 95.2% were identified to be hybrids. Prairie cordgrass is a protogynous outcrossing species (Gedye et al., 2012), leading to the possibility that later-maturing stigmas could have been exposed to pollen from the same female parent, resulting in 4.8% of the F<sup>1</sup> being selfs. The analysis of the 76 F<sup>2</sup> progeny obtained by selfing eight F<sup>1</sup> plants indicate that the SNPs, and the SNP markers chosen, could distinguish between a true selfed plant and an outcrossed plant. This is based on individuals


consistently being genotyped as heterozygous (outcrossed) rather than being homozygous (selfed) as expected. Ninety-one percent of the F<sup>2</sup> progeny were identified as successful selfs. Because of the protogynous nature of this species, there is already a natural element working against selfing. This could explain why outcrossed individuals were identified. There is also a possibility that some of the early-maturing stigmas were exposed to pollen in the greenhouse before bagging. This could explain why more F<sup>2</sup> progeny were identified as unexpected genotypes (outcrosses) than the expected genotype (selfs) of the F<sup>1</sup> progeny.

There is evidence that the tetraploid cytotype is an allotetraploid that may follow a disomic inheritance pattern. Two divergent copies in the Waxy lineages of Spartina genus support the allotetraploid origin of S. pectinata (Fortune et al., 2007). The bivalent pairing that occurs during meiosis (Church,



Analysis indicates that all primers produce expected results from a monohybrid Mendelian cross. df = 2, p = 0.05, critical χ <sup>2</sup> = 5.991.



\*Indicates homozygous SNPs.

1940; Marchant, 1968a,b; Bishop, 2015) and the observation of disomic inheritance using genotyping-by-sequencing (Crawford, 2015) both suggest a disomic inheritance pattern in S. pectinata. This hypothesis was tested in a cross between two prairie cordgrass populations, exploiting the bi-allelic nature of the KASP technology to suggest Mendelian segregation ratios in a monohybrid type cross. The analysis of the F<sup>1</sup> hybrids and F<sup>2</sup> selfs conclude that disomic inheritance of SNPs in tetraploid prairie cordgrass is in agreement with the chromosomal and genomic evidence, and a possibility in this cross (Marchant, 1968a,b; Fortune et al., 2007; Bishop, 2015; Crawford, 2015).

The primary requirement of any breeding program is to ensure that accurate crosses are made (Glaszmann et al., 2010). The small flower size of prairie cordgrass and the large number of flowers per head make it hard to perform physical emasculation. Possibilities of self-pollination always exist and, therefore, developing a molecular way to confirm true crosses from selfs is warranted (Fang et al., 2004; Gedye et al., 2012). In prairie cordgrass, SSR markers have been developed that identified successful crosses in this protogynous species without the need for emasculation. This study also confirms that hybrids of prairie cordgrass can be created and verified with molecular markers. However, utilizing SSRs can be time-consuming, limited in number, and more expensive than SNP markers, making a way for the introduction of these newly developed and validated KASP assays.

### CONCLUSION

This study reports the first research of SNP marker development for use in prairie cordgrass. SNP markers developed from the nuclear transcriptome were tested on a core collection of DNA and found to be polymorphic among and within populations. The amount of variation differs from previous findings based on chloroplast DNA, which identified the octoploid cytotype as the

#### TABLE 7 | SNP assay results for F<sup>2</sup> individuals of two out of the eight selfed F<sup>1</sup> samples.


Indicated are samples that can be identified as true selfs and as outcrossed.

\*Primers that can distinguish true selfs from outcrossed samples.

†F<sup>2</sup> individuals that are identified as outcrossed samples.

most variable. However, one must recognize these SNP markers cover a wide range of expressed genomic DNA vs. two noncoding chloroplast DNA regions, giving nucleic SNP markers an advantage in identifying random genetic variation. These markers were used to assess the validity of true crosses that were made between two different populations using F1 and F2 (selfs of F1) progeny. Utilizing the biallelic nature of the KASP system, χ 2 analysis of the F<sup>1</sup> samples suggests that tetraploid prairie cordgrass may follow Mendelian disomic inheritance although other modes of inheritance were not ruled out. This analysis provides insight into the genomic structure of this species, supporting the hypothesis that tetraploid prairie cordgrass is an allotetraploid. However, further analysis must be done on other cytotypes to completely understand the genome structure of this species and to evaluate genetic diversity. In addition, this study underlines the usefulness of using SNP marker technology in future breeding programs of prairie cordgrass, and opens up the ability for the final step using SNP markers in genotyping germplasm collections or genetic/breeding populations of prairie cordgrass.

### ACKNOWLEDGMENTS

This work was funded by the Department of Crop Sciences and the Energy Biosciences Institute, the University of Illinois and supported by Brain Pool Program through the Korean Federation of Science and Technology Societies (KOFST) funded by the Ministry of Science, ICT and Future Planning (151S-4-3-1269).

### REFERENCES


cordgrass (Spartina pectinata Link) in the USA. BioEnerg. Res. 8, 1371–1383. doi: 10.1007/s12155-015-9604-3


in Higher Plants, ed R. J. Henry (Cambridge, MA: CABI Publishing), 97–118.

Zilverberg, C., Johnson, W. C., Archer, D., Kronberg, S., Schumacher, T., Boe, A., et al. (2014). Profitable prairie restoration: the EcoSun Prairie Farm experiment. J. Soil Water Conserv. 69, 22A–25A. doi: 10.2489/jswc.69.1.22a

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Graves, Rayburn, Gonzalez-Hernandez, Nah, Kim and Lee. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Implementation of Genomic Prediction in Lolium perenne (L.) Breeding Populations

Nastasiya F. Grinberg<sup>1</sup> , Alan Lovatt<sup>2</sup> , Matt Hegarty<sup>2</sup> , Andi Lovatt<sup>2</sup> , Kirsten P. Skøt<sup>2</sup> , Rhys Kelly<sup>2</sup> , Tina Blackmore<sup>2</sup> , Danny Thorogood<sup>2</sup> , Ross D. King<sup>1</sup> , Ian Armstead<sup>2</sup> , Wayne Powell2,3 and Leif Skøt<sup>2</sup> \*

<sup>1</sup> Manchester Institute of Biotechnology, University of Manchester, Manchester, UK, <sup>2</sup> Institute of Biological, Environmental and Rural Sciences, Aberystwyth University, Aberystwyth, UK, <sup>3</sup> CGIAR Consortium, CGIAR Consortium Office, Montpellier, France

Perennial ryegrass (Lolium perenne L.) is one of the most widely grown forage grasses in temperate agriculture. In order to maintain and increase its usage as forage in livestock agriculture, there is a continued need for improvement in biomass yield, quality, disease resistance, and seed yield. Genetic gain for traits such as biomass yield has been relatively modest. This has been attributed to its long breeding cycle, and the necessity to use population based breeding methods. Thanks to recent advances in genotyping techniques there is increasing interest in genomic selection from which genomically estimated breeding values are derived. In this paper we compare the classical RRBLUP model with state-of-the-art machine learning techniques that should yield themselves easily to use in GS and demonstrate their application to predicting quantitative traits in a breeding population of L. perenne. Prediction accuracies varied from 0 to 0.59 depending on trait, prediction model and composition of the training population. The BLUP model produced the highest prediction accuracies for most traits and training populations. Forage quality traits had the highest accuracies compared to yield related traits. There appeared to be no clear pattern to the effect of the training population composition on the prediction accuracies. The heritability of the forage quality traits was generally higher than for the yield related traits, and could partly explain the difference in accuracy. Some population structure was evident in the breeding populations, and probably contributed to the varying effects of training population on the predictions. The average linkage disequilibrium between adjacent markers ranged from 0.121 to 0.215. Higher marker density and larger training population closely related with the test population are likely to improve the prediction accuracy.

Keywords: perennial ryegrass, genomic selection, BLUP, machine learning, forage crop

### INTRODUCTION

Genetic improvement of crops involves the selection of plants with superior characteristics in terms of traits that are considered important. This could be yield (biomass or seed), resistance to diseases and pests and better tolerance to abiotic stress. The selection criteria have been and still are based largely on phenotypic performance. Phenotypic assessment can be time consuming and laborious, particularly for perennial crops. There is pressure to increase

#### Edited by:

Keenan Amundsen, University of Nebraska-Lincoln, USA

### Reviewed by:

Hao Peng, Washington State University, USA Scott Eric Warnke, United States Department of Agriculture–Agricultural Research Service, USA

> \*Correspondence: Leif Skøt lfs@aber.ac.uk

#### Specialty section:

This article was submitted to Crop Science and Horticulture, a section of the journal Frontiers in Plant Science

Received: 11 September 2015 Accepted: 25 January 2016 Published: 12 February 2016

#### Citation:

Grinberg NF, Lovatt A, Hegarty M, Lovatt A, Skøt KP, Kelly R, Blackmore T, Thorogood D, King RD, Armstead I, Powell W and Skøt L (2016) Implementation of Genomic Prediction in Lolium perenne (L.) Breeding Populations. Front. Plant Sci. 7:133. doi: 10.3389/fpls.2016.00133

agricultural output at a faster rate to keep up with population growth and reduced area available for agricultural production. Molecular marker assisted selection (MAS) represents a way of potentially reducing the time and effort needed for phenotypic testing (Lande and Thompson, 1990; Dekkers and Hospital, 2002; Xu and Crouch, 2008). The success of MAS is dependent upon sufficient linkage disequilibrium (LD) between a marker and the phenotypic QTL (quantitative trait locus), and the QTL explaining a substantial proportion of the variation for the trait. Often, this is not the case, and the association between marker and QTL is not significant, and thus discarded. Therefore, the use of MAS in plant breeding has not been widespread. Recent improvements in genotyping techniques have made it possible to cover the genome with densely populated molecular markers, and this has paved the way for genome wide association studies (GWASs) (Rafalski, 2002; Flint-Garcia et al., 2003) in which marker-trait associations can be identified in breeder relevant and more diverse populations, rather than bi-parental mapping populations. The disadvantages of this approach includes low statistical power from small population sizes, confounding population structure of the germplasm used, and overestimation of the effect of few significant marker associations with QTL (Heffner et al., 2009).

Genomic selection (GS) represents a way of dealing with many of the problems of current MAS methodology. The term was first used by Meuwissen et al. (2001) to describe the use of genome wide molecular markers to simultaneously estimate the effect of all markers across the genome, irrespective of whether they are significant, in order to calculate a genomically estimated breeding value (GEBV) of selection candidates. GS depends upon the establishment of a training population, for which both phenotypic and genotypic data are available. The marker effects calculated from these data can be used to estimate the breeding values in populations with only genotypic data available (Meuwissen et al., 2001; Heffner et al., 2009). In terms of prediction methods the most widely used are the genomic or ridge regression. BLUP (best linear unbiased prediction) and other penalized regression methods (Gianola et al., 2006; de los Campos et al., 2009; Li and Sillanpää, 2012) and various Bayesian techniques (Meuwissen et al., 2001; de los Campos et al., 2009; Habier et al., 2011). However these techniques do not explicitly account for interactions. There is currently considerable interest in applying machine learning (ML) to science, and reviews have recently appeared (Ghahramani, 2015; Jordan and Mitchell, 2015). These methods are increasingly being applied in GWASs and GS (Dudoit et al., 2002; Long et al., 2007; Ziegler et al., 2007; Szymczak et al., 2009; Ogutu et al., 2011, 2012; Ornella et al., 2014; Spindel et al., 2015). ML algorithms are well suited to application in plant-breeding datasets. Most are easy to use and are easily available in a variety of implementations. Many methods perform attribute selection (e.g., lasso, regression trees) or assign importance scores to variables (e.g., random forest, boosted trees). Some methods, such as tree based approaches, do not require any assumptions about the underlying trait (e.g., additivity of effects, the numbers and size of interactions, depth of interactions etc.) and are able to capture complex non-linear relationships between response and regressors.

Genomic selection is an attractive alternative to classic selection methods, and it has been adopted in animal breeding, particularly dairy cattle (Schaeffer, 2006; Pryce and Daetwyler, 2012; Hayes et al., 2013b). The uptake of GS has been slower in plant breeding, but is now gathering pace. Many papers have assessed the potential use of GS in simulation and empirical studies of some of the major crops (Bernardo and Yu, 2007; Heffner et al., 2009, 2010, 2011; Piepho, 2009; Zhong et al., 2009; Jannink et al., 2010; Albrecht et al., 2011; Poland et al., 2012; Zhao et al., 2012, 2013; Xu, 2013; Bentley et al., 2014; Jarquin et al., 2014; Wang et al., 2014; Xu et al., 2014). The applicability of GS in perennial crops such as trees and forages is even more appealing, due to the possibility of significantly reducing the length of the breeding cycle (Grattapaglia and Resende, 2011). Some empirical studies in trees suggest reasonable prediction accuracies can be obtained (Resende et al., 2012a,b; Zapata-Valenzuela et al., 2012; Beaulieu et al., 2014). Two factors need to be taken into consideration when dealing with breeding in many forage crops such as perennial ryegrass. Firstly, the performance of individual spaced plants generally does not correlate well with the phenotype in sward for many economically important traits (Casler and Brummer, 2008). Secondly, most of the important forage crops are outbreeding, so variety development is usually based on population improvement via recurrent selection schemes (Posselt, 2010; Conaghan and Casler, 2011). These factors probably contribute to the low genetic gains achieved in forages, but other factors have been suggested, including a lack of a harvest index trait to breed for, inability to exploit heterosis and a large number of target traits with no or negative correlation between them (Casler and Brummer, 2008). Two recent reviews have assessed the prospects for GS in perennial forage crops such as grasses and legumes (Hayes et al., 2013a; Resende et al., 2014). The latter concluded that GS is likely to be most beneficial when phenotypic values of spaced plants do not correlate with those in sward, when within-family selection is difficult or impossible, and for traits that can be assessed only after several years of plot trials. Hayes et al. (2013a) also suggested that significant modifications to most current mass selection breeding schemes in, e.g., perennial ryegrass would be desirable/necessary to implement GS effectively.

However, there is very little empirical data available from forage crops with evaluation of GS performance. Lipka et al. (2014) described the use of GS in predicting breeding values in switchgrass (Panicum virgatum L.), a perennial grass which is being developed as an energy crop. They obtained cross validation accuracies of up to 0.52. Slavov et al. (2014) reported prediction accuracies varying between 0.05 (dry matter) and 0.95 (moisture) with an average of 0.57 for 17 traits in the energy grass, Miscanthus sinensis. Both used association panels as the training and validation population. Recently, an empirical study of genomic prediction of biomass yield in tetraploid alfalfa reported prediction accuracies between 0.21 and 0.60 depending on the breeding cycle (Li et al., 2015). The authors concluded that the selection efficiencies per unit time based on GS were better than for phenotypic selection. To our knowledge, no empirical data have been published of GS performance in perennial ryegrass, the most important forage crops in temperate

grassland agriculture. Here we report our first results of an evaluation of GS in the populations from a long standing and successful recurrent selection breeding program at the Institute of Biological, Environmental, and Rural Sciences (IBERSs). The current populations were established in the late 1980's from a relatively small founder population, and have now been through up to 14 generations of selection and recombination. We have used current and some historical phenotypic data from plot trials of half sib progeny of mother-plants in combination with genotypic data from the mother-plants. Higher prediction accuracies were obtained for traits related to forage quality, particularly water soluble carbohydrates (WSCs) and digestibility (DMD) than for biomass yield. For most trait-training population combinations the ridge regression BLUP prediction method outperformed the three ML methods employed here. We discuss possible explanations for the results as well as potential ways of improving prediction accuracies particularly for biomass yield.

### MATERIALS AND METHODS

### Plant Material and Breeding Cycle

Plant material from the perennial ryegrass breeding populations was used to obtain genotypic and phenotypic data. In order to put the data collection into context, a brief description of the breeding cycle is given. It is also illustrated in **Figure 1**. Any given cycle starts with a polycross of about 400–600 plants from four to six families. Those parents have been collected from spaced plant field plots. Approximately 100 of the highest seed yielding mother-plants are selected to provide half sib progeny for evaluation in sward plot trials. Four replicate plots of the half sib progeny are evaluated over three growing seasons. Biomass yield was recorded for seven cuts each year for the first 2 years, and material from cuts 4 and 5 in the 1st year was used to obtain estimates of dry matter digestibility (DMD), WSCs and nitrogen, with near infrared reflectance spectroscopy (NIRS) (Lister and Dhanoa, 1998). The mean of results from those two cuts were used in the present analysis. At several stages during all three growing seasons persistency was assessed by scoring ground cover visually on a scale of 0–9. In the breeding program the phenotypic data are used to select three–five parents from the mother-plants for poly crossing to obtain a synthetic population for variety trials. The results are also used to inform the selection of 3–6 half-sib families for each new generation. Around 400–600 genotypes from the spaced plant trials of 1000 plants from each family are selected for poly crossing. However, other factors, such as plant stature, disease resistance, and winter survival are also taken into consideration in this selection.

Broad sense heritabilities were calculated as follows:

$$H\_B^2 = \frac{\sigma\_G^2}{\sigma\_G^2 + \sigma\_E^2} \tag{1}$$

where σ 2 G is the genetic variance, and σ 2 E is the residual error variance. The variance components were obtained from a oneway analysis of variance of each of the traits separately. The standard deviation was obtained via leave-one-out Jacknife analysis.

### Genotyping and Linkage Disequilibrium

A 3K Illumina Infinium iSelect Array was used for genotyping of the mother-plants. The SNPs in the array were identified on the basis of polymorphisms in transcriptome libraries from perennial ryegrass plants representing six diverse populations. The development and validation of this array was described in detail previously (Blackmore et al., 2015). The DNA was extracted from leaf material of the mother-plants from each generation as described (Skøt et al., 2011), except for the F12 generation. None of the mother-plants from that generation are in existence, so the DNA was obtained from the husks of the seed derived from the respective mother-plants. In total, DNA samples of sufficient quality were obtained from 86 mother-plants of the F12 generation. After allele calling in the Illumina GenomeStudio software, the genotypic scores were converted to −1, 0, and 1 for input into the various prediction models.

Linkage disequilibrium data (r 2 ) were obtained using a consensus genetic map containing 1670 markers from the 3 K Infinium Array as described in Blackmore et al. (2015). The LD landscape plots were generated based on an R script described earlier (Wang et al., 2013), but modified and improved for L. perenne.

### Training and Test Populations

This work was aimed at making genomic predictions of the breeding values of the 100 mother plants of the F14 generation based on training populations consisting of various parts of the previous generations of both the intermediate- and late flowering breeding populations. We wanted to assess the effect of training population size and relatedness to F14 on prediction ability, and also to compare a number of different prediction models in terms of their performance. Three training populations were used. The first was based on the F13 generation, which is closest genetically to F14 (see **Figure 2**). It consisted of 54 mother plants. The second included data from all the intermediate-flowering generations for which we have genotypic and phenotypic data, namely F11, F12, and F13 (this training population is referred to as 'INT'). The size of that training population was 259. Finally, we also included the late-flowering population F5. This brought the training population size up to 364 (we refer to this training population as 'ALL').

All phenotypic data were normalized with respect to each subpopulation's mean and scaled to have variance 1. Thus, hybrid phenotypes F11 + F12 + F13 and F5 + F11 + F12 + F13 do not have variance of exactly 1.

### Prediction Models

We investigated predictive abilities of four methods: GBLUP from statistical genomics and three ML methods. The advantage of GBLUP compared to standard multivariate regression is the ability to cope with the p >> n situation and prevent overfitting

SNP data, and the analysis was implemented in R.

via the penalty mechanism. We use GBLUP as the benchmark method against which we compared the three ML models.

We used two tree-based methods: random forests (RF) (Breiman, 2001) and boosted trees GBM (Friedman, 2001). Both methods are non-parametric and make no assumptions about the distribution or any other properties of the data they are applied to, which is an advantage.

For RF we have used the standard values for the number of variables considered at each split (1/3 of the total number), a minimum of five observations per terminal node; trees were grown to their maximal depth and were not pruned and we have grown 500 trees per forest.

For GBM we have used a shrinkage parameter (which discounts each successive tree to avoid overfitting) of 0.01, subsampling rate (proportion of data used to construct each tree) of 0.5 and trees of depth 5, of which have grown 1500 per model.

Thirdly, we used k-nearest neighbors algorithm (KNN) – a model that predicts each new sample point based on the values of its nearest (according to some metric) neighbors in the training set. In KNN regression this prediction is just the average over the values in the neighborhood. This is an example of a lazy learning method – generalization beyond training data only occurs when test data is introduced. The advantage of the method is its simplicity and ease of use (one effectively only has one tuning parameter, k, the number of neighbors to consider for each new instance) and in the context of GS – the fact that genetic relatedness of plants in the training and test populations is exploited as only plants genetically close to the target are used to calculate each GEBV. For each trait we used the optimal number of neighbors chosen via cross-validation on the corresponding training population (between 1 and 10 for the F13 training


Standard deviations are in brackets. Trait identification: total7c, total biomass yield over all 7 cuts; conscuty, Yield of conservation cut, i.e., second cut; vegyld, Total biomass yield minus conservation cut; gcscore, Ground cover score; dmd, Dry matter digestibility (%); n, nitrogen (%); wsc, Water soluble carbohydrates (%).

population, between 3 and 20 for INT and between 4 and 26 for ALL).

Performance of each model was assessed by calculating Spearman's rank correlation (r(y, GEBV) between the corresponding predicted values and the observed F14 phenotypic values.

All analysis was done in R (R Core Development Team, 2014); we used the gbm package for GBM, randomForest for RF, FNN for KNN (Hastie et al., 2009) and rrBLUP (Endelman, 2011) for BLUP.

### RESULTS

### Phenotypic Data and Heritabilities

The phenotypic data were obtained from sward trials derived from half-sib progeny of the 100 or so mother-plants of each generation. The quality traits, such as digestibility, WSCs and nitrogen tended to have higher heritability than the yield-related traits (**Table 1**). There is also variation between years and cuts, highlighting the effects of time. The heritabilities for the biomass yields in the 2nd year tended to be lower than for the 1st year, particularly for F14, but also for the other Intermediate generations.

### Structure of the Breeding Populations

A 3K SNP Infinium array was used as a platform for genotyping the ryegrass breeding populations (Blackmore et al., 2015). **Figure 2** shows the first two principal components from a PCA analysis on the full genotypic data set, Intermediate F11–14, Late F5–F6 (note that F6 was not used in the analysis elsewhere, since no phenotypic information for it was available at the time of writing, but was included in the PCA analysis, since genotype data were available). The first principle component clearly separates the genotypes in two clusters, one containing the Intermediate population and one containing the Late. The two generations of the Late breeding populations, F5 and F6 form one single cluster, while the Intermediate generations separate along the second principle component. While F13 and F14 form one cluster, F12 and in particular F11 are partially separated from the F13–F14 cluster. LD in the total breeding population is illustrated in two ways. **Figure 3** shows r <sup>2</sup> between pairs of markers against the corresponding pairwise distances for each of the seven chromosomes. The average distance between consecutive markers is given in brackets above each plot. The average LD for each pairwise marker distance ranged from 0.121 to 0.215. Supplementary Figure S1 shows landscape and heatmap plots, and they demonstrate that the average pairwise LD ignores some local variations in LD along the chromosomes. The landscape plots and heat-maps show the presence of some hotspots of LD particularly on chromosomes 2 and 6, while the overall level of LD fluctuates between 0.1 and 0.2.

### Genomic Predictions

There are two phases where GS can potentially accelerate the breeding program (**Figure 1**). One is at the spaced plant nursery stage where genotypic information of all the mother plants could assist in the selection of the families being taken forward to the next generation. We do not yet have that information. The other stage is the selection of parents for a new variety or synthetic population, and this is the focus of this first experiment. This is based on genotypic information from the 100 or so mother-plants selected for sward trials of its half sib progeny. We compared four prediction models for the three training sets. The results, recorded as correlations between genomically predicted values and phenotypic data, are summarized in **Tables 2–4**. All four methods were poor at predicting the conservation cut yield, while the predictions of total yield were slightly better overall and for vegetative yield even better. For most traits the BLUP method outperformed the other methods (see **Tables 2–4**). RF was the second best method with KNN and GBM trailing behind. The highest correlation between observed and predicted values was observed for the forage quality traits, particularly WSCs. This was especially pronounced for the BLUP method, where the correlation approached 0.6 when the INT and INT + F5 = ALL was used as a training population. There was, however, no consistent pattern to the effect of the training population. For BLUP and RF a trend toward better performance was discernible

with increasing size of the training population, particularly for the quality traits. However, even that was not entirely consistent. For example DMD had the highest accuracy with F13 as the training population (**Tables 2–4**). For the yield based traits, the best prediction accuracies were generally found in the 1st year harvests for the BLUP method (**Tables 2–4**). For the two largest training populations (INT and ALL), the prediction accuracy for ground cover (gcscore) was higher in year 2 than in year 1. Data for ground cover in year 3 is not yet available for F14, so prediction accuracies could not be calculated. Of the three biomass yield related traits the highest prediction accuracies were obtained with the BLUP method. The prediction accuracies for these traits were all higher for year 1 data with the BLUP model.

derived from three mapping families (see Materials and Methods) with a superimposed cubic smoothing spline.

### DISCUSSION

### Accuracy and Prediction Model

This work represents the first empirical evaluation of GS in perennial ryegrass, the most important temperate forage grass crop. We tested four prediction models and three training populations in order to assess the effect of the method and the size and composition of the training populations. The



The highest prediction accuracy for each trait is highlighted in bold. Trait identification is as described in Table 1.

TABLE 3 | Correlation between observed phenotyped and GEBV predicted by the four methods trained on INT (Intermediate F11 + F12 + F13).


The highest prediction accuracy for each trait is highlighted in bold.

comparison between the different prediction models was most straightforward, since this can be done for each training population. Overall BLUP was the best performing method but ML techniques were reasonably successful on the F13 training population, where they outperformed BLUP for 4 out of 11 traits (**Table 2**). Traits with higher heritability consistently gave better prediction accuracy. This was particularly evident for DMD and WSCs (**Tables 2–4**), which both have the highest heritability (**Table 1**) and the highest prediction accuracy. One of the characteristics of the quality traits is that the frequency distribution in terms of percentage of dry matter was unimodal even after combining the data for different generations, while yield-related traits differ markedly between years, location and generation, and so have bi- or tri-modal frequency distributions. We have tried to mitigate these environmental effects here by scaling the trait values separately for each generation, and also normalizing them against phenotypic values of control varieties. Furthermore, we considered yield-related traits in different years as different traits. The effectiveness of this is very much dependent on the presence or absence of genotype by environment interaction (G × E). If there is considerable G × E the predictions will be different for different years. Furthermore, variation in heritability between years is also likely to have an effect on accuracy. **Tables 2–4** show that there are differences in prediction accuracies between years, and in particular differences between the effects of composition of the training population. The generally lower prediction accuracies for the yield-related traits are consistent with their lower heritability (**Table 1**).

### Accuracy and Training Population

The relationship between size and composition of training population on the one hand and prediction accuracy on the other was complex, and more difficult to interpret. This is because the change in training population size is compounded by the population structure (**Figure 2**). For 16 of the 44 trait/prediction method combinations, the prediction accuracies increased when replacing F13 with all intermediate generations, i.e., F11 + F12 + F13. This increased the training population size from 54 to 259, so everything else being equal, an increase TABLE 4 | Correlation between observed phenotyped and GEBV predicted by the four methods trained on ALL (INT + Late F5).


The highest prediction accuracy for each trait is highlighted in bold.

in accuracy would be expected. However, for more than a half of the combinations this was not the case. A further increase in the training population with 105 individuals from the Late F5 generation did not improve accuracy appreciably for most of the traits/methods combinations. Population structure could partially explain this result. While the most obvious difference was between the Intermediate and the Late groups, F12 and particularly F11 diverged from the F13/F14 cluster (**Figure 2**). The genetic distance between the generations could possibly explain why we do not see a consistent increase in prediction accuracy with an increase in training population size. This may be equivalent to the situation in animal breeding where there are examples of loss in predication accuracy across breed predictions as compared to within breed (Daetwyler et al., 2012; Erbe et al., 2012). Due to the limited extent of LD across breeds, it is estimated that large cross-breed reference populations are needed (Goddard and Hayes, 2009). In the ryegrass breeding populations, the genetic separation is most likely driven by a combination of deliberate selection and genetic drift, the latter of which is more important in a population with a small effective population size. A small effective population size limits the number of genes causing an effect on a trait. The original number of founders of the Intermediate population was low (10), but polycrossing in subsequent generations included approximately 400 plants, and thus helped generate a great many more haplotypes than the original 20. A combination of a larger effective population size and genetic separation requires a higher coverage of SNP markers. An estimate of the effective population size of the breeding population can be obtained as described from the empirical estimates of LD we have obtained. The expectation of LD is given by r <sup>2</sup> = 1/(4Nec + 1), where N<sup>e</sup> is the effective population size, and c is the distance between adjacent marker in Morgans (Sved, 1971). Assuming an LD estimate of 0.1 (**Figure 3**, Supplementary Figure S1) and an average distance of 0.003 Morgans between adjacent markers, the effective population size is 281. This is somewhere between the original number of founders (10) and the number of parents in the polycrosses of selected spaced plants at each generation (400). Given a prediction accuracy of r = ∼0.5 and heritability

of 0.4 (approximate values for WSC), one would expect to require a training population size of 2983 unrelated individuals. The appropriate values have been substituted in the following equation: r <sup>2</sup> = Nh<sup>2</sup> /(Nh<sup>2</sup> + Me), where M<sup>e</sup> = 2NeL/ln(4NeL). L is the genome size in Morgans (eight for L. perenne), h 2 is heritability, and N is the size of the training population (Meuwissen, 2009). The prediction accuracies we have obtained here, at least for the quality traits, with a much smaller training population is likely due to the strong relatedness of the training population to the test population. Relatedness would thus appear to be a very important factor determining the success of GS.

### Genomic Prediction in Future Ryegrass Breeding

The breeding program described here is similar to the suggested generalized scheme for implementation of GS in forage crops (Hayes et al., 2013a). It thus represents a suitable template for this initial evaluation of prediction accuracies. The particular methodology of the ryegrass breeding program, however, presents a challenge. The need to use sward trials to obtain realistic phenotypic data, especially for biomass-related traits, means that the prediction accuracies in our implementation are based on genotypic data from mother-plants and phenotypic data from sward derived from seed of half-sib progeny of the motherplants. Given the mixture of genotypes in such a sward this is likely to lower the obtainable prediction accuracies. If genotypic data were available from all the potential pollen donors in the poly crosses, it would enable us to predict allele frequencies in the progeny, but this was not economically feasible. In a white spruce population it was also found that prediction accuracies decreased markedly when the validation population was unrelated (or had unknown relationship) to the training population (Beaulieu et al., 2014). Prediction accuracies between 0.327 and 0.435 were found where the relationship between training and validation population was closest. The larger training population and number of markers (1694 and 6358, respectively) could explain the more consistent results across traits compared to our results. Nevertheless, the prediction accuracies for the forage quality traits are comparable to those in white spruce. In switchgrass prediction accuracies for a range of morphological and quality traits varied between 0 and 0.55, and are thus also within the same range as ryegrass. In alfalfa it was recently reported that genomic prediction accuracies of biomass yield were highest within the same breeding cycle compared to prediction across cycles (Li et al., 2015). This is consistent with the situation in the ryegrass breeding program. The higher and more consistent accuracies reported in alfalfa is most likely due to higher heritabilities for the biomass traits, and that the phenotypic and genotypic data were obtained from the same spaced plants, and not half sib progeny.

As has been pointed out previously (Daetwyler et al., 2012; Liu et al., 2015) prediction accuracies are determined to a large extent by genomic relationships (population structure) and LD. Given the limited number of markers used in this study and the extent of LD in the breeding populations, it would seem likely that the accuracies obtained here are attributable to the capture of the relatedness between genotypes. In other words, the closer the relationship between training population and test population, the fewer markers are required to obtain a given accuracy (Liu et al., 2015).

Other factors that influence the accuracy are the environmental factors affecting plants grown in different years. This is highlighted by the variable prediction accuracies between years for the yield related traits (**Tables 2–4**). These factors make combining populations into homogenous training sets a non-trivial, and often difficult, task. This also makes tuning hyperparameters of ML models on the training set difficult; for instance, often parameters deemed optimal by tuning on (any of the three) training populations were suboptimal choices when tested on the F14 population. This significantly reduced accuracy results produced by the three, usually very powerful, ML models on the F14 test set. Another reason for the comparatively good performance of GBLUP is the fact that biomass-related and forage quality traits are controlled by many QTLs with small effects, a situation which is optimal for GBLUP. However, for some of the combinations RF performed better than GBLUP (e.g., **Table 2**, DMD). If a ML prediction method was consistently outperforming other methods, it would be easy to "mix and match" prediction methods to traits. At present the results are not sufficiently consistent to consider this. Obtaining more biomass yield data from different sites (environments) should improve prediction accuracies.

In this work we considered the phenotypic performance in sward, and the GEBV values obtained from this can be used to inform which parents to select for generating a potential new synthetic population or variety (**Figure 1**). For this purpose prediction accuracies would need to be as high as the predictions based on phenotypic evaluation. While this is not the case, the GEBVs can also be used to assist in the selection of families (seed of a mother plant) to select for the next generation of the spaced plant nursery. The long running IBERS breeding scheme outlined in **Figure 1** is in fact very similar to the one proposed in a recent review (Hayes et al., 2013a). As we obtain more and more complete information of the pedigree of the breeding populations from the genotypic data, we can begin to make informed decisions to maximize the genetic variation in the breeding population, and perhaps even reduce the size of it, while maintaining variation. The improvement of GEBVs over generations will eventually lead to a situation where they can compete with the phenotypic evaluation, and thus begin to save time (Hayes et al., 2013a).

We demonstrated the use of a GS approach, in which one standard statistical method and three ML methods were compared for predicting GEBVs in L. perenne. The results are most encouraging for forage quality traits, such as WSCs and DMD, and highlight several important points. Improved prediction accuracies are desirable for the yield related traits, particularly in the second year. A larger training population closely related to the validation population and a larger number of markers would probably improve accuracy. However, low heritability of a trait makes such improvements more difficult to achieve. Future work might involve devising more efficient ways of combining different sub-populations, since small

training population size together with genome wide LD (**Figure 3**, Supplementary Figure S1) limit the prediction ability in GS. It would also be very interesting to incorporate meteorological data into ML models thus not only accounting for some of the environmental effects, but also uncovering G × E interactions.

### AUTHOR CONTRIBUTIONS

fpls-07-00133 February 10, 2016 Time: 20:57 # 9

NG conceived the work, analyzed the data and wrote the paper, AL provided phenotypic data from the breeding programme and analysed some of the data, MH conceived the work, and developed the SNP CHIP, Andi Lovatt maintained and propagated the plants and provided technical assistance, KPS ran the SNP CHIP analysis, RK provided technical assistance with DNA extraction and marker analysis, TB developed the SNP CHIP and provided the marker data, DT developed the genetic

### REFERENCES


map used as the basis for the LD analyses, RDK conceived the work and supervised the data analysis, IA conceived the work, WP conceived the work, LS conceived and supervised the work, analyzed some of the data and wrote the paper.

### ACKNOWLEDGMENT

This work was funded by a responsive mode grant under the Industrial Partnership Award scheme from the BBSRC (BB/J006955/1) and Germinal Holdings LTD.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fpls.2016.00133



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Grinberg, Lovatt, Hegarty, Lovatt, Skøt, Kelly, Blackmore, Thorogood, King, Armstead, Powell and Skøt. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Identification and Characterization of Switchgrass Histone *H3* and *CENH3* Genes

Jiamin Miao1, 2, Taylor Frazier <sup>1</sup> , Linkai Huang<sup>2</sup> , Xinquan Zhang<sup>2</sup> \* and Bingyu Zhao<sup>1</sup> \*

*<sup>1</sup> Department of Horticulture, Virginia Tech, Blacksburg, VA, USA, <sup>2</sup> Department of Grassland Science, Sichuan Agricultural University, Ya'an, China*

Switchgrass is one of the most promising energy crops and only recently has been employed for biofuel production. The draft genome of switchgrass was recently released; however, relatively few switchgrass genes have been functionally characterized. CENH3, the major histone protein found in centromeres, along with canonical H3 and other histones, plays an important role in maintaining genome stability and integrity. Despite their importance, the histone *H3* genes of switchgrass have remained largely uninvestigated. In this study, we identified 17 putative switchgrass histone *H3* genes *in silico*. Of these genes, 15 showed strong homology to histone *H3* genes including six *H3.1* genes, three *H3.3* genes, four *H3.3-like* genes and two *H3.1-like* genes. The remaining two genes were found to be homologous to *CENH3*. RNA-seq data derived from lowland cultivar Alamo and upland cultivar Dacotah allowed us to identify SNPs in the histone *H3* genes and compare their differential gene expression. Interestingly, we also found that overexpression of switchgrass histone *H3* and *CENH3* genes in *N. benthamiana* could trigger cell death of the transformed plant cells. Localization and deletion analyses of the histone *H3* and *CENH3* genes revealed that nuclear localization of the N-terminal tail is essential and sufficient for triggering the cell death phenotype. Our results deliver insight into the mechanisms underlying the histone-triggered cell death phenotype and provide a foundation for further studying the variations of the histone *H3* and *CENH3* genes in switchgrass.

Keywords: *Panicum virgatum* L., histone H3, CENH3, *Agrobacterium*-mediated transient assay*,* cell death

### INTRODUCTION

Plant nucleosomes are composed of a protein octamer that contains two molecules of each of the core histone proteins: H2A, H2B, H3, and H4. The canonical histone H3 protein is one of the most important protein components of this complex. The histone H3 protein has a main globular domain (including αN-helix, α1-helix, Loop1, α2-helix, Loop2, and α3-helix motifs), a long Nterminal tail and is post-translationally modified significantly more than the other histone proteins (Ingouff and Berger, 2010). The histone H3 family consists of four main members. Three of these members, including H3.1, H3.2, and H3.3, share more than 95% identity. The fourth member consists of the histone H3 variant CENH3, which is highly divergent in sequence structure in comparison to canonical histone H3. CENH3 is specifically present at centromeres and is essential for centromere and kinetochore formation in all organisms (Allshire and Karpen, 2008).

#### *Edited by:*

*Keenan Amundsen, University of Nebraska-Lincoln, USA*

#### *Reviewed by:*

*Alan Rose, University of California, Davis, USA Hao Peng, Washington State University, USA*

#### *\*Correspondence:*

*Xinquan Zhang zhangxq@sicau.edu.cn; Bingyu Zhao bzhao07@vt.edu*

#### *Specialty section:*

*This article was submitted to Crop Science and Horticulture, a section of the journal Frontiers in Plant Science*

*Received: 22 January 2016 Accepted: 21 June 2016 Published: 12 July 2016*

#### *Citation:*

*Miao J, Frazier T, Huang L, Zhang X and Zhao B (2016) Identification and Characterization of Switchgrass Histone H3 and CENH3 Genes. Front. Plant Sci. 7:979. doi: 10.3389/fpls.2016.00979*

The histone H3 proteins have been extensively characterized in many plant species. Recent studies have demonstrated that histone H3 variants have evolved to play an important role in specialized plant functions, including heterochromatin replication (Jacob et al., 2014), gene silencing (Mozzetta et al., 2014), flowering time regulation (Shafiq et al., 2014), stem elongation (Chen et al., 2013), gene activation (Ahmad and Henikoff, 2002a,b) and so on. More recently, some studies have reported that the aberrant expression of CENH3 may cause errors during mitosis, such as triggering the formation of micronuclei and the subsequent loss of chromosomes, thus resulting in aneuploidy and possibly reduced fertility (Ravi and Chan, 2010; Lermontova et al., 2011; Sanei et al., 2011).

Switchgrass (Panicum virgatum L.), a warm-season perennial grass species native to North America, has a growth habitat that ranges from northern Mexico into southern Canada. Switchgrass has recently been targeted as a model herbaceous species for biofuel feedstock development (McLaughlin and Adams Kszos, 2005). With a base chromosome number of 9, ploidy levels within naturally occurring populations of the species vary from diploid (2n = 2x = 18) to dodecaploid (2n = 12x = 108) (Nielsen, 1944; Hultquist et al., 1996). Switchgrass has two distinct ecotypes, lowland and upland, that are characterized based on growth habitat. The lowland ecotypes, which are commonly tetraploid (2n = 4x = 36), are frequently found in warm and humid environments in the southern regions of North America. The upland ecotypes, however, can vary in ploidy (4X, 6X, and 8X) and are adapted to colder and more arid environments in central to northern North America (Porter, 1966; Hopkins et al., 1996). Under natural conditions, inter-ecotype hybridizations between switchgrass species with different ploidy levels are rare. Even so, aneuploidy is common in switchgrass, indicating that its genome is unstable in nature and making it difficult for switchgrass genetic studies and breeding selection (Costich et al., 2010).The cause of switchgrass genome instability could be related to abnormal centromeres that interact with the mitotic spindle, which can lead to selective chromosome loss (Bennett et al., 1976). It is possible that preferential or differential expression of switchgrass H3 and CENH3 genes may contribute to switchgrass genome instability. Despite their importance, both switchgrass histone H3 and CENH3 genes have remained largely uninvestigated.

In this study, we used a homology based approach to identify histone H3 and CENH3 genes in switchgrass. We identified 15 putative switchgrass histone H3 genes and two CENH3 genes. RNA-seq analysis allowed us to detect genetic polymorphisms in these genes, as well as differences in their expression levels, between upland cultivar Alamo and lowland cultivar Dacotah. We also characterized these switchgrass histone H3 and CENH3 genes with respect to their subcellular localization and phenotypic responses in Nicotiana benthamiana. Interestingly, we report for the first time that transient overexpression of histone H3 or CENH3 genes can trigger programmed cell death in Nicotiana benthamiana. The results of this study provide a better understanding of the function of switchgrass histone H3 genes and may aid in improving future genetic studies of this important biofuel crop.

### MATERIALS AND METHODS

### Plant Material

Switchgrass cv. Alamo, cv. Dacotah and N. benthamiana (PI 555478) seeds were obtained from the USDA (United States Department of Agriculture) germplasm center. Italian ryegrass (Lolium multifolorum L.) cv. Changjiang No.2 seeds were obtained from Sichuan Agricultural University of China. Two independent NtSGT1-RNAi N. tabacum transgenic lines (Traore et al., under revision) were also used for histone H3 transient expression analyses. Switchgrass, tobacco and Italian ryegrass were planted in pots and grown in a growth chamber programmed for 16 h light at 28◦C and 8 h dark at 24◦C.

### Database Search and Sequence Analysis

The Arabidopsis histone H3 gene (At1g09200) was used as a query to BLASTX search against the switchgrass draft genome (Panicum virgatum v1.1, DOE-JGI, http://www.phytozome. net/pvirgatum). Multiple sequence alignment of histone H3 genes/proteins was performed using DNAMAN 7.0 (Lynnon Biosoft, San Ramon, USA), and visualized using BOXSHADE 3.21 (http://www.ch.embnet.org/software/BOX\_form.html). The neighbor-joining phylogenetic tree was generated using MEGA6 (Tamura et al., 2013).

### RNA-seq Analysis of the Putative Switchgrass Histone *H3* and *CENH3* Genes

RNA-sequencing reads from leaf tissue of three biological replicates of Alamo and Dacotah were obtained (Hupalo et al., in preparation) and imported into CLC Genomics Workbench version 7.5 (CLCBio/Qiagen, Boston, MA). The reads were quality trimmed and filtered and mapped to the Panicum virgatum reference genome (v1.1) using the default parameters with the following adjustments: maximum number of hits for a read = 1, similarity fraction = 0.9, length fraction = 1.0, mismatch cost = 2, insertion cost = 3 and deletion cost = 3. After mapping, the reads were locally realigned and variants were called using the Basic Variant Detection tool with the following settings: ploidy = 4, minimum coverage = 3, minimum count = 2, minimum frequency (%) = 25.0 and base quality filter = yes. The RPKM and fold difference values for the switchgrass histone H3 and CENH3 genes of Alamo and Dacotah were calculated using an unpaired two-group comparison experiment feature that is part of the CLC Genomics Workbench program.

### RT-PCR Analysis of the Histone *H3* Genes

Total RNAs from fresh leaves of Alamo, tobacco, Arabidopsis and Italian ryegrass were isolated using a RNeasy <sup>R</sup> Plant Mini Kit (Qiagen, Valencia, CA) and were treated with the RNase-Free DNase Set (Qiagen) to remove DNA contamination. Reverse transcription (RT) of the first strand cDNA synthesis was performed using a DyNAmoTM cDNA Synthesis Kit (Fisher Scientific Inc, Pittsburgh, PA). PCR was performed using either gene-specific primers or conserved primers, as listed in **Table S1**. Gene-specific primers were designed based on the nucleotide sequences of the 5′ and 3′ UTRs of specific switchgrass histone H3 genes. Switchgrass histone H3 conserved primers were designed based on the sequence alignment of the 15 switchgrass histone H3 genes. Tobacco and Italian ryegrass histone H3 gene-specific primers were designed based on known ESTs (EF051133.1, AB366152.1, and AB205017.1) located in GenBank. Primers for Arabidopsis thaliana H3.3 (At1g13370) and CENH3 (At1g01370) were designed based on the nucleotide sequences identified in the TAIR (The Arabidopsis Information Resource) database.

The iProofTM high fidelity DNA polymerase (Bio-Rad, Hercules, CA) was used for all PCR amplifications and the PCR program was run as follows: 98◦C for 3 min; 98◦C for 30 s, 58◦C for 45 s, 72◦C for 50 s (30 cycles); and a final extension at 72◦C for 7 min. The PCR products were gel purified using an E.Z.N.A. <sup>R</sup> MicroElute Gel Extraction Kit (Omega Bio-Tek Inc., Norcross, GA) and were subsequently cloned into the pENTRTM/D-TOPO <sup>R</sup> vector (Invitrogen, Carlsbad, CA). The vector was then transformed into E. coli (DH5α) (Life Technologies) and the bacteria were grown at 37◦C. Plasmid DNAs were isolated using a Qiagen plasmid miniprep kit (Qiagen, Valencia, CA) and sequenced using a M13 forward primer (**Table S1**) at the core facility of Virginia Bioinformatics Institute (Blacksburg, VA, USA).

For deletion analyses of switchgrass histone H3 genes, a series of primers (**Table S1**, **Figure 3**) were designed to amplify different fragments based on the gene sequence of PvH3.3 (Pavir.Ib01857.1) and PvCENH3 (Pavir.J05674.2). Chimeric gene fusions of the PvH3.3 N-terminal tail and the PvCENH3 folddomain were constructed by overlapping PCR (**Figure 3C**).

### Cloning of the Histone *H3* Genes into a Plant Expression Vector

Using the Gateway <sup>R</sup> LR Clonase <sup>R</sup> II Enzyme mix kit (Invitrogen, Carlsbad, CA), the different histone H3 genes were sub-cloned into the binary vectors pEarleyGate101, pEarleyGate104 (Earley et al., 2006) and pEAQ-HT-DEST3 (Sainsbury et al., 2009). These vectors were then transformed into E. coli (C2110) (Wu and Zhao, 2013) by electroporation. The derived histone H3-YFP, YFP-H3, and H3-Hisx6 fusion genes were cloned behind the CaMV35S promoter. The plasmid constructs were sequenced using either the 35S forward primer, the YFP forward primer or gene-specific primers (**Table S1**).

### *Agrobacterium*-Mediated Transient Assays in Tobacco Plants

The pEarleyGate101-PvH3s-YFP, pEarleyGate104-YFP-H3s, and pEAQ-HT-PvH3.3-Hisx6 plasmid DNAs were transformed into Agrobacterium tumefaciens strain GV2260 by electroporation. Transformed Agrobacterium cells were grown on LB agar medium supplemented with kanamycin 50 µg/ml and rifampicin 100 µg/ml and incubated at 28◦C for 2 days. Agrobacterium-mediated transient assays in tobacco leaves were performed as described previously (Traore and Zhao, 2011). The fully expanded leaves of 3–4 week old N. benthamiana or N. tabacum plants were chosen for infiltration. The fluorescence signal of the histone H3-YFP protein was monitored 2 days post inoculation using a confocal microscope (Zeiss Axio Observer.A1, Carl Zeiss MicroImaging, Inc., Thornwood, NY).

### Analysis of Histone H3-YFP Fusion Proteins by Western Blot

Agrobacterium tumefaciens GV2260 cultures carrying the pEG101-PvH3s-YFP constructs were infiltrated into young but fully expanded tobacco leaves at a concentration of OD600nm = 0.5. Leaf disks (1.9 cm diameter) were collected 3 days post inoculation using a cork borer, ground in liquid nitrogen and re-suspended in 100 µl 3 × Laemmli buffer containing 16% β-mercaptoethanol. The tissue was then boiled for 10 min and pelleted at a high speed for 10 min. Twenty micro liters of protein extract was applied to and separated on a 10% SDS-PAGE gel. The proteins were blotted to a PVDF membrane using a Bio-Rad Trans-Blot <sup>R</sup> TurboTM Transfer System. The membrane was blocked with 5% nonfat skim milk in 1 × Trissaline buffer supplemented with 0.5% Tween 20 (1 × TBST). Next, the membrane was probed with anti-HA-HRP (Abcam, 1:2000) and the signal was detected with using SuperSignal <sup>R</sup> West Pico Chemiluminescent Substrate (Thermo Scientific, Waltham, MA). The chemiluminescent signals were exposed to autoradiography film (Genesee Scientific, San Diego, CA) using a Kodak film processor (Kodak, A Walsh Imaging, Inc, Pompton Lakes, NJ).

## RESULTS

### Seventeen Histone *H3* Genes were Identified from the Current Draft Switchgrass Genome

To identify putative switchgrass histone H3 genes, we used the Arabidopsis histone H3 gene (At1g09200) as a query to BLASTx search against the switchgrass draft genome (Panicum virgatum v1.1). Seventeen potential switchgrass histone H3 genes, which showed significant similarity to the Arabidopsis histone H3 gene, (the majority of E-values were <1.00e-30) were identified (**Table 1**).

The seventeen switchgrass histone H3 genes encode nine different proteins (**Table 1**). These nine proteins were aligned with human (NCBI Reference Sequence: NP\_003520.1, NP\_0 66403.2, and NP\_002098.1), mouse (NP\_659539.1, NP\_03857 6.1, and NP\_032236.1), Arabidopsis (At1g09200, At4g40030, and At1g01370), rice (Os01g0866200, Os03g0390600, and Os05g048 9800) and maize (NP\_001131276.1, AFW71933.1, and NP\_00110 5520.1) histone H3 and CENH3 proteins (**Figure 1A**). All of the switchgrass histone H3 proteins, except for Pavir.J05674.2 and Pavir.J25829.1, which encode putative CENH3 proteins, display a high degree of homology and conservation to the histone proteins of these different species.

To further classify the switchgrass histone H3 proteins, we used the switchgrass and Arabidopsis histone H3 proteins to generate a phylogenetic tree (**Figure 1B**). The switchgrass histone H3 proteins largely grouped into five clades. The first clade consisted of Pavir.J01005.1, Pavir.Ga01868.1, Pavir.J18804.1,


*cRPKM* = *reads per kilobase per million* =

*dFold Difference* = *the amount of the mean expression values differ between Alamo and Dacotah.*

*eChromosome*

 *Location can find in Phytozome v10.1 Panicum virgatum v1.1 genome.*

*Number of mapped reads/length*

 *of transcript in kilo base/million*

 *mapped reads.*

Pavir.J00640.1, Pavir.J20671.1, and Pavir.Db02133.1 and grouped with the Arabidopsis H3.1 proteins. The second clade contained Pavir.J07529.1, which grouped with the Arabidopsis H3.1-like proteins. The third clade consisted of Pavir.J26857.1, Pavir.J24812.1 and Pavir.J09299.1 and grouped with the Arabidopsis H3.3-like proteins. The fourth clade contained Pavir.Ia03121.2, Pavir.Ib01857.1, and Pavir.J05563.1 and grouped with the Arabidopsis H3.3 proteins. Finally, Pavir.J05674.2 and Pavir.J25829.1 grouped with the Arabidopsis CENH3 protein. Pavir.Fa02085.1 and Pavir.J10481.1, along with the H3.1-like and H3.3-like proteins, are novel H3 variants; however, they appear to be clustered into the H3.3 or H3.1 groups due to amino acid substitutions commonly found in these variants at positions 32, 42, 88, and 91 (Luger et al., 1997; Malik and Henikoff, 2003). These proteins also contain other sequence variations, such as insertions. Corresponding to which Arabidopsis proteins they grouped closely to, we renamed these switchgrass histone H3 proteins as PvH3.1, PvH3.1-like (A and B), PvH3.3, PvH3.3-like (A, B, and C), PvCENH3.1, and PvCENH3.2. Therefore, we identified six major histone H3s (H3.1) and three histone H3 variants (H3.3) from the current draft switchgrass genome (**Figure 1**, **Table 1**).

Although, most of the canonical switchgrass histone H3 and H3 variants have 136 amino acids, we also found histone H3 variants (Pavir.Fa02085.1) that have small insertions at position 12. Pavir.J05674.2 and Pavir.J25829.1 are centromeric histone H3 variants (PvCENH3) that have highly diverse sequences. CENH3 shows significant sequence divergence among both switchgrass and Arabidopsis, and it even differs significantly from canonical histone H3s, which are highly variable in the N-terminal tail; however, the histone fold domain is relatively conserved (**Figure 1**).

Using the gene IDs, the CDS sequences for all of the switchgrass histone H3 genes were obtained from Phytozome (Panicum virgatum v1.1). Alignment of the CDS sequences with the genomic DNA sequences for all of the switchgrass histone H3 genes allowed us to identify the exons and introns of these genes. The PvH3.1 genes do not have any introns whereas the PvH3.3 and PvCENH3 contain between 1–6 introns (**Table 1**). Although the intron splicing sites are conserved, the sequence content in the introns is highly diverse (**Figure S1**).

### Sequence Polymorphism and Expression Variation of Histone *H3* Genes Identified from Two Switchgrass Cultivars

In order to evaluate gene expression of the 17 putative switchgrass histone H3 genes, we obtained and analyzed RNA-seq datasets that were constructed from the leaf tissue of two switchgrass cultivars, Alamo and Dacotah (Hupalo et al., in preparation). As summarized in **Table 1**, at least 10 of the histone H3 genes could be identified in the RNA-seq datasets that were generated from Alamo and Dacotah. Based on the RPKM (Reads Per Kilobase per Million reads mapped) values, four of the switchgrass histone H3 genes (Pavir.J26857.1, Pavir.J24812.1, Pavir.Ia03121.2, and Pavir.Ib01857.1) are expressed at relatively higher levels than the others in switchgrass leaf tissue.

The expression variation of the Alamo and Dacotah histone H3 genes was also analyzed by comparing the expression fold difference values between the two cultivars. As shown in **Table 1**, four out of the 17 histone H3 genes displayed more than a 2 fold expression difference in RPKM values. Interestingly, two Alamo histone H3.1 genes (Pavir.J00640.1 and Pavir.J01005.1) are expressed at 29.22 and 10.89 folds higher, respectively, than their homologous genes in Dacotah.

The current release of the switchgrass genome (Panicum virgatum v1.1) contains 636 Mb of sequencing data assembled onto 18 scaffolds with an additional 593 Mb remaining on unanchored contigs (http://phytozome.jgi.doe.gov). Five of the 17 histone H3 genes identified in this study (Pavir.Fa02 085.1, Pavir.Db02133.1, Pavir.Ga01868.1, Pavir.Ia03121.2, and Pavir.Ib01857.1) have been anchored onto a given switchgrass chromosome, while the others are located on contigs. Eight out of the 17 histone H3 genes are identical in Alamo and Dacotah. The other nine genes have nucleotide sequence polymorphisms that contain anywhere from 4 to 25 SNPs (**Table 1**, **Table S2**).

### Cloning Switchgrass Histone *H3* Genes by RT-PCR

To validate the gene sequences and expression levels of the predicted switchgrass histone H3 genes, we used a pair of conserved primers (**Table S1**) to perform RT-PCR on the leaf cDNAs of cv. Alamo. PCR products were cloned and 22 clones were randomly chosen for sequencing analysis. Fifteen clones (68%) carried the DNA sequences coding for PvH3.3, which suggests that the PvH3.3 gene has a relatively higher expression level in switchgrass leaf tissue in comparison to the other histone H3 genes. This is consistent with the RNA-seq data in which the two PvH3.3 genes in both Alamo and Dacotah exhibited a relatively higher RPKM value than the RPKM value of the other histone genes (**Table 1**).

Four H3.1 genes (Pavir.Ga01868.1, Pavir.J18804.1, Pavir.J00 640.1, and Pavir.J01005.1), two H3.3 genes (Pavir.Ia03121.2 and Pavir.Ib01857.1), two H3.3-like genes (Pavir.J26857.1 and Pavir.J24812.1) and one CENH3 gene (Pavir.J05674.2) were also amplified from Alamo cDNA using the histone H3 and CENH3 specific primers (**Table S1**). In addition, one H3.3 gene (PvH3.3) was amplified in our RT-PCR analysis (data not shown) that was not identified in the switchgrass genome (Panicum virgatum v1.1).

### PvH3s and PvCENH3 Fused to YFP are Predominately Located in the Plant Cell Nucleus

To test the subcellular localization of the switchgrass histone H3 genes, we cloned Pavir.J01005.1 (PvH3.1), Pavir.Ib0 1857.1 (PvH3.3), Pavir.J24812.1 (PvH3.3-like), Pavir.J05674.2 (PvCENH3), and PvH3.3 (not identified in our genome search, but identified by RT-PCR) into the binary vector pEarleygate101 (Earley et al., 2006), which fused a C-terminus YFP (yellow fluorescent protein) to each of the histone genes. The fusion genes were then transiently expressed in Nicotiana benthamiana plant cells. PvH3s-YFP and PvCENH3-YFP fusion proteins localized predominantly in the plant cell nucleus, whereas the control YFP was located in both the cytosol and the nucleus (**Figure S2**).

### Overexpression of Histone *H3-YFP* Triggers a Cell Death Phenotype in *N. Benthamiana*

When transiently overexpressed in N. benthamiana, both the PvH3-YFP and the PvCENH3-YFP proteins triggered a cell death phenotype 3 days post inoculation (dpi) (**Figures 2A,B**). Interestingly, overexpression of the histone H3 genes cloned from other plant species, including Arabidopsis H3.3 and CENH3, Italian ryegrass H3.3 and N. benthamiana H3.3 and CENH3, also triggered cell death in the transformed N. benthamiana plant cells (**Figures 2B,C**).

To test if the C-terminal YFP fusion affected the cell-deathtriggering ability of histone H3, we fused YFP to the Nterminus of switchgrass H3.3 by cloning the H3.3 gene into the binary vector pEarleygate 104 (Earley et al., 2006). As shown in **Figure 2C**, YFP-PvH3.3 and YFP-PvCENH3 could also trigger cell death.

Fusion of YFP to either end of a protein may alter the native protein structure and function. Thus, YFP-tagged proteins can produce unexpected and unwanted phenotypes. To determine if YFP was contributing to the cell death phenotype of the histone H3 genes, we also cloned PvH3.3 into the pEAQ-HT-DEST3 vector, which fuses a smaller tag (6xHistidine) to the C-terminal end of the protein (Sainsbury et al., 2009). As shown in **Figure 2E**, transient expression of pEAQ-HT-PvH3.3 was still able to trigger cell death. Therefore, we conclude that overexpression of solely the PvH3.3 protein is able to trigger cell death in N. benthamiana.

Histone H3 is one of the key elements of the nucleosome, which is predominately located in the nucleus. To test if nuclear localization of the histone H3s is required for its ability to trigger the cell death phenotype, we fused a myristoylation signal peptide to the N-terminus of the PvH3.3 gene (Pavir.Ib01857.1) that also contained an N-terminal YFP tag (**Figure S3**). As shown in **Figure 2D**, the Myr-YFP-PvH3.3 fluorescence signal was primarily detected on the plasma membrane with no obvious nuclear localization. Interestingly, the fusion of the myristoylation signal peptide could completely inhibit the cell death phenotype triggered by YFP-PvH3.3. This suggests that the PvH3.3 protein needs to be localized in the plant nucleus in order to trigger cell death.

### Overexpression of the N-Terminal Tail of *PvH3.3* and *PvCENH3* Triggers Programmed Cell Death of Transformed Tobacco Cells

Histone H3 proteins have two domains: an N-terminal tail and a histone fold domain (Luger et al., 1997; Malik and Henikoff, 2003). In order to determine the part of the PvH3.3 protein that is essential for triggering cell death in N. benthamiana, we performed a deletion mutagenesis series from both the N- and C-terminal ends. Five different fragments of PvH3.3 (**Figure 3A**) were used to generate PvH3.3-YFP fusion genes. As shown in **Figure 3D**, the N-terminal tail (1–43aa) is the part of the protein that maintains the ability to trigger cell death. All of the fragments that contained the N-terminal tail were predominately localized

in the nucleus (**Figures S4A–C**). This suggests that there is an unidentified nuclear localization signal in the N-terminal tail sequence. The fragments that contained solely the PvH3.3 Cterminal histone fold domain, however, lost the ability to localize in the nucleus (**Figures S4D,E**) and could not trigger cell death in the transformed N. benthamiana plant cells (**Figure 3D**).

To investigate whether the N-terminal of PvCENH3 is also sufficient enough to cause cell death, both the N-terminal tail and the histone fold domain of the PvCENH3 gene were amplified (**Figure 3B**) and cloned into expression vectors that fused them with a C-terminal YFP tag. After transiently expressing each construct in tobacco, the fluorescence signal showed that the N-terminal tail of PvCENH3 solely localized in the nucleus, whereas the histone fold domain was ambiguously located between the nucleus and the cytosol (**Figures S4F,G**). Therefore, PvCENH3 must also contain an uncharacterized nuclear localization signal in its N-terminal tail. Similarly to the results for PvH3.3, overexpression of the N-terminal tail caused cell death after 3 dpi, whereas the histone fold domain failed to trigger cell death (**Figure 3E**). A previous report has suggested that a chimerical histone H3-CENH3 fusion gene can trigger chromosome elimination in Arabidopsis (Ravi and Chan, 2010). In this study, we also fused the N-terminal tail of PvH3.3 with the histone fold domain of PvCENH3 (**Figure 3C**). We found that transient expression of PvH3.3-PvCENH3-YFP could also trigger cell death in N. benthamiana plant cells (**Figure 3E**). The expressions of all different fragments of PvH3.3 and PvCENH3 in N. benthamiana were confirmed by Western blot analyses (**Figure 3F**).

### Silencing of the *SGT1* Gene in *N. Tabacum* Inhibits the Cell Death Phenotype Triggered by *PvH3.3* and *PvCENH3*

The cell death phenotype caused by the overexpression of PvH3.3 and PvCENH3 is similar to the hypersensitive response (HR) triggered by the interaction between plant pathogen effectors and cognate plant R genes (Coll et al., 2011). Since the HR-like cell death that is triggered by many R genes requires the function of SGT1, which is a conserved immune signaling component (Peart et al., 2002), we therefore tested whether or not SGT1 is also required for the elicitation of cell death induced by overexpression of PvH3.3 and PvCENH3 in Nicotiana tabacum. Two independent NtSGT1-RNAi transgenic lines (Traore et al., under revision) were used for transient overexpression of PvH3.3 and PvCENH3. As shown in **Figure 4A**, transient

FIGURE 3 | Diagram of eight fragments of the switchgrass histone H3.3 and CENH3, and the phenotype of overexpression of different fragments of switchgrass *histone H3.3* and *CENH3* in *N. benthamiana*. (A) PCR primer location for different fragments of switchgrass histone H3; (B) PCR primer location for different fragments of switchgrass CENH3; (C) The chimeric gene contains the N-terminal tail of switchgrass histone H3.3 and the histone-fold domain of CENH3 was constructed by using overlap PCR. Agrobacterium strains expressing different DNA fragments as outlined in (A-C) were inoculated in *N. benthamiana* and the cell death phenotype were pictured at 4 days post inoculation. (D) 1, Fragment 1-YFP; 2, Fragment 2-YFP; 3, Fragment 3-YFP; 4, Fragment 4-YFP; 5, Fragment 5-YFP; 6, YFP only (negative control); (E) 1, PvCENH3-YFP; 2, Fragment 8-YFP; 3, Fragment 6-YFP; 4, Fragment 7-YFP; 5, YFP only. (F) Western blot to detect switchgrass histone H3-YFP fusion proteins. Different fragments of histone H3-YFP fusion proteins transiently expressed in *N. benthamiana* plant cells were detected by western blot. 1. Fragment 1-YFP; 2. Fragment 2-YFP; 3. Fragment 3-YFP; 4. Fragment 4-YFP; 5. Fragment 5-YFP; 6. Fragment 6-YFP; 7. Fragment7-YFP; 8.YFP only (negative control); 9*. N. benthamiana* total proteins (negative control); the western blot membrane was also stained with Ponceau S Staining Solution to show the equal loading of each protein samples.

expression of PvH3.3(1−63aa) -YFP and PvCENH3(1−72aa) -YFP triggered cell death phenotypes on the wild type N. tabacum plants but failed to trigger any phenotype on the NtSGT1-RNAi plants. Western blot analysis found that the PvH3.3(1−63aa) - YFP and PvCENH3(1−72aa) -YFP fusion proteins were both expressed in the wild type and the NtSGT1-RNAi transgenic plants (**Figure 4B**). Therefore, NtSGT1 is essential for promoting histone H3-mediated cell death. Further studies are needed to investigate the role that SGT1 may play in histone gene-mediated cell death.

### DISCUSSION

In this study, we identified 17 potential switchgrass histone H3 genes in the draft switchgrass genome (v1.1). Although, their nucleotide sequences differ, these 17 genes encode for 9 individual proteins. Despite being a tetraploid, only two of these genes represent putative CENH3 genes, which is the key histone component of the centromere. RNA-seq analysis of the histone H3 genes in leaf tissue of Alamo and Dacotah found that some of the histone H3 genes have sequence polymorphisms at the nucleotide level (**Table S2**). Alamo and Dacotah represent lowland and upland switchgrass ecotypes, respectively, and are genetically diverse from each other. Therefore, the identified SNPs could be developed into molecular markers that can distinguish histone H3 gene alleles between the different ecotypes of switchgrass. We also found that transient overexpression of several of the putative switchgrass histone H3 proteins in tobacco leaves can trigger cell death.

### Histone *H3* Genes in Switchgrass

Plant histone H3 genes belong to a gene family with multiple members. For instance, maize and rice have approximately 14– 15 histone H3 genes (Ingouff and Berger, 2010). Additionally, Arabidopsis contains 15 histone H3 genes including five H3.1 genes, one H3.1-like gene, three H3.3 genes, five H3.3 like genes, and one CENH3 gene (Okada et al., 2005). In this study, we used a homology-based approach to identify histone H3 genes in switchgrass. We found six H3.1 genes, three H3.3 genes, four H3.3-like genes, two H3.1-like genes, and two CENH3 genes in the switchgrass genome. Core histones are among the most highly conserved proteins in eukaryotes, emphasizing their important role in maintaining genome stability and structure (Marino-Ramirez et al., 2011). In this study, we were also able to determine that histone H3 proteins from different species are highly conserved. The histone H3 proteins and the CENH3 proteins share a relatively conserved histone fold domain; however, the N-terminal tail of each protein group shows significant sequence divergence, both among species and between each group (**Figure 1A**). This is consistent with previous studies showing that CENH3 has evolved rapidly, particularly in its N-terminal tail, and the adaptive evolution that has occurred in the N-terminal tail may be in response to changing centromeric satellite repeats (Henikoff et al., 2001; Malik and Henikoff, 2001; Talbert et al., 2002).

In this study, 10 PvH3 genes were cloned from switchgrass leaf cDNA. The other PvH3 genes may be expressed at undetectable levels in switchgrass leaves but may be significantly expressed in other tissue types. For example, AtMGH3, which is an

Non-transgenic *N. tabacum* total proteins (negative control).

Arabidopsis histone H3 gene, can only be detected in mature buds and mature bicellular and tricellular pollen (Okada et al., 2005).

### The Subcellular Localization of Histone H3

The fluorescent signals of the histone H3-YFP fusion proteins were mainly located in the nucleus. This is similar to the histone H3 that was studied in Drosophila (Ahmad and Henikoff, 2002b). The histone H3 fused with two tandem repeats of GFP in yeast cell is also in the nucleus (Mosammaparast et al., 2002). Using the pSORT software (http://www.psort.org/), we identified a putative nuclear localization signal (NLS), "KRVTIMPKDIQLARRIR," which is conserved among the switchgrass histone H3 proteins. This NLS is located in the Loop2 and α3-helix of the protein. Two DNA binding motifs, "KAPRKQL" and "PFQRLVREI," were identified in the N-terminal tail and the α1-helix, respectively (**Figure 1A**). In our transient assay experiments, however, only the N-terminal tail fused with YFP was predominantly localized in the plant nucleus and the fragments that did not contain the Nterminal tail lost the ability to localize in the nucleus (**Figure S4**). The result is similar with the H3 and H4 GFP fusion proteins in yeast cells. The amino-terminal domains of yeast H3-(1–58) and H4-(1–42) are necessary and sufficient for nuclear transport. While truncated H3 or H4 proteins that lacked amino terminal H3-(58–136) and H4-(42–103) resulted in reduced nuclear accumulation (Mosammaparast et al., 2002). Previous studies have indicated that the histone-fold domain of CENH3 controls kinetochore localization (Sullivan et al., 1994; Lermontova et al., 2006; Black et al., 2007). In this study, we demonstrated that the N-terminal tail of PvCENH3 is essential and sufficient for its nuclear localization (**Figure S4**). Interestingly, we identified a putative NLS signal, "PKKKLQF," in the N-terminal tail of the switchgrass CENH3. Previous studies have reported that NLS activity is found in the N-terminal tail domains of all core histones (Baake et al., 2001; Mosammaparast et al., 2001). Therefore, switchgrass histone H3 maybe also contains a NLS at its N-terminal tail. Site-directed mutagenesis on this putative NLS signal will be necessary in order to further validate its function.

### Overexpression of Histone *H3* Triggers Cell Death in *N. Benthamiana* Plant Cells

In this study, we found that overexpression of switchgrass histone H3 genes could trigger a cell death phenotype in N. benthamiana. Since overexpression of histone H3 genes of N. benthamiana, Arabidopsis and Italian ryegrass also triggered a similar cell death, the cell death phenotype that we observed is not simply due to the heterologous expression of the switchgrass histone H3 genes in N. benthamiana. Therefore, aberrant expression of histone H3 genes appears to be toxic to plant cells. The cytotoxicity of histones has been previously reported in yeast and mammalian cells (Singh et al., 2010). A delicate balance between histone protein concentrations and DNA synthesis during the packaging of the genome into chromatin is essential for cell viability (Singh et al., 2010). Insufficient amounts of histone proteins inside the cell have been shown to be lethal (Han et al., 1987). On the other hand, excessive levels of histone proteins have also proved to be deleterious for cell growth as they promote genomic instability, increase DNA damage sensitivity and accelerate cytotoxicity (Gunjan and Verreault, 2003; Singh et al., 2009). The toxicity of large amounts of histone proteins could be attributed to their highly positive charge, which may exhibit non-specific electrostatic interactions with many negatively charged subcellular molecules including nucleic acids, such as DNA and RNA, and negatively charged proteins (Singh et al., 2010). Interestingly, mammalian histones have been found in the extracellular space between cells. These extracellular histones may bind to cell membrane receptors and may activate multiple signaling pathways that can trigger diverse cellular responses including cytotoxicity, proinflammation, procoagulation and barrier dysfunction (Chen et al., 2014; Xu et al., 2015).

Chronic obstructive pulmonary disease (COPD) is a progressive disease that is characterized by extensive lung inflammation and apoptosis of pulmonary cells. Inflamed lung cells trigger an 8-fold increase in production of hyperacetylated histone H3.3, a modified version of this histone protein that is resistant to proteasomal degradation. As a result, the damaged cells release acetylated H3.3 into the extracellular space where it binds to lung structural cells and induces apoptosis (Barrero et al., 2013). Xu et al. found that a mixture of histones was cytotoxic to the endothelial cells (EA.hy926) of sepsis patients, and that the toxic effects were mainly due to histones H3 and H4 (Xu et al., 2009). In addition, sera from patients with sepsis directly induced histone-mediated cardiomyocyte death ex vivo (Alhamdi et al., 2015). In vivo studies on septic mice also confirmed the cause-effect relationship between circulating histones and the development of cardiac injury, arrhythmias and left ventricular dysfunction (Alhamdi et al., 2015). Therefore, the over-production of histones, either inside of cells or in the extracellular space, is toxic in mammalian cells. In the future, it will be interesting to test if increased levels of extracellular histones can also trigger cell death in plants.

The cytotoxicity caused by aberrant levels of histone proteins is normally avoided by regulating histone gene expression (Gunjan and Verreault, 2003). Correspondingly, other studies have also suggested that histone H3 gene expression is tightly regulated, either at the transcriptional or translational levels, in different tissue types and during various developmental stages (Reichheld et al., 1998; Marino-Ramirez et al., 2005; Forcob et al., 2014). Further study on the regulation mechanisms underlying histone H3 gene expression may help us gain a deeper understanding of the biological functions of histone H3 proteins.

The cell death phenotype observed in this study is similar to the hypersensitive response (HR) triggered by the interaction between plant disease resistance (R) proteins and their cognate effectors (Coll et al., 2011). The HR-like cell death is usually associated with ion fluxes across the plasma membrane and a burst of reactive oxygen species, such as H2O<sup>2</sup> and superoxide anion radicals. This leads to increased cytosolic Ca2<sup>+</sup> levels, activated protein kinase cascades, global transcriptional reprogramming, nuclear DNA cleavage, rapid cytoskeletal reorganization, organelle dismantling and eventually results in a rapid cell death (Pontier et al., 1998). In the future, it will be interesting to investigate if the cell death triggered by histone H3 genes activates signaling cascades that overlap with the HR-like cell death triggered by R proteins. In this study, we demonstrated that silencing of SGT1, which promotes HR-like cell death in response to plant pathogens, in N. tabacum could completely abolish the cell death phenotype triggered by PvH3s (**Figure 4A**). Therefore, SGT1 may also function in promoting histone H3-mediated cell death.

A previous study in mice has suggested that CENH3 fused with GFP may have altered protein function and thus may affect mice embryo development (Kalitsis et al., 2003). For example, transgenic mice carrying the heterozygous CENPA-GFP/CENPA alleles were healthy, fertile and normal, whereas the mice carrying homozygous CENPA-GFP/CENPA-GFP alleles had delayed development and died during the embryo development stage (Kalitsis et al., 2003). It is possible the homozygous transgenic mice have increased CENPA-GFP protein accumulation that is lethal. We speculate that the GFP fused to switchgrass CENH3 may also have resulted in a high accumulation of PvCENH3-GFP proteins and causing cell death phenotype in the transient assay on N. benthamiana plants. However, further investigation is needed to understand the mechanism of cell death caused by CENPA-GFP in either mice or N. benthamiana.

Recent studies have shown that transgenic Arabidopsis plants expressing a chimeric GFP-tailswap protein (N-terminal tail of CENH3 was replaced by H3.3 N-terminal tail) display abnormal chromosome segregation during meiosis (Ravi and Chan, 2010; Lermontova et al., 2011). When the transgenic Arabidopsis carrying GFP-tailswap were crossed with a wild type plant, the GFP-tailswap derived genome was eliminated from the zygotes of some F<sup>1</sup> individuals, thus generating a high proportion of haploid (45%) and aneuploid (28%) progenies (Ravi and Chan, 2010). In this study, we also generated a chimerical switchgrass CENH3 gene by replacing the N-terminal tail of CENH3 with the PvH3.3 N-terminal tail. Our transient assay results showed that this chimerical gene was also able to trigger a cell death phenotype (**Figure 3E**). Therefore, we speculate that the Arabidopsis chimeric GFP-tailswap gene may have altered native gene expression levels or resulted in histone protein accumulation in transgenic Arabidopsis, which ultimately was lethal to the plant cells. Further studies are needed to fully investigate the mechanisms underlying GFP-tailswap toxicity in Arabidopsis.

In this study, we revealed that the overexpression of the N-terminal tail of histone PvH3.3 in the nucleus is essential and sufficient for triggering a cell death phenotype. Histone H3 proteins are important components of the nucleosome and play critical roles in the formation of higher-order chromatin (Dorigo et al., 2004; Kan et al., 2007; Sperling and Grunstein, 2009). Previous studies have demonstrated that histone N-terminal tails play an important role in the structure and stability of nucleosomes. The N-terminal truncation of histone H3 and CENH3 proteins enhances the transient unwrapping of DNA at the ribosomal entry/exit regions, which disrupts some histone-DNA contacts and thus reduces chromatin stability (Biswas et al., 2011; Tachiwana et al., 2011; Iwasaki et al., 2013). We speculate that overexpression of the histone H3 N-terminal tail may compete with native histone H3s to interact with the nucleosomes, thus causing chromosomal instability. This interference may also lead to aberrant gene expression and disrupt faithful DNA replication, ultimately resulting in a cell death phenotype. In the future, identification of genes that are either up- or down-regulated, through CHIP-seq or RNA-seq, by the overexpression of histone H3 may allow us to understand how these proteins can contribute to cell death.

### ACCESSION NUMBERS

The RNA-seq data used in this article can be found in the GenBank database under the following accession numbers: SRR 3473343, SRR3473344, SRR3467193, SRR3467194, SRR3467195, SRR3467196, SRR3467197, and SRR3467198.

### AUTHOR CONTRIBUTIONS

BZ, XZ designed the research projects. JM, TF performed the experiments. JM, TF, LH, XZ, and BZ wrote the manuscript. All authors reviewed and edited the manuscript before submission.

### ACKNOWLEDGMENTS

The authors thanks Drs. Hadrien Peyret and Lomonossoff for providing the vector pEAQ-HT-DEST3 used in this study. This project was partly supported by a grant from the Institute of Critical Technology and Applied Sciences (ICTAS) of Virginia Tech, the Bioprocess and Biodesign Research Center of the College of Agricultural and Life Science of Virginia Tech, a grant (2011-67009-30133) from the United States Department of Energy and National Institute of Food and Agriculture to BZ, the Foundation of American Electric Power, the National High

### REFERENCES


Technology Research and Development Program (863 Program) of China (No. 2012AA101801-02) and the Virginia Agricultural Research Station (VA135872).

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fpls.2016. 00979

Figure S1 | Switchgrass histone H3.3 variant exons and introns alignment. The position of three introns were highlighted with black line on the top of the DNA sequence. The conserved nucleotides are highlighted in red. The polymorphic nucleotides A and G are in blue, while C and T are in black.

Figure S2 | Subcellular localization of different histone H3-YFP fusion proteins. (A) PvH3.3-YFP predominately localized into plant nucleus of the transformed tobacco plant cells; (B) PvCENH3-YFP predominately localized into plant nucleus of the transformed tobacco plant cells; (C) YFP only (negative control) localized in both cystosol and nucleus of the transformed tobacco plant cells.

Figure S3 | The diagram of binary vector pEG203-MyrYFP for expressing proteins fused with an N-terminal myristoylation signal peptide. A DNA fragment contains an N-terminal myristoylation signal peptide fused with YFP gene was amplified using overlap PCR method. The derived fragment was inserted into pEarleyGate 203 to generate pEG203-MyrYFP. Myr, Myristoylation signal peptide; TEV, TEV protease cleavage site.

Figure S4 | Localization of different fragments outlined in Figure 3 of H3.3-YFP and CENH3-YFP fusion proteins in the transformed tobacco plant cells. (A) Fragment 1-YFP; (B) Fragment 2-YFP; (C) Fragment 3-YFP; (D) Fragment 4-YFP; (E) Fragment 5-YFP; (F) Fragment 6-YFP; (G) Fragment 7-YFP.

Table S1 | Primer sequences used in this study.

Table S2 | SNPs identified between cv. Alamo and cv. Dacotah. SNV, single nucleotide variant.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Miao, Frazier, Huang, Zhang and Zhao. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Identification, Characterization, and Expression Analysis of Cell Wall Related Genes in *Sorghum bicolor* (L.) Moench, a Food, Fodder, and Biofuel Crop

Krishan M. Rai <sup>1</sup> , Sandi W. Thu<sup>1</sup> , Vimal K. Balasubramanian<sup>1</sup> , Christopher J. Cobos <sup>1</sup> , Tesfaye Disasa1, 2 and Venugopal Mendu<sup>1</sup> \*

*<sup>1</sup> Department of Plant and Soil Science, Fiber and Biopolymer Research Institute, Texas Tech University, Lubbock, TX, USA, <sup>2</sup> National Agricultural Biotechnology Research Center, Ethiopian Institute of Agricultural Research, Addis Ababa, Ethiopia*

#### *Edited by:*

*Teresa Donze, West Chester University, USA*

#### *Reviewed by:*

*Chuang Ma, Northwest A&F University, China Liezhao Liu, Northwest A&F University, China*

> *\*Correspondence: Venugopal Mendu venugopal.mendu@ttu.edu*

#### *Specialty section:*

*This article was submitted to Crop Science and Horticulture, a section of the journal Frontiers in Plant Science*

*Received: 28 May 2016 Accepted: 11 August 2016 Published: 31 August 2016*

#### *Citation:*

*Rai KM, Thu SW, Balasubramanian VK, Cobos CJ, Disasa T and Mendu V (2016) Identification, Characterization, and Expression Analysis of Cell Wall Related Genes in Sorghum bicolor (L.) Moench, a Food, Fodder, and Biofuel Crop. Front. Plant Sci. 7:1287. doi: 10.3389/fpls.2016.01287* Biomass based alternative fuels offer a solution to the world's ever-increasing energy demand. With the ability to produce high biomass in marginal lands with low inputs, sorghum has a great potential to meet second-generation biofuel needs. Despite the sorghum crop importance in biofuel and fodder industry, there is no comprehensive information available on the cell wall related genes and gene families (biosynthetic and modification). It is important to identify the cell wall related genes to understand the cell wall biosynthetic process as well as to facilitate biomass manipulation. Genome-wide analysis using gene family specific Hidden Markov Model of conserved domains identified 520 genes distributed among 20 gene families related to biosynthesis/modification of various cell wall polymers such as cellulose, hemicellulose, pectin, and lignin. Chromosomal localization analysis of these genes revealed that about 65% of cell wall related genes were confined to four chromosomes (Chr. 1–4). Further, 56 tandem duplication events involving 169 genes were identified in these gene families which could be associated with expansion of genes within families in sorghum. Additionally, we also identified 137 Simple Sequence Repeats related to 112 genes and target sites for 10 miRNAs in some important families such as cellulose synthase, cellulose synthase-like, and laccases, etc. To gain further insight into potential functional roles, expression analysis of these gene families was performed using publically available data sets in various tissues and under abiotic stress conditions. Expression analysis showed tissue specificity as well as differential expression under abiotic stress conditions. Overall, our study provides a comprehensive information on cell wall related genes families in sorghum which offers a valuable resource to develop strategies for altering biomass composition by plant breeding and genetic engineering approaches.

Keywords: cell wall polymers, cellulose, hemicellulose, lignin, pectin, plant biomass, sorghum, abiotic stress

### INTRODUCTION

Sorghum (Sorghum bicolor), a C4 grass species, is one of the world's important multipurpose cereal crops with uses in food, fodder, and biofuel industries. Sorghum with its relatively smaller genome size (∼730 Mbp) makes it an ideal model bioenergy crop compared to other C4 crops such as switch grass, sugarcane or miscanthus with bigger and more complex genomes. Grain sorghum is grown worldwide with an annual production of 62 million tons of grain yield from an estimated area of 42 million hectares (FAOSTAT data 2013; http://faostat3.fao.org). Biomass yield of energy sorghum (fodder sorghum) is twice that of the grain sorghum due to longer vegetative growth period, increased leaf area which helps in greater radiation interception and efficiently converting the synthesized carbon into cell wall polysaccharides (Olson et al., 2012). Apart from food, livestock feed, and biofuel source, sorghum is also a source of malt for brewing and various food industries (Taylor et al., 2006). With a higher biomass yield potential (15–40 Mg/hc), sorghum can be used as high value energy source in second generation biofuel industry (Rooney et al., 2007). Due to its high adaptability to different environmental conditions such as drought, salinity, water-logging, ability to grow in marginal land areas, efficient light to biomass energy conversion rate, and nitrogen utilization rate, sorghum is emerging as favorite multipurpose crop in recent years (Taylor et al., 2010; Byrt et al., 2011).

The focus of second generation biofuels is to produce biofuels from lignocellulosic material, which is derived mainly from plant cell walls. Lignocellulosic material is mainly composed of cellulose (15–40%), hemicellulose (30–40%), lignin (20–30%), and pectins (Mendu et al., 2011b). These structural polymers composition varies between primary and secondary cell walls, different tissues of an individual plant and among different plant species (Mendu et al., 2011b; Welker et al., 2015). Primary cell wall (PCW) is present in all plant cell types whereas secondary cell wall (SCW) is present in specific cell types such as tracheary elements (TE) and sclerenchymal cells. Cellulose, the most abundant polymer on earth, is a homopolymer of β-(1,4)-linked glucose monomers (Cosgrove, 2005; Somerville, 2006) whereas hemicelluloses are branched heteropolymers of pentose and hexose sugar monomers (Burton et al., 2010; Ochoa-Villarreal et al., 2012). Pectin, a complex polymer consists of α-(1–4)-linked D-galacturonic acid backbone, is another polysaccharide which is mainly present in primary cell walls. Three classes of pectins, based on the nature of the sugars on the branches have been known in plants; homogalacturonans (HG), rhamnogalacturonans-I (RG-I), and rhamnogalacturonans-II (RG-II; Burton et al., 2010). The cellulose microfibrils are cross-linked with various matrix polysaccharides such as hemicelluloses and pectins thereby forming a complex polymeric network to maintain the cell wall strength (Cosgrove, 2005; Muthamilarasan et al., 2015). In addition to cellulose, hemicellulose and pectin, plant secondary cell walls are enriched with lignin. Lignin is a complex aromatic heteropolymer synthesized mainly from three canonical hydroxycinnamyl alcohol monomers viz. p-coumaryl (H), coniferyl (G), and sinapyl (S) alcohols (Boerjan et al., 2003; Vanholme et al., 2010; Welker et al., 2015). Lignin is ester- and ether-linked with cellulose and hemicellulose polysaccharides in the plant cell walls with the help of ferulic acid (Harris and Trethewey, 2010). Callose, another β-1,3-linked glucan polymer, is present in the cell walls of specialized structures involved in pollen development, cell wall formation during cytokinesis, and plasmodesmatal canals (Nedukha, 2015). Apart from developmental deposition, callose is deposited in response to various external stimuli including biotic and abiotic stresses (Chen and Kim, 2009; Muthamilarasan et al., 2015).

Cell wall biosynthesis, reassembly, and degradation are complex processes, which involves cell wall biosynthetic, modification, and degrading enzymes. Cellulose is synthesized by plasma membrane localized cellulose synthase complexes while other matrix polysaccharides such as hemicelluloses and pectins are synthesized in Golgi complex followed by their transport and cross-linking/embedding which involves cell wall biosynthetic, modifying and degrading enzymes. Cell wall hydrolyzing enzymes produced by bacteria, fungi, and nematodes (Rai et al., 2015) degrade plant cell walls to gain entry into the plant cell and access the sugars for their survival while the cell wall hydrolyzing enzymes produced by plant cells are primarily involved in controlled cleavage of wall polymers to facilitate cell growth and elongation (Cosgrove, 2005). Carbohydrate Active enZymes (CAZy; http://www.CAZy.org/) database broadly classified cell wall enzymes into 135 families of Glycoside Hydrolases (GHs), 98 families of Glycosyl Transferases (GTs), 24 families of Polysaccharide Lyases (PLs), 16 families of Carbohydrate Esterases (CEs), and 13 families of Auxiliary Activities (AAs) enzymes based on the presence of protein catalytic or functional domains (Lombard et al., 2014). Some other web based databases such as Cell Wall Navigator (Girke et al., 2004) and Cell Wall Genomics (https://cellwall.genomics. purdue.edu/families/index.html) further classified these enzymes into different groups based on biological processes in which they are involved.

Most of the enzymes involved directly in polysaccharide biosynthesis belong to the glycosyl transferases. Glycosyl transferases form glycosidic bonds by catalyzing the transfer of sugar moieties from donor to accepter molecules (Scheible and Pauly, 2004). Cellulose microfibrils are synthesized exclusively by cellulose synthases A (CESA) protein complexes, which belong to GT2 family of enzymes. Apart from CesA genes, Cellulose synthase like (Csl) genes are also found in plants which are involved in hemicellulose and other glucan biosynthesis (Lerouxel et al., 2006). Among the other hemicellulose biosynthetic enzymes, xyloglucan α-1,6-xylosyltransferases (GT34), xyloglucan fucosyltransferases (GT37), xyloglucan galactosyltransferases (GT47) are involved in synthesis of various xylan and xyloglucan molecules (Zhong and Ye, 2003; Del Bem and Vincentz, 2010; Vuttipongchaikij et al., 2012; Zabotina et al., 2012; Voiniciuc et al., 2015). The pectin biosynthetic galacturonosyltransferases (GT8) genes such as FRAGILE FIBER8, IRREGULAR XYLEM8, and IRREGULAR XYLEM9 are reported to be involved in glucuronoxylan biosynthesis (Lee et al., 2007; Yin et al., 2010). In addition to the regular cell wall polymers, callose, a β-1,3-glucan, which is deposited by the callose synthase (glucan synthase like; Gsls) belongs to the GT48 family (Farrokhi et al., 2006; Muthamilarasan et al., 2015). Integration of new polymers into the cell wall through synergistic action of biosynthesis and wall loosening process is essential in order to maintain the integrity during the cell elongation process (Cosgrove, 2005). This loosening and reassembly is accomplished by the combined action of various degrading enzymes such as glycoside hydrolases (Buchanan et al., 2012; Glass et al., 2015; Wei et al., 2015), pectin lyases (Jiang et al., 2013), xyloglucan endotransglucosylases/hydrolases (XTH; Rose et al., 2002; Nishitani and Vissenberg, 2007), and cell wall loosening proteins such as expansins (Cosgrove, 2015; Marowa et al., 2016), and yieldins (Okamoto-Nakazato et al., 2001). In sorghum, a total of 12 CesA and 36/37 Csl genes have been reported in previous studies (Paterson et al., 2009; Yin et al., 2009). Characterization of sorghum (1,3; 1,4)-β-glucan biosynthetic gene subfamilies CslF and CslH showed that CslF6 plays an important role in elongating cells while CslH3 has a major role in cells that has stopped growth and started depositing storage compounds (Ermawar et al., 2015a). In a recent study, genes encoding cellulose, lignin, and glucuroarabinoxylan biosynthetic enzymes were dynamically expressed during the different development stages of sorghum (McKinley et al., 2016). The expansins and XTHs encoding genes were also shown to be differentially expressed in the growing stem internodes of sorghum. One of the glycosyl hydrolases gene families, endo-(1,4)-β-glucanase (GH9) has been studied across 5 grass genomes and 24 members were reported from sorghum (Buchanan et al., 2012).

The focus of second-generation biofuel production from plant biomass is to utilize the sugars from lignocellulosic material for biofuels, in particular for bioethanol production. In order to utilize the lignocellulosic biomass for bioethanol production, the cell wall polysaccharides need to be separated from lignin, hydrolyzed by polysaccharide degrading enzymes to produce fermentable sugars, a process called saccharification (Lin and Tanaka, 2005). The presence of interlinked lignin around cell wall polysaccharides contributes to biomass recalcitrance by hindering the enzyme access to polysaccharides (Ermawar et al., 2015b). Separation of lignin from other cell wall polysaccharides requires pretreatment with concentrated acids at high temperatures. In addition, the presence of hydroxyl groups in the cellulose units allows intra and intermolecular hydrogen bonding which makes the structure more crystalline. Either decreasing the lignin content or reducing the cellulose crystallinity or both will improve saccharification efficiency. A comparative analysis of lignin biosynthesis related gene families have been done across plant kingdom including sorghum (Xu et al., 2009). In sorghum, several mutants (bmr, brown midrib, and rg,red for green) with reduced lignin content showed increase in saccharification and digestibility compared to control plants (Palmer et al., 2008; Xin et al., 2008; Saballos et al., 2009; Yan et al., 2012; Petti et al., 2013; Sattler et al., 2014). Among these bmr mutants, several loci have been identified which includes bmr2 encoding 4-coumarate: coenzyme A ligase (4CL), bmr6 encoding cinnamyl alcohol dehydrogenase (CAD), and bmr12 and bmr18 encoding caffeic acid O-methyltransferase (COMT) enzymes of monolignol pathways (Saballos et al., 2009; Sattler et al., 2009; Scully et al., 2016). The sorghum biomass digestibility and saccharification efficiency can be further improved by targeting various genes involved in lignin biosynthesis coupled with genes that alter the cellulose crystallinity. Apart from lignin related gene families, CesA, Csls, and Gsls are among the most studied cell wall related gene families in sorghum. As the research on function of cell wall genes in model crop Arabidopsis is advancing, it is now essential to identify and characterize the cell wall related genes in sorghum to engineer sorghum biomass for food, feed and biofuel and bioproduct applications.

The present study focuses on mining of publically available S. bicolor genome for identification and comprehensive analysis of gene families involved in the biosynthesis of cell wall biopolymers. In addition, various other gene families involved in degradation and reassembly of cell walls have also been analyzed. Further, phylogenetic analysis, physical mapping, and duplication analysis of identified genes have been performed in order to get insight into the relation among the genes and their origin. All the identified genes were also analyzed for the presence of SSR markers and miRNA target sites for molecular breeding and biotechnological applications. Publically available transcriptome datasets from various tissues were analyzed to study the expression pattern of these genes. Furthermore, to understand the expression pattern of cell wall related gene families under abiotic stress condition, differential expression analysis of exogenous abscisic acid (ABA), and polyethylene glycol (PEG) treated tissues were also performed. The identification and analysis of cell wall related gene families in the present study would help the research community in planning effective strategies for more efficient utilization of biomass for various applications.

### MATERIALS AND METHODS

### Data Retrieval and Identification of Cell Wall Related Gene Families

Publically available sequences of gene, protein, and chromosomes were downloaded from the Phytozome 11 database (https://phytozome.jgi.doe.gov/pz/portal.html#; Goodstein et al., 2012) for the identification and analysis of cell wall related gene families in sorghum. Protein sequences from other plants were downloaded from the Cell Wall Navigator database (Girke et al., 2004) to build the family specific HMM profile using HMMER v3.1b1 package (http://www.ebi.ac.uk/Tools/hmmer/). We first performed the multiple alignment of downloaded family specific sequences using Clustal Omega (http://www.ebi.ac.uk/Tools/msa/clustalo/) and saved the output alignment as <sup>∗</sup> .stockholm files. Using the family specific <sup>∗</sup> .stockholm alignment file as input for hmmbuild script we built the family specific HMM profiles (Data Sheet 1). Sorghum proteome was screened to identify the protein sequences related to various cell wall related families using HMMER with default parameters. All the identified proteins were screened for presence of their characteristic pfam domains. The successful candidate proteins were further verified for the presence of conserved domains using NCBI's Conserved Domain Database (CDD; (http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi;

Marchler-Bauer et al., 2011). Additionally, Arabidopsis cell wall related proteins were used as query to search the sorghum genome using blastP with an e-value of 10−<sup>5</sup> and further validated with CDD search. The identified protein sequences from HMMER analysis were compared to blast identified proteins to prepare gene family specific non-redundant gene list. The coding and amino acid sequences of all the identified members were retrieved from sorghum genome dataset obtained from Phytozome database and used for the further analysis. The molecular weight and pI-values of all the identified proteins were calculated using online tool Compute pI/Mw (http://web.expasy.org/compute\_pi/; Gasteiger et al., 2005).

### Phylogeny, Physical Mapping, and Duplication Analysis of Cell Wall Related Genes

The protein sequences of individual families were used for multiple sequence alignment using ClustalW program of MEGA v6 package (Tamura et al., 2013). Individual phylogenetic tree was constructed for the individual gene families with the MEGA v6 using neighbor-joining method. Bootstrap test was performed with 1000 iterations. To map physical locations of the identified cell wall related genes on sorghum chromosomes, their genomic coordinates along with chromosome number were retrieved from the file (.gff ) downloaded from Phytozome database. The physical localization of genes was performed using the Mapchart 2.30 software (Voorrips, 2002). Furthermore, all the identified genes were analyzed for tandem duplications within the genome using the Plant Genome Duplication Database (http://chibba.agtec.uga.edu/duplication/) dataset (Lee et al., 2012).

### Identification of SSR Markers in Cell Wall Related Genes

Coding sequences of all the cell wall related genes were used for SSRs identification using microsatellite identification tool (MISA, http://pgrc.ipk-gatersleben.de/misa/misa.html). The criteria for SSR search was repeat stretches having a minimum of five repeat units for dinucleotide (DNRs), trinucleotide (TNRs), tetranucleotide (TtNRs), pentanucleotide (PNRs), and hexanucleotide (HNRs). Mononucleotide repeats (MNRs) were excluded from the analysis. The maximum distance between two markers in a compound microsatellite was set to 100.

### *In silico* Prediction of miRNA Target Sites on Cell Wall Related Genes

The identified cell wall related genes from individual families were analyzed for the presence of miRNA target sites using psRNATarget server (http://plantgrn.noble.org/psRNATarget/; Dai and Zhao, 2011). The maximum expectation values of 3.0 with other default parameters were used to perform the analysis.

### Expression Analysis of Cell Wall Related Genes at Various Developmental Stages of Sorghum

Publically available transcriptome datasets from different developmental stages of sorghum (stem, 20 days old leaves, vegetative meristem, floral meristem, spikelet, flowers, embryos, and seeds) were downloaded from NCBI's Short Read Archive (SRA) database (http://www.ncbi.nlm.nih.gov/sra). All the transcriptome datasets were mapped on cell wall related genes using the QSeq program of DNASTAR Lasergene package (http://www.dnastar.com/t-nextgen-qseq.aspx). For the mapping purpose, 520 gene sequences related to cell wall gene families were exclusively used as reference. Transcript abundance was visualized by MeV (http://www.tm4.org/mev.html) generated hierarchical clustered heat map for individual gene families using the self-normalized RPKM (Reads Per Kilobase per Million reads) values calculated by the QSeq program.

### Differential Gene Expression Analysis under Various Abiotic Stress Conditions

The role of identified cell wall related sorghum genes in abiotic stress conditions (exogenous ABA and PEG induced osmotic stress) in root and shoot was analyzed using publically available transcriptome datasets (Dugas et al., 2011). In brief, the published experiments were performed by germinating the S. bicolor BTx623 seeds and treating the seedlings on the 8th day after germination with 20 µM ABA (dissolved in NaOH), 57.1 µM NaOH (control for ABA), 20% PEG-8000, and Milli-Q (control for PEG treatment). After 27 h of treatment total RNA was extracted from the shoots and roots in three biological replicates and sequenced using the Illumina platform. Respective data sets of stress treated tissues along with controls were downloaded from SRA database of NCBI (Dugas et al., 2011). The expression pattern of cell wall genes was analyzed by using QSeq program of DNASTAR Lasergene package with self-normalized RPKM method. Fold change was calculated by using RPKM values of H2O and NaOH treated root and shoot tissues as controls for PEG and ABA, respectively. All the differentially expressed genes were analyzed for statistical significance using the Student's t-test with multiple-hypothesis testing at less than 0.05. Significantly differentially expressed genes (fold change ≥2.0, p < 0.05) from different stress conditions were used to find commonly up or down regulated genes from root and shoot using the online tool Venny 2.1 (http://bioinfogp.cnb.csic.es/tools/venny/).

### RESULTS

### Identification of Cell Wall Related Genes from Sorghum

Cell wall related gene families have been shown to play crucial roles in various biological processes related to plant development, biotic and abiotic stress responses (Hamann, 2012; Lombard et al., 2014; Le Gall et al., 2015). Lignin biosynthetic gene families of sorghum have been analyzed elsewhere (Xu et al., 2009), hence, the present study focused on the gene families involved in various cell wall related processes such as polysaccharide synthesis and reassembly and degradation (Girke et al., 2004). Additionally, previously unreported laccase genes that are involved in lignin biosynthesis, were also analyzed in the present study. All 47,205 protein-coding transcripts and proteins from publically available S. bicolor genome were downloaded and analyzed. HMMER search identified a total of 520 genes from 20 cell wall related gene families with an average of 26 genes per family (**Table 1**, **Figure 1**, and Supplementary Table 1). Among the analyzed gene families, expansin with 83 members was the largest gene family whereas rhamnogalacturonan I lyases (CAZy ID: PL4) was the smallest family with 6 members (**Table 1** and **Figure 1A**). According to the CAZy distribution, the total identified genes were classified into glycosyl transferases (160), glycoside hydrolases (201), pectin/pectate lyases (16), carbohydrate esterases (35), auxiliary activity (25), and expansin (83) families (**Table 1**).

### I. Cell Wall Polysaccharide Biosynthetic Gene Families of Sorghum

Plant cell wall biosynthesis is a complex process involving plethora of enzymes resulting in the biosynthesis of vast variety of cross linked cell wall polysaccharides. Majority of these enzymes belong to glycosyl transferases superfamily that is involved in the synthesis of cellulose, hemicellulose, pectin and callose.

### **Cellulose biosynthetic genes**

Genome-wide analysis of sorghum showed presence of 11 CesA genes. In contrast to the 12 CesA genes reported in the previous studies (Paterson et al., 2009; Yin et al., 2009), the present study identified only 11 SbCesA genes using conserved domain profiles based HMMER scanning of sorghum genome (**Table 1** and **Figure 1B**). Further investigation showed, according to the updated sorghum genome (v3.1, phytozome), the previously predicted two CesA genes (Sb03g004310.1 and Sb03g004320.1) are indeed a single gene (Sobic.003G049600.2). In addition, domain analysis of these 11 CESA proteins showed the presence of canonical cellulose synthase (CS, PF03552), zinc-binding RING-finger (PF14569), and glycosyl transferase 2 (PF13632) domains (**Figure 1B**). Majority of the CESAs (9/11) showed presence of all canonical domains, while Sobic.003G296400.1 showed lack of ZF domain and Sobic.010G183700.1 showed lack of ZF and GT2 domains. Eukaryotic CesA genes were first cloned from cotton (Pear et al., 1996) and have been later reported from Arabidopsis (10), maize (12), poplar (18), and 14 in foxtail millet (Richmond and Somerville, 2000; Appenzeller et al., 2004; Djerbi et al., 2005; Muthamilarasan et al., 2015). Cluster analysis of CESA proteins found to be clustered with CSLD and CSLF proteins (**Figure 2**) consistent with earlier reports (Ermawar


et al., 2015a). The clustering is due the presence of common conserved domains among cellulose synthase and cellulose synthase like family of proteins. Chromosomal distribution of CesA genes showed presence of 4 and 3 genes on chromosome 2 and 1, respectively while remaining four are present on chromosome 3 (2 genes), 9 (1 gene), and 10 (1 gene; **Figure 4A**).

### **Hemicellulose biosynthetic genes**

Sorghum genome showed presence of four hemicellulose biosynthetic enzyme families i.e., cellulose synthase like (Csl; GT2), xyloglucan xylosyltransferases (XXT; GT34), xyloglucan fucosyltransferases (MUR2, XFT; GT37), and xyloglucan galactosyltransferases (MUR3, XGT; GT47). A total of 104 genes representing, Csl (36), XXT (12), MUR2 (19), and MUR3 (37) gene families were identified (**Table 1**, **Figure 1A**, and Supplementary Table 1). Further, domain analysis of these genes revealed the presence of CS (PF03552) domain in CSL, presence of GT34 (PF05637) domain in XXT, presence of XG\_Ftase (PF03254) domain in MUR2, and exostosin (PF03016) domain in MUR3 family proteins (**Figure 1B**). Phylogenetic analysis of CSL proteins with Arabidopsis CSLs clustered them into 6 different sub-families namely, CSLA with 8, CSLC, CSLD, and CSLE with 5 each, CSLF with 10 and CSLH with 3 members (**Figure 2**). Phylogenetic analysis of other hemicellulose related gene families XXT (GT34) and XFT (GT37) showed their uniform clustering with Arabidopsis homologs (**Figures 3A,B**). Phylogenetic analysis of sorghum XGT (GT47) members with Arabidopsis homologs further clustered them into 5 subfamilies, A, B, C, D, and E with 13, 7, 3, 4, and 10 members, respectively (**Figure 3F**). Physical mapping of Csl family genes showed its distribution over all chromosomes except Chr. 5 (**Figure 4A**). Majority of the Csl genes, almost one-third (13), were found to be present exclusively on Chr. 2. Physical mapping of GT34 family members showed their distribution over six chromosomes (Chr. 1, 2, 3, 4, 5, and 8) whereas GT37 members were found to be present on five chromosomes (Chr. 2, 4, 6, 8, and 10) with maximum of 8 genes on Chr. 4 (**Figure 4A**). Another hemicellulose specific family, GT47 members were found to be distributed on all the chromosomes except on Chr. 5. About one-third (12) of GT47 family members were found present on Chr. 1 (**Figure 4A**).

### **Pectin biosynthetic genes**

A total of 33 genes were identified as members of homogalacturonan α-1,4-galacturonosyltransferases (GAUT), a GT8 family involved in pectin biosynthesis. Presence of glycosyl\_transferase\_8 domain (PF01501) in pfam analysis confirms the annotation of these genes as GT8 members (**Figure 1B**). Sorghum GAUT members were further clustered into 5 sub-families (A–E) based on phylogenetic analysis with Arabidopsis homologs (**Figure 3D**). Among sub-families, D

was the largest one with 19 members followed by C (6), A (5), B (2), and smallest sub-family E with a single member (**Figure 3D**). GT8 members were found to be distributed on all the chromosomes with 8 members exclusively present on Chr. 1 (**Figure 4A**).

### **Lignin biosynthetic genes**

Sorghum lignin biosynthetic genes were analyzed along with other species (Xu et al., 2009) except laccase family. Here, we analyzed the laccase gene family that is involved in lignin biosynthesis. Laccases are among the CAZy AA1 class of enzyme which play important role particularly in lignin metabolism. A total of 25 sorghum genes were identified as laccase family genes based on the presence of three copper containing conserved domains namely Cu\_oxidase (PF00394), Cu\_oxidase2 (PF07731), and Cu\_oxidase\_3 (PF07732) (**Table 1** and **Figure 1B**). Phylogenetic analysis of laccase proteins showed alignment of some of them with Arabidopsis proteins whereas some of sorghum laccase proteins clustered in distinct clusters (**Figure 3E**). Physical mapping of laccase genes showed their presence on all the chromosomes except Chr. 2, 6, and 7 with a maximum of 9 genes on Chr. 3 (**Figure 4A**).

### **Other cell wall biosynthetic genes**

Glucan synthase-like (Gsl) gene family, a GT48 enzyme involved in biosynthesis of specialized polysaccharide callose, were found to have 12 members in sorghum genome based on the glucan\_synthase (PF02364) conserved domain (**Figure 1B**). Phylogenetic analysis of SbGSL proteins showed their uniform distribution with Arabidopsis and rice homologs (**Figure 3C**). Physical mapping of these genes showed their distribution limited to Chr. 1, 3, 4, and 10 with maximum 4 genes on Chr. 3 (**Figure 4A**).

### II. Gene Families Involved in Cell Wall Reassembly and Degradation

Apart from cell wall biosynthetic enzymes, gene families involved in dynamic and complex cell wall extension and reassembly processes such as controlled degradation, loosening, and reassembly of cell wall polymers were also analyzed in the present study. A total of 335 genes were identified from 12 gene families that are involved in cell wall modifications.

### **Cell wall loosening gene families**

A total of 83 and 24 genes were identified as members of expansins and yieldins gene families which are primarily involved in cell wall loosening (**Table 1**). The identification was based on the presence of conserved domains DPBB\_1 (PF03330) and pollen\_allerg\_1 (PF01357) observed in the pfam domain analysis (**Figure 1B**). Phylogenetic analysis of these proteins classified them into two major clusters, Expansin-A with 40 proteins and Expansin-B with 43 proteins (Supplementary Figure 1A). Further physical mapping and distribution analysis of these genes showed their presence on all the chromosomes except Chr. 5 and 8 (**Figure 4B**). Chromosome 1 was observed to have a maximum number of 34 expansins genes. Pfam domain profiling of another cell wall loosening protein family yieldins (GT18) showed presence of glyco\_hydro\_18 (PF00704) conserved domain (**Figure 1B**). Phylogenetic analysis of sorghum yieldins showed more similarity with rice yieldins proteins than Arabidopsis proteins indicating conservation of yieldins among monocots (Supplementary Figure 1B). Further, yieldins were found distributed on six sorghum chromosomes namely Chr. 1, 2, 3, 5, 6, and 7 with maximum of 8 genes on Chr. 5 (**Figure 4B**). Xyloglucan endotransglucosylases/hydrolases (XTH), another important cell wall loosening proteins, have dual role of hydrolyzing and extension of existing cell wall. A total of 35 sorghum genes were identified as XTH family members based on the observed conserved domains glyco\_hydro\_16 (PF00722) and

XET\_C-term (PF06955) (**Table 1** and **Figure 1B**). Phylogenetic analysis of XTH proteins classified them in 3 sub-families, subfamily A with 6 genes, B with 19 genes, and C with 10 genes (Supplementary Figure 1C). Physical mapping of these genes showed their distribution over six chromosomes (Chr. 1, 2, 4, 6, 7, and 10) with maximum of 7 genes were found to be present on Chr. 10 (**Figure 4B**).

### **Glycoside hydrolases**

Among the identified cell wall modifying genes families in sorghum, there are 7 GH gene families (GH9, GH10, GH16, GH17, GH18, GH28, and GH35) with 201 genes (**Table 1**). Out of the identified 7 families, GH16 and GH18 have also been classified as cell wall loosening proteins. Among the GH families, GH17 (Glucan 1, 3-β-glucosidases) was the largest with 54 genes followed by GH28 (polygalacturonases) with 38 genes, GH9 (endo-1, 4-β-glucanases) with 26 genes, GH35 (β-galactosidases) with 13 genes, and GH10 (endo-xylanases) with 11 genes (**Table 1**). Further, Pfam domain analysis showed the presence of conserved Glyco\_hydro\_9 (PF00759) in GH9, Glyco\_hydro\_10 (PF00331) and CBM\_4\_9 in GH10, Glyco\_hydro\_17 (PF00332) and X8 domain (PF07983) in GH17, Glyco\_hydro\_28 (PF00295) and pectate\_lyase\_3 (PF12708) in GH28 and Glyco\_hydro\_35 (PF01301) domains in GH35 (**Figure 1B**). Phylogenetic analysis of these gene families showed further classification of GH17 and GH28 into 5 and 7 sub-families, respectively whereas GH9, GH10, and GH35 showed clustering with Arabidopsis proteins without any sub-classification (Supplementary Figures 1D–G). GH17 has been further clustered into sub-families A with 13 genes, B with 13 genes, C with 16 genes, D with 9 genes, and sub-family E with 3 genes (Supplementary Figure 1F). GH28 has also been showed sub-clustering into 7 sub-families namely, A with 12 genes, C with 6 genes, D with 7 genes, E with 4 genes, F with 5 genes, and G with 2 genes (Supplementary Figure 1G) however, no sorghum proteins were found in sub-family B. Physical distribution of these genes on the sorghum genome showed the presence of GH9 genes on all chromosomes except Chr. 5 and 8, GH10 genes on five chromosomes (Chr. 1, 2, 3, 4, and 6), GH17 genes on all the chromosomes, GH28 genes on all the chromosomes except Chr. 8 and GH35 genes on all the chromosomes except Chr. 5 and 6 (**Figure 4B**).

### **Pectin modifying enzymes**

Among the identified cell wall related genes, 51 genes were classified as members of 4 gene families (2 pectin lyases and 2 pectin esterases) involved in pectin modification. Two pectin related lyases (PLs) namely pectate and pectin lyases (PL1) and rhamnogalacturonan I lyases (PL4) were found to have 10 and 6 genes respectively (**Table 1**). Conserved domain analysis of these proteins revealed the presence of Pec\_lyase\_C (PF00544) and Rhamno\_gal\_lyase (PF06045) domains in the PL1 and PL4, respectively (**Figure 1B**). An additional Pec\_lyase\_N (PF04431) domain was found in PL1 family while CBM\_like (PF14683) and Fn3\_3 (PF14686) domains were seen in PL4 family members. Phylogenetic analysis of PL1 family members along with Arabidopsis and rice homologs showed further clustering into three sub-families. PL1 sub family B was the largest one with 8 genes whereas sub family A and C was found to have one gene each (Supplementary Figure 1I). Phylogenetic analysis of PL4 genes showed more similarity to rice PL4 genes rather than Arabidopsis (Supplementary Figure 1J). Chromosomal distribution of these genes showed the presence of PL1 genes over six chromosomes (Chr. 1, 3, 4, 6, 8, and 10) whereas PL4 genes localization was limited to 4 chromosomes (Chr. 5, 7, 8, and 9) (**Figure 4B**).

Pectin esterases are another class of enzymes, which are involved in the cell wall reassembly. A total of 23 genes were identified as PME (CE8) homologs based on conserved pectin esterase (PF01095) domain whereas 12 genes were identified as PAE (CE13) family members based on pectin acetyl esterase domain (PF03283), respectively (**Table 1** and **Figure 1B**). Phylogenetic analysis of PME protein from sorghum showed their even distribution with Arabidopsis PME proteins whereas sorghum PAE proteins showed more similarity to rice PAEs compared to Arabidopsis PAE's as expected (Supplementary Figures 1K,L). PME genes were distributed on all the 10 sorghum chromosomes with a maximum of 5 genes on Chr. 3 whereas PAE family members were limited to 6 chromosomes (Chr. 1, 2, 3, 4, 6, and 9) with a maximum of 7 genes on Chr. 3 (**Figure 4B**).

### Chromosomal Localization and Duplication Analysis

Chromosomal localization of identified cell wall related genes were performed on the 10 sorghum chromosomes using Mapchart 2.30 mapping software (**Figures 4A,B**). Approximately 65% (336) of cell wall related genes were present on 4 chromosomes namely, chromosome 1 with 22.3% (116), chromosome 3 with ∼16% (83), chromosome 2 with 13.2% (69), and chromosome 4 with 13.1% (68). Remaining genes were found to be distributed on remaining six chromosomes with a minimum of 15 genes on chromosome 8 (**Figure 4**). Further, all the cell wall related families were analyzed for tandem duplication within the respective gene families to study their expansion. Out of 20 gene families analyzed, 56 tandem duplication events involving 169 genes were observed in 17 families (**Figures 4A,B**). No tandemly duplicated genes were observed in CesA, glucan synthase, and β-galactosidase gene families. Expansins gene family was observed to have highest number of tandem duplication events (14) involving 51 genes. Among other gene families, MUR3 (5 events/15 genes), XTH (5 events/13 genes), MUR2 (4 events/15 genes), GH28 (4 events/9 genes), yieldins (4 events/14 genes), laccases (4 events/10 genes), GH17 (3 events/10 genes), PAE (2 events/5 genes), and XXT with 3 events involving 6 genes were observed with significant number of duplications. Apart from this, Csl gene family was also found to have 2 duplication events involving 9 genes. Chromosome 1 was found to have a maximum number (14) of tandem duplications of the cell wall gene families, followed by chromosome 3 (9 duplications), 4 (8 duplications), and 2 with 6 duplications. Only one tandem duplication event among cell wall genes was observed on chromosome 8.

### Cell Wall Related Genes with SSR Markers in Sorghum

Microsatellites or SSR markers are short tandem DNA repeats which belongs to comparatively most efficient class of molecular markers with its genome wide distribution and high level of polymorphism. Expressed or coding sequence derived SSRs (ESSRs) have been reported to be comparatively more conserved than the genomic derived SSRs (Guo et al., 2006) which makes ESSRs as an important tool for marker assisted selection for various plant breeding programs. Considering the importance of cell wall related genes in developing sorghum mutants for biofuel applications, we analyzed all of the identified cell wall related genes for the presence of SSR markers. Out of 520 genes, 112 genes were identified with 137 SSRs (125 Simple and 6 compounds; **Figure 5A**, Supplementary Tables 2, 3). Among the identified SSRs, tri-nucleotide repeats (TNRs) were most

abundant with 111 occurrences followed by 24 DNRs. The identified SSRs were found to be present in all the 20 families analyzed with highest representation in Csl (16) and xyloglucan galactosyltransferases (16) gene families (**Figure 5B**).

### Cell Wall Related Genes with Putative miRNA Target Sites in Sorghum

MicroRNAs (miRNAs) are small and conserved non-coding RNA molecules which are known to regulate the gene expression at transcriptional and post-transcriptional levels. To understand the potential roles of miRNAs in regulating the cell wall related gene expression, all the 520 cell wall genes from various families were analyzed for the presence of miRNA target sites. A total of 10 genes were identified to have miRNA target sites out of which 6 belong to laccase (Sobic.001G422300.1, Sobic.003G352700.1, Sobic.003G352800.1, Sobic.003G353200.1, Sobic.005G198500.1, and Sobic.009G162800.1), 2 belong to Gsl (GT48) (Sobic.003G298900.1 and Sobic.004G107800.1) and one each to CesA (Sobic.003G049600.2) and Csl (Sobic.008G125700.1) (Supplementary Table 4). Six different miRNA families (miR156, miR164, miR397, miR528, miR5566, and miR6230) were identified to target these cell wall related genes.

### Expression Profile of Cell Wall Related Genes in Different Organs of Sorghum Plant

The availability of whole transcriptome data online presented an excellent opportunity to identify candidate genes that play key roles in specific organs during sorghum development. The information on candidate genes can be further used to engineer cell walls in a cell/tissue specific manner to meet various industrial needs particularly in biofuel/feed industry. Publically available whole transcriptome datasets (Supplementary Table 5) were used to analyze spatial expression of the cell wall related genes in 8 different organs (leaves, embryo, seed, stem, spike, flower, vegetative as well as floral meristem) using RPKM values. The relative expression data was represented family wise using individual heat maps in order to better analyze the role of genes from each family (**Figures 6**, **7**, Supplementary Table 6). In case of CesA genes, 7 out of 11 genes were observed to have high expression in all analyzed organs, whereas, 3 genes were highly expressed in stem, flowers and spikes with moderate expression in leaves, seeds, and vegetative meristem (VM; **Figure 6A**). One CesA gene (Sobic.010G183700) found to be expressed exclusively in leaves. A mixed pattern of expression was observed in Csl gene family with 11 genes showing high expression and 3 genes with very low expression in all the 8 tissues analyzed (**Figure 6B**). Among hemicellulose

biosynthetic genes, 12 xyloglucan xylosyltransferases genes were clustered in two main clusters, first with 5 genes having medium to high expression in all the 8 tissues, whereas second cluster of 7 genes with tissue specific expression mainly in leaves (**Figure 6C**). Xyloglucan fucosyltransferases gene family members, other than 3 genes (Sobic.004G308200, Sobic.004G308400, Sobic.004G308600) that showed higher expression in all the 8 tissues, expression of remaining genes was mostly limited to leaves (**Figure 6D**). Half (19) of the xyloglucan galactosyltransferases gene family, showed a higher expression level in all the tissues, whereas the other half (18) showed moderate to low expression in various tissues (**Figure 6E**). In homogalacturonan α-1,4-galacturonosyltransferase gene family, 25 out of 33 genes showed high expression in all the tissues whereas remaining 8 genes showed tissue specific expression (**Figure 6F**). Lignin biosynthetic related laccase genes showed high expression in selective tissues like leaves, seeds, stem flower, and spikes (**Figure 6G**). Other than few laccase genes, most of the laccases were not expressed in embryo and meristematic tissues.

FIGURE 6 | Heat map showing hierarchical clustering of the sorghum's cell wall related biosynthetic gene families in various developmental stages. (A) CesA, (B) Csl, (C) Xyloglucan xylosyltransferases, (D) Xyloglucan fucosyltransferases, (E) Xyloglucan galactosyltransferases, (F) Homogalacturonan α-1,4-galacturonosyltransferase, (G) Laccases, (H) Glucan synthase-like. RNA-seq data from various developmental stages viz. stem (St), 20 days old leaves (Lv), vegetative meristem (VM), floral meristem (FM), spikelet (Sp), flowers (FL), embryos (Em), and seeds (Sd) were mapped on gene sequences related to above gene families. The respective RPKM values were used to construct heatmap with scale bar on the top showing expression of the genes. Red colors represent high expression whereas green represents low expression.

All the 12 genes of glucan synthase gene family showed high expression across all the tissue analyzed (**Figure 6H**).

Apart from polysaccharide biosynthetic gene families, the spatial expression of 12 gene families involved in degradation and reassembly was also analyzed. Among expansin genes, 16 genes from 2 clusters were observed with medium to high expression in all the tissues (**Figure 7A**). Remaining expansin genes mostly showed expression in flower, spike, leaves, seeds, and/or stem. Yieldins were observed to express consistently in leaves, whereas some of them showed high expression in all the other tissues (**Figure 7B**). Most of the XTH gene family showed moderate to high expression in almost all the tissues analyzed other than a cluster with no expression in embryo and meristematic tissues (**Figure 7C**). The five GH family genes (endo-xylanases, endo-1, 4-β-glucanases, glucan 1, 3-β-glucosidases, polygalacturonases and β-galactosidases) were highly expressed in leaves, stem, seed, flowers, and spikes apart from the clusters with high expression in all tissues (**Figures 7D,E,G,H,J**, respectively). Other than βgalactosidases, a significant proportion of genes from these families were not expressed in the embryo and meristematic tissues. Among the 10 genes encoding pectate lyases, 3 showed expression in all the tissues whereas expression of remaining genes were limited to flower, spike, and seeds (**Figure 7F**). In the other pectin related rhamnogalacturonan I lyases gene family, only one gene showed consistent expression in all the tissues whereas remaining genes showed leaf specific expression (**Figure 7I**). Among the two families of esterases, gene encoding PMEs were majorly clustered into two clusters based on expression, first with moderate to high expression in almost all the tissues analyzed whereas second cluster with expression limited to tissues other than embryo and meristem (**Figure 7K**). All the PAE family genes showed medium to high expression in almost all the tissues analyzed (**Figure 7L**).

### Differential Expression Analysis of Cell Wall Related Genes under Different Abiotic Stress Conditions

Differential expression analysis of cell wall related genes under two abiotic stress treatments (ABA and osmotic stress) was

FIGURE 7 | Heat map showing hierarchical clustering of the sorghum's cell wall related gene families involved in reassembly and degradation in various developmental stages. (A) Expansins, (B) Yieldins, (C) XTH, (D) Endo-xylanases, (E) Endo-1, 4-β-glucanases, (F) Pectate and pectin lyases, (G) Glucan 1, 3-β-glucosidases, (H) Polygalacturonases, (I) Rhamnogalacturonan I lyases, (J) β-Galactosidases, (K) Pectin methyl esterases, (L) Pectin acetyl esterases. RNA-seq data from various developmental stages viz. stem (St), 20 days old leaves (Lv), vegetative meristem (VM), floral meristem (FM), spikelet (Sp), flowers (FL), embryos (Em), and seeds (Sd) were mapped on gene sequences related to above gene families. The respective RPKM values were used to construct heatmap with scale bar on the top showing expression of the genes. Red colors represent high expression whereas green represents low expression.

performed to analyze their response in seedling root and shoots. A total of 19 and 29 genes were found to be significantly up-regulated (FC ≥ 2.0 and p < 0.05) in the sorghum shoots whereas 34 and 67 genes were found significantly up-regulated in the roots subjected to ABA and PEG treatment, respectively (**Figures 8A,B**, Supplementary Table 7, Supplementary Figures 2A,C, 3A,C). Similarly, 53 and 25 genes were significantly down-regulated in shoot whereas 133 and 14 genes were found down-regulated in root treated with ABA and PEG, respectively (**Figures 8A,C**, Supplementary Table 7, Supplementary Figures 2B,D, 3B,D). Relatively higher number of genes was downregulated in the ABA treated shoot and root than PEG treatment. Comparative analysis of differentially expressed genes in root and shoot subjected to ABA and PEG treatment showed common upregulation of 1 gene (**Figure 8B**) whereas no gene was found to be down-regulated in common (**Figure 8C**).

The ABA treated shoots showed up-regulation of polygalacturonases (4) whereas down-regulation of expansins (12), laccases (9), glucan 1,3-β-glucosidases (8), and polygalacturonases (5) (Supplementary Figures 2A,B). The ABA treated roots showed up-regulation of glucan 1,3-β-glucosidases (5), homogalacturonan α-1,4-galacturonosyltransferases (4), laccases (3), and xyloglucan galactosyltransferases (3) whereas down-regulation of expansins (38), glucan 1,3-β-glucosidases (18), laccases (11), Csls (11), and XTHs (9) (Supplementary Figures 3A,B). Large number of expansins showed downregulation in ABA treated shoots as well as roots. Glucan 1,3-β-glucosidases was the second most down-regulated family in ABA treated shoot as well as root. Similarly, most of the up-regulated genes in PEG treated roots mainly belong to expansins (21), XTHs (10), and glucan 1,3-β-glucosidases (6) families whereas in PEG treated roots the down-regulated genes mainly belong to yieldins (3) and Csls (3) (Supplementary Figures 3C,D). The major up-regulated cell wall related gene families in PEG-treated shoots were expansins (6), glucan 1, 3-β-glucosidases (4), and XTHs (4). Similar to PEG-treated roots, yieldins were the most down-regulated gene family in PEG treated shoots (Supplementary Figures 2C,D).

### DISCUSSION

Cell wall biogenesis is a dynamic process that involves synergistic action of multiple gene families that are involved in the biosynthesis as well as controlled degradation and reassembly of cell wall polymers. Broadly, glycosyl transferases are the major class of enzymes involved in cell wall polysaccharide biosynthesis while substrate specific glycoside hydrolases, pectin related lyases, various esterases together with cell wall loosening enzymes are responsible for cell wall extension. Among these families, cellulose synthase and cellulose synthase like genes are among the most studied in model, tree, and crop plants (Richmond and Somerville, 2000; Appenzeller et al., 2004; Djerbi et al., 2005; Muthamilarasan et al., 2015). Successful production of renewable biofuels and bioproducts from lignocellulose requires a comprehensive understanding on the genes involved in the biosynthesis of plant lignocellulosic material. A comprehensive report on cell wall related genes was missing in sorghum, which is an important food, fiber, bioproduct, and biofuel crop. Understanding the presence and distribution of cell wall related genes in sorghum would augment the plant breeding and biotechnological approaches to develop sorghum plants with altered cell wall composition for various industrial applications apart from crop improvement. In the present study, 520 genes from 20 cell wall related gene families have been identified and characterized in silico. Gene expression analysis of the identified genes was performed in different organs under normal and abiotic stress treated conditions to understand their role

FIGURE 8 | Differential expression analysis (DEG) of sorghum cell wall related genes in ABA and PEG treated shoot and root. (A) Details of differentially expressed genes (Fold change ≥ 2.0 and *p* < 0.05) during stress. (B) Venn diagram representing up-regulated genes during stress. (C) Venn diagram representing down-regulated genes during stress.

in cell wall development and abiotic stress of sorghum. These candidate genes can be putative targets of reverse genetics for crop improvement apart from value addition to sorghum.

Lignocellulosic material deconstruction is important for bioethanol production from plant biomass. Current technology of bioethanol production involves separation of lignin from the lignocellulosic material, saccharification of sugars from wall polysaccharides and fermentation. Lignocellulosic based bioethanol production is technically and economically not competitive compared to fossil based gasoline with the existing conversion technologies. Further, the cost of bioethanol from plant biomass is higher (\$1.5/gal) than starch based (\$0.9/gal) bioethanol (http://www.nrel.gov/docs/fy01osti/28893.pdf). Improving the efficiency of the lignocellulosic biomass deconstruction, particularly separation of lignin from other wall polymers is essential to make bioethanol production economically feasible. Altering the biomass composition is essential to reduce biomass recalcitrance and improve the conversion technologies. Understanding the composition of lignocellulosic biomass and the genes involved in the lignocellulosic biomass facilitates the biomass engineering to improve the conversion efficiency.

Lignocellulosic material is mainly composed of cellulose, hemicellulose, lignin, and pectin. Cellulose is a linear homopolymer of β-1–4 linked glucose molecules occupying 30–40% of cell wall weight. CesA genes were first identified in bacteria and later in cotton followed by other plants (Saxena et al., 1990; Pear et al., 1996). The cellulose is synthesized by plasma membrane localized cellulose synthase complexes (CSCs) composed of multiple CESA proteins that produce individual glucan chains (Persson et al., 2007; Kumar and Turner, 2015). The individual glucan chains form a cellulose microfibrils and several microfibrils form a cellulose fiber hence the number of cellulose synthases present in individual species is important to understand the cellulose biosynthetic process. Moreover, cellulose synthases serve specific roles in plant development as they have cell, tissue, and developmental specific roles (Taylor et al., 2000; Mendu et al., 2011a). Hence there are multiple CesAs in each species and the number of genes varies based on the plant species; Arabidopsis (10), maize (12), poplar (18), and foxtail millet (14) (Richmond and Somerville, 2000; Appenzeller et al., 2004; Djerbi et al., 2005; Muthamilarasan et al., 2015). In the present study we identified 11 CesA genes in contrast to the reports of 12 CesA genes in sorghum, which is due to the removal of errors in the updated sorghum genome assemblies (**Table 1**, **Figure 1**). Though the number of CesAs present in sorghum is known, it is important to study the role of individual CesAs in primary and secondary cell wall biosynthesis to modify the biomass composition in specific organ or tissue.

Hemicellulose composition and biosynthesis is complex as they are composed of branched polysaccharides compared to homopolymeric cellulose. Hemicelluloses play an important role in cell wall polymer cross-linking and help in maintaining the cell wall integrity and strength. The hemicellulose content and composition is different in monocot and dicot plants. The cell walls of monocots such as sorghum contain 20–50% of hemicelluloses that makes it an attractive source for pentose sugars (Welker et al., 2015). Bioethanol production from pentose sugars apart from hexoses is currently being heavily investigated (Unrean and Srienc, 2010). Understanding the hemicellulose biosynthesis and genes involved in the biosynthetic process will help to alter the biomass composition for easy deconstruction as well as to improve hexose to pentose ratio. The hemicellulose biosynthesis genes have not been studied very well other than cellulose synthase like genes. In the present investigation, the large family of sorghum Csls has been classified into 8 different groups (CslA to H) based on the phylogenetic studies (Ermawar et al., 2015a; **Figure 2**). Particularly, two clusters, CslF and CslH were observed be unique to sorghum with no Arabidopsis homologs, which is in agreement with the previous reports of these clusters as grass specific (Paterson et al., 2009; Ermawar et al., 2015a) while no sorghum Csl genes were clustered in Clusters CslB and CslG (**Figure 2**). Similar clustering of SbCsl genes has been reported previously in the sorghum draft genome report (Paterson et al., 2009). Apart from Csls, we also identified additional three hemicellulose gene families including xyloglucan xylosyltransferases (GT34), xyloglucan fucosyltransferases (GT37), and xyloglucan galactosyltransferases (GT47). Homogalacturonan α-1,4-galacturonosyltransferases (GT8), a pectin biosynthesis related gene family was also observed as one of the sorghum's big cell wall biosynthetic gene families (**Table 1**, **Figure 1**). Pectin molecules play an important role in cell adhesion and contributes for biomass recalcitrance due to extensive interlinks with other cell wall polymers. Overall, information on cell wall biosynthetic genes will help to design customized biomass production for economical production of biofuels and bioproducts from sorghum. Apart from easier deconstruction and saccharification, enhancing the total sugars in the walls will help to improve the cost effectiveness of bioethanol production from sorghum biomass.

Cell wall biosynthesis is dynamic; it allows cell elongation while maintaining the wall integrity to withstand the internal turgor pressure. The degradation/assembly mechanism plays important roles in the cell wall building process, wall strength and integrity. Altering the process of wall degradation/assembly process will influence the cell wall deconstruction/digestibility hence identification and characterization of genes involved in cell wall degradation/assembly is important. The degradation/assembly related genes identified in this study has been distributed in to 3 cell wall loosening related gene families, 6 family of glycoside hydrolases, 2 pectin related lyases as well as 2 pectin related esterase gene families. The conserved domain analysis of the cell wall related gene families (**Figure 1**) along with clusters obtained from phylogenetic analysis with Arabidopsis and rice proteins suggests the evolutionary conserved nature of these proteins (Supplementary Figure 1). Physical mapping revealed presence of approximately 65% of cell wall related genes mainly confined to chromosomes 1–4 (1 with 22.3%, chromosome 3 with ∼16%, chromosome 2 with 13.2%, and chromosome 4 with 13.1%; **Figure 4**, Supplementary Table 1). These chromosomes with hotspot of cell wall genes can be targeted in breeding and crop improvement programs to alter the cell wall composition. Further, 56 tandem duplication events observed in these genes were found to be distributed across all the 10 chromosomes with maximum duplication observed on first 4 chromosomes. The excessive duplications observed on the chromosomes 1–4 could be the possible reason of presence of ∼65% genes on these chromosomes. Further, a total of 137 SSR markers were found on ∼22% cell wall related genes with highest representation in Csl (16) and xyloglucan galactosyltransferases (16) gene families (**Figure 5**, Supplementary Tables 2, 3). Among these, TNRs with ∼81% share are the most abundant SSRs which are in agreement with the previous reports of TNRs abundance in plants. These molecular markers will help the breeding programs for selection of genes in a breeding population or introgression of a specific cell wall related genes. Further, in silico analysis for the presence of miRNA targets revealed presence of miR156, miR164, miR397, miR528, miR5566, and miR6230 target sites in 10 independent cell wall related genes (Supplementary Table 4). Three of these miRNA families viz. miR156, miR164, and miR528 have been reported to be differentially expressed in stem and leaves during sugar accumulation in sweet sorghum (Yu et al., 2015). Further, Yu et al. (2015) reported miR164 and miR528 as stem specific miRNA whereas miR156 was up-regulated in the leaves at dough stage. Over-expressed miR156 has been reported to cause the Corngrass1 (Cg1) phenotype in maize (Chuck et al., 2007). Further, four of the six sorghum laccase family genes found to have target site of miR397 and showed differential expression during the drought stress conditions (Hamza et al., 2016) indicating a potential change in the sorghum cell wall composition under stress.

The cell wall composition and gene expression varies among different tissues of the plant (stem, root, leaves, etc.) and among the cell types within a tissue (i.e., epidermal, xylem, phloem, fiber cells, etc.; Hatfield et al., 1999; McKinley et al., 2016). Expression analysis across different tissues will provide important insight into the role of cell wall related genes in that particular tissue. In the present study, we found a differential expression of genes among different tissues (**Figures 6**, **7**). This analysis provides information on the tissue specific target genes for bioengineering purposes. A recent study of sorghum gene expression in pre- and post-anthesis stages of stem internodes showed differential expression of genes involved in growth, cell wall development and stem sugar accumulation (McKinley et al., 2016). CesA, Csls, callose synthases, XTHs, glucuroarabinoxylan biosynthetic genes, expansins, glucosyl hydrolases, pectin lyases/esterases, and lignin biosynthetic genes were among the major differentially expressed cell wall related genes (McKinley et al., 2016). In the present study, expression analysis of sorghum cell wall related genes showed that most of the genes from CesA, xyloglucan galactosyltransferases, homogalacturonan α-1,4-galacturonosyltransferase, glucan synthase-like, glucan 1,3-β-glucosidases, polygalacturonases, β-Galactosidases, and pectin acetyl esterases families expressed in all the stages studied. Remaining cell wall gene families showed genes with either stage specific to ubiquitously expressed genes or both.

Environmental conditions including temperature, drought, osmotic, and salinity, etc., have been shown to affect the gene expression and crop productivity (Tenhaken, 2014; Wang et al., 2016). With the fast changing environmental conditions across the globe, studying the effect of stress on plants is important. In addition, to avoid food/fuel competition, the biofuel crops were advocated to be grown on marginal lands with limited irrigation and minimal input. Upon exposure to these adverse environmental conditions, the plants alter their gene expression and biochemical metabolism to survive in these conditions including cell wall composition. Most important component of the cell wall that adds enormous cost of bioethanol production is lignin. It has been reported that abiotic stress results in increased lignin content in plants (Moura et al., 2010). This results in increased cost of bioethanol production hence there is a need to develop bioenergy crops that do not accumulate higher lignin when grown in marginal lands with limited irrigation and low inputs. A better understanding of the cell wall gene expression under abiotic stress is important to design strategies to produce crops in marginal lands with less lignin accumulation. Analysis of sorghum transcriptome under abiotic stress showed differential expression of significant number of cell wall related genes (**Figure 8**). Comparatively, root was observed to have more altered expression of cell wall genes compared to shoot. Among the differentially expressed gene families, expansins, laccases, and glucan 1, 3-β-glucosidases showed down-regulation in ABA treated root and shoot (Supplementary Figures 2B, 3B). Similarly, following PEG treatment, expansins, and XTHs were among up-regulated genes in root as well as shoot whereas yieldins were among the highly down-regulated genes in both the tissues (Supplementary Figures 2C,D, 3C,D). Since most of the cell wall related gene families are with multiple genes and each with either specific or redundant function, there is a need to characterize function of individual genes in order to develop a fine annotation of their function in normal growth and development as well as under abiotic stress conditions.

### CONCLUSIONS

Comprehensive information on cell wall related genes would facilitate biosynthetic pathway engineering for enhanced biomass production as well as efficient deconstruction and saccharification. Lignin content and cellulose crystallinity contribute to the poor separation and saccharification, which are the biggest hurdles in the cost efficient utilization of sorghum biomass for biofuel production. Here we have identified various cell wall related gene families and analyzed the gene expression pattern but the functional role of the individual genes is still not known. Cell wall related gene mutations in sorghum showed higher saccharification efficiency and are being used for animal feed hence further analysis and functional characterization will lead to development of more efficient sorghum lines for animal feed, biofuel and bioproduct industries. Apart from this, analyzing the cell wall composition of sorghum under abiotic stress conditions and their correlation with differentially expressed genes will also shed light on the mechanism involved in regulation of cell wall biosynthesis and degradation. The present study analyzed the gene expression of sorghum seedlings exposed to abiotic stress, which provides valuable information, however a detailed study at different developmental stages that are critical for biomass harvest will provide information necessary to manipulate the biomass through plant breeding and genetic engineering. Overall, the comprehensive information developed in the present study can be used in expanding target genes as well as developing better strategies for the future sorghum crop improvement programs.

### AUTHOR CONTRIBUTIONS

KR designed the work, performed the analysis and wrote the manuscript. ST, VB, CC, and TD helped with bioinformatics analysis, prepared figures and wrote the manuscript. VM

### REFERENCES


conceived the idea, designed work and wrote the manuscript. All the authors have read and approved the manuscript.

### ACKNOWLEDGMENTS

This work was supported by the Department of Plant & Soil Sciences, Texas Tech University, the USDA/Agricultural Research Service (Ogallala Aquifer Initiative), and USDA-FAS.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fpls.2016. 01287


family 8. Plant Physiol. 153, 1729–1746. doi: 10.1104/pp.110. 154229


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Rai, Thu, Balasubramanian, Cobos, Disasa and Mendu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Integrative analysis and expression profiling of secondary cell wall genes in C<sup>4</sup> biofuel model *Setaria italica* reveals targets for lignocellulose bioengineering

Mehanathan Muthamilarasan1 †, Yusuf Khan1 † , Jananee Jaishankar <sup>1</sup> , Shweta Shweta<sup>1</sup> , Charu Lata<sup>2</sup> and Manoj Prasad<sup>1</sup> \*

*<sup>1</sup> National Institute of Plant Genome Research, New Delhi, India, <sup>2</sup> Division of Plant-Microbe Interactions, CSIR-National Botanical Research Institute, Lucknow, India*

### *Edited by:*

*Gautam Sarath, United States Department of Agriculture - Agricultural Research Service, USA*

#### *Reviewed by:*

*Lam-Son Tran, RIKEN Center for Sustainable Resource Science, Japan Erin D. Scully, United States Department of Agriculture - Agricultural Research Service, USA*

*\*Correspondence:*

*Manoj Prasad manoj\_prasad@nipgr.ac.in*

*† These authors have contributed equally to this work.*

#### *Specialty section:*

*This article was submitted to Crop Science and Horticulture, a section of the journal Frontiers in Plant Science*

*Received: 26 June 2015 Accepted: 22 October 2015 Published: 04 November 2015*

#### *Citation:*

*Muthamilarasan M, Khan Y, Jaishankar J, Shweta S, Lata C and Prasad M (2015) Integrative analysis and expression profiling of secondary cell wall genes in C4 biofuel model Setaria italica reveals targets for lignocellulose bioengineering. Front. Plant Sci. 6:965. doi: 10.3389/fpls.2015.00965* Several underutilized grasses have excellent potential for use as bioenergy feedstock due to their lignocellulosic biomass. Genomic tools have enabled identification of lignocellulose biosynthesis genes in several sequenced plants. However, the non-availability of whole genome sequence of bioenergy grasses hinders the study on bioenergy genomics and their genomics-assisted crop improvement. Foxtail millet (*Setaria italica* L.; Si) is a model crop for studying systems biology of bioenergy grasses. In the present study, a systematic approach has been used for identification of gene families involved in cellulose (*CesA/Csl*), callose (*Gsl*) and monolignol biosynthesis (*PAL, C4H, 4CL, HCT, C3H, CCoAOMT, F5H, COMT, CCR, CAD*) and construction of physical map of foxtail millet. Sequence alignment and phylogenetic analysis of identified proteins showed that monolignol biosynthesis proteins were highly diverse, whereas CesA/Csl and Gsl proteins were homologous to rice and *Arabidopsis*. Comparative mapping of foxtail millet lignocellulose biosynthesis genes with other C<sup>4</sup> panicoid genomes revealed maximum homology with switchgrass, followed by sorghum and maize. Expression profiling of candidate lignocellulose genes in response to different abiotic stresses and hormone treatments showed their differential expression pattern, with significant higher expression of *SiGsl12, SiPAL2, SiHCT1, SiF5H2,* and *SiCAD6* genes. Further, due to the evolutionary conservation of grass genomes, the insights gained from the present study could be extrapolated for identifying genes involved in lignocellulose biosynthesis in other biofuel species for further characterization.

Keywords: foxtail millet (*Setaria italica* L.), secondary cell wall biosynthesis, lignocellulose, bioenergy grasses, genomics, comparative mapping

## INTRODUCTION

Cell wall polymers of living plants constitute a predominant proportion of their biomass, which is formed by fermentable linked sugars. These polymers form a major structural component of plant cell wall and particularly, secondary cell walls provide mechanical strength and rigidity to vascular plants (Wang et al., 2013; Zhong and Ye, 2015). Secondary cell walls are present in tracheary elements, xylem, phloem, extraxylary and interfascicular fibers, sclereids and seed coats, and are made of cellulose, hemicelluloses and lignin. Cellulose, the primary unit, cross-links with hemicelluloses including xylan and glucomannan, and impregnated with phenolic polymer lignin, and altogether, this complex polymeric network forms secondary cell wall. The proportion of cellulose, hemicelluloses, and lignin varies among different plant species and of note, the composition may also vary in response to diverse developmental and environmental conditions (Zhong and Ye, 2015). Being the prime constituents of wood and fiber, secondary cell walls have been extensively studied to understand and exploit their biofuel prospects. Biochemical and genomic methods have identified the genes encoding for enzymes which participate in the biosynthesis of secondary cell wall components.

Pear et al. (1996) was the first to identify cellulose synthase (CesA) genes in cotton and following this, CesA genes in other plants have been identified and their numbers were shown to vary between plant species. In Arabidopsis, 10 CesA genes have been identified (Richmond and Somerville, 2000), whereas 12 in maize (Appenzeller et al., 2004), 16 in barley (Burton et al., 2004), 18 in poplar (Djerbi et al., 2005) have been reported. The CesA enzymes belong to glycosyltransferase-2 (GT-2) superfamily, which is defined by an eight-transmembrane topology and conserved cytosolic substrate binding and catalytic residues (McFarlane et al., 2014). In addition to CesA, plants also have cellulose synthase-like (Csl) genes, which can be involved in biosynthesis of hemicellulose and other glucans (Lerouxel et al., 2006). Csl genes can synthesize other polysaccharides that are not components of the hemicellulose matrix (Lerouxel et al., 2006). So far, several types of Csl genes have been identified, denoted as CslA to CslK. CslA encodes for (1,4) β-D-mannan synthases (Dhugga et al., 2004; Liepman et al., 2005), CslF and CslH encode the mixed linkage glucan synthases for (1,3;1,4)-β-glucan biosynthesis (Burton et al., 2006; Doblin et al., 2009), CslC genes are involved in xyloglucan biosynthesis (Cocuron et al., 2007), and CslD in xylan and homogalacturonan synthesis (Hamann et al., 2004; Bernal et al., 2008a,b; Li et al., 2009), whereas the functional roles of other Csl genes remain elusive (Yin et al., 2009). Noteworthy, CslB and CslG are specific to dicots whereas CslF and CslH are found only in monocots (Fincher, 2009; Doblin et al., 2010), but recently two CslG genes were identified in Panicum virgatum (Pavirv00027268m and Pavirv00027269m; Yin et al., 2014).

Callose is a (1,3)-β-D-glucan, which is not present in cell walls but deposited in the walls of specialized tissues such as pollen mother cell walls, plasmodesmatal canals, and sieve plates in dormant phloem during normal growth and development (Stone and Clarke, 1992). In addition, callose is also deposited in response to environmental stimuli including abiotic stress, wounding, and pathogen challenge (Stone and Clarke, 1992; Muthamilarasan and Prasad, 2013). Callose is synthesized by callose synthases, which are encoded by glucan synthase-like (Gsl) genes (Saxena and Brown, 2000; Cui et al., 2001). To date, 12 Gsl genes have been identified in Arabidopsis, 13 in rice, 9 in poplar, and 8 in barley (Farrokhi et al., 2006).

In the case of lignin biosynthesis, phenylalanine is metabolized through the phenylpropanoid pathway to produce hydroxycinnamoyl-CoA esters, which enter the lignin branch of this pathway and are converted to monolignols. The process requires the involvement of phenylalanine ammonia lyase (PAL), trans-cinnamate 4-hydroxylase (C4H), 4-coumarate CoA ligase (4CL), hydroxycinnamoyl CoA:shikimate/quinate hydroxycinnamoyl transferase (HCT), p-coumaroyl shikimate 3 ′ -hydroxylase (C3H), caffeoyl CoA 3-O-methyltransferase (CCoAOMT), ferulate 5-hydroxylase (F5H), caffeic acid Omethyltransferase (COMT), cinnamoyl CoA reductase (CCR), and cinnamyl alcohol dehydrogenase (CAD) (Bonawitz and Chapple, 2010; Zhong and Ye, 2015). Of these enzymes, PAL is the first enzyme of phenylpropanoid pathway which catalyzes the deamination of phenylalanine to generate cinnamic acid and C4H hydroxylates cinnamic acid to generate p-coumaric acid (Harakava, 2005). 4CL performs CoA esterification of p-coumaric acid and caffeic acid, whereas HCT catalyzes the conversion of p-coumaroyl-CoA and caffeoyl-CoA into corresponding shikimate or quinate esters and C3H converts these esters to corresponding caffeoyl esters. Following this, CCoAOMT catalyzes methylation of caffeoyl CoA to produce feruloyl CoA, whereas CCR converts hydroxycinnamoyl CoA esters to their corresponding aldehydes (Harakava, 2005). F5H has been assumed to catalyze the conversion of ferulic acid to 5-hydroxyferulic acid but recombinant DNA studies in Arabidopsis and Liquidambar styraciflua revealed that F5H converts coniferaldehyde and coniferyl alcohol to synapaldehyde and sinapyl alcohol, respectively (Humphreys et al., 1999; Osakabe et al., 1999). COMT is involved in the conversion of 5-hydroxyconiferaldehyde and/or 5-hydroxyconiferyl alcohol to sinapaldehyde and/or sinapyl alcohol, respectively (Osakabe et al., 1999; Parvathi et al., 2001), while CAD catalyzes the conversion of cinnamyl aldehydes into their corresponding alcohols (Harakava, 2005). The genes encoding these enzymes have recently been identified and characterized in several plant species (Raes et al., 2003; Vanholme et al., 2012; Shen et al., 2013; Carocha et al., 2015; van Parijs et al., 2015).

With the raise in the impacts of global climate change, reduction of greenhouse gases is essential, which could be facilitated through generating biorenewables. Importantly, production of lignocellulosic biofuels from secondary cell wall biomass has become a strategic research area, as it holds the potential to enhance energy security. C<sup>4</sup> grasses, namely switchgrass (P. virgatum), napier grass (Pennisetum purpureum), pearl millet (P. glaucum), and foxtail millet (Setaria italica) have recently gained momentum in lignocellulosic biofuel research due to their high-efficiency CO<sup>2</sup> fixation and efficient conversion of solar energy to biomass through C<sup>4</sup> photosynthesis and photorespiration-suppressing modifications, respectively (Schmer et al., 2008; Byrt et al., 2011; van der Weijde et al., 2013). In addition, these grasses also possess better water use efficiency (WUE), higher nitrogen use efficiency (NUE), capacity to grow in arid and semi-arid regions and relatively high tolerance to environmental constraints including heat, drought, salinity and water-logging. For these reasons, C<sup>4</sup> photosynthesis is an important trait for lignocellulosic biofuel crops (Byrt et al., 2011; van der Weijde et al., 2013).

Recently, foxtail millet (S. italica) and its wild progenitor, green foxtail (S. viridis) have been recognized as the suitable experimental models for biofuel research owing to their genetic relatedness to several biofuel grasses (Li and Brutnell, 2011; Zhang et al., 2012; Lata et al., 2013; Petti et al., 2013; Diao et al., 2014; Warnasooriya and Brutnell, 2014; Muthamilarasan and Prasad, 2015). The genomes of both foxtail millet and green foxtail have been sequenced (Bennetzen et al., 2012; Zhang et al., 2012), and the availability of foxtail millet draft genome sequence in public domains has facilitated various genetic and genomic studies in this model crop pertaining to stress response and crop improvement (Diao et al., 2014; Muthamilarasan and Prasad, 2015; Muthamilarasan et al., 2015) though no comprehensive genome-wide study on biofuel traits has been performed. Recently, Petti et al. (2013) has compared the lignocellulosic feedstock composition, cellulose biosynthesis inhibitor response, saccharification dynamics and CesA gene family of green foxtail with sorghum, maize and switchgrass. The study identified eight potential CesA gene family members for functional genomic characterization (Petti et al., 2013).

The present study has been performed to identify the gene families participating in lignocellulose biosynthesis using computational approaches. Further, qRT-PCR analysis of few genes has been performed to understand their expression patterns in response to different abiotic stress treatments.

### MATERIALS AND METHODS

### Identification of Lignocellulose Biosynthesis Gene Families

Protein sequences of enzymes involved in cellulose biosynthesis, namely CesA, Csl, and Gsl of rice and Arabidopsis were retrieved from cell wall genomics webserver (https://cellwall. genomics.purdue.edu/intro/index.html). The sequences for PAL, C4H, 4CL, HCT, C3H, CCoAOMT, F5H, COMT, CCR, and CAD reported in other crops (Appenzeller et al., 2004; Burton et al., 2004; Carocha et al., 2015; Zhong and Ye, 2015) were retrieved from respective literatures and HMM profile has been generated for individual families. Precisely, the sequences of respective families were aligned using Clustal Omega (http://www.ebi.ac.uk/Tools/msa/clustalo/) and HMM profiles were built using hmmbuilt command (http://hmmer. janelia.org/). HMMER tool was used to identify respective homologous proteins in foxtail millet protein dataset retrieved from Phytozome v10.2 (http://phytozome.jgi.doe.gov/) under default parameters (Muthamilarasan et al., 2014a). The protein sequences were confirmed using HMMSCAN (http://www.ebi. ac.uk/Tools/hmmer/search/hmmscan) analysis, and respective genomic, transcript, and CDS sequences were downloaded from Phytozome by BLAST searching the retrieved protein sequences against S. italica database under default parameters (http://phytozome.jgi.doe.gov/pz/portal.html#!info?alias=Org\_Si talica).

### Protein Properties and Phylogenetic Analysis

The properties of identified cell wall-related proteins including molecular weight, pI, and instability index were identified using ExPASy ProtPram tool (http://web.expasy.org/protparam/). The amino acid sequences of respective families were imported into MEGA v6 (Tamura et al., 2013) for multiple sequence alignment and phylogenetic tree construction using neighborjoining method after bootstrap analysis for 1000 replicates (Muthamilarasan et al., 2014b). Sequence alignment and analysis was performed using BioEdit v7.2.5 (http://www.mbio.ncsu.edu/ bioedit/bioedit.html).

### Physical Mapping and Gene Structure Analysis

The chromosomal location of cell wall biosynthesis genes including chromosome number, position of gene start and end, gene length and orientation were obtained from Phytozome and a physical map was constructed using MapChart (Voorrips, 2002). Gene duplications, namely tandem and segmental were identified by performing MCScanx (Wang et al., 2012) according to the protocol of Plant Genome Duplication Database (Lee et al., 2012). Gene structure was predicted using Gene Structure Display Server v2.0 (http://gsds.cbi.pku.edu.cn/).

### Promoter Analysis, Targeting miRNA, and Marker Prediction

The upstream genomic sequence (∼2 kb) of lignocellulose pathway genes of foxtail millet were retrieved from Phytozome and the presence of cis-regulatory elements were identified by Signal Scan Search using New PLACE web server (https:// sogo.dna.affrc.go.jp/cgi-bin/sogo.cgi?page=analysis&lang=en).

Mature miRNA sequences of foxtail millet were downloaded from miRBase v21 (Kozomara and Griffiths-Jones, 2014) and FmMiRNADb (Khan et al., 2014). This information along with the miRNA data of a dehydration stress library (Yadav et al., unpublished data) were used to identify the miRNAs targeting the transcripts of lignocellulose pathway genes using psRNAtarget server (Dai and Zhao, 2011) under default parameters. The large-scale genome-wide molecular markers namely simple sequence repeats (SSR; Pandey et al., 2013), expressed sequence tag (EST)-SSR (eSSR; Kumari et al., 2013), and intron-length polymorphic markers (Muthamilarasan et al., 2014c) were retrieved from the Foxtail millet Marker Database (http://www.nipgr.res.in/foxtail.html; Suresh et al., 2013) and searched for their presence in the genic and promoter regions of lignocellulose biosynthesis genes using in-house perl script.

### Comparative Genome Mapping and Evolutionary Analysis

Protein sequences of lignocellulose pathway genes of foxtail millet were BLASTP searched against the protein sequences of switchgrass (Panicum virgatum), rice (Oryza sativa), and poplar (Populus trichocarpa), and hits with more than 80% identity were selected. The genomic and CDS sequences along with chromosomal locations for these proteins were retrieved by performing BLAST searches against the corresponding genomes retrieved from Gramene (http://www.gramene.org/) under default parameters and comparative maps were visualized using Circos (Krzywinski et al., 2009). Reciprocal BLAST was also performed to ensure the unique relationship between the homologous genes (Mishra et al., 2013). Estimation of nonsynonymous substitutions per non-synonymous site (Ka) and synonymous substitutions per synonymous site (Ks) for paralogous (tandem and segmentally duplicated genes) as well as homologous (comparative mapping data) gene pairs were calculated by codeml program in PAML using PAL2NAL (Suyama et al., 2006). The Ka/Ks ratios along with estimation of duplication and divergence (as T = Ks/2λ, where, λ = 6.5 ×10−<sup>9</sup> ) were performed according to Puranik et al. (2013).

### *In silico* Expression Profiling in Tissues and Drought Stress

The transcriptome data of different tissues, namely root (SRX128223), stem (SRX128225), leaf (SRX128224), spica (SRX128226), and a drought stress library (SRR629694) as well as its control (SRR629695) were retrieved from European Nucleotide Archive (http://www.ebi.ac.uk/ena) (Zhang et al., 2012; Qi et al., 2013). The reads were filtered using NGS Toolkit (http://www.nipgr.res.in/ngsqctoolkit.html), mapped on foxtail millet genome using CLC Genomics Workbench v4.7.1, normalized by RPKM method and a heat map was generated using MultiExperiment Viewer (MeV) v4.9 (Saeed et al., 2003).

### Plant Materials, Stress and Hormone Treatments and Quantitative Real-time PCR Analysis

Seeds of foxtail millet cv. "IC-403579" (dehydration and salinity tolerant) were grown under optimum conditions following Lata et al. (2014). Twenty one day-old seedlings were exposed to 250 mM NaCl (salinity), 20% PEG6000 (dehydration), 4◦C (cold), 100 mM abscisic acid (ABA), 100 mM methyl jasmonate (MeJA), and 100 mM salicylic acid (SA) treatments (Mishra et al., 2013; Puranik et al., 2013; Kumar et al., 2015) and whole seedlings were collected at 0 h (h) (control), 1 h (early), and 24 h (late) (Yadav et al., 2015). The samples were frozen immediately in liquid nitrogen and stored at −80◦C. RNA isolation, cDNA synthesis and RT-PCR analysis were performed according to Puranik et al. (2013) in three technical replicates for each biological triplicate using the primers mentioned in **Supplementary Table S1**. All qRT-PCR data were the means of at least three independent experiments and the results were presented as the mean values ± SE. The significance of differences between mean values of control and each stressed samples were statistically performed using One-Way analysis of variance (ANOVA) and comparison among means was carried out through Tukey-Kramer multiple comparisons test using GRAPHPAD INSTAT software v3.10 (http://www.graphpad. com). The differences in the effects of stress treatments on various parameters in 16 foxtail millet genes under study were considered statistically significant at <sup>∗</sup>P < 0.05, ∗∗P < 0.01, ∗∗∗P < 0.001.

### RESULTS

### CesA/Csl and Gsl Superfamily of Foxtail Millet

HMM searches identified the presence of 14 CesA (SiCesA) and 39 Csl (SiCsl) proteins in foxtail millet (**Supplementary Table S2**). Among the 14 SiCesA proteins, one was found to be an alternate transcript (Si028766m), whereas in SiCsl, three alternate transcripts (Si029554m, Si035399m, and Si035101m) were identified. Domain analysis of SiCesA proteins revealed the presence of both the cellulose synthase domain (CS; PF03552) and the zinc finger structure (ZF; PF14569) in all the proteins except SiCesA8 and SiCesA10, which have only the CS domain (**Supplementary Table S3**). In addition, all the SiCesA proteins except SiCesA8 had Glycosyl transferase 2 (GT2; PF13632) domain. In the case of SiCsl proteins, 36 proteins (primary transcripts) were identified, of which 10 belonged to SiCslA, 6 to SiCslC, 5 to SiCslD, 4 to SiCslE, 7 to SiCslF, 2 each to SiCslH and SiCslJ families (**Supplementary Table S2**). Interestingly, two members of CslJ have been identified in foxtail millet, which was previously considered to be a cereal-specific gene family (Doblin et al., 2010). Domain analysis showed that all the SiCslA and SiCslC proteins possess GT2 domain (PF13641, PF13632, PF00535, and PF13506) (**Supplementary Table S3**).

All 5 SiCslD proteins possess CS (PF03552) and GT2 (PF13632) domain, and interestingly, SiCslD2, SiCslD4, and SiCslD5 were evidenced to have an additional RING/Ubox like zinc-binding domain (PF14570), whereas SiCslD3 has two CS domains (**Supplementary Table S4**). All the SiCslE proteins except SiCslE2 have more than one CS domain and SiCslE3 has an additional GT2 domain (PF13641). In the case of SiCslF proteins, all of the members except SiCslF6 have two CS domains and in addition, SiCslF1, SiCslF3, and SiCslF7 possess GT2 domain (PF13632). Two members each belonging to CslH and CslJ family proteins were identified and both the group members have two CS domains (**Supplementary Table S4**).

A total of 12 Gsl (SiGsl) proteins were identified in foxtail millet and all possessed glucan synthesis (GS) domain (1,3-beta-glucan synthase component; PF02364) (**Supplementary Table S5**). The number of GS domain within these proteins also varied as SiGsl1, SiGsl6, and SiGsl12 have two GS domains, whereas SiGsl11 had three domains. In addition, SiGsl2, SiGsl3, SiGsl5, SiGsl7, SiGsl8, SiGsl10, and SiGsl11 have a 1,3-beta-glucan synthase subunit FKS1, domain-1 (PF14288). Furthermore, SiGsl08, SiGsl10, and SiGsl11 have an additional Vta1 (VPS20-associated protein 1) like domain (PF04652) (**Supplementary Table S5**).

### Monolignol Pathway Proteins of Foxtail Millet

HMM profiling of PAL (SiPAL), C4H (SiC4H), 4CL (Si4CL), HCT (SiHCT), C3H (SiC3H), CCoAOMT (SiCCoAOMT), F5H (SiF5H), COMT (SiCOMT), CCR (SiCCR), and CAD (SiCAD) proteins in foxtail millet identified 10, 3, 20, 2, 2, 6, 2, 4, 33, and 13 members, respectively (**Supplementary Table S6**). Splice variants were evidenced among these members, including three each in SiCL16 and SiCCR14, two in SiCCR11 and one each in Si4CL5, SiCCoAOMT1, SiCOMT, SiCCR11, and SiCCR17. HMMSCAN revealed a diverse domain organization of these proteins (**Supplementary Table S7**). All of the SiPAL proteins possess aromatic amino acid lyase (PF00221) domain, whereas Cytochrome P450 (PF00067) was present in all SiC4H, SiC3H, and SiF5H proteins. AMP-binding enzyme (PF00501) and AMP-binding enzyme C-terminal (PF13193) domains were present in all the Si4CL proteins except Si4CL13, which has only an AMP-binding enzyme domain. Both SiHCT1 and SiHCT2 have transferase family (PF02458) domains, and SiCCoAOMT proteins were evidenced to possess Omethyltransferase (PF01596) and methyltransferase (PF13578) domains with an exception of SiCCoAOMT, which has two O-methyltransferase domains (**Supplementary Table S7**). O-methyltransferase domain was also found to be present in SiCOMT proteins, whereas SiCOMT2 has an additional dimerisation domain (PF13578). A diverse domain composition was observed among SiCCR proteins in addition to the presence of signature NAD-dependent epimerase/dehydratase family (PF01370) and 3-beta hydroxysteroid dehydrogenase/isomerase family (PF01073) domains. Almost all the SiCCR proteins possess additional domains including GDP-mannose-4,6 dehydratase (PF16363), Male sterility protein (PF07993), NmrA-like family (PF05368), NAD(P)H-binding (PF13460), Polysaccharide biosynthesis protein (PF02719), and KR domains (PF08659). Of note, SiCCR7 was devoid of any of these domains except the NAD-dependent epimerase/dehydratase family domain, and SiCCR3 has an additional Alcohol dehydrogenase GroES-like domain (PF08240) (**Supplementary Table S7**). The presence of Alcohol dehydrogenase GroES-like and Zincbinding dehydrogenase (PF00107) domains is the characteristic feature of SiCAD proteins and in addition to these, D-isomer specific 2-hydroxyacid dehydrogenase, NAD-binding domain (PF02826) was present in SiCAD4, SiCAD9, and SiCAD12. Moreover, an alanine dehydrogenase/PNT, C-terminal domain (PF01262) was found to be present in SiCAD12 and SiCAD13 (**Supplementary Table S7**).

### Properties of Lignocellulose Pathway Proteins

Among the SiCesA proteins, SiCesA4 was the largest protein with 1095 amino acids (aa), followed by SiCesA2 (1092 aa), SiCesA11 (1090 aa) and SiCesA3 (1088 aa), and the smallest was SiCesA8 (884 aa) (**Supplementary Table S2**). The molecular weight of these proteins also varied accordingly, ranging from SiCesA8 (95.5 kDa) to SiCesA11 (123.2 kDa), with an isoelectric pH (pI) of 6.03 (SiCesA10) to 8.15 (SiCesA1). The protein instability index was between 36.07 (SiCesA11) to 50.62 (SiCesA8), which signified that all the SiCesA proteins except SiCesA2, SiCesA8, and SiCesA10 were stable. In the case of SiCsl proteins, the smallest protein was SiCslE2 with 144 aa and the largest was SiCslD1 (1217 aa), and their respective molecular weights ranged from 16.4 kDa (SiCslE2) to 132.2 kDa (SiCslD1). The pI of SiCsl proteins ranged from 4.61 (SiCslE2) to 9.32 (SiCslF7), and their instability index range (31.44–67.71) revealed that a maximum of SiCsl proteins (∼33%) were stable. The size and molecular weights of SiGsl proteins ranged from 418 aa (47.8 kDa in SiGsl9) to 1956 aa (225.2 kDa in SiGsl8). Similarly, pI range of these proteins was between 8.61 (SiGsl12) and 9.69 (SiGsl9). The instability index range between 28.89 and 52.08 indicated that ∼46% of SiGsl proteins were stable and the rest are unstable (**Supplementary Table S2**).

The SiPAL class of monolignol pathway proteins showed a narrow range of protein properties, as their sizes varied from 699 (SiPAL1 and SiPAL2) to 851 aa (SiPAL10), with molecular weights from 74.9 kDa (SiPAL2) to 91.1 kDa (SiPAL10) (**Supplementary Table S6**). The pI range of SiPAL was between 5.82 and 6.52, and their instability index range (28.82–39.84) showed that all the proteins except SiPAL5 were stable. The three members of SiC4H, namely SiC4H1, SiC4H2, and SiC4H3 had molecular sizes of 530 aa (59.7 kDa), 430 aa (49.3 kDa), and 506 aa (57.9 kDa), respectively. Their respective pI were 9.26, 7.72, and 9.33, and their instability index (46.46, 49.84, and 48.61) revealed that SiC4H proteins were stable. Among the Si4CL proteins, Si4CL4 and Si4CL10 were the smallest proteins with 198 aa (21.8 and 21.7 kDa in size, respectively) and the largest was Si4CL9 (642 aa; 68.5 kDa). Their pI range was between 5.14 and 8.98. The protein instability index ranged from 24.76 (Si4CL3) to 47.96 (Si4CL6) hinting that all the Si4CL proteins except Si4CL3 were stable. SiHCT, SiC3H, and SiF5H proteins have two members each, with a narrow range of protein properties, and all these proteins were found to be stable as indicated by their stability index. A significant difference was observed with the sizes of SiF5H members since SiF5H1 was 158 aa (16.7 kDa) and SiF5H2 was 524 aa (57.7 kDa) (**Supplementary Table S6**). Among SiCCoAOMT proteins, the smallest protein was SiCCoAOMT1 with 243 aa (25.7 kDa) and the largest was SiCCoAOMT5 with 307 aa (33.4 kDa). The pI range was between 5.04 and 8.94, and the protein instability index range (27.69–51.49) showed that except SiCCoAOMT4, all others were stable. The three-member SiCOMT class proteins have molecular sizes of 247 aa (25.8 kDa; SiCOMT1), 402 aa (43.53 kDa; SiCOMT2), and 153 aa (16.71 kDa; SiCOMT3). The pI values were 5.09, 5.97, and 9 for SiCOMT1, SiCOMT2, and SiCOMT3, respectively. The instability index range (42.24– 52.75) hinted that all SiCOMT proteins are stable. Among the monolignol pathway proteins, SiCCR class has the highest number (26 members) and their sizes ranged from 27.2 kDa (251 aa; SiCCR26) to 69.13 kDa (625 aa; SiCCR9), with a pI range of 4.72 (SiCCR23) to 9.32 (SiCCR19). The protein instability index ranged from 24.86 (SiCCR18) to 54.11 (SiCCR13), which points out that ∼77% of SiCCR proteins were stable. In the case of SiCAD proteins, SiCAD9 and SiCAD13 were the smallest proteins with 336 aa (35.6 and 36.4 kDa in size, respectively) and SiCAD8 was the largest with 495 aa (52.7 kDa). The pI ranged from 5.05 to 9.24, and the instability index (19.35– 39.79) showed that ∼50% of SiCAD proteins are unstable (**Supplementary Table S6**).

### Sequence Alignment and Phylogenetic Analysis of CesA/Csl and Gsl Proteins

SiCesA and SiCsl proteins were aligned individually, and the alignment revealed the presence of conserved "DXD, D, QXXRW" motif in both the superfamilies. All the SiCesA proteins except SiCesA8 have a "DCD, D, QVLRW" consensus sequence, whereas SiCesA8 had a unique "DYD, D" sequence and the motif "QXXRW" was absent (**Supplementary Figure S1**). Noteworthy, SiCesA8 protein has only the CS domain, while the other SiCesA proteins possess CS, ZF, and GT2 domains (**Supplementary Table S3**). In the case of SiCsl proteins, the "DXD" motif is absent in all the members of SiCslA, SiCslC and SiCslE2 (**Supplementary Figure S2**). This motif was predominantly "DCD," except in SiCslF1 and SiCslF2, which have "DGD." The second consensus "D" amino acid is present in all the SiCsl members (as "ED"), except SiCslA6, SiCslE2, and SiCslF4 (**Supplementary Figure S2**). In addition, SiCslA6 and SiCslE2 did not possess the "QXXRW" motif also, whereas a subgroup-wise conservation was evidenced in this motif in rest of the members. The majority of SiCslA (7) and all the SiCslC members have "QQHRW" motif, whereas SiCslE proteins have "QHKRW," SiCslH and SiCslJ proteins have "QYKRW" and "QNKRW" motifs, respectively (**Supplementary Figure S2**). The unrooted phylogenetic tree constructed using the amino acid sequences of SiCesA/Csl proteins along with CesA/Csl proteins of rice and Arabidopsis (https://cellwall.genomics.purdue.edu/ intro/index.html) showed 2 distinct clusters, namely I and II (**Figure 1**). Cluster I was resolved into six branches including CesA, CslD, CslE, CslF, CslH, and CslJ, whereas cluster II had two branches, CslA and CslC.

Sequence alignment of SiGsl proteins showed that the Nterminal region of all these proteins was diverse, whereas the C-terminal region was conserved (**Supplementary Figure S3**). Prediction of transmembrane (TM) helices in these proteins using TMHMM Server v2.0 (http://www.cbs.dtu.dk/services/ TMHMM/) showed the presence of 7–16 TM helices in SiGsl proteins (**Supplementary Figure S4**). Phylogeny of foxtail millet, rice and Arabidopsis Gsl proteins showed three clusters (**Figure 2**). Cluster I included SiGsl4, SiGsl5, and SiGsl7, whereas cluster II comprised SiGsl2 and SiGsl3. SiGsl1, SiGsl6, SiGsl8, SiGsl10, SiGsl11, and SiGsl12 were included in cluster III.

### Sequence Alignment and Phylogenetic Analysis of Monolignol Biosynthesis Pathway Proteins

Sequence alignment and analysis of SiPAL proteins showed that all the members are almost completely conserved (**Supplementary Figure S5**). SiPAL2 was found to possess an extended N-terminal sequence of about 135 amino acids, which is unique to this class of protein. A phylogenetic tree constructed with PAL sequences of foxtail millet, eucalyptus, poplar, tobacco, medicago and Arabidopsis showed that the SiPAL proteins are phylogenetically divergent from the rest (**Figure 3A**). Sequence alignment of SiC4H showed that all the members share the conserved P450 superfamily domain and P450-featured motifs, namely, haem-iron binding motif (PFGVGRRSCPG), the T-containing binding pocket motif (AAIETT, the E-R-R-E-R-E-R), for optimal orientation of the enzyme (**Supplementary Figure S5**). Further, presence of conserved substrate recognition sites (SRSs) of C4H/CYP73A5 enzymes, including SRS1 (SRTRNVV FDIFTGKGQDMVFTVY), SRS2 (LSQSFEYNY), SRS4 (IVENINVAAIETTLWS), and SRS5 (RMAIPLLVPH) was also evidenced (**Supplementary Figure S5**). Phylogeny of SiC4H along with C4H protein sequences of other organisms showed the grouping of SiC4H1 with C4H1 proteins of eucalyptus and Phaseolus vulgaris, whereas SiC4H2 and SiC4H3 were found to be more divergent (**Figure 3B**).

Si4CL protein sequence alignment showed the presence of 2 highly conserved peptide motifs "box I" (LPYSSGTTGLPKGV; AMP binding signature) and "box II" (GEICIRG), in addition to other conserved regions (**Supplementary Figure S5**). Phylogeny of 4CL proteins showed grouping of Si4CL1, Si4CL2, Si4CL15, and Si4CL16 with switchgrass (Pvi4CL1), demonstrating their close proximity and similarly, Si4CL11 was found to be grouped with Pvi4CL2, whereas other Si4CL proteins formed their own distinct cluster (**Figure 3C**). Alignment of SiHCT sequences showed that all the proteins have the conserved motifs for the acyl transferase family, namely "HXXXDG" and "DFGWG" (**Supplementary Figure S5**). Multiple sequence alignment of SiC3H proteins showed the presence of Cytochrome P450 cysteine heme-iron ligand signature [FW]- [SGNH]-x-[GD]-{F}-[RKHPT]-{P}-C-[LIVMFAP]-[GAD]

(**Supplementary Figure S5**). The conserved motifs including three putative S-adenosyl-L-methionine binding motifs (A, B, and C) and CCoAOMT signature motifs (D, E, F, G, and H) were identified through multiple sequence alignment of SiCCoAOMT proteins (**Supplementary Figure S5**). Phylogenetic analysis of SiHCT, SiC3H, and SiCCoAOMT proteins with their respective family members of other organisms revealed the dissimilarity of foxtail millet proteins compared to their homologs (**Figures 3D–F**). In the case of CCoAOMT, SiCCoAOMT2 formed a distinct clade, whereas other SiCCoAOMT members were grouped together in one clade (**Figure 3F**).

Being truncated proteins, alignment of SiF5H1 with SiF5H2, and SiCOMT2 with SiCOMT1 and SiCOMT3 were not performed (**Supplementary Figure S5**). Protein sequence alignment between SiCOMT1 and SiCOMT3 did not highlight any consensus motif and their phylogenetic analysis with COMT proteins of other plants showed grouping of SiCOMT with ZmaCOMT of maize (**Figure 3G**). Sequence alignment of SiCCR proteins revealed that the conserved "KNWYCYGK" motif, catalytic site or the binding site for the cofactor NADPH (Larsen, 2004) has been diversified in foxtail millet (**Supplementary Figure S5**). Except SiCCR1 and SiCCR24, other SiCCR proteins have at least one amino acid change in this motif, which could be attributed to the substrate affinity of CCR proteins (Pichon et al., 1998). Phylogenetic analysis of SiCCR proteins showed that a maximum of these proteins were clustered in a separate group, whereas few proteins were grouped with CCR proteins of maize, switchgrass and poplar (**Figure 3H**). Alignment results of SiCAD highlighted a high degree of similarity in conserved domains and binding residues, including Zn-1 binding domain motif GHE(X)2G(X)5G(X)2V, NADP(H) cosubstrate-binding motif GXG(X)2G (glycine-rich repeat) and Zn-2 metal ion binding motif GD(X)9,10C(X)2C(X)2C(X)7C

(**Supplementary Figure S5**). Phylogenetic tree of SiCAD with CAD proteins of other plant species showed clustering of a maximum of SiCAD proteins in one clade with complete out-grouping of SiCAD10. SiCAD1 and SiCAD11 were found to cluster with poplar CAD proteins (**Figure 3I**).

### Gene Structure of Lignocellulose Pathway Genes

The sequence data of genomic DNA, transcript and CDS along with chromosomal locations of confirmed protein sequences of identified lignocellulose biosynthesis pathway enzymes were retrieved and analyzed for gene size, intron-exon and physical position (**Supplementary Tables S2**, **S6**). The size of SiCesA genes ranged from 3.1 (SiCesA8) to 6.9 kb (SiCesA9) and few genes including SiCesA3, SiCesA7, SiCesA5, and SiCesA9 have a maximum of 13 introns, whereas SiCesA12 was intronless (**Supplementary Figure S6**). The gene sizes of SiCsl ranged from 1.7 (SiCslA6 and SiCslE2) to 6.6 kb (SiCslA1 and SiCslF6), and their gene structure analysis revealed that SiCsl genes have up to eight introns (**Supplementary Figure S7**). The only intronless gene of SiCsl superfamily was SiCslE2. Among the SiGsl gene family members, SiGsl3 was the smallest gene (3.2 kb), whereas the largest one was SiGsl4 (17 kb). Interestingly, SiGsl genes were evidenced to contain numerous introns. SiGsl7 has a maximum of 49 introns, whereas SiGsl2 and SiGsl3 were intronless (**Supplementary Figure S8**).

SiPAL gene sizes ranged from 2.1 (SiPAL4) to 4.6 kb (SiPAL3), of which SiPAL4, SiPAL5, and SiPAL6 were intronless, SiPAL2 has two introns and other SiPAL genes have 2 introns each (**Supplementary Figure S9**). Among the Si4CL genes, Si4CL3 was the smallest gene (2 kb), whereas Si4CL15 was the largest (6.7 kb). A total of 10 Si4CL genes have 5 introns each, while maximum number of introns was found in Si4CL5 (6 introns). Si4CL3 has the least number of one intron in its gene (**Supplementary Figure S9**). The size of SiCCoAOMT genes ranged from 0.8 (SiCCoAOMT4) to 3 kb (SiCCoAOMT2) with a maximum number of introns (7) in SiCCoAOMT2. SiCCoAOMT3 and SiCCoAOMT4 have one intron each (**Supplementary Figure S9**). Among the SiCCR genes, SiCCR3 was 1.3 kb in size and though it is the smallest gene of this class, it has eight introns. SiCCR9 and SiCCR22 are the largest genes with a size of 5.8 kb and both the genes have 4 introns each. SiCCR2 has a maximum of 10 introns, while SiCCR7 is the only intronless gene in this group. The size of SiCAD genes ranged from 1.4 (SiCAD9) to 4.2 kb (SiCAD1 and SiCAD8), with SiCAD7, SiCAD8, and SiCAD9 having a minimum of 2 introns each whereas SiCAD5 has a maximum of 6 introns (**Supplementary Figure S9**).

### Chromosomal Location and Gene Duplication of Lignocellulose Pathway Genes

The identified secondary cell wall biosynthesis genes were plotted onto the nine chromosomes of foxtail millet to generate the physical map (**Figure 4**), which showed that the majority of lignocellulose biosynthesis pathway genes (31; ∼22%) were present in chromosome 2, followed by chromosome 9 (24 genes; ∼17%) and chromosome 1 (21 genes; ∼15%), and a minimum of 4 genes (∼3%) were mapped on chromosome 8. Expansion of respective gene families within the genome were analyzed by investigating tandem and segmental duplication, which showed that 7 genes underwent tandem duplication, whereas segmental duplication did not occur among the lignocellulose pathway genes (**Figure 4**). SiCesA members were distributed on chromosomes 2 (4 genes), 4 (1), 5 (2), and 9 (3) and none of the genes were evidenced to undergo tandem or segmental duplication. SiCsl genes were found to be present in all the chromosomes except chromosome 8, and duplication analysis revealed that SiCslE3 and SiCslE4 were tandemly duplicated gene pairs on chromosome 2. SiGsl members were distributed on chromosomes 1 (2 genes), 2 (1), 4 (2), 5 (4),

CAD proteins of *Setaria italica* (Si), *Eucalyptus gunnii* (Egu), *E. grandis* (Egr), *Nicotiana tabacum* (Nta), *Populus trichocarpa* (Ptr), *Pinus pinaster* (Ppi), *Pinus taeda* (Pta), *Medicago truncatula* (Mtr), *Panicum virgatum* (Pvi), *Zea mays* (Zma), *Malus domestica* (Mdom), *Vitis vinifera* (Vvi), *Eucalyptus globulus* (Egl), *Populus alba* x *Populus grandidentata* (Pag), *Petroselinum crispum* (Pec), *Populus tremuloides* (Ptm), *Phaseolus vulgaris* (Pvu), and *Eucalyptus robusta* (Er).

and 9 (3) and no duplication pattern in this gene family was observed.

Among the monolignol biosynthesis genes, the majority of SiPAL genes were present in chromosome 1 (5) and 7 (3), and interestingly, SiPAL4 and SiPAL5 as well as SiPAL8 and SiPAL9 were identified to be tandem duplicates. Each of the three SiC4H genes were found in chromosome 1, 3, and 5 (**Figure 4**). A higher number of Si4CL genes were present in chromosome 9 (7 genes), of which Si4CL11 and Si4CL12 were tandemly duplicated gene pairs. Chromosome 1 and 6 have two Si4CL members each and one member each in chromosome 2, 3, 4, 5, 7, and 8. Two members of SiHCT, SiC3H, and SiF5H as well as three genes of SiCOMT were present in chromosome 1, 3, 6, 7, 8, and 9 (**Figure 4**). Four out of five SiCCoAOMT genes were present in chromosome 6 and SiCCoAOMT1 was mapped on chromosome 2, and duplication analysis revealed that SiCCoAOMT3 and SiCCoAOMT4 were tandemly duplicated gene-pairs. Among the SiCCR genes, SiCCR26 could not be

mapped due to non-availability of its co-ordinates in Phytozome database. Of the 25 SiCCR genes mapped, a maximum of 8 genes were found to be present in chromosome 4 (8), followed by chromosome 2 (6) and 1 (4). Of the 13 SiCAD genes, maximum was in chromosome 2 (5) and a minimum of one each in chromosomes 1, 4, and 9. SiCAD2 and SiCAD3 on chromosome 2 as well as SiCAD8 and SiCAD9 on chromosome 6 were found to be tandemly duplicated gene-pairs (**Figure 4**).

### Promoter Analysis on Lignocellulose Pathway Genes

In silico analysis for predicting putative cis-regulatory elements showed the presence of universal as well as gene-specific promoter sequences in the upstream of lignocellulose pathway genes (**Supplementary Tables S8**, **S9**). A total of 271 cis-elements were found in CesA/Csl and Gsl genes, of which 15 (5.5%) elements, namely ACGTATERD1, ARR1AT, CAATBOX1, CACTFTPPCA1, DOFCOREZM, EBOXBNNAPA, GATABOX, GT1CONSENSUS, GTGANTG10, MYCCONSENSUSAT, NODCON2GM, OSE2ROOTNODULE, POLLEN1LELAT52, WBOXNTERF3, and WRKY71OS were present in all these genes (**Supplementary Table S8**). Thirty-nine unique cis-elements (∼14%) which were present in any one gene of CesA/Csl and Gsl superfamilies were also found, such as ABADESI1 (SiCslF6), CEREGLUBOX3PSLEGA (SiCesA2), GBOXLERBCS (SiCslA8), ZDNAFORMINGATCAB1 (SiCslA6), TATCCACHVAL21 (SiGsl3), etc. In addition, few promoter sequences were found to be present in all the genes except one or two genes and this includes BIHD1OS (SiCslC4), CCAATBOX1 (SiCslA1, SiCslF4), CURECORECR (SiCslC3, SiGsl1), DPBFCOREDCDC3 (SiCslC2, SiCslC4), EECCRCAH1 (SiGsl5), MYBCORE (SiCesA3, SiCslD4), RAV1AAT (SiCslD1), and SORLIP1AT (SiCesA4, SiGsl8). Of note, no superfamily specific regulatory elements were identified (**Supplementary Table S8**).

A total of 293 cis-elements were detected in the upstream region of monolignol pathway genes, of which 10 (3.4%) were present in all the genes and 37 (∼13%) were unique to any one gene (**Supplementary Table S9**). The elements which were present in all the genes include ARR1AT, CAATBOX1, CACTFTPPCA1, DOFCOREZM, EBOXBNNAPA, GATABOX, GT1CONSENSUS, GTGANTG10, WBOXNTERF3, and WRKY71OS. Few cis-regulatory elements were found to be present in all except one or two genes and it includes ACGTATERD1 (SiPAL2), CURECORECR (SiPAL2, SiPAL10), and MYBCORE (SiPAL7, SiCCR16). Similar to CesA/Csl and Gsl, no monolignol genes have superfamily specific regulatory elements (**Supplementary Table S9**).

### MicroRNAs and Molecular Markers of Lignocellulose Pathway Genes

In silico scanning of lignocellulose pathway gene transcripts to identify their targeting miRNAs showed that the transcripts of SiCslC2, SiGsl10, and SiF5H2 could be targeted by the miRNAs sit-miRn29, sit-miR114-npr and sit-miR395b, respectively (**Supplementary Table S10**). SiGsl3 was predicted to be targeted by two foxtail millet miRNAs, namely sit-miR156d-1 and sit-miR156d-2. These miRNAs would have a putative role in posttranscriptional gene silencing for regulation of lignocellulose pathway gene expression. Identification of previously reported molecular markers in the genic and regulatory regions of lignocellulose pathway genes revealed the presence of SSR and ILP markers in 34 genes (**Supplementary Table S11**). Of these, three genes have two and three markers each, and the remaining 28 genes possess single markers. Among the markers, SSRs were found to be predominant (∼81%) and the rest are ILPs (∼19%).

### Expression Profile of Lignocellulose Pathway Genes in Tissues and Dehydration Stress

Expression of all the genes in four tissues and dehydration stress was calculated using RPKM values derived from RNA-seq data. Tissue-specific expression profile showed differential expression pattern of all the genes with relatively lower expression in leaf (**Figure 5**). In the case of CesA/Csl and Gsl superfamilies, higher expression of SiCesA1, SiGsl2, SiGsl10, and SiGsl12 was evidenced in all the four tissues when compared to the other members of the same gene family. Tissue-specific higher expression of SiCslD1 in spica, and SiCslE4 and SiCslJ2 in leaf was also observed. Many genes including SiCesA6, SiCesA8, SiCslA3, SiCslC3, SiGsl3, and SiGsl7 were not expressed in these tissues (**Figure 5A**). Tissue-specific expression profiling of monolignol genes showed higher expression of SiPAL1, SiPAL2, SiPAL7, SiC4H2, Si4CL1, Si4CL3, Si4CL6, SiHCT2, SiCOMT2, SiCCR11, SiCAD1, and SiCAD5 in all the four studied tissues. Tissuespecific higher expression was evidenced with SIPAL4, Si4CL10, and SiCAD3 in root, and Si4CL9 and SiCAD12 in spica. Similar to CesA/Csl and Gsl, monolignol genes also showed a relatively lower expression in leaf tissue (**Figure 5B**). Expression profiling of all the genes in response to dehydration stress showed almost a uniform expression in both control and stress samples (**Figure 5**). Comparison of expression patterns between tissues and stress library revealed that the expression of predominant lignocellulose pathway genes was unaltered. Only three genes, namely SiCslA8,

SiCslA9, and Si4CL4 showed a higher expression in dehydration stress library compared to control, of which SiCslA8 and SiCslA9 were expressed only during stress and not in any of the tissue-specific RNA-Seq libraries. Few genes which were highly expressed in control were observed to be down regulated during stress and this includes SiCslA5, SiCslA6, SiCslA7, SiCslF2, and SiCCR26 (**Figure 5**).

### Homologous Relationships of Lignocellulose Pathway Genes with Other Grasses

Homologs of foxtail millet lignocellulose pathway genes in sequenced C<sup>4</sup> panicoid genomes, namely switchgrass (Panicum virgatum), sorghum (Sorghum bicolor), and maize (Zea mays) were derived (**Figure 6**). A maximum lignocellulose pathway gene-based homology was observed between foxtail millet and switchgrass as 19 genes of foxtail millet showed homology with 60 genes of switchgrass (**Supplementary Table S12**). Of the 19 foxtail millet genes, six belonged to SiGsl, four to SiCCR, three each to SiCsl and SiPAL, and one each to SiHCT, Si4CL and SiCAD. Eighteen foxtail millet genes showed orthologous relationship with 41 sorghum genes, of which SiGsl11 had a maximum of 11 homologs, followed by SiGsl7 (7 homologs) and SiGsl5 and SiCCR17 (3 homologs each) (**Supplementary Table S13**). In the case of foxtail milletmaize homology, 26 foxtail millet genes showed homologous relationship with 38 maize genes (**Supplementary Table S14**). Among the foxtail millet genes, SiGsl had a maximum of 7 homologs in maize, followed by SiGsl7 (3 homologs).

Among the lignocellulose pathway proteins, CADs and COMTs were well characterized as they play key role in secondary cell wall lignification (Saballos et al., 2009; Saathoff et al., 2011a,b, 2012; Sattler et al., 2012; Trabucco et al., 2013). Sequence analysis of these proteins in several grasses identified the presence of conserved motifs in few members, which distinguish them as lignifying proteins from the rest of non-lignifying proteins. Lignifying CADs possess additional 12 amino acids T49, Q<sup>53</sup> , L <sup>58</sup>, M60, C95, W119, V276, P286, M289, L290, F299, and I<sup>300</sup> , which are involved in substrate recognition and binding (Youn et al., 2006). Of the 13 SiCAD proteins, SiCAD11 contains 11 of 12 conserved amino acid residues. Of note, the active substrate-binding residues, W119 and F298, which determine specificity for aromatic alcohols and, the NADP(H) binding site, S212, were present in SiCAD11. Sequence-based homology analysis showed higher percentage of identity between SiCAD11 and lignifying CADs of other grasses namely switchgrass (Pavir.J34526; 91%), sorghum (Sobic.006G211900; 89%) and maize (GRMZM5G844562; 85%). Similarly, the conserved amino acids M130, N131, L136, A162, H166, F176, M180, H183, I319, M<sup>320</sup> , and N324, which function in substrate-binding and positioning in COMTs (Sattler et al., 2012; Trabucco et al., 2013) are found to be present in SiCOMT02 of foxtail millet. Sequence-based homology with SiCOMT02 showed high percent identity to sorghum (Sobic.007G047300; 94%), switchgrass (Pavir.Fa01907; 85%), and maize (AC196475.3; 89%).

### Duplication and Divergence of Lignocellulose Pathway Genes

The number of non-synonymous substitutions per nonsynonymous site (Ka) and synonymous substitutions per synonymous site (Ks) was calculated for paralogous as well as homologous gene pairs and Ka/Ks ratio along with time of divergence (in million years ago; mya) were derived. The ratio of Ka to Ks for tandemly duplicated gene-pairs ranged from 0.09 to 0.18 with an average value of 0.13, which suggested that these genes were under strong positive purifying selection (Ka/Ks > 1) and the duplication event was predicted to occur around 25 mya (**Supplementary Table S15**). In the case of Ka/Ks ratio of homologous gene-pairs, it was maximum between foxtail millet-switchgrass (average Ka/Ks = 0.91; **Supplementary Table S12**), whereas foxtail millet-sorghum and foxtail millet-maize homologs showed an average ratio of 0.19 (**Supplementary Tables S13**, **S14**). Since these values were less than 1, it signifies the intense positive selective pressure acted on respective protein-coding genes. The time of divergence between foxtail millet and switchgrass was predicted to occur around 4.7 mya, whereas the divergence of foxtail millet-sorghum and foxtail millet-maize occurred around 27 mya. This demonstrates that duplication and divergence have played a key role in shaping the lignocellulose pathway multigene families in foxtail millet and other C<sup>4</sup> grass genomes.

### Expression Profile of Candidate Genes during Stress and Hormone Treatments

Expression patterns of sixteen candidate lignocellulose biosynthesis genes, namely SiCesA5, SiCesA9, SiGsl2, SiGsl12, Si4CL10, SiPAL2, SiPAL7, SiC4H2, SiHCT1, SiCCoAOMT3, SiF5H2, SiCOMT2, SiCCR7, SiCCR22, SiCAD1, and SiCAD6 in response to stress (dehydration, salinity, cold) and hormone (abscisic acid, salicylic acid, methyl jasmonate) treatments was performed at two time points (1 h, early; 24 h, late). These candidates were chosen based on; (i) expression profiles deduced in silico using RNA-seq data, (ii) representing the nine chromosomes of foxtail millet, and (iii) their function in secondary cell wall formation such as SiCOMT2 in lignification Overall, the study demonstrated differential expression pattern of these genes during stress and hormone treatments except SiCCR22 which was found to be down-regulated under all conditions (**Figure 7**). SiGsl2 and SiGsl12 were found to be highly expressed during all the three stress conditions, whereas SiCAD6 was up-regulated during both salinity and dehydration stress. Dehydration stress has been observed to induce the expression of all the genes except SiCCoAOMT3, SiCOMT2, SiCesA5, SiCCR22, SiPAL7, SiCCR7, and SiCesA9, though the degree of expression varied between the genes. Salinity stress showed an induction in expression of SiC4H2, SiCAD6, SiF5H2, SiGsl12, and SiGsl2, while SiPAL2 was induced during early salt stress and SiCAD01, Si4CL10, and SiCCR7 were found to be up-regulated in late phase salinity stress, thus suggesting a significant higher expression among the members of SiGsl and SiCAD family. Significant up-regulation of SiGsl2, SiGsl12, Si4CL10, SiHCT1, and SiCCR7 during cold stress suggests the putative involvement of these genes in strengthening the cell wall for tolerance to low temperature. Higher expression of these genes was also found during both early and late phases of treatment with salicylic acid and methyl jasmonate. Differential expression of candidate genes was observed during the treatment of all the hormones except abscisic acid, which showed no effect on the expression of majority of candidate genes except SiGsl2, which was induced at early phase of ABA treatment, SiCCR7 and SiCes9, which were induced at late phase of ABA treatment, and SiC4H2, which was induced at both the phases of ABA treatment. In addition, expression of SiCCoAOMT3, SiCOMT2, SiCCR22, SiPAL7, SiCAD1, and SiCAD6 was found to be down-regulated during hormone treatments, while SiF5H2 was up-regulated only under late phase of salicylic acid treatment.

## DISCUSSION

Cellulose, hemicelluloses and lignin constitute the complex polymeric structure of secondary cell wall and the lignocellulose biosynthesis pathway involves the action of cellulose synthase (CesA), cellulose synthase-like (Csl), glucan synthase-like (Gsl), phenylalanine ammonia lyase (PAL), trans-cinnamate 4-hydroxylase (C4H), 4-coumarate CoA ligase (4CL), hydroxycinnamoyl CoA:shikimate/quinate hydroxycinnamoyl transferase (HCT), p-coumaroyl shikimate 3′ -hydroxylase (C3H), caffeoyl CoA 3-O-methyltransferase (CCoAOMT), ferulate 5-hydroxylase (F5H), caffeic acid O-methyltransferase (COMT), cinnamoyl CoA reductase (CCR), and cinnamyl alcohol dehydrogenase (CAD) genes, which are well studied in several crop plants as well as trees for understanding and improving biofuel traits (Zhong and Ye, 2015). In the present study, all these gene families in foxtail millet were systematically identified and characterized using in silico approaches, and expression profiling of chosen genes was performed in response to several stress as well as hormonal treatments for identifying target genes for functional characterization.

A total of 13 CesA and 36 Csl genes were identified in foxtail millet, and all the SiCesA proteins were found to possess the characteristic cellulose synthase (CS) domain and 12 SiCesA had an additional zinc finger (ZF) structure. Similarly, 11 CesA proteins have been reported in rice, of which 9 contained both CS and ZF domain, and 2 lacked ZF domain (Wang et al., 2010). Role of CesA proteins in cellulose biosynthesis in both primary and secondary cell walls has been well dissected in Arabidopsis. In this plant, 10 CesA genes have been identified (Richmond and Somerville, 2000), of which AtCesA1, AtCesA3, and AtCesA6 were reported to be involved in primary cell wall cellulose synthesis (Persson et al., 2007), AtCesA4, AtCesA7, and AtCesA8 in secondary cell wall development, and AtCesA2,

cold stress (CS) as well as abscisic acid (ABA), salicylic acid (SA) and methyl jasmonate (MJ) treatments for 0 (Control: CTL), 1 and 24 h. *Act2* was used as an internal control to normalize the data. The error bars representing standard deviation were calculated based on three technical replicates for biological triplicates. Statistical analysis between treatment and control using Tukey-Kramer multiple comparisons test has been performed and the differences in the effects of stress treatments in all the genes were considered statistically significant at \**P* < 0.05, \*\**P* < 0.01, \*\*\**P* < 0.001.

AtCesA5, AtCesA9, and AtCesA10 in tissue-specific cellulose biosynthesis processes (Gardiner et al., 2003; Taylor et al., 2003). Recent functional characterization of AtCesA proteins led to the identification of unidirectional movement of these protein complexes in seed coat epidermal cells, which deposit cellulose that are involved in mucilage extrusion, adherence and ray formation (Griffiths et al., 2015). In flax (Linum usitatissimum), 14 distinct CesA genes were identified and were targeted for silencing using virus-induced gene silencing (VIGS) approach, which showed impacts on outer-stem tissue organization and secondary cell wall formation (Chantreau et al., 2015). A genomewide association study of single nucleotide polymorphisms (SNPs) developed through re-sequencing of diverse chickpea accessions revealed a superior haplotype and favorable natural allelic variants in the upstream regulatory region of a CesA gene, denoted as Ca\_Kabuli\_CesA3 (Kujur et al., 2015). Interestingly, up-regulation of this superior gene haplotype resulted in higher transcript expression of Ca\_Kabuli\_CesA3 gene in pollen and pod of high pod/seed number chickpea accession, thus resulting in enhanced accumulation of cellulose (Kujur et al., 2015). The specific allelic variant caused cellulose changes specifically in pollen tubes of chickpea and therefore, investigating the homologous gene of foxtail millet identified in the present study will provide novel clues on its role, which could be manipulated for achieving greater biomass yield and bioconversion efficiency.

Physical map of SiCesA genes showed their distribution in chromosomes 2, 3, 4, 5, and 9, with a maximum of 4 genes in chromosome 4 and minimum of one gene in chromosome 3 (**Figure 4**). Extension of gene families is attributed to the occurrence of three major duplication mechanisms, namely segmental, tandem and retroposition (Cannon et al., 2004). However, none of these duplications were found to be involved in the expansion of SiCesA genes as revealed through MCScanX analysis though both tandem and segmental duplication events were reported in OsCesA family (Wang et al., 2010). Being a member of glycosyltransferase 2 (GT2) family, CesA proteins have the conserved "DXD, D, QXXRW" motif (Somerville et al., 2004) and conforming to this, all the SiCesA proteins except SiCesA8 have a "DCD, D, QVLRW" consensus sequence, whereas SiCesA8 had a unique "DYD, D" sequence and the motif "QXXRW" was absent. Similar sequence variations have also been reported by Wang et al. (2010) in rice. Studies on CesA gene family in crop plants have revealed the presence of a large family of cellulose synthase-like (Csl) genes with sequence similarity to CesA (Richmond and Somerville, 2000), and these genes are shown to be involved in biosynthesis of hemicelluloses (Yin et al., 2009). Similar to CesA, Csl proteins also belong to GT2 family and possess the conserved "DXD, D, QXXRW" motif (Somerville et al., 2004). In foxtail millet, 36 Csl genes were identified and categorized as CslA, CslC, CslD, CslE, CslF, CslH, and CslJ in accordance to the classification followed by Wang et al. (2010) in rice. Interestingly, 2 CslJ genes were identified in foxtail millet, which were reported to be specific to cereals though they are not present in rice and Brachypodium (Fincher, 2009). Domain analysis has shown the presence of GT2 domains in all SiCslA and SiCslC proteins, whereas other SiCsl possess CS domain. Similar reports in Arabidopsis and rice have shown the presence of characteristic GT2 domain in CslA and CslC proteins (Yin et al., 2009; Wang et al., 2010). Studies have shown that CslA and CslC subgroups are the most divergent proteins, which have evolved through duplication and divergence from a common ancestral gene (Yin et al., 2009; Del Bem and Vincentz, 2010), and therefore share similar structural and physicochemical properties (Youngs et al., 2007). Nevertheless, membrane topology and enzymatic function of these proteins are contrastingly different (Davis et al., 2010; Liepman and Cavalier, 2012). In addition, predominant SiCslD family proteins have an additional RING/Ubox like zincbinding domain, which contains a C3HC4 motif capable of binding to zinc cations.

Molecular processes and biological functions of Csl genes have been less explored when compared to CesA genes though Csl proteins are equally important in cell structuring. Numerous reports have supported the involvement of CslA protein in the synthesis of 1,4-β-mannan and glucomannan backbones (Dhugga et al., 2004; Liepman et al., 2005; Suzuki et al., 2006; Goubet et al., 2009; Gille et al., 2011) and heterologous expression of CslA genes has shown the activity of single enzyme in integrating mannose and glucose into glcomannan chains (Suzuki et al., 2006; Liepman et al., 2007; Gille et al., 2011). Similarly, CslC genes encode for xyloglucan glucan synthase, which are involved in xyloglucan biosynthesis (Cocuron et al., 2007). Heterologous expression of AtCslC4 in Pichia pastoris produced soluble 1,4-β-glucans with a low degree of polymerization, whereas expression of AtCslC4 along with AtXXT1 (xyloglucan xylosyltransferase) produced insoluble 1,4-β-glucans with a higher degree of polymerization suggesting the cooperative action of both the enzymes in xyloglucan biosynthesis (Liepman and Cavalier, 2012). Though CslD proteins were speculated to be involved in xylan and homogalacturonan synthesis (Hamann et al., 2004; Bernal et al., 2008a,b; Li et al., 2009), Arabidopsis csld mutants have been shown to possess severe phenotypic defects including deformed root hairs (csld2; Bernal et al., 2008b), root hairs burst (csld3; Bernal et al., 2008b), defective growth of pollen tube (csld1 and csld4; Bernal et al., 2008b; Wang et al., 2011) and reduced plant growth (csld5; Bernal et al., 2008a). These reports suggest the role of CslD in normal growth and development of plants beyond their function in xylan and homogalacturonan synthesis. The present study identified 4 SiCslE genes, whose characterization has not been performed yet in any crop species. One CslE gene in Arabidopsis and two in rice were reported to date. CslF family of genes were considered to be present among grass species and they regulate the synthesis of mixed-linkage glucan (β-1,3; 1,4, glucan) (Hazen et al., 2002; Burton et al., 2006). Mutation of barley CslF6 gene resulted in reduction of (1,3;1,4) β-D-Glucan and had an impact on chemical composition of barley grains (Hu et al., 2014), whereas overexpression of this gene in Nicotiana benthamiana led to accumulation of (1,3;1,4) β-D-Glucan (Wong et al., 2015). Recently, Jin et al. (2015) has demonstrated the role of OsCslF6 in affecting phosphate accumulation altering the level of carbon metabolism in rice. Similar to CslF, CslH and CslJ are also grass-specific gene family involved in deposition of (1,3;1,4)-β-D-Glucan (Doblin et al., 2009; Yin et al., 2009, 2014). In the present study, two genes each belonging to CslH and CslJ family were identified.

Similar to CesA/Csl, glucan synthase-like protein (Gsl) family are also involved in polysaccharide biosynthesis, particularly in synthesis 1,3-β-D-glucan callose (Li et al., 1999). Calloses are deposited in developing cell walls of fiber cells, seed hairs and plasmodesmatal canals. Moreover, deposition of callose is also reported in response to pathogen invasion (Muthamilarasan and Prasad, 2013) and abiotic stress including desiccation, wounding and metal toxicity (Stone and Clarke, 1992). In spite of the importance of Gsl genes, limited studies have been performed on elucidating the molecular role of these genes and their respective proteins. In Arabidopsis, 12 Gsl genes have been identified (https://cellwall.genomics.purdue.edu/intro/ index.html) and mutating AtGSL5 has been found to confer resistance to powdery mildew infection (Nishimura et al., 2003). A similar report by Jacobs et al. (2003) has also shown that silencing of AtGsl5 enhances the resistance of silenced lines to Sphaerotheca fusca, Golovinomyces orontii, and Blumeria graminis. In contrast to the role of callose in acting as a physical barrier to prevent pathogen invasion, the reports by Nishimura et al. (2003) and Jacobs et al. (2003) have demonstrated the resistance of Arabidopsis to pathogens in the absence of callose. These reports have proved the importance to study the molecular and physiological roles of Gsl proteins in response to biotic as well as abiotic stress, and the present investigation has identified 12 SiGsl genes which could serve as interesting candidates for functional characterization as foxtail millet is tolerant to environmental stresses.

In the case of monolignol biosynthesis, ten key enzymes namely PAL, C4H, 4CL, HCT, C3H, CCoAOMT, F5H, COMT, CCR, and CAD have been identified and characterized in the present study. Through systematic analysis, 10, 3, 17, 2, 2, 5, 2, 3, 26, and 13 proteins belonging to PAL, C4H, 4CL, HCT, C3H, CCoAOMT, F5H, COMT, CCR, and CAD families, respectively were identified (**Supplementary Table S6**). These numbers compared with the genes reported in Arabidopsis, poplar and eucalyptus has shown that foxtail millet has higher number of PAL genes (10) whereas other three organisms have 4, 5, and 9 genes, respectively (Raes et al., 2003; Shi et al., 2010; Carocha et al., 2015). Both foxtail millet and poplar have 2 C4H and 17 4CL genes, whereas Arabidopsis and eucalyptus have lesser number of C4H and 4CL genes. Of note, foxtail millet has a maximum of 26 CCR genes, while Arabidopsis has 7 and eucalyptus has 2 genes (Raes et al., 2003; Shi et al., 2010; Carocha et al., 2015). The identified monolignol biosynthesis genes were distributed in all the nine chromosomes of foxtail millet, of which two gene-pairs each of SiPAL (SiPAL4-SiPAL5; SiPAL8- SiPAL9) and SiCAD (SiCAD2-SiCAD3; SiCAD8-SiCAD9), and one pair each of Si4CL (Si4CL11-Si4CL12) and SiCCoAOMT (SiCCoAOMT3-SiCCoAOMT4) were identified to be tandemly duplicated (**Figure 4**). Phylogenetic analysis of foxtail millet monolignol biosynthesis proteins with bona fide proteins of eucalyptus, tobacco, poplar, Arabidopsis, maize, medicago and grape revealed that predominant proteins of foxtail millet are highly divergent (**Figure 3**).

Furthermore, promoter analysis has been performed for foxtail millet lignocellulose biosynthesis genes, which revealed the presence of diverse cis-regulatory elements that fall under the following categories; (i) cis-elements which are universally present in all the gene family members, (ii) cis-elements which are present in all the gene family members except one gene, and (iii) cis-element which is unique to any one gene of its corresponding gene family (**Supplementary Tables S8**, **S9**). These data suggest the transcriptional control of cell wall genes by the action of network of transcription factors. This would assist in understanding gene regulatory mechanism controlling the expression of lignocellulose genes and fine tuning them to achieve the optimal pattern of secondary cell-wall deposition. Since gene expression is also regulated at post-transcriptional level through miRNAs, the present study also identified foxtail millet miRNAs which target the transcripts of lignocellulose biosynthesis genes (**Supplementary Table S10**). Moreover, different kinds of molecular markers including SSRs, eSSRs, and ILPs present in both upstream and genic region of lignocellulose biosynthesis genes have been identified (**Supplementary Table S11**), which could be useful for conducting genomics-assisted breeding for biofuel traits in foxtail millet. In silico expression profiles of all the lignocellulose biosynthesis genes in four tissues as well as dehydration library revealed the differential expression of these genes in these tissues and during stress, thus signifying their putative involvement in biological functions other than cell wall structuring. This is supported by the reports on mutants of studied genes in Arabidopsis and other plants in which severe phenotypic defects have been observed.

In addition to being potential targets for biofuel traits, the lignocellulose biosynthesis genes have also been reported to play vital role in abiotic stress responses. Chen et al. (2005) have shown that Arabidopsis CesA8 mutants accumulate increased levels of ABA, proline and sugars, and express higher levels of stress-related genes, and thus possess enhanced tolerance to drought and osmotic stress. Considering this, Guerriero et al. (2014) analyzed the expression of nine putative CesA genes in response to cold, heat and salt stress in Medicago sativa and identified a salt/heat-induced and a cold/heatrepressed group of genes, which suggest the putative involvement of cellulose synthases in conferring abiotic stress tolerance. Similar to CesA genes, Csl genes have also been shown to participate in stress responsive machinery. Characterization of the salt-overly sensitive6 allele of AtCslD5 has demonstrated reactive oxygen species-based signaling mechanism in response to osmotic stress in Arabidopsis (Zhu et al., 2010). Similarly, accumulation of callose in response to environmental stimuli through overexpression of Gsl genes has been extensively studied (Nedukha, 2015). Stass and Horst (2009) have reported the production of abiotic stress-induced callose in all the plants through a highly conserved signaling pathway. Lignification has also been reported to be induced during abiotic stresses (Moura et al., 2010). In view of these, expression profiling of candidate genes in response to dehydration, salinity and cold stress as well as ABA, SA, and MeJA treatments was performed, which showed significant higher expression of SiGsl2 and SiGsl12 in all the stress conditions. Few genes including SiCAD6, SiC4H2, SiPAL2, SiF5H2, Si4CL10, SiHCT1, and SiCCR7 were evidenced to be up-regulated either at early or late or both the phases of stresses. Similarly, differential expression patterns were observed for all the genes during hormone treatments and of note, ABA treatment has no significant impact on the expression of the majority of genes.

Noteworthy, the expression profiles of candidate lignocellulose biosynthesis genes were in correlation with the cis-regulatory elements present in the promoter regions of respective genes. The genes which are up-regulated during dehydration and salinity stress including SiGsl2, SiGSl12, SiPAL2, SiC4H2, Si4Cl10, SiF5H2, SiHCT1, SiCAD1, and SiCAD6 have one or more "response to dehydration stress" cis-motifs ABRELATERD1, ACGTATERD1 and MYCATRD22 in their promoter regions (Vandepoele et al., 2009; Yan et al., 2014). Similarly, SiGsl2, SiGSl12 and Si4Cl10 that showed higher expression under cold stress have CACGT motif, which was reported to be responsible for response to cold stress (Vandepoele et al., 2009). In case of hormonal treatments, methyl jasmonate responsive cis-element BOXLCOREDCPAL (Yan et al., 2014) was found in the promoter regions of SiCesA5, SiGsl2, SiGSl12, Si4Cl10, SiPAL2, SiC4H2, and SiCCR7. These genes showed significant up-regulation at either early or late or both the phases of methyl jasmonate treatment. Similarly, ABA-responsive genes such as SiC4H2, SiCCR7, SiGsl2, and SiCesA9 have both MYCCONSENSUSAT and MYCATRD22 cis-motifs, which have been reported to be MYC recognition site in the promoter of dehydration responsive rd22 gene which in turn was ABA-dependent (Yan et al., 2014), suggesting that these genes were activated in response to ABA. Thus the present study demonstrates that the interaction of cis-elements and transcription factors has resulted in differential gene expression through activation or repression respective genes in response to various environmental stresses and hormone treatments (Lee et al., 2002; Benitez et al., 2013). The findings and potential correlation between the cis-elements to response to a specific elicitor condition are indirect. It is possible that they are linked, but such primary evidence is not provided here. It is also not known if there were any changes to cell walls in the plants used for expression analyses. Altogether, the present investigation suggests the putative involvement of these genes in strengthening the cell wall for tolerance to abiotic stresses, and they could serve as potential candidates for further functional characterization.

### CONCLUSIONS

The present study has identified the genes belonging to CesA/Csl, Gsl, PAL, C4H, 4CL, HCT, C3H, CCoAOMT, F5H, COMT, CCR, and CAD superfamilies in foxtail millet and the genes were mapped onto nine chromosomes. In silico analyses of putative protein properties and gene structures revealed diverse characteristic features of these proteins and their gene duplication analysis showed that few gene family members underwent tandem duplication. Phylogenetic analysis of respective proteins demonstrated that except CesA/Csl and Gsl superfamily, the monolignol biosynthesis proteins are highly diverse. Promoter analysis showed the presence of various unique and common cis-regulatory elements in the upstream of lignocellulose biosynthesis genes and potential miRNAs of foxtail millet were identified to target few genes for post-transcriptional gene silencing. In addition, three types of molecular markers were found in lignocellulose biosynthesis genes, which could be used in genomics-assisted breeding. Comparative genome mapping of foxtail millet lignocellulose biosynthesis genes with the sequenced C<sup>4</sup> panicoid genomes revealed higher homology with switchgrass, followed by sorghum and maize. Evolutionary analysis showed that both paralogous and homologous genepairs underwent intense positive purifying selection, and duplication occurred ∼25 mya, whereas divergence of foxtail millet and switchgrass occurred ∼4 mya. Similarly, divergence of foxtail millet from sorghum and maize was predicted to occur ∼27 mya. In silico expression analysis of all the identified genes in four tissues and dehydration stress library of foxtail millet revealed their differential expression pattern, and also suggested the putative biological function of these genes in processes other than cell wall biosynthesis. Expression profiling of candidate genes in response to dehydration, salinity and cold stress along with ABA, SA and MeJA treatments supported the differential expression of these genes with significant higher expression of SiGsl12, SiHCT1, and SiCAD6 genes. The results suggested that these genes could be used as potential candidates for functional characterization for biofuel traits. Though similar studies have already been completed in switchgrass, sorghum and maize, the present study conducted in biofuel model foxtail millet would facilitate improving the crop for efficient biofuel production.

### AUTHOR CONTRIBUTIONS

MP conceived and designed the experiments. MM, YK, JJ, SS, CL performed the experiments. MM, CL, MP analyzed the results. MM, MP wrote the manuscript. MP approved the final version of the manuscript.

### FUNDING

Research on foxtail millet genomics at MP's laboratory is funded by the Core Grant of National Institute of Plant Genome Research, New Delhi, India.

### ACKNOWLEDGMENTS

MM acknowledges University Grants Commission, New Delhi, India for providing Research Fellowship. The authors thank Mr. Rohit Khandelwal for critically reading the manuscript.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fpls.2015. 00965

Supplementary Figure S1 | Multiple sequence alignment of SiCesA proteins.

Supplementary Figure S2 | Multiple sequence alignment of SiCsl proteins.

Supplementary Figure S3 | Multiple sequence alignment of SiGsl proteins.

Supplementary Figure S4 | Prediction of transmembrane domains in the SiGsl proteins. Red line represents transmembrane, blue line represents inside and pink line represents outside orientation.

Supplementary Figure S5 | Multiple sequence alignment of monolignol biosynthesis proteins.

Supplementary Figure S6 | Gene structure of *SiCesA* genes.

Supplementary Figure S7 | Gene structure of *SiCsl* genes.

Supplementary Figure S8 | Gene structure of *SiGsl* genes.

Supplementary Figure S9 | Gene structure of monolignol biosynthesis genes.

Supplementary Table S1 | Details of primers used in qRT-PCR analysis.

Supplementary Table S2 | Details of *SiCesA/Csl* and *SiGsl* superfamily genes of foxtail millet.

Supplementary Table S3 | Details of various domains present in SiCesA proteins.

Supplementary Table S4 | Details of various domains present in SiCsl proteins.

Supplementary Table S5 | Details of various domains present in SiGsl proteins.

Supplementary Table S6 | Details of monolignol biosynthesis pathway genes of foxtail millet.

Supplementary Table S7 | Details of various domains present in monolignol biosynthesis pathway proteins.

Supplementary Table S8 | Summary of cis-regulatory elements present in *SiCesA/Csl* and *SiGsl* superfamily genes.

Supplementary Table S9 | Summary of cis-regulatory elements present in monolignol biosynthesis pathway genes.

Supplementary Table S10 | Details of foxtail millet miRNAs identified to target the transcripts of lignocellulose pathway genes.

Supplementary Table S11 | Summary of molecular markers present in lignocellulose pathway genes.

Supplementary Table S12 | The Ka/Ks ratios and estimated divergence time for homologous lignocellulose pathway proteins between *Setaria italica* and *Panicum virgatum*.

Supplementary Table S13 | The Ka/Ks ratios and estimated divergence time for homologous lignocellulose pathway proteins between *Setaria italica* and *Sorghum bicolor*.

### REFERENCES


Supplementary Table S14 | The Ka/Ks ratios and estimated divergence time for homologous lignocellulose pathway proteins between *Setaria italica* and *Zea mays*.

Supplementary Table S15 | The Ka/Ks ratios and estimated divergence time for tandemly duplicated lignocellulose pathway proteins.


cellulose synthase complexes in Arabidopsis. Proc. Natl. Acad. Sci. U.S.A. 104, 15566–15571. doi: 10.1073/pnas.0706592104


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2015 Muthamilarasan, Khan, Jaishankar, Shweta, Lata and Prasad. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Proteomic Responses of Switchgrass and Prairie Cordgrass to Senescence

Bimal Paudel 1‡, Aayudh Das 1 †‡, Michaellong Tran<sup>1</sup> , Arvid Boe<sup>2</sup> , Nathan A. Palmer <sup>3</sup> , Gautam Sarath<sup>3</sup> , Jose L. Gonzalez-Hernandez <sup>2</sup> , Paul J. Rushton<sup>4</sup> and Jai S. Rohila1, 2 \*

<sup>1</sup> Department of Biology and Microbiology, South Dakota State University, Brookings, SD, USA, <sup>2</sup> Department of Plant Science, South Dakota State University, Brookings, SD, USA, <sup>3</sup> Grain, Forage and Bioenergy Research Unit, United States Department of Agriculture - Agricultural Research Service, Lincoln, NE, USA, <sup>4</sup> 22nd Century Group Inc., Clarence, NY, USA

### Edited by:

Soren K. Rasmussen, University of Copenhagen, Denmark

#### Reviewed by:

Xiyin Wang, North China University of Science and Technology, China Abu Hena Mostafa Kamal, University of Texas at Arlington, USA Angela Mehta, Embrapa Recursos Genéticos e Biotecnologia, Brazil

\*Correspondence:

Jai S. Rohila jai.rohila@sdstate.edu

### †Present Address:

Aayudh Das, Department of Biochemistry and Biophysics, Texas A&M University, College Station, TX, USA ‡ These authors have contributed equally to this work.

#### Specialty section:

This article was submitted to Crop Science and Horticulture, a section of the journal Frontiers in Plant Science

Received: 23 November 2015 Accepted: 24 February 2016 Published: 14 March 2016

#### Citation:

Paudel B, Das A, Tran M, Boe A, Palmer NA, Sarath G, Gonzalez-Hernandez JL, Rushton PJ and Rohila JS (2016) Proteomic Responses of Switchgrass and Prairie Cordgrass to Senescence. Front. Plant Sci. 7:293. doi: 10.3389/fpls.2016.00293 Senescence in biofuel grasses is a critical issue because early senescence decreases potential biomass production by limiting aerial growth and development. 2-Dimensional, differential in-gel electrophoresis (2D-DIGE) followed by mass spectrometry of selected protein spots was used to evaluate differences between leaf proteomes of early (ES)- and late- senescing (LS) genotypes of Prairie cordgrass (ES/LS PCG) and switchgrass (ES/LS SG), just before and after senescence was initiated. Analysis of the manually filtered and statistically evaluated data indicated that 69 proteins were significantly differentially abundant across all comparisons, and a majority (41%) were associated with photosynthetic processes as determined by gene ontology analysis. Ten proteins were found in common between PCG and SG, and nine and 18 proteins were unique to PCG and SG respectively. Five of the 10 differentially abundant spots common to both species were increased in abundance, and five were decreased in abundance. Leaf proteomes of the LS genotypes of both grasses analyzed before senescence contained significantly higher abundances of a 14-3-3 like protein and a glutathione-S-transferase protein when compared to the ES genotypes, suggesting differential cellular metabolism in the LS vs. the ES genotypes. The higher abundance of 14-3-3 like proteins may be one factor that impacts the senescence process in both LS PCG and LS SG. Aconitase dehydratase was found in greater abundance in all four genotypes after the onset of senescence, consistent with literature reports from genetic and transcriptomic studies. A Rab protein of the Ras family of G proteins and an s-adenosylmethionine synthase were more abundant in ES PCG when compared with the LS PCG. In contrast, several proteins associated with photosynthesis and carbon assimilation were detected in greater abundance in LS PCG when compared to ES PCG, suggesting that a loss of these proteins potentially contributed to the ES phenotype in PCG. Overall, this study provides important data that can be utilized toward delaying senescence in both PCG and SG, and sets a foundational base for future improvement of perennial grass germplasm for greater aerial biomass productivity.

Keywords: biofuel grasses, cordgrass, switchgrass, proteomics, senescence

## INTRODUCTION

Prairie cordgrass (PCG) and switchgrass (SG) are two warmseason C<sup>4</sup> grasses that are widely adapted to North American climatic conditions and have great potential as feedstock for the lignocellulosic-based biofuels industry (Lee and Boe, 2005; Sarath et al., 2008; Gonzalez-Hernandez et al., 2009). Senescence is a crucial plant development process that enables the remobilization of nutrients within a plant (Noodén et al., 1997). Leaf senescence is marked by the degradation of subcellular compartments, such as chloroplasts, and the remobilization of nutrients to other parts of the plants, such as seeds and underground rhizomes in the case of PCG and SG. The balance between oxidative stress and antioxidant activity plays a crucial role during senescence (Prochazkova et al., 2001). Senescence can be initiated by complex signals of age-specific factors in the genome and by temperature or day length in the case of seasonal senescence, and is often accompanied with increased ROS (reactive oxygen species) and oxidative stress (Jones et al., 2012; Palmer et al., 2015). Although ROS are generated continuously during normal growth, they are balanced by various antioxidant pathways, thus maintaining an optimal cellular redox state for growth. However, during stress related and age specific senescence, these antioxidant pathways cannot overcome oxidative stress, leading to senescence rather than normal growth. Mitochondria are among the major sources of ROS and ROS-related stress signals during programmed cell death (Zhao and Xu, 2000; Fleury et al., 2002).

The onset of early senescence, which is different from aging (Lim et al., 2003), is known to affect the amount of accumulated biomass by a perennial grass (Sarath et al., 2014). During senescence, leaves lose their photosynthetic capability sharply and, as a result, much less carbon is assimilated by the plant. Because leaves are the primary organ in plants that fixes carbon and directly contributes to plant biomass yields, untimely leaf senescence can cause losses in the total potential biomass yields of perennial grasses (Rinerson et al., 2015). Moreover, early senescing plants become more vulnerable to pathogen attacks, especially fungal pathogens, further reducing the biomass of the biofuel crops (Ahonsi et al., 2013). From an evolutionary point of view, senescence evolved to maximize the reutilization of nutrients that were accumulated by leaves during the growing season (Bleecker, 1998). Senescence in plants is a genetically programmed sequence of biochemical and physiological changes, but little is known about it at the protein level in bioenergy crops. Early and recent studies (Gan and Amasino, 1997; Biswal and Biswal, 1999; Palmer et al., 2014, 2015) suggested that senescence is highly correlated with the differential expression of senescenceassociated-genes (SAGs). By controlling these signature genes, which regulate the juvenile to adult phase transition in plants, it may be possible to modify or enhance the biomass properties of a wide range of bioenergy feedstock (Chuck et al., 2011). Identifying and using new SAGs is therefore necessary in breeding programs for bioenergy feedstock crops.

A transcriptomic study of switchgrass flag leaf development from elongation through the onset of senescence was performed recently by Palmer et al. (2015). Many candidate genes were identified that were presumably involved in regulating the expression of senescence-related pathways. The authors found that during the onset of senescence, leaf chlorophyll content decline was associated with a significant upregulation in transcripts coding for enzymes involved in chlorophyll degradation and a large number of SAGs. Moreover, genes such as ureide, ammonium, nitrate, and molybdenum transporters that code for nitrogen and mineral utilization shared expression profiles that were significantly co-regulated with the expression profiles of NAC transcription factors. Similarly, Rinerson et al. (2015) identified 240 WRKY genes in the switchgrass genome and studied their expression during flag leaf development. Twentyeight of these WRKY genes were identified as possible targets for increasing biomass yields in SG by delaying senescence. During senescence, protein activation by post-translational modification may be an additional mechanism of regulation. Several studies have shown that mRNA abundance for some genes may not necessarily be a predictor of protein abundance (Wang et al., 2006; Carp and Gepstein, 2007).

Similar to transcriptomics, high-throughput proteomic studies are a powerful tool to analyze changes in protein accumulation levels and post-translational modifications (Liu et al., 2013; Robbins et al., 2013; Wang et al., 2014), but no detailed proteomic study has been conducted on leaf senescence in either switchgrass or cordgrass.

Here, the power of global proteomics was utilized with an aim to profile and identify SG and PCG proteins that were associated with the senescence process using two contrasting genotypes for each species. Differentially abundant proteins in leaves from an early-senescence (ES) genotype were compared with those in a late-senescence (LS) genotype before and after the onset of senescence. Our results provide insights into the molecular basis of the differential responses of the two economically important bioenergy feedstock crops. These results may be instrumental in the rational engineering of senescence in SG and PCG with longer growing periods for increased biomass production.

### MATERIALS AND METHODS

### Plant Materials and Treatments

Leaf tissues were collected in triplicates from field-established clones of two different genotypes of SG [Genotype # 5 (ES SG), Genotype # 4 (LS SG). There was a 10-day difference between the two genotypes for date of onset of anthesis (10 August for the ES SG genotype compared 20 August for the LS SG genotype on average at Brookings, SD). These plants were selected from clonally maintained field nurseries established from random seedlings obtained from cultivar Sunburst; similarly, phenotypic selection from a larger PCG population was used to identify the ES PCG and LS PCG genotypes (Boe, unpublished). Similar to the phenological difference observed between the two switchgrass genotypes, there was about a 10-day difference in the onset of anthesis between the two PCG genotypes at the Brookings SD location. Leaves were collected from three clonal replicates of all four genotypes at two different times, before senescence and after senescence, for a total of 24 samples. Samples were assigned to groups A through H based on source population and harvest time (**Table 1**). The timing of senescence, for the sake of before and after senescence, was determined by measuring chlorophyll content (Palmer et al., 2015) using a hand-held chlorophyll meter (**Supplementary Figure 1**). After harvest, all leaf tissues were snap frozen under liquid nitrogen and were stored at −80◦C until further use.

### Sample Preparation, 2-D Differential In-Gel Electrophoresis (2D-DIGE), Gel Staining, Image Analysis, and Protein Identification by LC-MS/MS

The fold change in abundance of different proteins was analyzed in the form of ratios for eight different sets of comparisons based on group assignment (**Table 1**). Four are ratios of LS to ES PCG and SG: (i) the ratio of LS to ES PCG before senescence (G/E), (ii) the ratio of LS to ES PCG after senescence (H/F), (iii) the ratio for LS to ES SG before senescence (C/A), and (iv) the ratio of LS to ES SG after senescence (D/B). Similarly, the other four sets were as follows: (v) the ratio for after to before senescence in ES PCG (F/E), (vi) the ratio for after to before senescence in LS PCG (H/G), (vii) the ratio for after to before senescence in ES SG (B/A), and (viii) the ratio for after to before senescence in LS SG (D/C).



Two genotypes of SG and two genotypes of PCG were selected with two treatments for each for sample collection. Samples were collected in triplicate.

One gram of leaf tissue samples was ground to powder under liquid nitrogen using a mortar and pestle. Three hundred microliters of 2-D cell lysis buffer (30 mM Tris-HCl, pH 8.8, containing 7 M urea, 2 M thiourea, and 4% CHAPS) was added to this ground tissue and subjected to sonication on ice (Hurkman and Tanaka, 1986). Next, the tubes were shaken for 30 min on a shaker at room temperature. Then, samples were centrifuged for 30 min at 4◦C at 25,000 × g, and the supernatant was collected. The concentration of protein was measured in the supernatant by Bio-Rad protein assay buffer with BSA as a standard following the standard manufacturer's guidelines (Peterson, 1983). Lysate samples were diluted with 2-D lysis buffer to a concentration of 5 mg/ml.

### Minimal Cy Dye Labeling

The 2D-DIGE were performed by Applied Biomics (Hayward, CA) following the protocol described in Robbins et al. (2013) and Das et al. (in press). Briefly, 1.0 µl of diluted Cy Dye was added to 30µg of protein lysate (1:5 diluted with DMF from 1 nM/µl stock), followed by a short vortexing. The tubes were kept under dark on ice for 30 min followed by the addition of 1.0µl of 10 mM lysine to each of the samples and vortexing; then, the reaction was kept in the dark on ice for additional 15 min. A pooled protein sample was prepared containing equal amounts of all 24 samples and was labeled with Cy2 using the same protocol. The Cy2 labeled sample was used as an internal control to compare gel-to-gel variations. For one gel, 3 samples consisting of Cy2, Cy3, and Cy5 labeled samples were mixed with 2X 2-D sample buffer (8 M urea, 4% CHAPS, 20 mg/ml DTT, 2% pharmalytes, and a trace amount of bromophenol blue). Then, 100µl of destreak solution and rehydration buffer (7 M urea, 2 M thiourea, 4% CHAPS, 20 mg/ml DTT, 1% pharmalytes, and a trace amount of bromophenol blue) was also added to make a final volume of 350µl for the 18 cm IPG strip (pH 3-10). This was mixed well and spun before loading the labeled samples into the strip holder.

### IEF, SDS-PAGE, Image Scan and Data Analysis

After loading the labeled samples into the IPG strip holder, 18 cm strips were put facedown, and 1.5 ml of mineral oil was added on the top of the strips. This was followed by the protocol provided (Amersham BioSciences) and isoelectric focusing (IEF) was carried out in the dark at 20◦C. After IEF, the IPG strips were incubated in freshly made equilibration buffer I (50 mM Tris-HCl, pH 8.8, containing 6 M urea, 30% glycerol, 2% SDS, a trace amount of bromophenol blue, and 10 mg/ml DTT) for 15 min with slow shaking. Then, the strips were rinsed in freshly prepared equilibration buffer II (50 mM Tris-HCl, pH 8.8, containing 6 M urea, 30% glycerol, 2% SDS, a trace amount of bromophenol blue, and 45 mg/ml iodacetamide) for 10 min with slow shaking. The strips were then rinsed in the SDS gel running buffer once, followed by their transfer into the SDS gel (12.5% acrylamide SDS gel prepared using low florescent glass plates). They were then sealed with 0.5% (w/v) agarose solution (in SDS gel running buffer). Running of the SDS gels was carried out at 15◦C and stopped when the dye front ran out of the gels.

Paudel et al. Switchgrass and Cordgrass Senescence Proteomics

After the SDS-PAGE images were scanned using Typhoon TRIO (GE Healthcare Bioscience, Pittsburgh, PA, USA), and analysis of the scanned images was conducted using Image Quant software (version 6.0, GE Healthcare). An in-gel analysis of the images was conducted using DeCyder software, version 6.5 (GE Healthcare Bioscience). For this step, a difference ingel analysis (DIA) tool made by DeCyder software was used. The value of the estimated number of spots was 3000. For the sake of low experimental variation, an automated tool made by DeCyder software was chosen for background subtraction and normalization of visible protein spots. The DIA datasets, along with the images, were put into the biological variation analysis (BVA) module made by DeCyder software. The spot intensity data were normalized with the internal standard sample that was labeled with Cy2 dye. The protein spots ratios that had increased or decreased abundance by 1.5-fold and had a Student's t-test value of p ≤ 0.05 were considered to be differentially abundant, along with the extra condition that the spot should be present and analyzed in all 2D gels. These ratios were calculated by the DeCyder image analysis software from spot volumes. As per the manufacturer's instructions the spot ratios were calculated as follows: (volume of secondary image spot/volume of primary image spot). This ratio indicated the change in spot volume between the two images. These ratio values were normalized, so that the modal peak of volume ratios was zero (since the majority of proteins are not up or down regulated). This ratio parameter is referred to as the volume ratio. In all DeCyder 2D Software DIA tables (**Supplementary Tables 1**, **2**) the volume ratio is expressed in the range of 1–1,000,000 for increases in spot volumes and −1 to −1,000,000 for decreases in spot volumes. Values between −1 and 1 are not represented, hence a two-fold increase and decrease is represented by 2 and −2, respectively (and not 2 and 0.5 as might have been expected) (GE Healthcare Bio-Sciences, Pittsburgh, USA).

### Spot Picking and Trypsin Digestion and Mass Spectrometry

Protein spots that were statistically significant (see Section IEF, SDS-PAGE, Image Scan and Data Analysis) with a p ≤ 0.1 and cut off value of 1.5-fold were picked up using the Ettan Spot Picker (GE Healthcare). The picked gel spots were washed a few times, digested with modified porcine trypsin protease (Trypsin Gold, Promega), and then desalted using Zip-tip C18 (Millipore, Billerica, MA, USA) (Robbins et al., 2013; Gupta et al., 2014; Hayashi et al., 2015; Das et al., in press). Peptides were eluted from the Zip-tip with 0.5µl of matrix solution (α-cyano-4-hydroxycinnamic acid, 5 mg/ml in 50% acetonitrile, 0.1% trifluoroacetic acid, and 25 mM ammonium bicarbonate) and spotted onto a MALDI plate (Robbins et al., 2013; Hayashi et al., 2015; Das et al., in press).

MALDI-TOF (MS) and TOF/TOF (tandem MS/MS) were performed on a 5800 mass spectrometer (AB Sciex). MALDI-TOF mass spectra were acquired in reflectron positive ion mode, averaging 2000 laser shots per spectrum. TOF/TOF tandem MS fragmentation spectra were acquired for each sample, averaging 2000 laser shots per fragmentation spectrum on each of the 10 most abundant ions present in each sample (excluding trypsin autolytic peptides and other known background ions) (Gupta et al., 2014; Hayashi et al., 2015; Das et al., in press).

### Database Search

Analysis of the MS/MS results were performed using GPS Explorer, version 3.5, which was equipped with a MASCOT search engine (Matrix science). A search in the database of the National Center for Biotechnology Information non-redundant (NCBInr) and in Phytozome v.10.3 (http://phytozome.jgi.doe. gov/pz/portal.html#!info?alias=Org\_Pvirgatum) was performed without constraining the protein molecular weight or isoelectric point, with variable carbamidomethylation of cysteine, oxidation of methionine residues, and one missed cleavage allowed in the search parameters (**Table 2** along with **Supplementary Tables 5**, **6**). Candidates with either protein score C.I.% (Confidence Interval) or Ion C.I.% greater than 95 were considered significant.

### Bioinformatic Analyses

The best hit proteins from MS/MS were blasted to the NCBI (http://www.ncbi.nlm.nih.gov/) and Phytozome (Goodstein et al., 2012) databases. NCBI, Arabidopsis Information Resource (http://arabidopsis.org) (Rhee et al., 2003), and Uniprot (http:// uniprot.org) (The UniProt Consortium, 2008) were used to retrieve further information on protein functions. The Kyoto Encyclopedia of Genes and Genomes website (KEGG; http:// www.genome.jp/kegg/) (Kanehisa and Goto, 2000) was used to retrieve information on proteins regarding their involvement in metabolic pathways. STRING database, version 9.1 (http:// string-db.org/) (Szklarczyk et al., 2011), was used to predict protein-protein interactions. We retrieved primary interaction data from the STRING database and portrayed the interaction in Cytoscape 3.1.1 software, along with a categorization of their functions (Shannon et al., 2003).

### RESULTS AND DISCUSSION

### Genotypes of SG and PCG with Contrasting Senescence Phenotypes

The two contrasting genotypes of switchgrass selected from "Sunburst" differed in morphological and phenological traits. The LS genotype headed approximately 10 days later than the ES genotype and reached full senescence 2–3 weeks later in autumn. The LS genotype was more disease resistant, taller, and produced more average biomass yields (12.2 Mg dry matter ha−<sup>1</sup> compared with 8.5 Mg dry matter ha−<sup>1</sup> ) than the ES genotype.

Similar differences were found between the LS PCG (origin in southeastern South Dakota) and the ES PCG (origin in southeastern North Dakota). The LS PCG headed approximately 2 weeks later and reached full senescence approximately 2–3 weeks later in autumn than the populations PCG plants. The LS PCG genotype was taller (2.5 m) than the ES PCG plants (1.25 m) at its maximum height during anthesis. At peak standing in mid-summer, ES PCG plants of PCG produced an average of 7.1 Mg dry matter ha−<sup>1</sup> compared with 15.4 Mg dry matter ha−<sup>1</sup> for the LS PCG plants.

#### TABLE 2 | Proteins identified by 2D-DIGE followed by Mass Spectrometry.


(Continued)

#### TABLE 2 | Continued


\*Different spot number corresponds to the same protein.

\*\*pI value 0 corresponds to the proteins with isoelectric point either below 3.0 or above 10.0 or it's mixed up with a nearby spot, thus pI value is unspecific.

#Largest protein based on molecular weight; \$Smallest protein based on MW.

### Proteomic Response during Pre- and Post-Senescence in PCG and SG

A representative 2D-DIGE gel image of sample A1/C7 is shown in **Figure 1**. 2D-DIGE along with MS/MS revealed the differential abundance of 74 protein spots (**Figure 1**). Among those, 69 different protein spots were selected based on statistical analyses (threshold of significance of p ≤ 0.1) and a cut off value of 1.5 fold increase/decrease in protein abundance, which is shown in the heat map (**Figure 2**).

To achieve an improved understanding of the proteomic responses due to senescence, differentially abundant proteins that were identified in both PCG and SG were categorized according to their GO annotation function (**Figure 3**; Camon et al., 2003). Our analysis indicated that 69 differentially abundant proteins were found to be involved in various biological processes, including photosynthesis (41%), amino acid metabolism (13%), carbohydrate metabolism (12%), ATP synthesis (6%), protein metabolism (5%), kinase activity (4%), ATP signal transduction (3%), cell division (3%), sterol metabolism (1%), Auxin binding (1%), and GTPase signal transduction (1%).

When we analyzed the fold change in protein abundance for after-to-before senescence in LS and ES PCG, nine unique protein spots were found to be differentially abundant that were absent in SG (**Supplementary Table 1**), whereas ten protein spots were found to have common abundance patterns between SG and PCG (**Figure 4**). Among those 19 protein spots of interest, 9 had higher abundance, and 10 had lower abundance. Our proteomic profiling of both ES and LS SG genotypes revealed that 28 major protein spots were differentially abundant (**Supplementary Table 2**). Among these 28 protein spots, 18 proteins were found exclusively in SG, whereas the other 10 protein spots were found in both SG and PCG (**Figure 3**). Overall, 10 protein spots had increased and 18 protein spots had decreased abundance in SG.

To investigate the changes in the protein profiles of before and after the onset of senescence (four different comparisons: B/A, D/C, F/E, and H/G) in PCG and SG (**Table 1**), we identified 10 major differentially abundant protein spots based on statistical significance (p ≤ 0.05) (seven different proteins: putative aconitate hydratase, ribulose bisphosphate carboxylase large chain, oxygen-evolving enhancer protein 1, glutathione Stransferase, hypothetical protein ZEAMMB73, and hypothetical protein VITISV) (**Figure 2**). Five protein spots were found to have higher abundance and five spots had lower abundance, most likely in response to senescence processes.

### Biological Implications of Selected Proteins

### Role of Putative Aconitate Hydratase during Senescence

Putative aconitate hydratase was found to have increased abundance during senescence in all four comparisons (**Figure 5**). Aconitate hydratase catalyzes steps in the TCA cycle and glyoxylate cycle that isomerizes citrate to isocitrate (Evans et al., 1966). Sugar metabolism toward gluconeogenesis, hexose formation, and conversion to sucrose seems to be an important phenomenon to signal the source-sink translocation in perennial grasses (**Supplementary Figure 2**). Aconitate hydratase in Arabidopsis is reported to also bind mRNA of CSD2 (CuZn superoxide dismutase 2), which was found to play a key role in antioxidant defense mechanisms (Gregersen and Holm, 2007; Moeder et al., 2007). Moreover, reduced aconitate hydratase levels in cells reportedly enhanced resistance to oxidative stress. It was also reported that upregulation of the cytoplasmic aconitate hydratase gene in wheat flag leaf plays a crucial role in the mitochondria with respect to oxidative cell damage (Gregersen and Holm, 2007). Therefore, it is possible that aconitate hydratase in both SG and PCG helps to mediate oxidative stress and regulate plant growth or senescence by influencing the expression of genes for redox balance (Moeder et al., 2007). Two genes encoding aconitate hydratases were significantly upregulated in senescing SG flag leaves (Palmer et al., 2015), corroborating the findings at the protein level.

### Role of Glutathione S Transferase in Anti-Oxidation and Its Early Senescing and Late Senescing Response in Prairie Cordgrass

It has been reported that GST is a marker gene for oxidative stress and that it participates in cellular protection when plants are subjected to wounding, pathogen attack, and lipid peroxidation (Bilang and Sturm, 1995; Grant et al., 2000). Consistent with previous findings, our proteomic analysis also showed significantly higher levels (five-fold) of a GST protein

when its fold change was analyzed before and after senescence in the ES SG genotype (**Figure 5**).

Our results also showed that a GST was found to be significantly higher in LS PCG before senescence than ES PCG (**Figure 6**). However, after senescence, the levels were higher in ES PCG than LS PCG (**Supplementary Table 1**), similar to what was observed for SG. This observation suggests that consistent higher abundance of GST levels in PCG before the onset of senescence may play a key role in delaying senescence signals by potentially delaying ROS mediated signaling. On the other hand, we found increased levels of this GST during senescence when we compared GST levels from after-to-before senescence in four

different sets of comparisons (B/A, D/C, F/E, and H/G). These results imply that the levels of this specific GST increased during senescence. ROS are continuously generated in plants in different organelles, such as mitochondria, chloroplasts, lysosomes, and peroxisomes, but are eliminated by different mechanisms to keep a balance between generation and removal. GST plays a crucial role in the removal of reactive oxygen species. Major pathways involving GST retrieved from KEGG indicate that GST seems to be a crucial enzyme in balancing cellular oxidative stress.

### Proteins Involved in the Regulation of Photosynthesis During Pre- and Post-Senescence

It is known that photosynthetic activity declines rapidly during leaf senescence (Kura-Hotta et al., 1987). Previous studies in rice showed that RuBisCO is highly active during leaf expansion but drastically reduced during senescence (Suzuki et al., 2001). Based on our 2-D DIGE analysis in SG and PCG, we also found significantly reduced levels of RuBisCO large subunit when the fold changes for this protein were compared before and after senescence in both PCG and SG genotypes (**Figure 5**). RuBisCO is the major enzyme responsible for assimilating carbon dioxide (Spreitzer and Salvucci, 2002). Reduced RuBisCO content can be expected to reduce CO<sup>2</sup> fixation and net photosynthesis. It has been reported that RuBisCO is the prime target for degradation as a consequence of increased levels of ROS in chloroplasts (Khanna-Chopra, 2012).

Furthermore, 2D-DIGE analysis in the current investigation revealed that other proteins involved in photosynthesis, such as oxygen evolving enhancer (OEC) protein in photosystem II, ferrodoxin-NADP-reductase, and the α- and β-subunits of ATP synthase, had higher abundance in the LS PCG compared with the ES PCG genotype, critically when protein fold changes were compared before senescence. All of these proteins are involved in electron transfers or in ATP synthesis in chloroplasts, (Haehnel, 1984). Consistent higher abundance of these proteins

in the LS PCG compared with the ES PCG before senescence suggests that an overall upregulated machinery of photosynthesis in plants may contribute to delaying the senescence process (**Supplementary Figure 3**). In line with our findings in PCG, downregulation of OEC has also been found to decrease photosynthesis in mutant Arabidopsis plants where one of the psbO genes was defective (Hager et al., 2002; Ifuku et al., 2005).

Higher abundance of the enzymes involved in photosynthetic electron transfer and ATP synthesis, along with the increased abundance of enzymes involved in the Calvin cycle, seems to contribute to the accumulation of higher biomass in LS PCG compared with ES PCG. However, consistent increased abundance of the photosynthetic machinery itself may have contributed to delay senescence in the LS genotypes. Photosystem II and its corresponding OEC proteins are tightly regulated during senescence (Lim et al., 2007). Downregulation of OEC gene has been known to reduce photosynthetic activity in mutant Arabidopsis plants, indicating that OEC protein plays a vital role in photosynthesis by oxidizing water in photosystem II on thylakoid membranes (Lundin et al., 2007; Dwyer et al., 2012). Interestingly, our proteomic evaluation showed that the levels of OEC proteins were reduced after the onset senescence compared with before senescence in PCG leaves and similar results were observed in SG as well (**Figure 5**). Although why OEC is downregulated during senescence remains unknown, it is clear that downregulation of OEC leads to photo-damage in photosystem II, with the malfunctioning of acceptor and donor electron carriers (Allahverdiyeva et al., 2009). This photodamage may play a role in producing more ROS, thus destroying chlorophylls and RuBisCO large subunit and signaling the initiation of senescence. Interestingly, higher abundance of ferrodoxin NADP-reductase and the chloroplastic α- and βsubunits of ATP synthase in the LS PCG compared with the ES PCG also supports the idea that increased abundance of photosynthesis-related proteins may contribute to delay the senescence process in this plant.

### Role of 14-3-3 Like Protein, S-adenosylmethionine Synthase, and Ras-Related RAB Protein in Senescence Signaling

The 14-3-3 like proteins are well studied and are involved in apoptotic and cell survival signaling in Arabidopsis (van Hemert et al., 2001). Similarly, S-adenosylmethionine (SAM) synthase generates SAM which is involved in the ethylene synthesis pathway and in many methylation-dependent pathways. The involvement of RAB protein in vesicular transport mechanisms of cells is well documented (Grbic and Bleecker, 1995; Fu ´ et al., 2000; Woollard and Moore, 2008). We found differential abundance of these proteins in LS PCG to ES PCG (G/E) before senescence (**Figure 6**).

We observed that a 14-3-3 like protein was highly abundant in LS PCG compared with ES PCG before senescence, whereas SAM synthase and RAB protein had lower abundance. Earlier

studies showed that a 14-3-3 protein antagonizes pro-apoptotic activity by playing a key role in integrating the signals for cell death and survival (Fu et al., 2000). It has been reported that the overexpression of the Arabidopsis 14-3-3 gene GF14λ in cotton plants resulted in maintaining green plants and significantly delaying senescence and improving drought stress tolerance (Yan et al., 2004; Gregersen et al., 2013). Similarly, the overexpression of a 14-3-3 gene in potato plants delayed senescence and elevated antioxidant activity, whereas the downregulation of this gene led to early senescence (Wilczyñski et al., 1998; Łukaszewicz et al., 2002). Thus, the overexpression of genes encoding 14-3- 3 proteins with appropriate promoters in PCG may be a good strategy to delay senescence. Studies on RAB genes related to senescence are very limited, but their overexpression is known to accelerate leaf senescence in mutant Arabidopsis plants (Kwon et al., 2009). Consistent lower abundance of RAB proteins in the LS PCG compared with the ES PCG also highlights the probable role of RAB genes in the senescence processes of this plant.

### Higher Abundance of the Calvin Cycle and Pentose Phosphate Pathway Derived Proteins in Late Senescing PCG

Transketolase, sedoheptulose-1,7-bisphosphatase (SBPase), and fructose-bisphosphate aldolase, all of chloroplastic origin, had increased abundance in LS PCG compared with ES PCG when the fold changes were evaluated before senescence. All three enzymes are part of the Calvin cycle and play a role in important rate limiting steps (**Supplementary Figure 4**).

As one of the crucial pentose phosphate pathway (PPP) enzymes, transketolase catalyzes the reversible reaction in the formation of ribulose-5-phosphate, and this reaction increases the regeneration rate of RuBisCO (Jones et al., 2012). Transketolase also facilitates isoprenoids synthesis and ultimately reduces oxidative stress (Bouvier et al., 1998, 2005; Lichtenthaler, 1999; Loreto et al., 2001; Vickers et al., 2009). Therefore, the regulation of the PPP and the Calvin cycle seems to be a factor during senescence in plants. Increased abundance of transketolase in LS PCG compared with ES PCG potentially signifies its role in PPP signaling required for cell survival or senescence.

Similarly, SBPase, which produces sedoheptulose-7 phosphate during the Calvin cycle, speeds up the regeneration of RuBisCO with RuBP, thereby speeding up carbon fixation (Lefebvre et al., 2005). An SBPase loss of function mutant in Arabidopsis, sbp, showed severe growth retardation, poor SBPase dependent carbon assimilation, and starch biosynthesis (Moore et al., 2003). Moreover, overexpression of SBPase in tobacco plants resulted in enhanced photosynthesis and growth rates in the early stage of development (Lefebvre et al., 2005; Feng et al., 2007; Liu et al., 2012), and increased growth rate and accumulation of biomass (Tamoi et al., 2006). Similarly, the higher abundance of SBPase and fructose-bisphosphate aldolase in LS PCG also corroborates the importance of the Calvin cycle during senescence. The overexpression of SBPase, with suitable promoters in tobacco and rice, has been reported to enhance photosynthetic rates and growth rates, whereas the downregulation of this gene resulted in retarding growth (Lefebvre et al., 2005; Tamoi et al., 2006; Feng et al., 2007; Liu et al., 2012). Therefore, the overexpression of the genes responsible for this enzyme may be a good strategy for increasing biomass in biofuel crops, such as PCG and SG. Data from the study of natural senescence of switchgrass flag leaves would support a positive role for SBPase in delaying senescence. SBPase transcript levels decreased significantly in flag leaves with the onset of senescence in field-grown switchgrass plants (Palmer et al., 2015).

Another potential Calvin cycle enzyme, fructose bisphosphate aldolase, regulates photosynthetic carbon flux. Importantly, downregulation of fructose bisphosphate aldolase causes reduced photosynthesis, altered levels of sugar and starch, and retarded growth in potato plants (Haake et al., 1998). Studies on young Brassica napus leaves showed that fructose bisphosphate aldolase expression is lower in mature green leaves but is then enhanced again during senescence (Buchanan-Wollaston, 1997). Consistent with earlier reports, our proteomic profiling shows increased abundance of fructose-bisphosphate aldolase in LS PCG, suggesting the significance of Calvin cycle activities in this late senescing genotype. It has also been reported that the overexpression of the Arabidopsis plastid fructose bisphosphate aldolase in tobacco plants resulted in augmented biomass by up to 2.2-fold (Uematsu et al., 2012). Producing plants overexpressing the genes encoding this enzyme may be another strategy to produce more biomass in biofuel crops. Biomass accumulation in LS PCG is higher compared with ES PCG (Boe, unpublished). Thus, the accumulation of higher amounts of biomass in LS PCG seems to be the result of upregulation of the Calvin cycle, increased photosynthesis, and increased carbon assimilation.

In C<sup>4</sup> plants, such as SG and PCG, CO<sup>2</sup> is generally not the rate-limiting factor as the oxygenase activity of RuBisCO is mostly limited due to high internal CO<sup>2</sup> concentrations in these plants. As a result, the regeneration rate of RuBisCO plays a critical role in enhancing the photosynthesis rate and increasing biomass. Increased abundance of Calvin cycle enzymes (transketolase, sedoheptulose-1,7-bisphosphatase, and fructose-bisphosphate aldolase) along with increased abundance of photosynthesis machinery (oxygen evolving enhancer proteins, ferrodoxin-NADP-reductase, and the α- and β-subunits of ATP synthase) likely increases the regeneration rate of RuBisCO and the overall carbon assimilation in the LS PCG genotype.

### Oxidative Phosphorylation and Senescence

Two key proteins of oxidative phosphorylation, succinate dehydrogenase (Complex II) and cytochrome c oxidase (Complex IV), were found in differential abundance during senescence. Complex IV was found to be more abundant in ES PCG compared with LS PCG before senescence, whereas complex II had increased abundance during senescence in PCG when the protein fold change was analyzed for "after senescence" to "before senescence" (**Supplementary Figure 5**). Oxidative phosphorylation is well shown to generate ROS in both animal and plant cells (Fleury et al., 2002; Turrens, 2003).

Consistent lower abundance of complex IV of the oxidative phosphorylation cycle (i.e., cytochrome c-oxidase) in LS PCG compared with ES PCG, along with higher abundance of complex II (i.e., succinate dehydrogenase) during senescence in PCG, suggests the probable role played by oxidative phosphorylation pathways in the senescence mechanism of PCG. Upregulation of oxidative phosphorylation may contribute to the generation of ROS, thereby signaling for senescence; thus, its downregulation may play a role in delaying the senescence process, as observed in LS PCG. We also found that during senescence, there was higher abundance of proteins involved in fatty acid break down, proteolysis, starch conversion, and gluconeogenesis. Upregulation of fatty acid breakdown also mediates mitochondrial dysfunction ROS generation, ultimately increasing oxidative stress in plants. Upregulation of genes involved in fatty acid break down, proteolysis, starch conversion, and gluconeogenesis also results in hexose accumulation, which signals for developmental senescence (Dai et al., 1999). Similarly, upregulation of sucrose phosphate synthase during senescence may have played a role in sucrose formation from hexoses and translocation to the sinks, which are rhizomes in the case of these perennial grasses (Lemoine et al., 2013). Many of these changes at the protein level mirror data observed at the transcript level in senescing switchgrass flag leaves (Palmer et al., 2015).

### Prediction of Protein-Protein Interactions

To better understand cellular and molecular functions, it is crucial to address various functional interactions between proteins. Using computational methods (see Section Bioinformatic Analyses), we tried to model a variety of functional protein-protein interactions between the differentially abundant proteins during senescence in PCG and SG to elucidate the complex underlying regulatory processes. From the database search, we found 39 homologous proteins in Arabidopsis (**Supplementary Table 3**) and analyzed their interactions (**Supplementary Table 4**). Our computational analysis of protein-protein interaction networks revealed two major proteins (PETC and SBPase) that have more than eight interactions considered to be at the central body of the network system (**Figure 7**). Most of the proteins central to this network were related to photosynthesis, whereas other connected proteins were involved in metabolism, ATP synthesis, and cell signaling (**Supplementary Table 4**). The majority of the proteins involved in photosynthesis were more abundant in LS PCG compared with ES PCG before senescence (**Figure 8**). This suggests that the upregulation of the photosynthesis machinery is a key factor in delaying senescence. Senescence mediated differential expression of genes encoding these central body proteins of the network may ultimately affect the specific pathways of related predicted partner proteins during growth.

## CONCLUSION

This study identified 69 statistically significant, differentially abundant proteins in PCG and SG genotypes contrasting in senescence, namely ES (early senescence) and LS (late senescence), using 2-Dimensional, differential in-gel electrophoresis (2D-DIGE) followed by mass spectrometry of leaf samples collected just before and after onset of senescence. The goal was to begin a catalog of leaf proteins that could be indicators of either phenotype as a starting point for more proteomic and physiological studies to understand mechanisms impacting senescence in these two biofuel grasses. LS genotypes of both grasses analyzed before senescence contained significantly higher abundances of a 14-3-3 like protein and a glutathione-S-transferase protein when compared to the ES genotypes. The higher abundance of 14-3-3 like proteins may be one factor that impacts the senescence process in both LS PCG and LS SG and provides a target at the protein and genetic level to evaluate in field nurseries containing PCG and SG germplasm. Species specific differences in proteins between the ES and LS genotypes were also observed, indicating that subtle variations in the accumulation of specific proteins could influence longevity. As an example, the maintenance of proteins associated with the photosynthetic machinery in LS PCG compared to ES PCG suggests that factors that control proteolysis within plastids could be important in delaying senescence and boosting biomass yields, especially toward the end of the rowing season. Overall this study opens the door toward a more vibrant understanding of various differentially abundant senescence-related proteins and should prove valuable to the future improvement of perennial grasses for higher biomass yields (Kim et al., 2014; Das et al., 2015). Application of such gel-based quantitative proteomic approaches together with mapping of post-translational modifications will provide comprehensive insights into the regulation of various senescence-related proteins, corresponding to their biological function (Komatsu et al., 2013).

## AUTHOR CONTRIBUTIONS

AB, GS, JG, PR, and JR designed the experiments. BP, AD, MT performed the experiments. BP, AD, GS, JG, PR, NP, and JR analyzed the results. BP, AD, AB, NP, GS, JG, and JR wrote the manuscript.

## FUNDING

This research was supported by funding from the North Central Regional Sun Grant Center at South Dakota State University through a grant provided by the US Department of Agriculture under award number 2010-38502-21861. This work was supported in part by the USDA-ARS CRIS project 3042-21000- 030-00D and CRIS project SD00H541-15. The U.S. Department of Agriculture, Agricultural Research Service, is an equal opportunity/affirmative action employer and all agency services are available without discrimination. Mention of commercial products and organizations in this manuscript is solely to provide specific information. It does not constitute endorsement by USDA-ARS over other products and organizations not mentioned.

## SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fpls.2016. 00293

Supplementary Figure 1 | Pre- and post-senescence periods were determined by measuring the chlorophyll content. Red arrow indicates the mid line between pre- and post-senescence period that was determined on the basis of chlorophyll measurement. Sample interval indicates number of weeks and CCI indicates relative amount of chlorophyll content in that particular sample.

Supplementary Figure 2 | Proposed model of sugar metabolism toward hexoses and sucrose formation during senescence which signals for early floral development, translocation of sugars from source to sink, and early senescence. Analysis revealed higher abundance of five proteins shown above during senescence, whereas three proteins (β-ketoadepyl CoA thiolase, cysteine protease, and sucrose phosphate synthase) had consistent lower abundance in LS PCG compared to ES PCG when analyzed before senescence onset.

Supplementary Figure 3 | Three proteins (oxygen evolving enhancer protein of photosystem II, ferrodoxin-NADP-reductase, and α- and β-subunit of ATP synthase) found with consistently higher abundance in LS PCG compared to ES PCG that are involved in key steps of photosynthesis. Figure of photosynthesis reproduced from KEGG reference pathway (http://www.genome.jp/kegg/pathway/map/map00195.html).

Supplementary Figure 4 | Proteins of the Calvin cycle found in greater abundance in LS PCG compared to ES PCG that catalyze the vital rate limiting steps are marked with red arrows. These enzymes are 1- fructose bishphsphate aldolase, 2- transketolase, 3- sedoheptulose-1,7-bisphosphatase.

Supplementary Figure 5 | Increased abundance of complex II, and complex IV of the oxidative phosphorylation pathway as observed in our study. Higher abundance of these complexes may contribute to increased ROS levels, thus signaling the expression of senescence associated genes. Consistent higher abundance of complex IV was observed in ES PCG compared LS PCG. Figure of oxidative phosphorylation reproduced from KEGG pathway (http://www. genome.jp/kegg/pathway/map/map00190.html).

Supplementary Table 1 | Fold change in differential abundance of protein spots for after to before senescence, in Early and Late PCG. F/E represents

### REFERENCES


ratio for after to before senescence in Early PCG, whereas H/G represents that ratio for late PCG. Positive values represent increased abundance; negative values represent decreased abundance. All ratios included are statistically significant (p < 0.01).

Supplementary Table 2 | Fold change in differential abundance of protein spots for after to before senescence, in Early and Late SG. B/A represents

ratio for after to before senescence in Early SG, whereas D/C represents that ratio for late SG. Positive values represent increased abundance; negative values represent decreased abundance. All ratios included are statistically significant (p < 0.01). Match quality H means high confidence, L means low confidence and N means no confidence. ∗NA, not available.

Supplementary Table 3 | Proteins, and their Arabidopsis homologs and percentage similarity.

Supplementary Table 4 | Protein function in protein-protein interaction table.

Supplementary Table 5 | Protein identifications with NCBInr database search.

Supplementary Table 6 | Protein identifications with Phytozome v10.3 database search.


in oxidative stress, carbon assimilation, and multiple aspects of growth and development in Arabidopsis. Mol. Plant 5, 1082–1099. doi: 10.1093/mp/sss012


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Paudel, Das, Tran, Boe, Palmer, Sarath, Gonzalez-Hernandez, Rushton and Rohila. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.