# NATURAL DIVERSITY IN THE NEW MILLENNIUM

EDITED BY: Joanna M. Cross, Chiarina Darrah, Nnadozie Oraguzie, Nourollah Ahmadi and Aleksandra Skirycz PUBLISHED IN: Frontiers in Plant Science

#### *Frontiers Copyright Statement*

*© Copyright 2007-2016 Frontiers Media SA. All rights reserved. All content included on this site, such as text, graphics, logos, button icons, images, video/audio clips, downloads, data compilations and software, is the property of or is licensed to Frontiers Media SA ("Frontiers") or its licensees and/or subcontractors. The copyright in the text of individual articles is the property of their respective authors, subject to a license granted to Frontiers.*

*The compilation of articles constituting this e-book, wherever published, as well as the compilation of all other content on this site, is the exclusive property of Frontiers. For the conditions for downloading and copying of e-books from Frontiers' website, please see the Terms for Website Use. If purchasing Frontiers e-books from other websites or sources, the conditions of the website concerned apply.*

*Images and graphics not forming part of user-contributed materials may not be downloaded or copied without permission.*

*Individual articles may be downloaded and reproduced in accordance with the principles of the CC-BY licence subject to any copyright or other notices. They may not be re-sold as an e-book.*

*As author or other contributor you grant a CC-BY licence to others to reproduce your articles, including any graphics and third-party materials supplied by you, in accordance with the Conditions for Website Use and subject to any copyright notices which you include in connection with your articles and materials.*

> *All copyright, and all rights therein, are protected by national and international copyright laws.*

*The above represents a summary only. For the full conditions see the Conditions for Authors and the Conditions for Website Use.*

ISSN 1664-8714 ISBN 978-2-88919-952-5 DOI 10.3389/978-2-88919-952-5

## About Frontiers

Frontiers is more than just an open-access publisher of scholarly articles: it is a pioneering approach to the world of academia, radically improving the way scholarly research is managed. The grand vision of Frontiers is a world where all people have an equal opportunity to seek, share and generate knowledge. Frontiers provides immediate and permanent online open access to all its publications, but this alone is not enough to realize our grand goals.

## Frontiers Journal Series

The Frontiers Journal Series is a multi-tier and interdisciplinary set of open-access, online journals, promising a paradigm shift from the current review, selection and dissemination processes in academic publishing. All Frontiers journals are driven by researchers for researchers; therefore, they constitute a service to the scholarly community. At the same time, the Frontiers Journal Series operates on a revolutionary invention, the tiered publishing system, initially addressing specific communities of scholars, and gradually climbing up to broader public understanding, thus serving the interests of the lay society, too.

## Dedication to Quality

Each Frontiers article is a landmark of the highest quality, thanks to genuinely collaborative interactions between authors and review editors, who include some of the world's best academicians. Research must be certified by peers before entering a stream of knowledge that may eventually reach the public - and shape society; therefore, Frontiers only applies the most rigorous and unbiased reviews.

Frontiers revolutionizes research publishing by freely delivering the most outstanding research, evaluated with no bias from both the academic and social point of view. By applying the most advanced information technologies, Frontiers is catapulting scholarly publishing into a new generation.

## What are Frontiers Research Topics?

Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are collections of at least ten articles, all centered on a particular subject. With their unique mix of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author by contacting the Frontiers Editorial Office: researchtopics@frontiersin.org

# **NATURAL DIVERSITY IN THE NEW MILLENNIUM**

Technology Sustainable Development, Belem, Brazil

Topic Editors:

**Joanna M. Cross,** Inonu University, Turkey **Chiarina Darrah,** Eunomia Research and Consulting, UK **Nnadozie Oraguzie,** Washington State University, Prosser, USA **Nourollah Ahmadi,** Centre de coopération internationale en recherche agronomique pour le développement, France **Aleksandra Skirycz,** Instituto Tecnológico Vale Desenvolvimento Sustentável/Vale Institute of

Electron microscopy of a transversal cut of the stem of an Arabidopsis Columbia plant. Image by Joanna M. Cross

Natural diversity has been extensively used to understand plant biology and improve crops. However, studies were commonly based on visual phenotypes or on a few measurable parameters. Nowadays, a large number of parameters can be measured thanks to next generation sequencing, metabolomics, proteomics, and transcriptomics thus providing an unprecedented resolution in the detection of natural diversity. This enhanced resolution offers new possibilities in terms of understanding plant biology. Technology advances also contribute to a better assessment of the biodiversity loss currently taking place. Hence, the topic presents an overview on efforts for maintaining biological diversity in crops, on possibilities offered by recent technologies in the assessment of natural variation, and ends with examples of the diversity found even at the cellular level.

**Citation:** Cross, J. M., Darrah, C., Oraguzie, N., Ahmadi, N., Skirycz, A., eds. (2016). Natural Diversity in the New Millennium. Lausanne: Frontiers Media. doi: 10.3389/978-2-88919-952-5

# Table of Contents

*05 Editorial: Natural diversity in the new millennium* Joanna M. Cross, Chiarina Darrah, Nnadozie Oraguzie, Nourollah Ahmadi and Aleksandra Skirycz

## **1) Natural diversity in a populated world**



Shallu Thakur, Pankaj K. Singh, Alok Das, R. Rathour, M. Variar, S. K. Prashanthi, A. K. Singh, U. D. Singh, Duni Chand, N. K. Singh and Tilak R. Sharma

*90 Analysis of genetic variation and diversity of Rice stripe virus populations through high-throughput sequencing*

Lingzhe Huang, Zefeng Li, Jianxiang Wu, Yi Xu, Xiuling Yang, Longjiang Fan, Rongxiang Fang and Xueping Zhou

*97 A Genome-wide Combinatorial Strategy Dissects Complex Genetic Architecture of Seed Coat Color in Chickpea*

Deepak Bajaj, Shouvik Das, Hari D. Upadhyaya, Rajeev Ranjan, Saurabh Badoni, Vinod Kumar, Shailesh Tripathi, C. L. Laxmipathi Gowda, Shivali Sharma, Sube Singh, Akhilesh K. Tyagi and Swarup K. Parida

## **3) Natural diversity at the molecular level**


Walaa K. Mousa and Manish N. Raizada

*158 Target or barrier? The cell wall of early- and later-diverging plants* **vs** *cadmium toxicity: differences in the response mechanisms*

Luigi Parrotta, Gea Guerriero, Kjell Sergeant, Giampiero Cai and Jean-Francois Hausman

# Editorial: Natural diversity in the new millennium

Joanna M. Cross <sup>1</sup> \*, Chiarina Darrah<sup>2</sup> , Nnadozie Oraguzie<sup>3</sup> , Nourollah Ahmadi <sup>4</sup> and Aleksandra Skirycz <sup>5</sup>

<sup>1</sup> Horticulture Department, Inonu University, Malatya, Turkey, <sup>2</sup> Eunomia Research and Consulting, Bristol, UK, <sup>3</sup> Irrigated Agriculture Research and Extension Center (IAREC), Washington State University, Prosser, WA, USA, <sup>4</sup> Centre de coopération internationale en recherche agronomique pour le développement, Montpellier, France, <sup>5</sup> Instituto Tecnológico Vale Desenvolvimento Sustentável/Vale Institute of Technology Sustainable Development, Belem, Brazil

Keywords: natural diversity, crop improvement, next generation sequencing, biodiversity, plant biotechnology

Natural diversity is a recurrent theme in everyday life and in research. Thus, the objective of this topic was to highlight progress or novelties that occurred in this new millennium. The topic includes many different fields as can be seen by the variety of the articles. Indeed, subjects covered include variations in strawberry aroma, diversity of protein families, and peculiar ecosystems. However, several themes connect this apparently unrelated set of articles.

Unfortunately, one cannot ignore the issue of biodiversity and loss thereof. The United Nations Climate Change Conference was held in Copenhagen in 2009 to discuss ways of mediating environmental changes. No simple agreement was signed by all participating countries. That, combined with press articles has left a negative feeling. Thus, the general opinion is that the world is divided into two, the Western group environmentally friendly, and the rest of the world environmentally unfriendly. The articles reveal a more complex picture as many countries are trying to combine sustainable industries with economic realities. For instance, the Brazilian government has issued laws to safeguard the peculiar Canga ecosystems while allowing mining (Skirycz et al., 2014). Efforts are underway to produce improved oil palms to spare land (Barcelos et al., 2015). Moreover, many ecological studies can be found in the literature which highlights the international concern for loss of biodiversity. Unfortunately, biodiversity loss also applies to crops. Many landraces are abandoned as farmers adopt a few high yielding varieties. This results in a significant loss of genetic variability. Thus, more diversity was found in a disease resistance gene in rice landraces than in common varieties (Thakur et al., 2015). Once again, the articles illustrate an emphasis on collecting cultivars even for orphan crops. For instance, the Ethiopian Institute of Biodiversity houses 5000 accessions of tef, a major staple crop for the country (Assefa et al., 2015). The number represents a four-fold increase over the last 20 years. Likewise, significant resources exist for millet (Goron and Raizada, 2015).

Assessment of biodiversity requires a combination of phenotypic and molecular techniques. Technological improvements have been significant in all areas. Hence, metabolomics can now detect many complex molecules such as aroma compounds. As a result, factors determining the taste of strawberry (Negri et al., 2015) and other fruits can be elucidated. However, the most spectacular change comes from the decrease in cost and enhanced speed of Next Generation Sequencing. Fifteen years ago, international consortiums sequenced a few model species and major crops. Now many universities are acquiring their own sequencing equipment. Most plants covered in the articles harbor some sequencing resources either as transcriptomics or as an annotated genome. This opens enormous possibilities for identifying new genetic variations, assessing the variability of isozymes, and associating a given phenotype with regions in the genome. Thus, sequencing reveals the enormous evolutionary potential of viruses (Huang et al., 2015). In addition, extensive diversity was found in the MAP, MAPP, MAPPP kinase families in grapevine (Çakir and Kılıçkaya, 2015).

Edited and reviewed by: Richard A. Jorgensen, University of Arizona, USA

\*Correspondence: Joanna M. Cross joanna.cross@inonu.edu.tr

#### Specialty section:

This article was submitted to Plant Genetics and Genomics, a section of the journal Frontiers in Plant Science

Received: 20 September 2015 Accepted: 09 October 2015 Published: 29 October 2015

#### Citation:

Cross JM, Darrah C, Oraguzie N, Ahmadi N and Skirycz A (2015) Editorial: Natural diversity in the new millennium. Front. Plant Sci. 6:897. doi: 10.3389/fpls.2015.00897

Finally, these articles illustrate the spectacular biological diversity of plants. They seem to survive in any environment, by any means. Hence, a seemingly fixed structure such as the cell wall shows remarkable flexibility in response to environmental, physiological, and genetic cues (Parrotta et al., 2015). Moreover, the richness of plant metabolism is further enhanced by symbiotic relations with microorganisms (Mousa and Raizada, 2015). However, while well described, this diversity harbors many

## REFERENCES


mysteries as to biological functions, evolutionary adaptations, and physiological mechanisms. With the enhanced technologies, let's hope we can understand and safeguard our beautiful world.

## AUTHOR CONTRIBUTIONS

JC, CD, and AS contacted participants; JC and AS edited manuscripts; NA and NO partly edited articles.

strawberries "Profumata di Tortona" (F. moschata) and "Regina delle Valli" (F. vesca). Front. Plant Sci. 6:56. doi: 10.3389/fpls.2015.00056


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2015 Cross, Darrah, Oraguzie, Ahmadi and Skirycz. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

## Canga biodiversity, a matter of mining

## *Aleksandra Skirycz1, Alexandre Castilho2 , Cristian Chaparro1, Nelson Carvalho1, George Tzotzos1 and Jose O. Siqueira1\**

<sup>1</sup> Department of Sustainable Development, Vale Institute of Technology, Belém, Brazil <sup>2</sup> Vale S.A., AP Supervisao PCM, Carajas, Brazil

#### *Edited by:*

Elena R. Alvarez-Buylla, Universidad Nacional Autónoma de Mexico, Mexico

#### *Reviewed by:*

Luis Enrique Eguiarte, Universidad Nacional Autónoma de México, Mexico Alma Pineyro-Nelson, University of California Berkeley, USA

#### *\*Correspondence:*

Jose O. Siqueira, Department of Sustainable Development, Vale Institute of Technology, Rua Boaventura da Silva, 955 (Nazaré), 66055-090 Belém, PA, Brazil e-mail: jose.oswaldo.siqueira@itv.org Brazilian name canga refers to the ecosystems associated with superficial iron crusts typical for the Brazilian state of Minas Gerais (MG) and some parts of Amazon (Flona de Carajas). Iron stone is associated with mountain plateaux and so, in addition to high metal concentrations (particularly iron and manganese), canga ecosystems, as other rock outcrops, are characterized by isolation and environmental harshness. Canga inselbergs, all together, occupy no more than 200 km2 of area spread over thousands of km2 of the Iron Quadrangle (MG) and the Flona de Carajas, resulting in considerable beta biodiversity. Moreover, the presence of different microhabitats within the iron crust is associated with high alpha biodiversity. Hundreds of angiosperm species have been reported so far across remote canga inselbergs and different micro-habitats. Among these are endemics such as the cactus Arthrocereus glaziovii and the medicinal plant Pilocarpus microphyllus. Canga is also home to iron and manganese metallophytes; species that evolved to tolerate high metal concentrations.These are particularly interesting to study metal homeostasis as both iron and manganese are essential plant micro-elements. Besides being models for metal metabolism, metallophytes can be used for bio-remediation of metal contaminated sites, and as such are considered among priority species for canga restoration. "Biodiversity mining" is not the only mining business attracted to canga. Open cast iron mining generates as much as 5–6% of Brazilian gross domestic product and dialog between mining companies, government, society, and ecologists, enforced by legal regulation, is ongoing to find compromise for canga protection, and where mining is unavoidable for ecosystem restoration. Environmental factors that shaped canga vegetation, canga biodiversity, physiological mechanisms to play a role, and ways to protect and restore canga will be reviewed.

**Keywords: canga, iron, ecosystem, endemism, restoration**

#### **CANGA AND SIMILAR ECOSYSTEMS WORLD-WIDE**

In Brazil term canga refers to the ecosystem associated with superficial iron crust and present in the states of Minas Gerais (MG; Quadrilatero Ferrifero) and Para (Serra de Carajas; **Figure 1A**; **Table 1**). Literally translated MG stands for "general mines" reflecting strong ties to the mining industry due to its rich mineral deposits. The iron quadrangle (IQ) covers approximately 7200 km2 with the superficial iron crusts distributed on the tops of mountains at altitudes ranging from 1000 to 2000 m above the sea level. The total area covered by canga in the IQ is relatively small, approximately 100 km<sup>2</sup> (Dorr, 1964), and dispersed on mountain inselbergs (600–700 m above the sea level). Similarly, in the Amazon state of Para (**Figure 1A**) canga occupies only 2 or 3% of natural clearings on iron rocks of the Carajas mountains, approximately 90 km2, with the surrounding area covered by dense tropical forest (Nunes, 2009).

Canga-like areas, characterized by the presence of ironstone, ferricrete soils, are not restricted to Brazil; however, in other parts of the world they would be recognized under a different name. Canga-like habitats, refered to as banded iron formations (BIFs) have been reported in South-Western Australia (Gibson et al., 2010, 2012; **Table 1**). Similarly to canga, BIFs form isolated inselbergs (some 700 km apart) on tops of mountains and are characterized by high local and regional diversity. As canga they are endangered by the mining industry and their restoration provides a major challenge (Gibson et al., 2010).

### **ENVIRONMENTAL FORCES THAT SHAPED CANGA BIODIVERSITY**

The canga ecosystem has been shaped by 1000s of years long evolution process that gave rise to communities thriving in unique and severe environmental conditions. As is the case with other mountain top outcrops, the canga biome had to adapt to high ultraviolet (UV) exposure, high daily temperatures, rapid water loss, strong winds and poorly developed soil cover (Jacobi et al., 2007). High transpiration and low soil retention capacity challenges plants during periods of drought, even during the wet season. In addition to canga soils being shallow, acid (pH∼4.0, in comparison to pH∼7.0 of for the optimal agricultural soils) and nutritionally poor, they also contain toxic levels of aluminum and heavy metals (**Table 1**). More specifically canga substrates contain little phosphorous (P), magnesium (Mg), and calcium (Ca) but plenty of iron (Fe) and manganese

**FIGURE 1 | Canga (A) Location of canga inselbergs investigated in Carajas and IQ.** Map generated and reproduced from Google Maps. Green indicator ( Nunes, 2009), red indicators (Jacobi et al., 2007), violet indicator (Viana and

Lombardi, 2007); blue indicator (Pifano et al., 2010). **(B–H)** Photographs of canga taken in Crajas South Range, courtesy of Alexandre Castilho. **(I)** Mining operation in the Carajas region, courtesy of Alexandre Castilho.

(Mn; Vincent and Meguro, 2008; Oliveira et al., 2009; Silva, 2013). Copper (Cu), nickel (Ni), zinc (Zn), chromium (Cr), and lead (Pb) accumulation has been also reported in plant samples harvested from the canga outcrop (Porto and da Silva, 1989; Silva, 1992). Importantly metal toxicity is aggravated by low pH, which increases metal mobility and so accessibility for plant uptake, e.g., under low pH iron is released from otherwise insoluble ferric oxides (Morrissey and Guerinot, 2009).

Adding to the severe abiotic conditions, the floral composition of canga is also a result of mineral and topographical heterogeneity giving rise to distinct microhabitats within canga (e.g., Jacobi et al., 2007; Viana and Lombardi, 2007). Iron-rich substrates may be totally fragmented or form a thick, solid crust. Cliffs, caves, grassland, rock fields, cracks in rock crevices, depressions, temporary or permanent ponds, temporally flooded areas all are part of canga. Finally canga biodiversity is affected by the surrounding ecosystems (Jacobi and Carmo, 2008a); in Carajas canga constitutes an enclave within an Amazonian forest whilst in IQ canga is located in the transition zone between the Cerrado and Atlantic Forest.

To summarize, harsh environmental conditions have supported evolution of hardy and often unique plants that can thrive in dry, metal rich but nutrient poor soil. High biodiversity is a result of canga being not one but rather a conglomerate of distinct microhabitats. It is dispersed along mountain tops rather than being a continuous habitat, and high biodiversity of the surrounding ecosystems.

## **PLANT BIODIVERSITY PORTFOLIO**

Canga is home to 100s of plant species (Jacobi and Carmo, 2011). Close to 500 have been reported so far, sampled from only a small percentage of the total area occupied by canga. Three taxonomic studies published for the IQ (Jacobi et al., 2007; Viana and Lombardi, 2007; Pifano et al., 2010) and one for the Carajas Mountains (Nunes, 2009) will be referred to here.

Looking at plant species several characteristics emerge. Firstly, canga floristic composition significantly differs from the neighboring ecosystems. When compared with the Atlantic and Amazonian forest (Nunes, 2009; Pifano et al., 2010) the difference is obvious at first sight; canga landscape resembling savanna covered by shrubs and herbaceous plants with few trees and associated crawlers

#### **Table 1 | Comparison of iron based ecosystems in Brasil (canga) and Australia (bounded iron formations).**


\*Aproximate diffrences in element composition when compared to standard soil (Reganold et al., 2010). \*\*Dominance calculated based on number of species.

(**Figures 1B–H**). Thus not surprisingly, 80% of species identified in canga was also found to be canga exclusive in comparison with the Atlantic forest, swamp and riparian forest (Pifano et al., 2010). Less obviously, the canga floristic composition also differs from other rocky outcrops present in IQ. A comparison of the 16 most represented angiosperm families found in sandstone and canga rocky outcrops revealed significant differences. For instance, whereas Solanaceae and Verbenaceae species are absent in the sandstone inselbergs they are abundant in canga. The opposite being true for Eriocaulacea and Xyridaceae angiosperms (Jacobi and Carmo, 2008b).

Secondly, canga is characterized by high beta diversity, which refers to the differences measured between distinct canga inselbergs (Jacobi et al., 2007). In their work Jacobi et al. (2007) investigated two sites approximately 32 km apart. Out of 235 fern and angiosperm species only 27% were common for both sites. Even less overlap is found between the three studies from IQ, cangas 100s of kilometers apart. Only two species were identified in all studies and 18% were shared between any of the two. And so not surprisingly, the plant composition in the Carajas mountains, 1000 km away, shares only seven species common with those found in the IQ.

Thirdly, canga is characterized by high alpha biodiversity, which refers to differences measured withing a single canga inselberg (Jacobi et al., 2007). High alpha biodiversity is strongly related to the presence of distinct microhabitats within canga; differences in the amount and composition of soil, and moisture being considered the main factors shaping diverse plant communities (Vincent and Meguro, 2008; Nunes, 2009; Silva, 2013). A comprehensive study of Viana and Lombardi (2007) reported 358

plant species representing 70 angiosperm plant families, which is approximately 15% of all known angiosperm families worldwide. Researchers examined and described four distinctive canga microhabitats. The majority of the plant species were found on so-called "grassy fields," which as the name implies, form homogenous, in appearance, sea of grasses and sages mixed with subshrubs and shrubs, and growing on a fragmented iron crust. In contrast "rocky fields" represent more an alpine like ecosystem characterized by the diversity of perennial and annual herbs (both monocots and dicots) growing in rock crevices formed on a solid iron crust. Noteworthy, endemic species: cactus *Arthrocereus glaziovii*, bromeliads*Consimilis dyckia* and*Vriesea minarum*, orchid *Oncidium gracile*, *Sinningia rupicola*, and legume trees *Mimosa calodendron* are all found in the open "grassy" or "rocky" fields (Jacobi and Carmo, 2011). In places with deeper substrate and more organic matter forest islands are established with small tree species and shrubs covered with crawlers and epiphytes. Last microhabitat represents an area disturbed by human activity and is mostly similar to the grassy fields albeit with only one fourth of the species richness. Remarkably more than 60% of the plant species (235 out of 358) were unique to the given microhabitat and only five were present in all of them. This kind of microhabitat specificity was also reported by Jacobi et al. (2007) and Nunes (2009). Examples of other microhabitats found in canga and described by Jacobi et al. (2007) include small permanent rock pools inhabited by unicellular and filamentous algae, ponds formed during the wet season in shallow depressions and covered with rare monocotyledonous plants from the Eriocaulon genus, entrances of small caves underneath iron crust providing shade and humidity for mosses and delicate herbs.

Finally, it was demonstrated that the exact floristic composition of canga could vary between dry and wet seasons (Nunes, 2009; Silva, 2013). However, canga is home to relatively few annual plants (Jacobi and Carmo, 2011), partial or complete loss of aboveground organs is a common adaptation to dry season (Jacobi et al., 2007).

To our knowledge and to date, canga diversity was estimated exclusively based on taxonomic identification, which may lead to both under- or over-estimation (Frankham, 2010), the first being more common. To illustrate the point, suffice to mention a recent work of Nistelberger et al. (2014). Analysis of genomic and mitochondrial regions of the millipede species from the Australian BIFs pointed to the long-term isolation of distinctive populations, inhabiting remote inselbergs, and so the need to place each population into a separate conservation unit.

When looking at individual species canga biodiversity may seem overwhelming. However, there is uniformity when looking at vegetation type and plant families. In general canga vegetation can be described as rocky, ephilitic (growing on stone), shrub dominated with large number of sedges, grasses, and orchids. Plant species are evenly distributed between shrubs, subshrubs, herbs, and epiphytes. Trees and crawlers are by far less diverse (Jacobi and Carmo, 2011). In terms of families the richest are Poaceae (grasses and sedges), Asteraceae (daisy like herbs), Fabaceae (legumes), Myrtaceae (subshrubs and shrubs), Melastomataceae (shrubs and small trees), and Orchidaceae (orchids; Jacobi et al., 2007;Viana and Lombardi, 2007; Nunes, 2009; Pifano et al., 2010). It is, however, important to mention that these families are not only rich in canga but also in other rocky outcrops (Jacobi and Carmo, 2008b), and they encompass 35% of the total ∼31000 angiosperm species reported in Brazil (Forzza, 2010). Area-wise ironstone outcrops are dominated by shrubs, mainly dicots. Monocots register a large proportion of herbaceous plants (grasses and sedges) and subshrubs. Dicot herbs are represented by numerous species of relatively small population sizes (Jacobi et al., 2007). This general characteristic of canga outcrop is also shared with plant communities found in Australian BIFs (**Table 1**); BIFs landscape being dominated by shrubs and herbaceous species from the Fabaceae, Myrtaceae, and Poaceae families (**Table 1**), generously represented in canga (Gibson et al., 2012).

Summarizing, canga ecosystems are characterized by high local and regional diversity. To date, 100s of plant species, some endemic, have been identified across diverse canga microhabitats and geographical locations. The total area of canga is estimated to be approximately 200 km2, of which approximately 1–2% was studied, pointing the need of additional investigation. Genomics tools can play a crucial role for the accurate estimation of genetic diversity.

## **CANGA AS SOURCE OF NEW METAL HYPER-ACCUMULATING SPECIES**

To cope with the harsh environmental conditions canga plants developed a number of physiological adaptations, of which metal tolerance is by far the most interesting. Together with Australian BIFs, cangas may constitute one of few ecosystems worldwide to "mine" for plants adapted to high iron. We are reviewing the importance of this in this section.

Metallophytes, that is species that evolved on metalliferous soils, are commonly divided into those that can exclude metals and those that can accumulate or even hyper-accumulate metals in their shoots (Baker, 1981; reviewed by Krämer, 2010 and Leitenmaier and Küpper, 2013). The first class represents majority of the species (>95%) that evolved avoidance mechanisms to block metal uptake into cells or mediate metal efflux and storage into root vacuoles, far from the photosynthetic cells. In contrast few percentages of plants would undergo extensive adaptation that would allow them to accumulate one or few metals in their shoots, where they act as repellent for herbivores and pathogens (Boyd, 2012). Metal transport from soil into plant shoot, complexation with organic compounds to reduce toxicity and sequestration inside vacuoles are the key processes considered for metal accumulation and tolerance (Krämer, 2010).

Growing on metal rich soils canga plant community is considered a unique source "to mine" for novel metal hyperaccumulating taxa. Of these manganese and iron hyperaccumulators stand out for a number of reasons. (1) In contrast to several 100 known nickel, cadmium, and zinc hyper-accumulators, not more than 20 manganese hyper-accumulating species were reported to date (Fernando et al., 2013). To our knowledge no iron accumulating taxa have been reported and studied. (2) Manganese and iron are essential plant micro-nutrients and their deficiency impacts plant yield (Marschner, 2012). But in high concentrations,

manganese and iron are toxic to plants (Lynch and St Clair, 2004). As the unraveling molecular mechanisms underlying plant iron and manganese homeostasis is considered key to address both deficiency and toxicity issues reported for the agricultural crops, hyper-accumulators can be considered a good tool. For instance, analogous to zinc hyper-accumulators (Krämer, 2010), canga metallophytes are expected to encompass metal sensitive populations, which grow in ironstone outcrops along canga but in metal free substrate. Metal tolerant and metal sensitive populations within a species provide a great resource for genetic and molecular mining. On the one hand, recombinant inbred lines derived from parents differing in the trait of interest are the most common way to discover underlying QTLs. On the other hand, in system biology "omics" based analysis, differences in transcripts, proteins, and/or metabolites measured between tolerant and sensitive plants are good suspects behind observed phenotypes. (3) Soils with toxic levels of manganese are associated with mineral deposits but also with iron ore mining (strip mining generates manganese rich dust) and processing (steel industry uses big amounts of manganese). High concentration of manganese is toxic to plants but also to animals; in humans manganese poisons the nervous system resulting in Parkinson like symptoms (Taba, 2013). Because hyperaccumulators have the capacity to remove manganese from the soil they are considered a useful resource for phyto-remediation of post-industrial and post-mining areas (Whiting et al., 2004). (4) Finally, as metallophytes, similarly to endemics, are unique to canga they are a priority for post-mining canga restoration (see below).

Despite the obvious interest, we found only two rather old studies reporting metal content of canga plants. Porto and da Silva (1989) and Silva (1992) examined 24 canga plant species, both from IQ and Carajas, for foliar concentration of Mn, Ni, Cr, Cu, Pb, and Fe. Twenty of the investigated specimens accumulated single or several metals in levels several times higher than "standard reference plant." Although this is not enough to

classify them as a hyper-accumulators (van der Ent et al., 2013) it is a good start for further canga research. Most interestingly, seven of the examined species accumulated at least double the iron or manganese leaf concentration of standard reference plants (**Figure 2**). Two more species accumulated iron and one more manganese in roots but not in their leaves (**Figure 2**). Concluding, studies of Porto and da Silva (1989) and Silva (1992) demonstrate that canga plants have developed various strategies to cope with high iron and manganese concentrations, through exclusion, or through accumulation, providing justification for additional studies.

To summarize quoting Krämer (2010): "metal hyperaccumulators will be instrumental in the development of systems biology approaches toward an integrated understanding of plant metal homeostasis" and Erskine et al. (2013): "to prevent continued biodiversity loss and to benefit from the unique adaptive mechanisms that exclude, tolerate, and even accumulate toxic metals in mine site rehabilitation, metallophytes must be recognized as vital asset at developmental stage of a mining operation." From this perspective identification of canga metallophytes appears an obvious direction to pursue.

## **OTHER PHYSIOLOGICAL ADAPTATION OF THE CANGA PLANTS**

Besides metal tolerance and similar to the Australian BIFs (**Table 1**), canga plant community is knownfor its overall resilience to water scarcity (Jacobi et al., 2007). Rocky plant communities found in IQ are home to largest group of resurrection plants in Brazil, being able to tolerate almost complete desiccation by partial or complete loss of their above-ground organs during periods of severe drought. Less drastic adaptations include thick, waxy, imbricate leaves; guard cells sunken into pits; extensive and deep root systems; presence of water storing organs, e.g., pseudo-bulbs in orchids. Some plants, such as Clusa trees, are characterized by CAM photosynthesis, which allow them to keep guard cells shut

during the day, by using night acquired CO2 stored in the organic acid malate. Slow growth rates of the canga plant species may be seen as yet another adaptation to water shortage. In stressful environments plants accumulate rather than spend their resources (carbon and water) in expense of growth, to avoid starvation if the conditions get worse (Skirycz et al., 2011).

### **CANGA, METABOLITES, AND MEDICINAL PLANTS**

To our knowledge, there is not an adequate molecular understanding of the physio-morphological traits that allow canga plants to thrive in their stressful environment. Of course parallels can be made from model plants but particulars need investigation. The potential of interesting findings can be demonstrated by available metabolite data. Many of the IQ plant species, present also in canga, were investigated for their metabolite profiles. These plants were those with reported medicinal properties and metabolite analysis was performed to investigate plausible agents responsible for the observed therapeutic affects (**Table 2**). Although these studies did not aim at understanding plant physiology, they revealed a richness of secondary metabolites associated with IQ ecosystem, with some of the compounds being well known for their involvement in plant stress responses. For example flavonols play a role in oxidative stress associated with drought and high UV (Skirycz et al., 2007) and past findings point to role of anthocyanin in metal complexion (Hale et al., 2001). Based on existing data, we believe that more comprehensive metabolite profiling of the canga plant species will shed light on the environmental adaptation of this unique rocky outcrop.

Touching upon medicinal plants, it is interesting to mention the small tree *Pilocarpus microphyllus* (popular name jaborandi) endemic to Brazil and found in Flona de Carajas, where its associated with forest clearings and poor soils such as those present in canga. Leaves of Pilocarpus are collected by local communities and sold to pharmaceutical companies as source of the FDA approved anti-glaucome drug Timpilo (Costa, 2012). The main constituent of Timpilo is pilocarpine, an imidazole alkaloid inaccessible for cheap organic synthesis. To avoid irreversible depletion of the wild population, pilocarpine collection is regulated by the environmental agencies. Because of its value for the local community jaborandi is considered one of the key plants for post-mining re-vegetation. Unfortunately, when cultivated Pilocarpus gives low levels of pilocarpine. It is thought that one of the reasons for this is the lack of environmental stressors (e.g., metals or low soil pH) that boost pilocarpine production in the wild. We believe that the molecular understanding of pilocarpine synthesis would shed light on jaborandi adaptation to disturbed environments. In turn, this may result in income generation for local communities and the pharmaceutical industry, as well as environmental benefits arising from more effective conservation and restoration practices of the mining industry.

#### **HOW TO PROTECT AND RESTORE?**

In 1992 during the Earth Summit in Rio de Janeiro, the Convention on Biological Diversity (CDS) was signed to help creating legislation for biodiversity protection. Mineral rich regions present a serious dilemma for the CDS (Gibson et al., 2010; Jacobi et al., 2011). On the one hand, they are home to unique ecosystems, often considered biodiversity hot spots. On the other hand the demand for metallic ores is growing rapidly and so is the number of mining permits, a trend that will continue in the future. Thus to sustain metalliferous ecosystems, while at the same time ensuring satisfactory returns from the mining activities compromise must be made. As long-term outcomes of any restoration efforts cannot be predicted (see below), it is indisputable that parts of canga must be protected; exactly how much is a subject of debate. In Brazil state protected reserves, which are exempted from mining rights, cover relatively small part of mineral regions. In MG 39 km<sup>2</sup> of canga lies within borders of Serra do Rola Moca State Park (Jacobi et al., 2008). In Carajas, iron ore is locked in two mountain ranges called Serra Sul and Serra Nord located at the territory of Natural Reserve Park, Flona de Carajas. Mining already takes place in Serra Nord (**Figure 1I**), and a new mining complex in Serra Sul, S11D, is planed to start producing in 2016. Importantly, parts of S11D will be exempted from mining, as a result of heavily negotiated agreements between environmental agencies and the mining company. For instance, of the 187

**Table 2 | Examples of secondary metabolites found in plant species from Minas Gerais (MG) and identified in canga outcrop by Jacobi et al. (2007).**


iron-stone caves 152 will be fully protected (Vale Sustainability Report, 2012). Moreover protection belt between mountain lakes, do Violão and do Amendoim, and the mining operation is planned to protect lakes and surrounding canga outcrop (**Figures 1C,D**).

In mining areas where protection is not possible ecosystem restoration is a legal requirement. In Brazil to obtain or renew mining permits, companies need to submit area restoration plans to be implemented upon mine closure. These include a plethora of steps that preceded the closure and sometimes even mining itself (see below). Moreover actions mitigating ecosystem destruction during mining operation need to be included. For example, as part of the compromise with the environmental agencies, a number of technological solutions have been implemented in the S11D Carajas complex, such as, truckless, conveyor belt operated delivery of the iron ore from the mining site to the processing plant outside Flona de Carajas reserve (Vale Sustainability Report, 2012, 2013). It is noteworthy that no such or similar solutions where required from mining companies, when exploitation started 30 years ago in Serra Nord. Similarly, in the past, it was legally acceptable to rehabilitate exhausted mining areas with plant cover from any available plant species (see Griffith and Toy, 2001 for the early re-vegetation strategies of iron mines in IQ). Nowadays the aim is to restore ecosystems relying exclusively on native taxa.

The dependence of canga on the unique biotic components mentioned earlier, makes restoration a difficult task. Iron ore open air mining is aggressive. Ironstone outcrops and associated biota are striped, so the iron ore deposits can be accessed with the subsequent excavation reaching even 300 m in depth. The landscape left at the end of the mining operation has an erased geographical topography, no previous terrain structure, biotic, and abiotic components left. Hydrology can be seriously disturbed due to changes in water table and erosion (Castro et al., 2011a,b). Despite these considerable difficulties, principal restoration steps do not differ when compared with recovery of other post-mining areas, and will be listed keeping canga in mind. (1) As mentioned previously, restoration goals are set before the mining begins and long before it ends. (2) Careful ecosystem reviews (e.g., Jacobi and Carmo, 2011) are used as restoration guidelines. Species information is collected together with habitat information (e.g., soil composition; for canga see, e.g., Nunes, 2009; Messias et al., 2013) and used for ecological modeling (reviewed by Memmott, 2009; Stouffer et al., 2012) to delineate and visualize relations between organisms (for canga seem, e.g., Araújo et al., 2006; Jacobi et al., 2008; Jacobi and Carmo, 2011). Based on the above, priority species and interactions are selected. For instance, detailed analysis of the angiosperm composition identified the following species as priority for canga recovery in IQ (1) grass *Andropogon ingratus* small shrub *Lychnospora pinaster,* and sedge *Bulbostylis fimbriata* as most abundant in terms of individual number and total area occupied (2) *Mimosa calodendron* as nitrogen fixing nursery taxa (3) Vellozia species for their metal tolerance and accumulation mechanisms (see previous section on metallophytes; Jacobi et al., 2008). Plants used by indigenous communities such as *Pilocarpus microphyllus* and endemic species such as *Arthrocereus glazowii* have been also considered as a priority. (3) Mining areas are

prepared so as to minimize deforestation, e.g., as in the S11D complex by shifting all the ore processing and tailings outside Flona de Carajas territory. (4) Topsoil removal and storage is now being practiced, as it contains seed banks and associated micro-biomes, and was proven to be top priority for ecosystem restoration (Cooke and Johnson, 2002). For instance, work of Rezende et al. (2013) demonstrated successful growth of 14 out of 15 tested canga herbaceous, shrubby, and woody species, over the period of 42 months, using topsoil salvaged from iron ore mining sites in MG. Another published example is a pilot study done for a new iron-ore mining complex in IQ, called Minas Rio (Mello et al., 2014). Flora rescue operations from canga areas to be mined salvaged 45 plant species belonging to different families and life forms that were subsequently introduced together with 5 cm of top-soil to a different location. After a period of approximately 3 years almost 2000 individuals representing 38 plant species were recorded. Noteworthy, the majority of the species, such as the already mentioned sedge *Bulbostylis fimbriata, were* successfully reproduced within the experiment duration. Since 2010, a further 108 plant species have been rescued by the Minas Rio operation. (5) Landscape architecture to stabilize slopes and restore some of the original topography is also required to recreate functional ecosystem, as for example in case of Germano mine in MG (see below; Castro et al., 2011a) (6) Revegetation using available planting material. As described above, many canga species can be successfully rescued before mining begins and subsequently cultivated *ex situ* awaiting restoration. Seeds and plantings obtained from *ex situ* collections constitute the obvious starting material for re-vegetation. (7) During the early phases of restoration applying fertilizers and using chemical means to deal with invading plant species and a plethora of pests is often required. Introducing native legumes to fix nitrogen is not uncommon. (8) Monitoring criteria, which are essential to assess restoration progress, are usually decided at the early operation stages (Cooke and Johnson, 2002). Soil organic matter content and species richness are traditionally used, physiological parameters (Cooke and Suski, 2008), nowadays more and more accessible from the aerial photographs (e.g., ground cover, rate of biomass increase, chlorophyll content), are, however, increasingly more popular.

An often cited example of a successful restoration effort is the Jarrah forest in Western Australia (reviewed by Grant and Koch, 2007). Within 50 years of the mining operation closure the original ecosystem, with more than 700 native plant and 200 animal species, has been reproduced and nowadays is considered fully self-sustainable. It demonstrates that restoration is possible, where all involved parties commit to their obligations, main fear of the environmentalists. In Brazil, two examples of recently closed iron ore mines are Aguas Claras and Germano in MG (Castro et al., 2011a). Due to its location, area of Aguas Claras will be used to build a small town neighboring a new lake created in the depleted mining pit. In contrast, the Germano pit closure focuses on environmental rehabilitation; extensive landscape architecture being carried on at the moment. Closure time for the Minas Rio (MG) and S11D (Carajas), new projects that include canga restoration planing, is envisaged only in 30–40 years, respectively. As it is very long time before it will be possible to evaluate restoration success, and as mentioned before, the need of canga protection remains indisputable.

### **SUMMARY, CANGA, AND MEANING OF BIODIVERSITY IN THE NEW MILLENNIUM**

Although very small in size canga demonstrates biodiversity challenges of the ongoing Millenium. Due to its unique setting, canga significantly contributes to regional and global biodiversity. And as such is seen as a holy grail by conservationists and a treasure box by biologists. Laying on mineral deposits canga is, however, threatened by human activities, presenting an important dilemma between biodiversity and economical gains, ever more acute in the current millennium, with the human population and its prosperity rapidly growing. The difficult comprise between human economic activities and biodiversity is simply unavoidable. Protect and restore discussions are not easy and new concepts, such as environmental offsets, are emerging. Hopefully the current Millennium will bring new solutions to biodiversity preservation as it brings challenges. "The future is unknown but it can be invented."

#### **REFERENCES**


*Eremanthus erythropappus* (DC) McLeisch (Asteraceae). *Molecules.* 18, 9785– 9796. doi: 10.3390/molecules18089785


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. ITV-DS Institute is funded by mining company Vale, which has mining activities in the regions of canga occurence. This review is related to one of the ITV roles, which is support of canga characterization and restoration affords. Considering Frontiers is a scientific journal all authors kept to the objectivity criteria.

*Received: 16 June 2014; accepted: 03 November 2014; published online: 24 November 2014.*

*Citation: Skirycz A, Castilho A, Chaparro C, Carvalho N, Tzotzos G and Siqueira JO (2014) Canga biodiversity, a matter of mining. Front. Plant Sci. 5:653. doi: 10.3389/fpls.2014.00653*

*This article was submitted to Plant Genetics and Genomics, a section of the journal Frontiers in Plant Science.*

*Copyright © 2014 Skirycz, Castilho, Chaparro, Carvalho, Tzotzos and Siqueira. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# **Oil palm natural diversity and the potential for yield improvement**

*Edson Barcelos <sup>1</sup>\*, Sara de Almeida Rios <sup>1</sup> , Raimundo N. V. Cunha <sup>1</sup> , Ricardo Lopes <sup>1</sup> , Sérgio Y. Motoike <sup>2</sup> , Elena Babiychuk <sup>3</sup> , Aleksandra Skirycz <sup>3</sup> and Sergei Kushnir <sup>3</sup>*

*1 Embrapa Amazonia Ocidental, Empresa Brasileira de Pesquisa Agropecuária, Manaus, Brazil, <sup>2</sup> Department of Phytotechnology, Federal University of Viçosa, Viçosa, Brazil, <sup>3</sup> Department of Sustainable Development, Vale Institute of Technology, Belém, Brazil*

African oil palm has the highest productivity amongst cultivated oleaginous crops. Species can constitute a single crop capable to fulfill the growing global demand for vegetable oils, which is estimated to reach 240 million tons by 2050. Two types of vegetable oil are extracted from the palm fruit on commercial scale. The crude palm oil and kernel palm oil have different fatty acid profiles, which increases versatility of the crop in industrial applications. Plantations of the current varieties have economic life-span around 25–30 years and produce fruits around the year. Thus, predictable annual palm oil supply enables marketing plans and adjustments in line with the economic forecasts. Oil palm cultivation is one of the most profitable land uses in the humid tropics. Oil palm fruits are the richest plant source of pro-vitamin A and vitamin E. Hence, crop both alleviates poverty, and could provide a simple practical solution to eliminate global pro-vitamin A deficiency. Oil palm is a perennial, evergreen tree adapted to cultivation in biodiversity rich equatorial land areas. The growing demand for the palm oil threatens the future of the rain forests and has a large negative impact on biodiversity. Plant science faces three major challenges to make oil palm the key element of building the future sustainable world. The global average yield of 3.5 tons of oil per hectare (t) should be raised to the full yield potential estimated at 11–18t. The tree architecture must be changed to lower labor intensity and improve mechanization of the harvest. Oil composition should be tailored to the evolving needs of the food, oleochemical and fuel industries. The release of the oil palm reference genome sequence in 2013 was the key step toward this goal. The molecular bases of agronomically important traits can be and are beginning to be understood at the single base pair resolution, enabling gene-centered breeding and engineering of this remarkable crop.

#### **Keywords: oil palm,** *E. guineensis***,** *E. oleifera***, germplasm, hybrid, breeding**

## **Introduction**

No human activity has altered the face of the planet more than agriculture (Foley et al., 2005) that is one of the principal causes of biodiversity loss (Green et al., 2005). The increase of the global agricultural production is thought to happen in tropical countries where cropland expanded by approximately 48,000 km<sup>2</sup> per year from 1999 to 2008 (Phalan et al., 2013). Cropland development is the most controversial in tropics, because they support the high species richness and endemism, and have large projected increases in demand for food from human populations growing in size

#### *Edited by:*

*Ann E. Stapleton, University of North Carolina Wilmington, USA*

#### *Reviewed by:*

*Mingsheng Chen, Chinese Academy of Sciences, China R. H. V. Corley, Independent, UK*

#### *\*Correspondence:*

*Edson Barcelos, Embrapa Amazonia Ocidental, Empresa Brasileira de Pesquisa Agropecuária, Rodovia AM 010, Km 29, Manaus, Amazonas 69011-970, Brazil edson.barcelos@embrapa.br*

#### *Specialty section:*

*This article was submitted to Plant Genetics and Genomics, a section of the journal Frontiers in Plant Science*

*Received: 30 September 2014 Accepted: 09 March 2015 Published: 27 March 2015*

#### *Citation:*

*Barcelos E, Rios SA, Cunha RNV, Lopes R, Motoike SY, Babiychuk E, Skirycz A and Kushnir S (2015) Oil palm natural diversity and the potential for yield improvement. Front. Plant Sci. 6:190. doi: 10.3389/fpls.2015.00190* and wealth (Laurance et al., 2014). Analysis of crop distribution and expansion in 128 tropical countries showed that overall expansion of annual crops has been more rapid and more widespread than expansion of perennial crops, and has occurred across much of South America, Africa, and tropical Asia. Crops that expanded most during 1999–2008 were soybeans, maize, paddy rice, and sorghum, in that order (Phalan et al., 2013). Soybean expansion is further recognized as a major cause of biodiversity loss in the Brazilian Cerrado savannas (Fearnside, 2001).

Individual crops differ in their biodiversity impacts, depending on how and where they are cultivated. Coffee covers a relatively small area, but tends to replace habitats of particularly high biodiversity value (Phalan et al., 2013). Oil palm fruit is only fifth on a list of biodiversity threats (Phalan et al., 2013), nevertheless "Few developments generate as much controversy as the rapid expansion of oil palm into forest-rich developing countries such as Indonesia" (Sheil et al., 2009). Why some crops have received relatively little attention from conservationists is a matter of debate, yet the negative impacts of the South-East Asian oil palm industry on biodiversity, and on orangutans in particular, have been well documented and publicized (Fitzherbert et al., 2008). In regard of the great apes habitat destruction, similar concerns are also expressed about the wildlife habitat conversion by the oil palm plantations in Africa (Wich et al., 2014), where about two million hectares are likely to be converted for oil palm cultivation (Murphy, 2014). Species diversity, density and biomass of invertebrate communities is estimated to suffer at least 45% decreases from land-use transformation of tropical forests to oil palm plantations (Barnes et al., 2014). Furthermore, as a substitute for reforestation, the native biodiversity of oil palm plantations is far lower than that of rubber tree plantations, which are the primary current threat to the rain forests in Cambodia (Fitzherbert et al., 2008). On the other hand, oil palm was found most sustainable with respect to the maintenance of soil quality, net energy production and greenhouse gas emissions, when biodiversity loss due to oil palm expansion was analyzed in relation to alternative crops for oil or energy, such as soybean, rapeseed, corn or sugar cane (de Vries et al., 2010). The peatland deforestation for oil palm cultivation in West Kalimantan, Indonesia have a large negative impact on greenhouse gas emissions (Carlson et al., 2012). However, global analysis of oil palm cultivation suggests that crop may encourage forest reversion and lower global emissions (Villoria et al., 2013), mainly because oil palm plantations store more carbon than alternative agricultural land uses (Sayer et al., 2012).

Deforestation and peatland degradation can be avoided when degraded lands, such as*Imperata cylindrica* grassland in Indonesia (Wicke et al., 2011) and cattle pastures in Amazon (Godar et al., 2014; Villela et al., 2014) are used for oil palm cultivation; a solution embraced by environmentalist and policy makers (World Resources Institute<sup>1</sup> ). Even Greenpeace admits that "good palm oil" is acceptable if policy makers: (1) put an end to deforestation; (2) introduce peatland restoration policies; (3) support smallholder farms and (4) involve local communities in palm oil

business<sup>2</sup> . To reduce the environmental footprint of oil palm, The Roundtable for Sustainable Palm Oil (RSPO) has been established in 2004<sup>3</sup> . The RSPO is a non-profit association that brings together palm oil producers, processors and traders, consumer goods manufacturers, retailers, banks and investors, as well as environmental and social non-governmental organizations (NGOs) to develop and implement a global standard for sustainable palm oil in order to produce Certified Sustainable Palm Oil (CSPO). On the other hand, smallholder farmers have difficulties to meet CSPO criteria (Murphy, 2014).

Why this global integration is so important? Oil palm cultivation is one of the most profitable land uses in the humid tropics (Sayer et al., 2012). In the state of Pará, Brazil for example, the average annual monetary return on investment is US\$ 2000 per hectare (Villela et al., 2014). With further 32 million hectares of degraded land suitable for oil palm cultivation (Ramalho-Filho et al., 2010) crop has a potential to evolve into a multibillion dollar business in Brazil. The crop is often considered as an industrial crop, but in many areas it is a valuable smallholder crop (Feintrenie et al., 2010). Globally, three million smallholders live from oil palm cultivation. The share of palm oil production by small, family-owned, estates is 30% worldwide and reaches 80% in Nigeria, Africa's largest producer (Morcillo et al., 2013). Government policies in Malaysia, Indonesia and Brazil favor smallholder involvement in the oil palm industry. Indonesia has a target of 40% production coming from smallholders, who supply the oil mills. In Indonesia, 25 million people livelihood depends one way or another on oil palm production (Murphy, 2014). Thus, oil palm cultivation alleviates poverty and with right governmental policies could transform livelihood of millions of people (Sayer et al., 2012).

Oil palm (*Elaeis guineensis*, Jacq.) is by far the most productive oil crop and alone is capable to fulfill the large and growing world demand for vegetable oils that is estimated to reach 240 million tons by 2050 (Corley, 2009). Per hectare of cropland, oil palm plantations give 3–8 times more oil than any other temperate or tropical oil crop. In 2012 for instance, 56.2 million tons of palm oil were produced on 17.24 million hectares. Only 23.6 million tons of oil were extracted from rapeseed grown on 36.4 million hectares<sup>4</sup> . On November 2014, palm oil was valued as a vegetable oil with the lowest production costs by the international commodities markets<sup>5</sup> , e.g., US\$ 700 and US\$ 850 per metric ton of oil palm and rapeseed oil, respectively. The oil palm cultivation however, is labor intensive. As the labor costs increase, overseas workers are often providing cheap labor force. Half a million Indonesian workers are now being recruited to Malaysian oil palm plantations. Manpower shortfall resulted in 15% losses of fruits in East Malaysia, whilst the lack of good field training is a further contributing factor for production losses (Murphy, 2014). It is a challenge for palm breeders to alter tree architecture in such a way as to lower the labor intensity of the crop and to facilitate mechanization of the harvest.

<sup>5</sup>http://www.indexmundi.com

<sup>2</sup>http://www.greenpeace.org

<sup>3</sup>http://www.rspo.org

<sup>4</sup>http://faostat3.fao.org/home/E

<sup>1</sup>http://www.wri.org

Oil palm is a C3 photosynthesis, evergreen tropical perennial tree (Corley and Tinker, 2003). Adult palms planted at optimal density of 130–150 trees per hectare are on a relative steady state in terms of canopy development and have a large leaf area index between 4 and 5, leading to a light interception efficiency close to one (Pallas et al., 2013). Unlike other studied angiosperms, oil palm does not regulate photosynthesis to adjust source–sink imbalances (Legros et al., 2009), instead photosynthates are converted into a reserve pool of non-structural carbohydrates (NSC) mainly located in the tree trunk as glucose and starch. The main physiological function of transitory NSC storage is to balance sink and source fluctuations within a day to seasons. The NSC reserve pool in oil palm is so large that it can theoretically sustain tree growth for 7 months, and most importantly to maintain fruit growth and energetically costly oil biosynthesis at fruit maturation, regardless of cloud cover or periodical suboptimal growth conditions (Legros et al., 2009). In contrast, intercepted solar radiation during seed filling is the rate limiting in an annual oil crop sunflower, and determines weight per seed and oil concentration (Aguirrezábal et al., 2003). Thus, unusual characteristics of source-sink interactions, combined with high efficiency of solar radiation interception and continuous year-round fruit production are likely ecophysiological traits that determine superior productivity of oil palm (Basri Wahid et al., 2005).

Plant scientists commonly argue that finding solutions for increasing crop yield potential, e.g., doubling yield by improving photosynthesis efficiency, and closing the yield gap will satisfy food demand by the growing human population that is estimated to reach 9–10 billion by the year 2050. The challenge for oil palm planters will be to close the yield gap between the average plantation output at present 3.5t, compared to some best known varieties that in favorable agro-climatic conditions produce up to 9–12t (Murphy, 2009). However, the intensification of land use appears to result in further biodiversity loss, the so-called tradeoffs between biodiversity value and yield (Phalan et al., 2011, 2014). High-yield oil palm expansion spares land at the expense of forests in the Peruvian Amazon (Gutiérrez-Vélez et al., 2011). Thus, increase in yield potential of oil palm crop is a necessary, but not sufficient requirement for the sustainable future of tropical forests.

The aim of this review is to acquaint the reader with natural diversity of oil palm species and how it can be used to increase productivity of the future oil palm plantations. We will begin by covering basic facts about oil palm biology and cultivation, followed by a brief review of germplasm and genomic resources, and how these have been used for gene discovery and breeding purposes. The potential of other oleaginous palms to lower environmental impact of oil palm cultivation is emphasized.

## **Oil Palm Biogeography, Biology and Cultivation**

The genus *Elaeis* of the monocotyledonous palm family *Arecaceae* was formally introduced into botanical classification in 1763 by Nicholas Joseph Jacquin, who described *Elaeis guineensis*, known as African oil palm. The Greek word "ελαιo*υ*"—oil, transliterated "elaion," gave the genus name (Hartley, 1977). The genus comprises two taxonomically well-defined species, the second is American oil palm, *E. oleifera*. The well-known, phylogenetically closest relative of oil palms are coconut palms, *Cocos nucifera*, which are also vegetable oil producing crop.

## **African Oil Palm**

The world-wide grown crop is African oil palm naturally abundant in all the African rain forests. Both climate, and humans shaped modern biogeographic distribution. The late Holocene phase of dramatic forest decline, around 2500 years ago was favorable for the expansion of this sun-loving, pioneering species (Maley and Chepstow-Lusty, 2001). The oleaginous properties of the fruits were important in the subsistence economy in Africa for the past 5000 years (Sowunmi, 1999).

Analysis of the species natural genetic diversity suggests that wild populations could be separated into three groups located at the extreme west of Africa, equatorial Africa and on Madagascar Island. The highest allelic diversity was found among Nigerian palm populations, indicating the possible center of origin (Bakoumé et al., 2015). Semi-wild feral populations of African oil palm found in a Brazilian state Bahia are very similar to palms from Nigeria and were most likely established during the period of slave trade (Bakoumé et al., 2015).

Thus, the majority of wild oil palm populations inhabit tropical lowlands with the average annual rainfall of about 1780–2280 mm and temperature ranging from 24 to 30°C. Accordingly, varieties standard in cultivation are usually sensitive to water deficit. The atmospheric humidity also strongly influences oil palm photosynthetic capacity. Low air humidity restricts stomatal opening and CO<sup>2</sup> uptake (Smith, 1989). Another ecophysiological characteristic that limits the latitude and altitude ranges of cultivation is cold sensitivity. The growth of common oil palm varieties is suppressed at ambient temperatures below 15°C (Corley and Tinker, 2003). Oil palm can be cultivated on a broad range of soils (Corley and Tinker, 2003).

African oil palm trees can reach 15–18 meters in height, up to 30 meters in a dense forest. It is believed that some palm groves are more than 200 years old (Corley and Tinker, 2003). The leaves could be 8 meters in length. It takes about 2 years for the first leaf primordia to reach the fully expanded stage. To achieve maximal yield on commercial plantations, the leaf length is a critical trait that determines tree planting density.

Oil palms are monoecious species that produce unisexual male and female inflorescences in an alternating cycle. Such "temporal dioecism" results in allogamous reproduction by cross-pollination (Adam et al., 2011). Inflorescences are enclosed during their development by spathe, a large bract, which is ruptured just before flower maturity is reached. Both genetic and environmental factors influence inflorescence sex determination. Reduced photosynthesis due to defoliation, or high density planting, for example, promotes male inflorescence development. This observation was critical for the development of industrial scale seed production by controlled pollination (Durand-Gasselin et al., 1999). Breeding for higher productivity result in varieties that produce larger number of female inflorescences and shortage of pollen. Thus, pollination efficiency has a large impact on yield of the crop.

Whilst wind pollination could occur, maximal pollination efficiency depends on insects. The main pollinators are weevils, a type of beetles, of the genus *Elaeidobius* spp., in particular *E. kamerunicus*. Weevils complete the entire life cycle by feeding on the palm male flowers and male inflorescence tissues. The aniseed scent that is the same between female and male inflorescences, is attributed to the emission of methyl chavicol (Lajis et al., 1985), that attracts *E. kamerunicus* (Hussein et al., 1989). In a search for the male inflorescences, weevils visit female flowers, depositing pollen grains by accident (Tandon et al., 2001). To achieve maximal pollination efficiency, *E. kamerunicus* were introduced on plantations both in South-East Asia, and Latin America (Corley and Tinker, 2003).

The oil palm fruit is a sessile drupe. Fruits grow in large bunches and mature in 5–6 months after pollination. Oil palm accessions show great variation in fruit shape and size (Corley and Tinker, 2003). The pericarp of the oil palm fruit is subdivided into the outer layer exocarp, fleshy mesocarp, and endocarp that in oil palm, is called shell. Shell encases the seed or kernel, i.e., embryo and endosperm. The crude palm oil and kernel palm oil are extracted from mesocarp and kernel, respectively.

In wild type palms, endocarp thickness varies from 2 to 8 mm in between different accessions (Corley and Tinker, 2003). Endocarp development depends on the major effect *SHELL* gene. *SHELL* mutant alleles with co-dominant monogenic inheritance characterize the so-called *pisifera* palms (*sh/sh*) that produce shell-less fruits (Beirnaert and Vanderweyen, 1941). Wild type *Sh/Sh dura* palms develop fruits with thick endocarp. Intraspecific heterozygous (*Sh/sh*) hybrids, known as *tenera* palms, have thinner shells surrounded by a distinct fiber ring (Beirnaert and Vanderweyen, 1941). The shell thickness has major effect on oil content, with *teneras* having 30% more mesocarp and respectively 30% greater oil content in bunches than *duras* (Corley and Tinker, 2003). Owing to their higher oil yields, *tenera* palms were selected already by the pre-colonial cultures in West Africa (Devuyst, 1953).

African oil palms have different coloring of exocarp, producing either *nigrescens* or *virescens* fruit types. Type *nigrescens* accumulate large amounts of anthocyanins, which accounts for the deep violet to black color at the fruit apex (**Figure 1A**). When unripe, *virescens* fruits are green turning orange due to accumulation of carotenoids and chlorophyll degradation in relation to ripening. Five spontaneous dominant mutant alleles in a *VIRESCENS* gene abolish anthocyanin synthesis, which explains the *virescens* fruit type (Singh et al., 2014). Fruit phenotype *nigrescens* is a likely wild type. The occurrence of *virescens* palms is usually less than 1%, however, in some Congolese oil palm populations up to 50% of trees will produce *virescens* fruits. Artificial selection by local communities has driven the persistence of the newly arising mutations (Zeven, 1972).

In most of angiosperms, flowers and immature fruitlets are naturally thinned by organ abscission in response to nutritional status. This phenomenon is negligible in oil palm that only shed ripe fruits (Roongsattham et al., 2012). It is plausible that low abscission of flowers and immature fruitlets contributes to the exceptional oil palm productivity. Ripe fruit shedding is a main indicator whether bunch is ready to be harvested (Corley and

**FIGURE 1 |** *E. guineensis* **and** *E. oleifera.* **(A)** Commercial African *tenera* oil palm from a cross *dura* Deli *× pisifera* Nigeria. The tree is 5-years-old and has *nigrescens* fruit type (inset). **(B)** 26-years-old plantation of *tenera* palms (Deli *×* Ghana). Trees are 7–8 meters tall. **(C)** Fruit bunch harvest. A worker is using a knife to cut off bunches from a tree on a plantation shown in **(B)**. Bunch ripeness is assessed by the presence on the ground of the shed-off fruits. **(D)** *E. oleifera*. The wild tree that is more than 30-years-old (Manicoré, Amazonas, Brazil). **(E)** Walking palm. The same tree as in **(D)** photographed at different angle to illustrate procumbent trunk that is outlined by the dotted line. **(F)** African oil palm male inflorescences. This individual has particularly long male inflorescence stalks that are pointed with arrows. **(G)** Fruit bunch stalk. The 3-months-old fruit bunch of the American oil palm is shown. Bar indicates short stalk. **(H)** Collection of *E. oleifera* bunch. Harvesting oil palm requires skilled workers, even when trees are not very tall. **(I)** Thickness of the bunch stalk. The cut-off bunch from a tree in **(H)**. Cutting through the bunch stalk (orange arrow) composed of a very fibrous tissue, requires physical strength.

Tinker, 2003), on the other hand, shedding is one of the causes of losses at harvesting (Osborne et al., 1992).

The stalk of fruit bunches is short and thick in oil palms (**Figures 1G,I**). The stalks of male inflorescences are longer (**Figure 1F**). Cutting fruit bunches off the trees is a laborious process (**Figures 1B,C,H**), thus short and thick stalks are the traits that limit harvest mechanization (Le Guen et al., 1990).

To achieve synchronous germination of commercially produced seeds, combined temperature and humidity treatments are required to break the dormancy. Each germinated seed is maintained in a pre-nursery for 4–5 months till a plantlet reaches a four-leaf stage, after which young palms are grown for about a year in a nursery before they are transferred to the field. Establishment of leguminous cover prior planting prevents soil erosion and surface run-off, improves soil structure and palm root development, increases the response to mineral fertilizer, and reduces the danger of micronutrient deficiencies (Corley and Tinker, 2003).

## **American Oil Palm**

*Elaeis oleifera* (Kunth, Cortés) is known as the American oil palm. Species is native to and broadly dispersed in Central America and northern regions of South America. Small and dense *E. oleifera* populations grow along the riverbanks, tolerating well both shade, and flooding, indicating a broader environmental adaptability compared to the African oil palm (Corley and Tinker, 2003). Judging by the higher morphological variation of the trees, the region occupied by Colombia, Suriname, and North-West Brazil is thought to be the species center of origin (Meunier, 1975; Ooi et al., 1981). In the Amazon River Basin that is considered a center of secondary diversification (Meunier, 1975; Barcelos et al., 2002), many *E. oleifera* populations are found on Amazonian Dark Earths, Terra Preta de ĺndio in Portuguese. Amazonian Dark Earths were formed in the past by pre-Columbian populations and are highly sustained fertile soils supported by microbial communities that differ from those extant in adjacent soils (Lima et al., 2014). In spite of this association of palms with human habitation, there are no historical indications of artificial selection for improved yield that in *E. oleifera* remains significantly lower compared to the African oil palms. Oil to bunch ratio of *E. oleifera* is about 5%, as compared to 25% in *E. guineensis teneras* (Barcelos, 1998b).

A distinguishing feature of *E. oleifera* is a much shorter, often procumbent trunk, a trait from which species are also known as a walking palm (**Figures 1D,E**). After procumbence, the basal part of the plant dies whilst adventitious roots sprouting from the part in contact with soil allow plant to restart growth. The high proportion of parthenocarpic fruits that may constitute up to 90% of the total is another striking characteristic of the *E. oleifera* fruit bunches as compared to the African species. Parthenocarpic fruits often abort, contributing to poor yield. Immature fruits are green turning orange at maturity, which resembles *virescens* African oil palms (**Figures 2A,B**). *E. oleifera* leaves have a different from *E. guineensis* positioning of leaflets (Corley and Tinker, 2003). *E. oleifera* pollination depends on insects. However, the profile of volatiles emitted by the inflorescences at anthesis is different and species do not synthesize methyl chavicol (Gomes, 2011).

## **Interspecies Oil Palm Hybrids**

African and American oil palm species are sexually compatible (Hardon and Tan, 1969). F1 hybrids show vegetative vigor and mid-parent stem growth increment (Corley and Tinker, 2003). *E. oleifera* leaf morphology and parthenocarpic fruit development (**Figure 2D**) behave as dominant traits. F1 hybrids from crosses between some *E. oleifera* accessions and *nigrescens E. guineensis*

**FIGURE 2 |** *E. oleifera × E. guineensis* **interspecific hybrids at EMBRAPA breeding station Rio Urubu (Amazonas, Brazil). (A)** Controlled pollination. Worker is removing the spathe on elite *E. oleifera* tree that is three meters tall. He wears safety gear to climb the tree, protective mask and gloves. In commercial seed production, female inflorescences are bagged 1 week before anthesis. **(B)** Bunch ripening. Three bunches at different times after pollination are visible. As fruit mature, the fruits change color from green to orange. The youngest female inflorescences were not used in pollination, those bunches are still enclosed in spathe, which is an *E. oleifera* species-specific characteristic. **(C)** F1 interspecific hybrid on commercial plantation. Tree is 4 years old. It has about twice shorter stem than African oil palm trees of the same age. Orange mature bunches are easily spotted within the tree crown. **(D)** Parthenocarpic fruits in mature bunch of the F1 interspecific hybrid. Insert: a few fruits were transversely cut. The fruit to the right has normal seed. White colored endosperm tissue of the kernel is surrounded by the black endocarp, mesocarp is orange. Parthenocarpic fruits have residual black endocarp and no kernel. **(E)** Female inflorescence of F1 hybrid at anthesis. Inflorescence is still enclosed in spathe, which is similar to the parent *E. oleifera.* **(F)** Female flowers. The same inflorescence as in **(E)**. Spathe was removed to show female flowers at anthesis. Three-lobed flower stigmas are yellowish in color and receptive for pollen. **(G)** Male inflorescence of F1 hybrid at anthesis. The overall morphology and flower identity are normal. **(H)** Andromorphic inflorescences. This 15-years-old tree developed five fully andromorphic inflorescences pointed with blue arrows and a single male inflorescence (behind the leaf rachises to the right). Orange arrow points a fruit bunch. **(I)** Partial andromorphy. Individual andromorphic inflorescence can show different proportions of male flowers (orange arrow) and female flowers (blue arrow). **(J)** Fruitlets from andromorphic inflorescence. Some fruitlets abort after anthesis, died dry flower pistils are black (orange arrow), other develop parthenocarpically (blue arrow). Andromorphic inflorescences in **(H)** are full of developing fruitlets.

parents, have *virescens* phenotype of fruits that are bright orange at maturity (**Figures 2C,D**).

To achieve reasonable yield from F1 hybrids, assisted pollination is required, i.e., one worker to pollinate 10–20 hectares of a plantation. In relation to the accession of the *E. oleifera* and *E. guineensis* parents, a number of developmental abnormalities could contribute to lower F1 hybrid fertility, including lower pollen yield, poor pollen germination, poor anther dehiscence (Corley and Tinker, 2003); lower emission of volatiles by the inflorescences at anthesis (Gomes, 2011), which is a likely cause of poor attractiveness for *E. kamerunicus*(Tan, 1985). The spathe encasing female inflorescence at anthesis could present a mechanical barrier for pollinating insects (**Figures 2B,E,F**). Of interest for developmental biologists are the andromorphic inflorescences, a trait that reduce pollen production and characterize some F1 hybrid combinations (**Figures 2G–J**). Whether allopolyploidization will alleviate, or exacerbate fertility problems of the interspecific oil palm hybrids is unknown.

It is possible that some of the F1 hybrid developmental abnormalities and dominance-recessiveness gene interactions can be attributed to the non-additive gene expression that is typical for interspecific hybrids (Chen, 2013). In a view of oil palm genomics development, it is tempting to consider ion beam deletion mutagenesis (Ishikawa et al., 2012) as an experimental tool to identify the molecular basis of the traits that distinguish oil palm species and their hybrids.

## **Natural Variation for Oil Palm Improvement**

The large scale establishment of Congolese and South-East Asian commercial plantations in 1910–1920s, was quickly followed by the research on the crop improvement by selection and breeding (Corley and Tinker, 2003). Oil palm breeding was influenced by maize breeding that relies on development of inbred parental lines to produce homogeneous F1 hybrids. Accordingly, reciprocal recurrent selection and family-individual selections methods were commonly chosen by oil palm breeders for the development of parental lines that are used in commercial F1 hybrid seed production. Ten percent productivity gains per decade were reported by the French and Malaysian breeding programs (Corley and Tinker, 2003).

The exploitation of the superior oil content of the *teneras* began in 1930s on Congolese plantations, which led Beirnaert to explain *dura*, *pisifera*, and *tenera* phenotypes by the co-dominant monogenic inheritance of the *shell* (*sh*) mutant allele (Beirnaert and Vanderweyen, 1941). On Malaysian plantations, the cultivation of *teneras* (*Sh/sh*) produced by controlled pollination of *duras* (*Sh/Sh*) with pollen of *pisifera* palms (*sh/sh*) took off in 1956. Most of commercial seeds today are intraspecific *dura × pisifera* (D *×* P) hybrids (Corley and Tinker, 2003).

Oil palm breeders use "breeding populations of restricted origin" (BPRO) that can be traced back to distinct, often small groups of wild or unimproved ancestral palms (Rosenquist, 1986). For example, Deli *duras* are used today as the mothers for almost all commercial *teneras* seed production. The Deli *dura* palms can be traced to four individuals planted in the Bogor Botanical Gardens (Java, Indonesia) in 1848 (Kushairi and Rajanaidu, 2000; Cochard et al., 2009). Most commonly used *pisiferas* descend from limited number of origins, as well. A single Django *tenera* palm from Congo, gave rise to the AVROS (Algemeene Vereniging van Rubber Planters ter Oostkust van Sumatra, now Indonesian Oil Palm Research Institute—IOPRI<sup>6</sup> ) *pisiferas* widely used for seed production in Indonesia, Malaysia, Papua New Guinea, and Costa Rica (Rajanaidu et al., 2000; Corley and Tinker, 2003). Thus, commercial *teneras* have a narrow genetic base due to the restricted number of ancestral progenitors (Hardon, 1968; Ooi and Rajanaidu, 1979).

Many breeders realized the need for new material of sufficient genetic variability for future progress. Thus, the major task of the oil palm research institutions was adequate characterization of the natural diversity either by prospection of *in situ* natural populations, or by the establishment of *ex situ* germplasm collections. Beginning in 1950s, tens of thousands of palms from *in situ* natural populations have been screened at low cost to isolate just a few elite individuals that were then used in breeding programs. This strategy was largely adopted by the Institut de Recherches pour les Huiles et Oléagineux (IHRO), presently CIRAD<sup>7</sup> (Meunier, 1969). Since 1970s, to capture broader spectra of natural variation, and to ensure protection of the trees from destruction, Palm Oil Research Institute of Malaysia (PORIM), presently Malaysian Palm Oil Board (MPOB)<sup>8</sup> engaged in random sampling of palms from wild populations to establish *ex situ* germplasm collection of 1467 accessions (Rajanaidu, 1994). This strategy is more expensive than *in situ* prospection. MPOB Nigerian collection alone occupies 200 hectares. However, *ex situ* approach allows more thorough characterization of the traits of interest, and provides guarantees for preservation of traits that might become of interest in the future (Rajanaidu et al., 2000; Paterson et al., 2013). The fate of *in situ* germplasm is unpredictable, they can be either destroyed, or replanted. In the survey organized by the Food and Agriculture Organization of the United Nations, 29 participating institutions reported a total of 21103 oil palm accessions (F.A.O, 2010) 9 .

Phenotypic screens of the *E. guineensis* germplasm collections conducted by the main oil palm research centers MPOB (Malaysia), CIRAD (France), IOPRI (Indonesia), and the Empresa Brasileira de Pesquisa Agropecuária (EMBRAPA, Brazil) revealed a significant phenotypic diversity for the valuable agronomical characteristics, such as: (1) leaf petiole, rachis length, i.e., breeding for the so-called compact palms; (2) increment in the growth in height, i.e., breeding for shorter palms; (3) bunch number, weight and production, i.e., oil yield; (4) fresh fruit bunch and crude palm oil yield; (5) total and vegetative dry matter production; (6) fruit and kernel size, i.e., fatty acid (FA) profile; (7) fruit shell thickness; (8) Fusarium wilt disease tolerance; (9) FA composition and iodine value; (10) carotene and vitamin E contents; (11) lipase activity; (12) *in vitro* regeneration potential; (13) drought and cold tolerance (Rajanaidu et al., 2000; Corley and Tinker, 2003). Based on the phenotypic data, selected accessions have

<sup>9</sup>http://www.fao.org/docrep/013/i1500e/i1500e00.htm

<sup>6</sup>http://iopri.org/

<sup>7</sup>http://www.cirad.fr/

<sup>8</sup>http://www.mpob.gov.my/

been subsequently used to develop improved *dura* and *pisifera* fruit type parental palm varieties.

Oil palm breeders became interested in *E. oleifera* agronomic potential at the beginning of the last century. In 1920s *E. oleifera* was introduced in Africa and in 1950s to Asia. However, it is only in the last 30–40 years that *E. oleifera* natural populations have been thoroughly sampled to establish *ex situ* germplasm collections in Malaysia, Ivory Coast, Costa Rica and Brazil (Meunier, 1975; Escobar, 1981; Ooi et al., 1981; Rajanaidu, 1986; Barcelos et al., 1999, 2002). FAO Database registered 506 accessions, of which 244 are maintained by the EMBRAPA on a breeding research station<sup>10</sup> located in the municipality of Rio Preto da Eva, state Amazonas, Brazil.

American oil palm is a source of many economically valuable traits, of which most important are (1) slow height increment, which simplify harvest (Corley and Tinker, 2003); (2) higher proportion of desaturated FAs in palm oil (Montoya et al., 2014); (3) lower lipase activity in mature fruit mesocarp, extending a period between harvest and fruit processing (Sambanthamurthi et al., 1995; Cadena et al., 2013); (4) higher vitamins A and E contents, improving oil nutritional value (Rajanaidu et al., 2000) and (5) broader environmental adaptability (Barcelos, 1986, 1998b). In addition, American oil palm is also more resistant to several diseases (Corley and Tinker, 2003), including bud-rot caused by *Phytophthora palmivora* and *Fusarium* wilt (Barcelos, 1986).

Hybrids between *E. guineensis × E. oleifera* excited much interest, because of the slower growth and higher desaturation of palm oil (Corley and Tinker, 2003). Interest in interspecific hybrids further increased with a recognition of their resistance to fatal yellowing disease that is a major threat to oil palm cultivation in Latin America, a discovery that led to the first commercial plantations in 1980s. In the study published in 1995, Amblard et al. (1995) analyzed 429 hybrid progenies obtained by crossing *E. oleifera* and *E. guineensis* of different origin. Bunch and oil production in the best inter-origin combinations reached 85 and 78% of the average values for the commercial oil palm cultivars, respectively. Some selected hybrids have oil productivity as high as commercial *teneras*, but only with assisted pollination because of the serious fertility problems. Seeds of high yielding interspecific hybrids are produced by CIRAD/PalmElit<sup>11</sup> (COARI hybrids) and EMBRAPA/Dendê do Pará S.A.<sup>12</sup> (MANICORÉ hybrids), using wild *E. oleifera* palms indigenous to the Coari and Manicoré municipalities in the Amazon river basin.

## **Genomics for Oil Palm Improvement**

Oil palm is a diploid (2n = 32) with an estimated genome size of 1.8-gigabases (Gb). A total of 1.535 Gb of the *E. guineensis* (AVROS, *pisifera* fruit form) reference genome assembly were released to public in 2013 (Singh et al., 2013b) and is freely available<sup>13</sup>. For comparative purposes, the genome the American oil palm was sequenced. Most of the reference genome is represented by segmental duplications, and not triplications, indicating that oil palm is a paleotetraploid. Analysis of conserved gene order revealed that the duplications were retained in *E. oleifera*, so that segmental duplications pre-dated the divergence of the African and American oil palms. On the other hand, 57% of the 1.8- Gb *E. guineensis* genome comprises repetitive elements of which 47% were uncharacterized previously, with 73% absent from *E. oleifera* genome, indicating extensive molecular speciation that might account for the fertility problems of interspecific oil palm hybrids.

Genome sequence and transcriptome data from 30 tissue types were used to predict at least 34,802 genes (Singh et al., 2013b). De novo assembly from RNA-seq data resulted in 51,452 oil palm unigenes (Lei et al., 2014). To characterize the genic regions in a greater detail, the methylation filtered libraries of the African and American oil palm species were sequenced (Low et al., 2014). Sequence analysis revealed single nucleotide polymorphisms (SNP) at densities 2.30 and 2.83 per 100 bp for *E. guineensis* and *E. oleifera*, respectively.

For a perennial tree that flowers only 2–3 years after seed germination, breeding oil palm requires 10–19 years per cycle of phenotypic selection (Wong and Bernardo, 2008). Molecular breeding uses genetic markers linked to the traits of choice for earliest pre-selection of desired phenotypes and has a potential to greatly shorten the breeding cycle, reducing costs. The data of (Low et al., 2014) were used to generate a final set of 4,451 SNPs that were selected for developing a customized oil palm specific SNP array (OPSNP3) printed on the Infinium HD iSelect BeadChips platform (Ting et al., 2014). Genotyping across 199 palms from two separate mapping F1 hybrid populations, e.g., *E. oleifera × E. guineensis* interspecific cross and a *dura × pisifera* intraspecific cross took less than 3 months (Ting et al., 2014) and greatly improved marker density and genome coverage in comparison to the first reference maps based on AFLP and SSR markers (Barcelos, 1998a; Barcelos et al., 2002; Singh et al., 2009; Billotte et al., 2010; Ting et al., 2013). Refined genetic maps combined with careful phenotyping of trees are likely to facilitate mapping and identification of molecular bases of both monogenic, and quantitative trait loci (QTL) that underpin major agricultural traits of interest (Yang et al., 2014).

## **Increasing Oil Palm Productivity**

The immediate impact of the oil palm genome sequence was identification of the *SHELL* gene (Singh et al., 2013a) that was shown to encode a MADS-box transcription factor homologous to the *Arabidopsis* ovule identity and seed development regulator SEEDSTICK. Two different amino-acid substitution mutations in a dimerization and DNA-binding domain of the SHELL protein occur in the *shMPOB* and *shAVROS* spontaneous mutant alleles. Mutant proteins are likely to act as *trans*-dominant negative isoforms, which explains the co-dominant phenotype in *tenera* palms. Molecular markers for *SHELL* gene alleles could be used to distinguish *dura*, *tenera*, and *pisifera* plants in the nursery long before they are planted in a field. Nursery stage screening can eliminate erroneous planting of *dura* palms and

<sup>10</sup>https://www.embrapa.br/en/amazonia-ocidental

<sup>11</sup>http://www.palmelit.com/en/

<sup>12</sup>https://www.embrapa.br/en/amazonia-ocidental

<sup>13</sup>http://genomsawit.mpob.gov.my/genomsawit/

control the precision of hybrid seed production. Marker-assisted introgression of the *SHELL* gene alleles on different genetic backgrounds could accelerate construction of new *dura* and *pisifera* palms.

It will be important to understand whether*shMPOB* ,*shAVROS* alleles or similar *trans*-dominant alleles constructed by protein engineering will show a dosage effect that could further increase the oil yield in palms, which have genotype *Sh/sh/sh*, for example. The effect of the *SHELL* gene variation on the mesocarp yield of other commercially useful palm species, such as date (*Phoenix dactylifera*), açaí (*Euterpe oleracea*), peach (*Bactris gasipaes*) palms, can be tested.

Further yield enhancements were brought by breeding *dura* and *pisifera* parents for the higher ratio of female inflorescences; bunch weight; oil to bunch ratio; oil recovery and earlier flowering. Donors of such traits are breeding accessions of Deli palms that have high yield; AVROS lines characterized by precocity, high yields and growth vigor; Ekona palms that have high oil to bunch ratio and earlier flowering Yangambi palms (Alvarado et al., 2010).

Best *dura × pisifera* combination have 28–32% oil to bunch ratio and can produce annually up to 10t, e.g., (i) CIRAD Deli *<sup>×</sup>* Yangambi (PalmElit<sup>14</sup>); (ii) Evolution (Alvarado et al., 2010) <sup>15</sup>, crosses of Deli *dura*with composite *pisifera* carrying traits introgressed from several oil palm populations.

Potential yield of hypothetical oil palm genotypes that combine physiologically plausible attributes is being estimated at about 18.5t (Corley, 1998), which is almost a double of the best varieties on the oil palm seed market. Site yield potential varies amongst individuals of the same descent, in part due to residual genetic variation in parental lines. Exceptionally high yields of 12–13t could be anticipated if some well performing individual trees could be multiplied (Sharma and Tan, 1997). Oil palms do not branch unless terminal single vegetative shoot apical meristem is damaged. Thus, to produce planting material from high yield elite individuals, the only practical way of vegetative propagation is *in vitro* clonal propagation, micropropagation (Corley and Tinker, 2003). Oil palm micropropagation remains an inefficient, lengthy process, however. Most genotypes are recalcitrant in tissue culture, which requires empirical tests of numerous media formulas. Introgression into elite varieties the superior somatic embryogenesis capacity known for some palm accessions is considered a partial solution to the problem (Ting et al., 2013). Inducible versions of genes that promote somatic embryogenesis is a worthwhile approach (Heidmann et al., 2011) that has not been tested with oil palm.

The alternative to micropropagation of oil palm can be a reverse breeding, which is a plant breeding technique to produce parental lines for any heterozygous plant (Dirks et al., 2009). The suppression of meiotic crossovers and transmission of non-recombinant chromosomes to haploid gametes is a key to reverse breeding. Gametes are subsequently regenerated as doubled-haploid offspring among which the parental lines are selected (Wijnker et al., 2014). Thanks to the knowledge of the oil palm gene space (Low et al., 2014), meiosis regulators, such as DMC1orthologs (Wijnker et al., 2012) can be identified and then controlled. The frequencies of spontaneous haploids in seed progeny are very low, 1,100 haploids among 60 million seedlings (Dunwell et al., 2010). Cultures of oil palm microspores have not yielded doubledhaploids so far (Corley and Tinker, 2003). The totipotency of the male gametophyte is thought to be negatively regulated by a histone deacetylase-dependent mechanism, which is affected by the stress treatments, such as cold or heat shock that are used to induce haploid embryo development in culture (Li et al., 2014). Two percent of oil palm microspores exposed to the low temperature and starvation stress initiated cell division and formed embryoids (Indrianto et al., 2014). It is likely that inhibitors of histone deacetylases, trichostatin A for example (Li et al., 2014), will further increase the efficiency of oil palm microspore reprogramming to somatic embryogenesis. Genotyping doubled-haploids using SNP arrays (Ting et al., 2014) could then enable reverse breeding of oil palms.

## **Yield Gap Caused by the Diseases**

Two major diseases threaten oil palm industry. In South-East Asia, the basal stem rot disease that is caused by the white rot fungi of the genus *Ganoderma* spp. is the major problem (Paterson, 2007; Rees et al., 2009). Nearly 60% of plantations in Malaysia reported the diseased trees. The basal stem rot is lethal, infected plants stop producing fruit and eventually die. The average tree mortality rate of 3,7% is equivalent to losses of US\$ 570 million per year (Mohammed et al., 2014). White rot fungi are characterized as facultative saprophytes, which are generally difficult to control. There are no good sources of natural genetic disease resistance neither amongst African, nor American oil palm accessions (Durand-Gasselin et al., 2005). The research efforts have focused on a more detailed understanding of the molecular defense responses in those plant-pathogen interactions, with a hope to find practical solutions to control the disease (Ho and Tan, 2014). To make cellulose available, white rot fungi are capable of degrading lignin to carbon dioxide and water (Paterson, 2007). Thus, understanding lignin biosynthesis in oil palm is of interest, whilst lignin structure modification by breeding could result in genetic resistance. Candidate genes for breeding basal stem rot resistance are being looked for amongst transcripts and proteins that alter their expression patterns upon infection (Ho and Tan, 2014). Current practical solutions are biological control with *Trichoderma* spp. fungi, palm endophytes and implementation of correct agronomical and phytosanitary practices (Mohammed et al., 2014).

The mysterious and devastating disease known as a fatal yellowing (transliterated from Portuguese "amarelecimento fatal") or lethal bud rot ("pudrición de cogollo," Spanish) is considered a major problem for the oil palm industry in Latin America (Chinchilla, 2008). Entire estates in Panama, Colombia, Suriname, Brazil, and Ecuador were destroyed by the disease (De Franqueville, 2003). Fatal yellowing has variable symptoms, which causes considerable confusion in a research field (Corley and Tinker, 2003). In spite of numerous efforts testing candidate fungi, bacteria, phytoplasma and viroids, there is no conclusive evidence that a phytopathogen is the primary cause of the disease.

<sup>14</sup>http://www.palmelit.com/

<sup>15</sup>http://www.asd-cr.com/

Microorganisms are thought to play rather an opportunistic role in the development of the disease that is primed by certain environmental factors. Remarkably, the root system development is altered even before affected palms show symptoms in the shoot (Chinchilla, 2008). Morphological and histological study showed that contrary to healthy-looking palms, diseased palms from Ecuador and Brazil did not have roots with soft and white tips, the so-called fine root system. Only a few meristematic cells could be detected in the apical shoot and root meristems, indicating cell cycle arrest (Kastelein et al., 1990). This finding can be validated by using reporters of the entry into the M-phase of the cell cycle. Chemical screening to promote re-activation of the cell cycle in affected roots could result in a treatment of the fatal yellowing disease.

*Elaeis oleifera* shows resistance to fatal yellowing. The trait is dominant in interspecific *E. oleifera × E. guineensis* F1 hybrids. In spite of the lower productivity and the need for manual pollination, F1 hybrid seeds are produced on commercial scale for planting. Otherwise, we are not aware of any systematic efforts to introgress *E. oleifera* fatal yellowing genetic resistance onto African oil palm genetic background. Most probably because there is no causality link between (a)biotic factors and disease development, which makes the screening procedures unpredictable. On a contrary, ASD Costa Rica released for sale the seeds of an unusual interspecific hybrid variety AMAZON<sup>16</sup>, which is rather an introgression of the *E. guineensis* productivity traits onto American oil palm genetic background.

## **Palm Oil Composition and Content**

Along with coconut oil, crude palm oil, and particularly kernel palm oil, are some of the few highly saturated vegetable fats. On average, crude palm oil contains 44% palmitic acid (C16:0), 5% stearic acid (C18:0) and traces of myristic acid (C14:0), which together constitute half of FAs found in triacylglycerols (TAG) synthesized by the *E. guineensis* fruit mesocarp (Sambanthamurthi et al., 2000). TAG unsaturated FAs are represented by 40% of oleic acid (C18:1), 10% linoleic acid (C18:2) and traces of linolenic acid (C18:3). Food industries consume eighty percent of palm oil, also as a replacement for *trans*-FAs. Oleochemical industry manufacture soaps, detergents, lubricants, solvents, bioplastics and biodiesel from the remaining twenty percent of palm oils.

Dietary FAs play significant roles in the cause and prevention of cardiovascular disease. *Trans*-FAs from partially hydrogenated vegetable oils have well-established adverse effects and should be eliminated from the human diet (Michas et al., 2014). Palm oil may be an unhealthy fat, because of its high saturated FA content. Meta-analysis of 51 dietary intervention studies showed both favorable, and unfavorable changes in coronary heart disease and cardiovascular disease risk markers when palm oil was substituted for the primary dietary fats, whereas only favorable changes occurred when palm oil was substituted for *trans*-FAs (Fattore et al., 2014).

Higher degree of FA unsaturation is therefore a desirable characteristic to alter in palm oil. Iodine index, commonly used as unsaturation measure, varies from approximately 50–60% in *E. guineensis*, highest values measured for La Mé variety (Montoya et al., 2014), but was found to be anywhere between 70 and 80% in *E. oleifera* (Chavez and Sterling, 1991). Amongst the *E. oleifera* accessions, the unsaturated FA content ranges from 47 to 69% for C18:1, 2 to 19% for C18:2, and 0.1 to 1.2% for C18:3. Interspecies *E. oleifera × E. guineensis* hybrids planted in Latin America have a mid-parent phenotype with iodine index varying from 58 to 71% (Ong et al., 1981). Nineteen QTL's controlling FA composition were identified in the interspecies pseudo-backcross populations (Montoya et al., 2014). Importantly for breeding purposes, work of Montoya et al. (2014)indicates that FA composition is not linked to biomass yield traits. Mapping intra-gene SNPs in candidate genes related to the oleic acid C18:1 biosynthesis, supported several QTL's underlying acyl-ACP thioesterase type A (FATA) and ∆9 stearoyl-ACP desaturase (SAD) (Montoya et al., 2013). Genome sequence analysis identified the oil palm gene repertoire playing a role in FA biosynthesis, TAG assembly, carbon fluxes, fruit ripening and regulators of these processes (Singh et al., 2013b). In combination with the high-density genetic map (Ting et al., 2014) and further intra-gene SNP characterization, introgression of high oleic acid content from *E. oleifera* into varieties of *E. guineensis* is becoming a reality within a reach.

An alternative approach to increase oleic acid content at expense of palmitic acid relies on genetic engineering of key enzymes for palmitic acid synthesis, β-ketoacyl-ACP synthase II (KAS II) or palmitoyl-ACP thioesterase (Sambanthamurthi et al., 2009). KAS II activity was shown to be positively correlated with unsaturated FA content across palms from PORIM germplasm in Malaysia (Sambanthamurthi et al., 2009). Reducing *Arabidopsis* KAS II levels was found sufficient to convert its oilseed composition to that resembling palm-like tropical oil (Pidkowich et al., 2007). Mesocarp specific over-expression of KAS II gene and antisense RNA suppression of palmitoyl-ACP thioesterase have been undertaken and currently await evaluation (Sambanthamurthi et al., 2009).

The freshly pressed unrefined palm oil is also known as red palm oil due to its deep orangey-red color. Large amounts of carotenoids, predominantly α- and β-carotene, in a range 180–2500 µg g-1 mesocarp dry weight are measured in African oil palm populations (Rajanaidu et al., 2000; Tranbarger et al., 2011). Even higher contents up to 4000 µg g-1 mesocarp characterizes American oil palm accessions, whereas F1 interspecies hybrids have a mid-parent values of carotenoid contents (Sambanthamurthi et al., 2000). In terms of retinol equivalents (RE), standard batches of red palm oil have seventeen times more of β-carotene than carrots. A few grams, i.e., 1.5–6.5 table spoons, of red palm oil provides approximately 600 RE of βcarotene<sup>17</sup>, which is sufficient to meet daily vitamin A requirements in humans and to prevent childhood blindness from

Barcelos et al. Oil palm natural diversity

<sup>17</sup>https://hungermath.wordpress.com/2013/01/02/red-palm-oil-as-a-sourceof-beta-carotene/

<sup>16</sup>http://www.asd-cr.com/

vitamin A deficiency (Burri, 2012). Indeed, worldwide interventions studies demonstrated usefulness of oil palm supplementation to improve vitamin A status; a deficiency commonly experienced by poor communities in Asia, Africa and South America (Rice and Burns, 2010). Red palm oil is also rich in tocotrienol, which is an unsaturated form of natural vitamin E. Tocotrienols have health benefits due to antioxidative, antihypercholesterolemic, and antiangiogenic effects on disease prevention (Wong and Radhakrishnan, 2012). *E. guineensis* population having high vitamin E content are available as well (Kushairi et al., 2004).

Oil content, composition and oil accumulating cells types are major traits of economic interest (Durrett et al., 2008). To decipher molecular mechanisms that underpin those traits, oil palm fruit development represents an excellent experimental model to apply "omics" trait dissection. Correlation analysis of the transcriptome and metabolome data has been performed on mesocarp samples harvested at multiple time-points during fruit development (Bourgis et al., 2011; Tranbarger et al., 2011). For comparative purposes and to advance understanding of the carbon partitioning between storage carbohydrates and TAG, similar data sets were generated in date palm (*Phoenix dactylifera*), a closely related palm species that accumulates almost exclusively sugars rather than oil in fruit mesocarp (Bourgis et al., 2011). Similar conclusions were drawn by both studies (Bourgis et al., 2011; Tranbarger et al., 2011). The transcript abundance of the FA biosynthetic machinery was remarkably coordinated with oil deposition in mesocarp tissues during fruit maturation. In contrast, TAG assembly pathway enzymes showed very low or the lack of up-regulation during fruit maturation, indicating that TAG assembly is not rate limiting for oil accumulation. The comparative co-expression analysis with transcriptomes of *Arabidopsis*, corn and date palm further implicated the oil palm APETALA2 (AP2)/ETHYLENE RESPONSE FACTOR family transcription factor EgWRI1-1 in regulation of FA accumulation. EgWRI1-1 is homologous to WRINKLED1, a transcriptional regulator of glycolysis and FA synthesis in *Arabidopsis* embryos (Cernac and Benning, 2004). Interspecies genetic complementation indicated that palm and *Arabidopsis* genes could be functional orthologs (Ma et al., 2013).

Oil content and composition differs between oil palm fruit mesocarp, endosperm and embryo. At 5 months after pollination, dry mass of endosperm contained 50% of oil in which lauric acid (C12:0) was predominant FA. The major FAs of mesocarp oil were palmitic acid (C16:0) and oleic (C18:1) acids. The oil palm embryo also stored up to 27% of oil, which contained 25% of linoleic acid (C18:2) (Dussert et al., 2013). To understand the mechanisms behind such differences in oil content and FA composition, transcriptome and lipid profiles were compared during development of oil palm fruit. Accumulation of lauric acid in endosperm relied on up-regulation of a acyl–acyl carrier protein thioesterase and TAG assembly enzymes isoforms (Dussert et al., 2013). Three paralogs of WRINKLED1 were proposed as candidate regulators determining different lipid profiles. In agreement with (Tranbarger et al., 2011), *EgWRI1- 1* was found to operate in mesocarp. *EgWRI1-2* and *EgWRI1- 3* were predominantly expressed in endosperm. Interestingly, embryo did not express either of *EgWRI1* paralogs (Dussert et al., 2013).

To provide new breeding material, targeted approaches, such as Ecotilling (Till et al., 2006) can be applied to screen oil palm germplasm collections for the loss or gain of function alleles in the identified subsets of lipid biosynthetic and regulatory genes. There are very few reports on induced mutagenesis in oil palm (Corley and Tinker, 2003). We are not aware of any chemically mutagenized oil palm populations that can enable standard TILLING (Till et al., 2006). Due to the 4-years-long seedto-seed cycle, the practicality of such oil palm population of a few thousands trees that will live for the next 100 years appears to be doubtful. This view may change, pending the progress with production of doubled haploids from oil palm microspores (Indrianto et al., 2014). Mutagenizing microspores may result in a "tilling" population of trees homozygous at all genetic loci the analysis of which will add supportive evidences for the "omics" data.

## **Oil Palm Tree Architecture**

Harvesting oil palm is expensive in manual labor, difficult task, compared with the ease of combine-harvesting arable crops (Corley and Tinker, 2003). The radical changes to the oil palm tree architecture are needed to enable development of harvesting machines. The major utility in harvesting mechanization were found to be palm height and bunch stalk length and thickness (Le Guen et al., 1990). A number of problems arise as palms age on a plantation. Fruit harvest is complicated when oil palm trees are taller than two-three meters. Cutting bunch stalk becomes physically challenging. Bunch fall bruises fruits many of which detach, demanding additional labor effort to collect lose fruits from the ground. Bruising activates TAG hydrolysis that lower oil quality. It is more difficult to assess the bunch ripeness. Though oil palm is a long lived species, replanting is thought to be required for plantations 20–25 years of age (Corley and Tinker, 2003).

In relation to the environment and genetic makeup, African oil palm accessions have height increment of 45–75 cm a year. Annual height increment of *E. oleifera* may be only 5–10 cm. Interspecific F1 hybrids between *E. guineensis* and *E. oleifera* have a mid-parent growth phenotype of 15–25 cm annual height increment (Corley and Tinker, 2003), indicating a useful gene introgression source.

Intraspecific variation enabled Malaysian breeding programs to develop PORIM series of dwarfish palms with a yield potential of 7t and annual height increment of 40 cm (Rajanaidu, 1994; Rajanaidu et al., 2000). Hybrids from crosses of Bamenda and Ekona *E. guineensis* accessions developed by ASD Costa Rica<sup>18</sup> are slower growing (45–50 cm/year) and are also known for overall high cold and drought tolerance.

To find QTLs that control the tree growth rates, two breeding populations of oil palm were used for linkage mapping (Lee et al., 2015). For selected genotypes of the *dura* and *pisifera* parents, the heights of the 6-year-old *tenera* palms in F1 populations were distributed from 71.0 to 180 cm with an average of 137.6 cm.

<sup>18</sup>http://www.asd-cr.com/

The QTL positioned on a linkage group 5 explained 51.0% of the phenotypic variation, suggesting that it should play a major role in height variation of selected palm genotypes. Oil palm genome sequence indicated that QTL is located more precisely within 65.6 kb region that includes eight genes, of which the gene encoding asparagine synthase-related protein is thought to be responsible for the tree height variation amongst analyzed *teneras* (Lee et al., 2015). Along with glutamine synthases, asparagine synthases have an important role in nitrogen assimilation and allocation within the plant. Interestingly, the ectopically expressed pine glutamine synthase accelerates the growth of the poplar trees (Man et al., 2011).

In *E. oleifera × E. guineensis* F1 hybrid population, a wild palm was discovered that in addition to short trunk, had relatively short leaves due to spontaneous heritable change in leaf length. Derived breeding program resulted in commercial COMPACT<sup>19</sup> varieties sold as clones. Compared to 7–8 meters long leaves of standard *tenera* hybrids, COMPACT palm leaves are 6.5 meters long, which allows a very high density planting, 180–200 trees per hectare. COMPACT palms have *<* 40 cm/year height increment. Micropropagated clones are rather expensive. Fortunately, the leaf length trait appears to be semi-dominant. Hybrids between COM-PACT palms and standard *E. guineensis* lines, such as Deli, Ghana and Nigeria, have 6.6–6.9 meters long leaves and can be planted at density of 170 trees per hectare, which is still higher than industry standard of 138–143 palms/hectare (Corley and Tinker, 2003). As compared to clones, seeds of such hybrids are more affordable for smallholder farmers, who can find such genetic material as an opportunity to increase production and make better use of scarce land resources (Alvarado et al., 2010).

The increases in wheat and rice yields during the "Green Revolution," were enabled by the introgression of dwarfing traits into the plants (Hedden, 2003). The "Green Revolution" genes showed the central role of gibberellin (GA) in the control of plant stature. Wheat *Reduced height (Rht)* genes interfere with the GA signal transduction pathway. The rice *semidwarf1* (*sd1*) gene impair the GA biosynthesis. *Arabidopsis GA5* gene is the ortholog of rice "Green Revolution" gene *SD1* (Barboza et al., 2013). Importantly, semidwarf individuals found in natural *Arabidopsis thaliana* populations were 21 different independent lossof-function mutations at *GA5*. Semidwarfness had no obvious general tradeoff affecting *Arabidopsis* plant performance traits (Barboza et al., 2013). Semidwarfism transgenes modifying GA, promoting root growth and enhancing morphological diversity, have been tested in hybrid poplar trees (Elias et al., 2012). Analysis of mutants in cereals further implicated brassinosteroids in the control of plant architecture (Dockter and Hansson, 2015). These findings have direct implications to the gene-centered analysis of oil palm natural variation and tree architecture engineering.

## **Preventing (Post)Harvest Losses**

The chemical properties of oils used in commerce are extremely important. The hydrolysis of TAG and release of free FAs has a strong impact on the quality of commodity oil, because free FA content above five percent is thought to be unfit for human consumption (Ebongue et al., 2008). Oil palm mesocarp contains a highly active lipase that within five minutes can bring the free FA content to 30% in crushed tissue. The biological function of lipase in palm fruit mesocarp is uncertain. Importantly, TAG hydrolysis does not occur in undamaged fruits. It is critical to reduce fruit bruising before they reach the oil mill where the first step of post-processing is high pressure steam sterilization to inactivate both palm, and microbial lipases (Corley and Tinker, 2003).

To gain flexibility for post-harvest fruit processing and extended ripening for increased yields, elite oil palm lines with a low lipase (LL) were selected. Oil pressed from LL fruits had substantially less free FAs than standard genotypes (Ebongue et al., 2008). *E. guineensis LIPASE1*, *EgLIP1* gene associated with the LL trait has been identified, allowing marker-based introgression of the LL trait into any elite oil palm genotypes (Morcillo et al., 2013). Approximately 30% of oil palm cultivation worldwide, and up to 80% in Africa, is in smallholder farms. To extract the best quality oil, farmers have only limited period of time to deliver their produce to the oil extraction mills. Commercialization of the LL trees is estimated to generate economic gain of almost billion US dollars per year (Morcillo et al., 2013).

*Elaeis oleifera* has naturally very LL activity in fruit mesocarp, whilst interspecific *E. oleifera × E. guineensis* hybrids are promising crosses with less lipase activity (Cadena et al., 2013). Overall it appears that several *E. oleifera* traits, e.g., *virescens* fruits, LL, lack of methyl chavicol biosynthesis that could have been interpreted as loss-of-function mutations, behave at least as co-dominant or dominant alleles in interspecies hybrids. Molecular identification of genes controlling such traits (Morcillo et al., 2013; Singh et al., 2014) will help to reach mechanistic understanding of interspecies genome interactions.

Shedding of ripe fruits from the bunches before they reach the factories is an important source of harvest losses (Osborne et al., 1992). The development of the fruit abscission zone takes place at the base of subtending fruit. It is a two-stage process involving primary and adjacent abscission zones. Abscission zones contain very low amount of methylated pectin and high levels of polygalacturonase (PG) activity that is involved in the depolymerisation of the cell wall pectin homogalacturonan (Henderson et al., 2001). The oil palm PG family comprise at least 14 genes, of which ethylene-inducible *EgPG4* is the most highly expressed in the fruit base (Roongsattham et al., 2012). Altogether, we may anticipate breeding of palms with delayed, if not abolished, fruit shedding. The question becomes how to assess the fruit bunch ripeness for such palms in practice?

The majority of commercial *teneras* have *nigrescens*, i.e., anthocyanin colored, fruit exocarp. To determine that bunches on *nigrescens* palms are ripe, harvesters rely on the presence of detached fruits on the ground (Corley and Tinker, 2003; Singh et al., 2014). It is thought that scoring the profound change in color from green to bright orange upon ripening in *virescens*fruits could be an alternative solution to assess bunch ripeness in the field

<sup>19</sup>http://www.asd-cr.com/

Barcelos et al. Oil palm natural diversity

(Sambanthamurthi et al., 2009). The *E. guineensis VIRESCENS* (*VIR*) gene is a R2R3-MYB transcription factor (Singh et al., 2014). The dominant-negative *virescens* phenotype is explained by the expression of the VIR protein isoforms truncated at carboxylterminus, which is likely to function as transcription activation domain. Interestingly, *E. oleifera* naturally has *virescens* fruit phenotype. Singh et al. (2014) cannot identify *VIR* homolog in a current *E. oleifera* draft genome assembly, indicating that *virescens* phenotype of *E. oleifera* fruit could be explained by a natural deletion mutation.

## **Breeding for Expanded Cultivation Range**

Standard cultivation range of oil palm commercial varieties lays within 20° of the equator (Corley and Tinker, 2003). The cultivation range, as well as the breeding challenges, are expected to evolve due to the climate change (Paterson et al., 2013). Expansion of cultivation range to sub-tropical regions (Lei et al., 2014) could lower the negative impact of the crop on tropical biodiversity.

Cold tolerance was observed in natural oil palm groves situated at 1000–2000 meters above sea level in the Bamenda Highlands of Cameroon and in Kigoma District, Tanzania (Blaak and Stirling, 1996). A breeding program by ASD de Costa Rica<sup>20</sup> and FAO<sup>21</sup> , using Bamenda and Kigoma germplasms, led to new cold tolerant commercial varieties that in addition showed precocity when planted at sea level, producing fruit at 2 years after planting (Chapman et al., 2003). With a goal to gain molecular understanding of cold stress response and to expand African oil palm cultivation to sub-tropical regions, including Hainan province located in the southern China, the cold stress response in oil palm was analyzed by deep RNA sequencing (Lei et al., 2014).Work revealed 51,452 expressed sequences from *E. guineensis*. Transcriptome data analysis resulted in discovery of 5791 gene-based simple sequence repeats (SSRs) markers of which 916 distinguished genes differentially expressed in response to cold stress (Xiao et al., 2014).

Breeding African oil palm for stress tolerance, especially drought tolerance, has been found to be challenging (Corley and Tinker, 2003). The alternative to lower the environmental footprint of the crop, whilst meeting the growing demand for vegetable oils, is to domesticate other oleaginous palm species that have different profile of ecophysiological adaptations.

To address the problem of fatal yellowing disease that restricts oil palm cultivation in Latin America, breeders of the ASD Costa Rica introduced a new hybrid variety AMAZON. The mother trees are *E. oleifera*, originating from wild palms indigenous to the Manaus region (Amazonas state, Brazil). The *pisifera* parents were selected from the progeny of *E. oleifera × E. guineensis* interspecies hybrid backcrossed to *E. guineensis*. In the AMAZON hybrid, *E. guineensis* genome is smaller than haploid in size. This work is a pioneering step toward domestication of *E. oleifera* through gene introgression for higher yield.

*Acrocomia aculeata* known as macaw palm, or macaúba in Portuguese, is particularly interesting for the development of a new oleaginous crop both because of high productivity potential, simpler harvest, and ability to grow in arid sub-tropical areas (Pires et al., 2013). Whilst collecting *E. oleifera* accessions in Latin America, Malaysian breeders added to their *ex situ* germplasm collection several other oleaginous palms, such as *Oenocarpus* spp., *Bactris gasipaes* (Corley and Tinker, 2003). It is beyond the doubt that the recent advances in understanding the oil palm genome, physiology, identification of key genes controlling productivity and TAG biosynthesis will accelerate the domestication breeding programs.

## **Concluding Remarks**

In this review we focused on the oil palm genetic diversity and how modern genomics tools could contribute both to the basic understanding of the physiology, metabolism and development of this remarkable crop, and to accelerate breeding of the high yielding varieties with a tailored oil composition.

Many of the discussed traits could be engineered using genetic modification of the crop (Murphy, 2014). Of particular interest is the success with regeneration of plants from protoplast cultures that were shown to be the superior starting material for PEGmediated DNA transfection and microinjection (Masani et al., 2014). This system could be an excellent recipient to implement genome editing technology (Joung and Sander, 2013; Shan et al., 2013). For instance, instead of time consuming gene introgression, *EgLIP1* gene can be destroyed in high oil yield *tenera* elite palm, or its *dura* and *pisifera* parents. High oleic acid content and the ability to synthesize polyunsaturated fatty acids (PUFA) could be similarly engineered using simpler *Agrobacterium*-mediated genetic modification (Murphy, 2014). Those traits in combination with high pro-vitamin A and vitamin E content could result in new palm varieties for the extraction of the virgin red palm oil of unprecedented nutritional quality serving millions, if not billions of people from impoverished countries. As of today, palm oil is the only non-GMO oil on a global market, which some activists believe increases the palm oil value.

Plant breeders will continue with the efforts to increase primary crop productivity, however, the immediate challenges in closing the yield gap lay in providing smallholder farmers with the access to the best planting material, with balanced application of fertilizers and overall corrects agronomical practices. The negative impact of the crop on biodiversity is undeniable; the high yielding varieties are available to spare the land. It is the socio-economical drivers, governmental decision and environmental activists views that make oil palms the "Palms of Controversies" (Rival and Levang, 2014).

## **Acknowledgments**

We thank reviewers for constructive criticism and suggestions that allowed us to improve the quality of the text. We apologize for the colleagues whose valuable contribution to the oil palm breeding and improvement was not referred to due to the space limitation.

<sup>20</sup>http://www.asd-cr.com

<sup>21</sup>http://www.fao.org

## **References**


abscission in the monocot species oil palm. *BMC Plant Biol.* 12:150. doi: 10.1186/1471-2229-12-150


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2015 Barcelos, Rios, Cunha, Lopes, Motoike, Babiychuk, Skirycz and Kushnir. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Genetic diversity in tef [*Eragrostis tef* (Zucc.) Trotter]

*Kebebew Assefa1, Gina Cannarozzi2, Dejene Girma2,3, Rizqah Kamies4, Solomon Chanyalew1, Sonia Plaza-Wüthrich2, Regula Blösch2, Abiel Rindisbacher2, Suhail Rafudeen4 and Zerihun Tadele2\**

*<sup>1</sup> National Tef Research Program, Debre Zeit Agricultural Research Center, Ethiopian Institute of Agricultural Research, Debre Zeit, Ethiopia, <sup>2</sup> Crop Breeding and Genomics, Institute of Plant Sciences, Department of Biology, University of Bern, Bern, Switzerland, <sup>3</sup> National Agricultural Biotechnology Laboratory, Holetta Agricultural Research Center, Ethiopian Institute of Agricultural Research, Holetta, Ethiopia, <sup>4</sup> Plant Stress Laboratory, Department of Molecular and Cell Biology, University of Cape Town, Cape Town, South Africa*

#### *Edited by:*

*Joanna Marie-France Cross, Inonu University, Turkey*

#### *Reviewed by:*

*Sergio Lanteri, University of Turin, Italy Rodomiro Ortiz, Swedish University of Agricultural Sciences, Sweden*

#### *\*Correspondence:*

*Zerihun Tadele, Crop Breeding and Genomics, Institute of Plant Sciences, Department of Biology, University of Bern, Altenbergrain 21, 3013 Bern, Switzerland zerihun.tadele@ips.unibe.ch*

#### *Specialty section:*

*This article was submitted to Plant Genetics and Genomics, a section of the journal Frontiers in Plant Science*

> *Received: 29 December 2014 Accepted: 05 March 2015 Published: 26 March 2015*

#### *Citation:*

*Assefa K, Cannarozzi G, Girma D, Kamies R, Chanyalew S, Plaza-Wüthrich S, Blösch R, Rindisbacher A, Rafudeen S and Tadele Z (2015) Genetic diversity in tef [Eragrostis tef (Zucc.) Trotter]. Front. Plant Sci. 6:177. doi: 10.3389/fpls.2015.00177* Tef [*Eragrostis tef* (Zucc.) Trotter] is a cereal crop resilient to adverse climatic and soil conditions, and possessing desirable storage properties. Although tef provides high quality food and grows under marginal conditions unsuitable for other cereals, it is considered to be an orphan crop because it has benefited little from genetic improvement. Hence, unlike other cereals such as maize and wheat, the productivity of tef is extremely low. In spite of the low productivity, tef is widely cultivated by over six million small-scale farmers in Ethiopia where it is annually grown on more than three million hectares of land, accounting for over 30% of the total cereal acreage. Tef, a tetraploid with 40 chromosomes (2n = 4x = 40), belongs to the family Poaceae and, together with finger millet (*Eleusine coracana* Gaerth.), to the subfamily Chloridoideae. It was originated and domesticated in Ethiopia. There are about 350 *Eragrostis* species of which *E. tef* is the only species cultivated for human consumption. At the present time, the gene bank in Ethiopia holds over five thousand tef accessions collected from geographical regions diverse in terms of climate and elevation. These germplasm accessions appear to have huge variability with regard to key agronomic and nutritional traits. In order to properly utilize the variability in developing new tef cultivars, various techniques have been implemented to catalog the extent and unravel the patterns of genetic diversity. In this review, we show some recent initiatives investigating the diversity of tef using genomics, transcriptomics and proteomics and discuss the prospect of these efforts in providing molecular resources that can aid modern tef breeding.

Keywords: *Eragrostis tef*, diversity, genomics, proteomics, tef, transcriptomics, variability

## Introduction

Tef [*Eragrostis tef* (Zucc.) Trotter] is the major food crop in Ethiopia where it is annually cultivated on more than three million hectares of land (CSA, 2014). Compared to other cereals, tef is more tolerant to extreme environmental conditions especially to water-logging. It is unique in its ability to grow and yield on poorly drained Vertisols which most cereals cannot tolerate. Unlike other cereals, the seeds of tef can be easily stored under local storage conditions without losing viability since the grains are resistant to attack by storage pests (Ketema, 1997). Tef grain is also a rich source of protein and nutrients and has additional health benefits including that the seeds are free from gluten (Spaenij-Dekking et al., 2005). According to a recent study, the bio-available iron content was significantly higher in tef bread than in wheat bread (Alaunyte et al., 2012). In general, tef provides quality food and grows under marginal conditions, many of which are poorly suited to other cereals. However, tef is considered to be an orphan crop since it is only of regional importance and has until recently not been the focus of crop improvement (Naylor et al., 2004; Assefa, 2014).

Despite its versatility in adapting to extreme environmental conditions, the productivity of tef is very low with the national average standing at 1.5 t/ha (CSA, 2014). Tef's major yield limiting factors are lack of cultivars tolerant to lodging, drought, and pests (Assefa et al., 2011). Lodging is the permanent displacement of the stem from the upright position. Tef possesses tall, weak stems that easily succumb to lodging caused by wind or rain. In addition, lodging hinders the use of high input husbandry since the application of increased amounts of nitrogen fertilizer to boost the yield results in severe lodging. When this occurs, both the yield and the quality of the grain and the straw are severely reduced and both manual and mechanical harvesting are impeded. Various attempts have been made by the research community to develop lodging-resistant tef cultivars (Assefa et al., 2011; Tadele and Assefa, 2012) but presently no cultivar with reasonable lodging resistance has been obtained.

The analysis of genetic relationships amongst tef varieties is an important component of improvement programs because it provides information about the genetic diversity of the crop and sets a platform for stratified sampling of breeding populations. Tef represents a unique biodiversity component in the agriculture and food security of millions of farmers in Ethiopia. The conservation, characterization, and utilization of the existing tef genetic diversity are becoming increasingly important in view of the evolving needs and manifold challenges of small-scale farmers in Ethiopia. This is primarily because tef has remarkable genetic traits useful for most Ethiopian farmers to utilize for coping with erratic climatic conditions, generation of household income, and fulfilling concerns of nutritional needs. Moreover, the conservation and utilization of the tef genetic resources offer a reliable basis for enhancing food security and developing crop diversification in the moisture stress and challenging agro-ecological areas of the country.

Here, we present an overview of the results of the major studies made on tef diversity and recent initiatives underway to better understand the diversity at molecular level and utilize these diversities in improving the crop using modern genetic and genomic tools.

## Taxonomy and Accessions of Tef

Tef belongs to the Poaceae or Grass family as do all economically important cereals. It is closely related to finger millet (*Eleusine coracana* Gaerth.) as both are in the subfamily Chloridoideae. The genus *Eragrostis* comprises about 350 species from which only tef is cultivated for human consumption. Unlike wheat, barley and rice, which are all C3 plants, tef (along with maize and sorghum) is a C4 plant which efficiently utilizes carbon dioxide during photosynthesis. This can be seen by tef's Kranz-type leaf anatomy with vascular centers surrounded by bundle sheath cells containing a high number of chloroplasts and by the low CO2 compensation point of the leaves, also typical of C4 as opposed to C3 species (Kebede et al., 1989).

Tef is an allotetraploid (2n = 4x = 40). Over the past few decades the ancestry of tef has been investigated using morphological and cytogenetic methods (Jones et al., 1978), biochemical methods (Bekele and Lester, 1981), and phylogenetic analysis using ribosomal DNA and transcription factor genes (Espelund et al., 2000) or nuclear and plastid genes (Ingram and Doyle, 2003). It has been suggested that *Eragrostis pilosa* is closely related to tef while *E. heteromera* and *E. cilianensis* are more distantly related (Ingram and Doyle, 2003). Similar conclusions were reached using biochemical methods (Bekele and Lester, 1981). The close relationship between tef and *E. pilosa* is also evidenced by the successful hybridization of these two species (Tefera et al., 2003a). This hybridization generated viable offspring and ultimately resulted in the release in 2009 of a variety called *Simada* (DZ-Cr-285 RIL295) from the inter-specific hybrid of tef [DZ-01- 2785 × *E. pilosa* (line 30-5); MoA, 2013]. However, since *E. pilosa*, like tef, is a tetraploid, the diploid ancestors of tef remain unknown.

Ethiopia is the origin and center of diversity for tef (Vavilov, 1951), harboring landraces with a wide array of phenotypic diversity, and also wild progenitors and related wild species. Charring experiments suggest that the domestication history of tef might be different from that of barley and wheat since in some cases tef might not survive the high temperatures tolerated by other cereals (D'Andrea, 2008).

As in any crop improvement program, tef breeding also relies mainly upon the germplasm resources existing in the genetic stock. Diverse types of accessions are available in the country, and collection, evaluation, and utilization of tef germplasm by national and international groups began in Ethiopia in the late 1950s. However, organized collection at the national level was made after the establishment of the Plant Genetic Resources Center of Ethiopia (PGRC/E) in 1976. After several changes in its name and mandate, the institute responsible for germplasm collection and maintenance as well as distribution is currently called the Ethiopian Institute of Biodiversity (EIB). The institute with only 1067 tef accessions in Demissie (1991) has reached to 5169 accessions in 2011 (Tesema, 2013). This fourfold increase in the collection size in just two decades shows the presence of both a wide diversity of germplasm in the country and also the commitment of institutes and individuals to collect and preserve these germplasm for future use.

Characterization of the accessions according to their properties such as morphology is important in order to provide information to interested researchers or other sectors of society. The first and most comprehensive detailed morphological descriptions for 35 tef cultivars were given based on phenology, plant vigor, shoot and root related traits, panicle form, spikelet size, growth habit, and lemma and caryopsis color (Ebba, 1975; **Table 1**).

## Phenotypic Diversity in Tef

Tef is highly diverse and variable in terms of morphological and agronomic characters. The distribution of the crop in different agro-ecological zones coupled with the selection by farmers on the basis of their preferred traits has resulted in a number of varieties with unique characters. Genetic diversity analysis of tef accessions facilitates the development of improved varieties with high productivity and yield stability. In view of this fact, efforts have been made to assess and quantify the extent and pattern of genetic diversity in the tef germplasm collections using different approaches (**Table 2**).

## Diversity in Natural Populations

The first studies on phenotypic diversity in tef germplasm used 124 single panicles collected from the major tef producing areas in Ethiopia as a source of seed. The germplasm accessions showed significant variability for plant height, panicle length, maturity, seed color, seed yield, lodging, and panicle form (Mengesha et al., 1965). As shown in **Figure 1**, at least four distinct panicle forms are present in tef accessions, namely very-compact, semi-compact, fairly loose, and veryloose.

TABLE 1 | Selected properties of 35 tef ecotypes (cultivars) characterized (Ebba, 1975).


*Panicle form: V-loose, very loose; F-loose, fairly loose; S-comp, semi-compact; V-comp, very compact. Seed color: Br, brown; mBr, medium brown; lBr, light brown; poW, purple orange white; yWh, yellow white.*



*RIL: recombinant inbred lines.*

Later, studies involving 2255 tef lines collected from different parts of the country showed high variation for flag leaf area, single plant grain yield, and straw yield (Ketema, 1993). The analyses of 9885 accessions collected from 14 former provinces of Ethiopia showed simple coefficient of variation (SCV) estimates ranging from 32% for primary panicle branches to 217% for grain yield/plant (Bekele, 1996). While using SCV, the extent of variation among traits is not affected by the magnitudes of values and units of measurement. Since SCV does not efficiently measure diversity among traits, phenotypic (PCV) and genotypic (GCV) coefficients of variation, which are based on partitioning of the total variance into components of genetic and non-genetic factors, are now more extensively used. Accordingly, various breeders have applied these two indices in evaluating the tef germplasm (Tefera et al., 1990; Hundera et al., 1999; Assefa et al., 2000; Chanyalew et al., 2009). Most of these studies revealed significant to highly significant differences among the genotypes for most of the traits examined, and this variability would serve as a basis for the improvement of the crop. Because the magnitude of genetic variation is better assessed from GCV, breeders usually focus on traits with high GCV estimates. High GCV values were reported for tiller number, panicle weight, grain yield per panicle, plant biomass, and grain yield (Assefa et al., 1999, 2001b; Balcha et al., 2003; Tefera et al., 2008; Chanyalew et al., 2009). This wide genetic variation indicates much potential for improving the crop through direct selection and/or hybridization.

Characters with huge variability include: days to panicle emergence (25–81), days to maturity (50–140), number of grains/plant (9,000–90,000), plant height (20–156 cm), number of tillers/plant (5–35), panicle type (from very loose to very compact), flag leaf area (2–26 cm2), culm diameter (1.2–5 mm; Ketema, 1993; Assefa et al., 2001a,b). Variability in tef germplasm for culm internode diameter is a key factor in the identification of tef lines with improved lodging resistance.

Soon after the discovery of breeding techniques for tef (Berhe, 1975), several studies were made to investigate the inheritance of key agronomic traits and their contributions to tef breeding. The initial studies dealt with investigations of the inheritance of lemma color, seed color, panicle form in F2 and F3 populations of crosses involving genotypes with contrasting phenotypes (Berhe et al., 1989a,b,c). Subsequent studies were made by several other researchers (Tefera et al., 2003a,b, 2008; Chanyalew et al., 2006; Yu et al., 2006; Zeid et al., 2011).

#### Diversity due to Agro-Ecology

Significant clinal diversity was reported in tef germplasm populations collected from different altitudinal zones for traits such as days to maturity, number of culm nodes, first and second basal culm internode diameter, and harvest index (Assefa et al., 2001b). Likewise, significant altitude-based diversity in tef germplasm

populations was found for traits such as main shoot culm node number, days to maturity, diameters of the first and second lowest primary shoot culm internodes, and harvest index (Assefa et al., 2002a). However, no significant differences for qualitative traits (such as lemma, seed and anther colors and panicle form) were reported among the altitudinal zones (Kefyalew et al., 2000). On the other hand, for the trait days to maturity, 36 heterogeneous tef populations had lower diversity levels for accessions collected between altitudes of 1800 and 2400 m, while high diversity was noted for accessions obtained below 1800 m above sea level (Assefa et al., 2000).

Evaluations of 70 accessions of tef collected from different regions of Ethiopia showed significant variations within populations, among populations within regions, and among regions in most of the phenotypic traits (Tadesse, 1993). On the other hand, studies based on evaluations of 3600 tef lines representing 36 populations collected from the Central and Northern Regions of Ethiopia revealed significant regional diversity for seed color and days to maturity (Kefyalew et al., 2000). Furthermore, other studies showed significant regional diversity for lemma color, number of culm internodes, and counts of basal and middle spikelet florets in tef germplasm populations from different parts of the country (Assefa et al., 2002b).

An experiment at two locations using 144 accessions collected from different regions of Ethiopia showed that accessions from the same origin clustered into different classes and those from different origins also clustered into the same group (Adnew et al., 2005). Other studies further confirmed that the level of genetic diversity is higher in tef germplasm within a region than between regions, and as a result, accessions that had originated from the same region and altitude were grouped into distinct and distant clusters (Assefa et al., 2001b).

On the other hand, no significant differences were obtained among diverse altitude zones for parameters like days to panicle emergence, culm and panicle length, number of panicle branches, counts of fertile florets/spikelet, and shoot biomass (Assefa et al., 2001a,b). Moreover, diversity studies using 33 accessions collected from North-Western Ethiopia and four improved varieties (Ayalew et al., 2011) and selected tef genotypes (Plaza-Wüthrich et al., 2013) revealed considerable variations among the genotypes for many of the traits assessed.

However, this genetic variability is rapidly declining as farmers are quickly adopting improved cultivars and using them instead of landraces. In order to reduce the expected genetic erosion, the EIB has made rescue collections from different agro-ecological zones.

## Molecular Diversity in Tef

In the past, efforts have been made to characterize and analyze the diversity levels in cultivars of tef and its relatives based on approaches other than morphological or phenotypic data (**Table 3**). Before high-throughput sequencing provided copious amounts of molecular data, chromatography, flow cytometry, TABLE 3 | Studies made on molecular (genotypic) diversity in tef.


∗*chloroplast DNA, 18S r, VP1 DNA.*

∗∗*AFLP, EST-SSR, SNP/INDEL, IFLP, ISSR.*

gel electrophoresis, and polymorphism assays were used for the molecular characterization of genetic diversity.

#### Proteins as a Marker

Early work using differences in protein content to classify and distinguish different accessions of tef employed the chromatography and electrophoresis of proteins involved in traits of interest such as seed storage proteins. Studies on the relatedness between *Eragrostis* species and tef accessions using chromatography of leaf phenolics and electrophoresis of seed proteins as biochemical markers showed complex patterns of variation amongst tef cultivars (Bekele and Lester, 1981). Similarly, polymorphisms among tef seed storage proteins (albumin, globulin, and prolamin) were found based on SDS-PAGE (Bekele et al., 1995). The study was able to classify 37 cultivars into seven groups, and suggested that the polymorphisms in albumins and globulins could be exploited to identify genotypes with desirable nutritional qualities.

## Genomics

Finding and exploiting DNA sequence variation within a genome is of utmost importance for crop genetics and breeding (Varshney et al., 2009). Over the last three decades, different methods have been developed to detect and quantify the genetic diversity of tef. The first techniques employed were flow cytometry, sequencing of single genes or regions and genotyping using AFLP, RAPD, RFLP, inter-simple sequence repeat (ISSR), and simple sequence repeat (SSR) markers, and these have all shed light on the structure of allelic diversity within selected tef germplasm collections (Girma et al., 2014). As shown in

accessions and improved varieties of tef. The ∗ represents improved varieties. The phylogenetic tree was constructed from ∼200 bp surrounding an SSR marker located on linkage group nine (Zeid et al., 2011). Quncho, the most popular variety in Ethiopia was produced from a cross between the high-yielding Dukem variety and the white-seeded Magna variety.

**Figure 2**, an SSR marker was used successfully to study the relationships among diverse tef genotypes, including natural accessions and improved varieties. However, only a small part of the diversity has been studied, and many of the essential questions still remain unanswered. Currently high-throughput single nucleotide polymorphism (SNP) genotyping is one of the methods that has been used to detect and exploit the genetic diversity of several crops. Genetic diversity analysis in some of the agriculturally important food crops such as sorghum (Nelson et al., 2011) and (Morris et al., 2013), barley (Close et al., 2009), rice (Thomson et al., 2012), bread wheat and emmer wheat (Akhunov et al., 2009), durum wheat (Trebbi et al., 2011), and maize (Yan et al., 2010) have been carried out with SNP genotyping methods employing next generation sequencing technologies.

## Genome Size and Ploidy Determination

The genomic content of tef was first studied using flow cytometry, a popular method for ploidy screening and genome size estimation (Dolezel and Bartos, 2005). In the first measurement using four tef cultivars, the genome size was found to be between 714 and 733 Mbp (Ayele et al., 1996), relatively small for a grass (**Table 4**). The small genome size of tef made it a good candidate for genetic mapping and later genome sequencing. In addition, 32 of the first 35 tef ecotypes characterized (Ebba, 1975) as well as three commercial varieties were tested for ploidy level; all were tetraploid. In a another study with 10 released varieties of tef, following optimization of the flow cytometry conditions, the resulting genome size estimates were between 648 and 926 Mbp (Hundera et al., 2000; **Table 4**).

### Sequence-Based Diversity

Around the same time, sequencing of single genes and small genomic regions was also employed to measure diversity and genetic relationships. Sequence analysis of non-coding regions of chloroplast DNA, 18S rDNA, and the transcription factor VP1 did not show significant intra-specific variation among six tef cultivars (Espelund et al., 2000). In addition, two rht1 (reduced height) gene homologs and three sd1 (semi-dwarf) genes were later sequenced for 31 accessions of tef (Smith et al., 2012). A low level of nucleotide diversity was observed and the genetic diversity could be organized into 2–4 haplotypes, a relatively small number.

## Molecular Markers

Molecular markers are short sections of DNA that differ between varieties, and thus can be used for identification of a germplasm by a specific pattern of polymorphisms, to assess diversity and to determine relationships. Genetic relationships among accessions of *E. tef*, *E. pilosa,* and *E. curvula* which were collected from Ethiopia and USA were assessed based on AFLP (Bai et al., 1999b; Ayele and Nguyen, 2000) and RAPD markers (Bai et al., 2000). These analyses depicted relatively low levels (18%) of polymorphism within *E. tef*, and high similarity between *E. tef* and *E. pilosa*. The Jaccard similarity coefficient (size of the intersection of two sets divided by the size of the union) among two tef populations ranged from 84 to 96% for RAPD and from 73 to 99% for AFLP markers, indicating very close similarity among accessions. On the other hand, ISSRs analysis on 92 tef genotypes from seven regions plus improved varieties showed higher diversity among tef cultivars with Jaccard similarity coefficients ranging from 26 to 86% (Assefa et al., 2003a). A comparison of AFLP, EST-SSR, ISSR, and SSR markers for polymorphisms in tef recombinant inbred lines concluded that EST-SSR and ISSR makers had as much polymorphism as AFLP markers (Chanyalew et al., 2007).

Assessment of genetic diversity and relationships among 326 tef accessions, 13 wild relatives, and four commercial varieties from the United States based on 39 SSR markers, 26 of which were flanking QTL intervals for stem strength related traits, yield and lodging index showed genetic similarity (GS) estimates of between 0.20 and 0.99 among tef accessions (Zeid et al., 2012), and this contrasted with the narrow genetic background suggested in the earlier studies described above. A large base of


TABLE 4 | Variations in 2C DNA content and genome size among tef genotypes.

genetic diversity is indispensable for successful breeding programs. However, the diversity in tef has never been sufficient to produce the desired improvement in lodging resistance. Given the complexity of lodging and its component traits such as plant height, and culm internode length and diameter, alternative approaches including genetic transformation in line with marker-assisted selection should be considered for improving the malignant lodging syndrome in tef.

The afore-mentioned study of Zeid et al. (2012) also revealed 27 cases where accessions were identical to one or more of the other accessions. According to the authors, the high GS estimates from previous studies (Ayele et al., 1999; Bai et al., 1999a, 2000; Ayele and Nguyen, 2000) using the same plant material (landraces), was a marker dependent issue rather than due to low polymorphism in tef as previously suggested. An SSR marker used to construct a phylogenetic tree for 16 natural accessions and four improved varieties of tef showed the relationship among these genotypes (Cannarozzi et al., 2014). A multiple sequence alignment of approximately 200 base pairs was variable at 32 sites of which 25 were informative for determining evolutionary relationships.

#### Genetic Mapping

Genetic maps show the position of the molecular markers and QTLs relative to each other in terms of recombination frequency, and are used to find genes responsible for traits of interest. The first genetic map of tef was produced with an intra-specific cross between the 'Kaye Murri' and 'Fesho' cultivars and contained 211 AFLP markers in 25 linkage groups (Bai et al., 1999a). The low number of polymorphisms found between the two varieties of tef impeded its use in breeding. The same group later produced an RFLP linkage map using 116 RILs from the cross of 'Kaye Murri' with *E. pilosa* (Zhang et al., 2001). This inter-specific cross produced far more polymorphisms; however, the level of polymorphism was still smaller than that of other grasses.

The group of Sorrells has been instrumental in identifying QTLs associated with yield related traits and producing genetic maps of tef using RILs from a cross between 'Kaye Murri' and *E. pilosa* with a variety of markers (Chanyalew et al., 2005; Yu et al., 2006; Zeid et al., 2011). Clusters of QTLs controlling yield and plant architecture were identified, thereby forming useful targets for applied breeding.

### High-Throughput Genomics

During the last 5 years, tef genomics research has moved from analysis of a handful of genetic polymorphisms, toward whole genome sequencing and genome-wide polymorphism search. The genome and the transcriptome of the tef genotype Tsedey (DZ-Cr-37) were sequenced by the Tef Improvement Project at the Institute of Plant Sciences, University of Bern (Cannarozzi et al., 2014). Genome sequencing has many applications in tef improvement. First and foremost, primer sequences can be identified without resorting to other genomes or degenerate primers. This is especially important for the isolation of homeologous copies of each sub-genome for techniques such as Targeting Induced Local Lesions IN Genome (TILLING) which require genome specific primers. The genome has already been used to discover genetic markers such as SNPs and SSRs useful for marker-assisted breeding, for the construction of high density genetic maps and for linkage disequilibrium studies on diverse germplasm. Possession of the genomic sequences allows an understanding of the molecular basis of the mechanisms of tef's many desirable properties such as its tolerance to many abiotic and biotic stresses. The genes obtained from these analyses could be then transferred to other economically important crops.

## Transcriptomics

To date, the transcriptome from only one tef improved variety has been sequenced (Cannarozzi et al., 2014), precluding comparison of transcriptomes between varieties or accessions. For the Tsedey improved variety (DZ-Cr-37), a normalized transcriptome library was prepared and sequenced from roots and shoots of tef seedlings resulting in a transcriptome with 27756 gene clusters and 38333 transcripts. In addition, a second non-normalized library was obtained from various tef tissues subjected to drought and water-logging, resulting in a similar number of gene clusters.

An RNA-Seq study of two different varieties of quinoa (Raney et al., 2014), one representing valley ecotypes and another one representing high plains ecotypes, under different watering conditions was recently conducted. It was found that 27 putative gene products were differentially expressed based on variety × treatment interaction. These included significant differences in root tissue in response to increasing water stress. A similar strategy could be employed for tef varieties to discover the QTLs responsible for specific accessions' traits.

### Proteomics

Proteomics has emerged as an indispensable tool to analyze the whole or specific protein complement present in a particular tissue, organ, cell, or organelle (Agrawal et al., 2005; Benkeblia, 2011). In recent years, plant proteome analysis has evolved into high-throughput techniques resulting in the generation of high quality data with the continuous improvements made in sample preparation, protein separation, mass spectrometry, and protein search algorithms (Thelen, 2007; Benkeblia, 2011).

The application of proteomic studies has led to the discovery of a number of important proteins, and has facilitated attempts to explore their importance in improving plant yield and tolerance to environmental stresses (Salekdeh and Komatsu, 2007; Mochida and Shinozaki, 2010; Benkeblia, 2011). Similarly, to take advantage of the diversity among tef lines, proteomic approaches can be narrowed and refined to investigate which proteins are characteristic of specific lines or play important roles in a selected tef line. The corresponding genes of these proteins of interest can then be isolated and characterized from the tef genome provided it is comprehensively annotated. The particular phenotype conferred by the protein(s) of interest can then be introduced or enhanced in other tef lines using genetic and transgenic approaches to improve crop productivity. This functional genomics approach has been proposed as a standard 'omic' strategy for the improvement of many crop species (Agrawal and Rakwal, 2006).

To date, there has been no published proteomic study on tef with respect to protein profiling or comparative proteomics, while numerous such studies have been done on maize (Mohammed, 2005; Zhu et al., 2006; Prinsi et al., 2009), wheat (Jiang et al., 2012; Budak et al., 2013) and rice (Agrawal and Rakwal, 2006; Kim et al., 2014) using both gel-based and gel-free (mass spectrometry) techniques. Recently, proteomic profiling of the tef drought response has been undertaken, and should contribute valuable information on the key biological processes affected by water loss in tef (Kamies et al., 2014).

A key constraint affecting tef yield is salinity in the lowland and Rift Valley areas of Ethiopia, especially in the Awash valley and lower plains (Asfaw and Dano, 2011). The effects of increased salinity on tef yield and yield components were investigated by screening 15 lowland tef genotypes (10 accessions and 5 varieties) at different salinity levels. They found grain yield per main panicle to be the most affected by increased salinity, and although there were differences in genetic variation between tef varieties and accessions, salt tolerance was observed in accession 237186 and variety DZ-Cr-37 (Tsedey) genotypes (Asfaw and Dano, 2011). This particular variety of tef, thus requires further proteomic and metabolomic investigation in order to elucidate the mechanisms of salt tolerance in tef and for identification of salt tolerant markers.

A comparative proteomics approach could be employed to investigate the cell wall proteome in both the tef stem and root tissues. A similar comparative proteomic study was done on maize primary and lateral roots whereby proteins involved in cell wall metabolism, cell elongation, lignin metabolism, defense, and citrate cycle were identified (Liu et al., 2006; Zhu et al., 2006). Such a study can be done on tef to identify and characterize stress-related cell wall proteins.

It is important to note that future tef improvements using the 'omics' tools should be conducted on one standardized consensus tef variety to allow for ease of comparison across functional genomic studies and to facilitate interpretation of data. Many studies have been conducted on the improved variety DZ-Cr-37 mostly because it is grown in areas which receive low rainfall (especially terminal drought-prone areas), and has been proposed to have a degree of drought tolerance in addition to being widely adaptable to differing climates (Ayele et al., 2001; Assefa et al., 2003a, 2011; Admas and Belay, 2011; Cannarozzi et al., 2014). Furthermore, since the genome and transcriptome information of this variety is available (Cannarozzi et al., 2014), it provides a platform for different proteomic strategies such as sub-cellular proteomics or phospho-proteomics to investigate stresses associated with tef. As stated earlier, proteomics is a functional tool that can provide insight to phenotypes of interest, and is largely dependent on the level of clarity and surety provided by the databases generated and the level of annotations made to the sequences. Since tef genome sequencing has been conducted and database annotation is in its infancy, proteo-bioinformatic approaches are somewhat limited, which in time will be remedied as more and more protein sequences are curated.

## Tef Diversity in Key Traits

## Grain Yield and Shoot Biomass

Development of varieties with high grain yield has been one of the top priorities of the National Tef Improvement Program in Ethiopia (Assefa et al., 2011). This varietal development process depends on the variability available within the gene pool. Over the past three decades, several studies (Assefa et al., 1999, 2000, 2001b, 2003b; Teklu and Tefera, 2005) were conducted to assess this variability, and tests both at research stations and on-farm yield trials were carried out at various locations. Over 30 improved varieties have been developed pushing the national average tef yield from 0.7 t/ha in 1994 to 1.5 t/ha in 2013 (CSA, 2014) hinting that the yield potential in tef can be further exploited. Variability in shoot biomass has also been studied in the majority of the above-mentioned studies, and a wide range (4–105 g/plant) was reported, suggesting the presence of high variability for this trait within the tef gene pool.

## Seed Size and Seed Coat Characteristics

Despite the importance of seed size in terms of both agronomy and productivity, there exists only one study on the variability of seed size in tef. Using two improved tef genotypes, sieve-graded larger tef seeds had an increased seed yield, but it was concluded that this increase did not justify seed grading in tef (Belay et al., 2009). Seed coat characteristics in tef have received little research attention. The only study reported in literature showed slime cell differences in two tef genotypes and a wild *Eragrostis* species (Kreitschitz et al., 2009). The authors reported the presence of slime cells, a type of modified epidermal cells covering the fruit of the genotypes under investigation, and that such cells could play an adaptive role for tef plants growing in dry areas.

## Physiology and Agronomy Related Traits

Due to a growing interest in utilizing tef as a gluten-free alternative to rice, there is corresponding interest in producing tef at a larger scale in some western countries. However, as a short day tropical cereal, growing tef in the temperate regions during the summer when the days get longer poses a big challenge. In order to investigate the ability of tef to flower in response to changes in the photoperiod, the effect of the relative lengths of day and night using four tef cultivars were studied. Two of the four cultivars had a stronger photoperiod response; panicle initiation as well as development and outgrowth of the panicle were influenced by photoperiod (van Delden et al., 2012).

### Nitrogen-Use Efficiency

Nitrogen use efficiency (NUE), defined as the ratio of grain yield to supplied N, is a key parameter for evaluating a crop cultivar, and it is composed of N uptake efficiency and N physiological use efficiency (de Macalel and Vlek, 2004). Breeding for NUE in tef could play a considerable role in reducing the amount of nitrogen fertilizer applied without affecting yield significantly. The NUE of tef is very low, ranging from 16 to 34% (Tulema et al., 2005). In the last decade, some authors looked at the genetic variation in NUE of tef (Tulema et al., 2005; Balcha et al., 2006; Habtegebrial et al., 2007). We suggest that further comparisons of nitrogen-use efficiency within the tef gene pool are important to evaluate their performance under limited nitrogen supply.

### Osmotic Adjustment and Root Depth

Water deficit and salinity are among the abiotic production constrains limiting survival, growth, and productivity of tef. However, it is likely that there exists variability within the tef germplasm pool, and certain tef genotypes could adopt some strategies such as osmotic adjustment to resist these constraints. Systematic sampling of 54 tef genotypes from the entire gene pool showed a significant genotype effect on osmotic adjustment and root depth, irrespective of the area from where the genotypes were collected (Ayele et al., 2001).

## Stress Related Traits Drought Tolerance

The production areas of tef range from the cool highlands to the dry lowlands that are often associated with moisture deficit during critical stages of plant development. Studies investigating the effect of moisture deficit on the performance of tef plants range from variability in key characters and response studies (Degu et al., 2008; Mengistu, 2009; Ginbot and Farrant, 2011; Shiferaw et al., 2012) to mapping QTLs related to economically important traits under water deficit conditions (Degu, 2010). In general, the majority of the studies have shown that there is genetic variability among the genotypes investigated suggesting that the tef gene pool harbors moisture stress tolerant genotypes that could be screened through efficient tools such as molecular markers.

## Salinity and Acidity Tolerance

Due to the anticipated changes in the climate and expansion of farmlands in the rift valley areas, studying and documenting the effect of such growing conditions on tef production and productivity is worthwhile. Earlier, a few of such studies have been published including one which showed the presence of broad intra-specific variability among the ten tef accessions studied for salinity tolerance (Asfaw and Dano, 2011), and one which showed the presence of genetic variability for tolerance to soil acidity and aluminum toxicity in selected tef genotypes (Abate et al., 2013).

## Nutrition, Health, and Consumers' Preference Related Traits

#### Seed Color Consumers' Preference

The Ethiopian Standards Agency recognizes four classes of tef grain mainly based on color of the seed (QSAE, 2001). These are very white, white, brown and mixed (commonly known as *Sergegna*). Oftentimes, farmers produce brown-seeded types for home consumption and white types for sale. Assessment of the diversity patterns of the seed color in tef with respect to growing regions and altitude zones revealed that the majority of tef collections from the north and northwestern part of Ethiopia were white-seeded as compared to those from the southern part of the country which were brown-seeded (Assefa et al., 2002b).

### Nutritional Quality and Physico-Chemical Properties of Tef Seed

Knowledge of the physical properties of tef seed can be useful for agronomy, storage, marketing, and other socio-cultural purposes. A handful of studies have been carried out on the starch and protein contents of tef seed. Starch is the principal carbohydrate of all cereals, and represents, from 56% (oats) to 80% (maize) of the grain dry matter (Eliasson and Larssson, 1993). The starch characteristics of tef seed have been extensively studied (Bultosa et al., 2002, 2008; Bultosa and Taylor, 2003, 2004; Bultosa, 2007). The scientific study of tef grain protein and more specifically the amino acid composition extends back for over 50 years. Previous, reports indicated that tef seed contains a good balance of the essential amino acids, except lysine (Jansen et al., 1962). Three decades later, investigations of the polymorphism of seed albumin, globulin, and prolamin fractions showed the existence of considerable polymorphism in the studied protein fractions among the 37 tef cultivars investigated (Bekele et al., 1995). At the same time, Tatham et al. (1996) purified and characterized prolamins of tef. According to this study, the tef protein is made up of 9–14% prolamins and these are similar to prolamins of maize and sorghum. This value is in a similar range to the previous results (3– 15%; Bekele et al., 1995). However, according to a recent report, the prolamin content of three tef genotypes studied reached as much as 40% (Adebowale et al., 2011). In these studies, there is a discrepancy between the number of genotypes used and the methods employed. Clearly variability exists within the tef gene pool and a comprehensive study with more genotypes and modern tools to characterize and document the seed protein fractions is necessary. More recently, studies on tef seeds have changed course and three studies by Gebremariam et al. (2013a,b,c) investigated the malt quality attributes, while another by Boka et al. (2013) assessed the antioxidant properties of differentially processed tef grain.

As a potential alternative gluten-free food source for celiac patients, tef has been studied along with wheat, oat rye, barley, rice, maize, and triticale (Spaenij-Dekking et al., 2005). This study showed that the tef cultivars evaluated contained no gluten or gluten homologs. This is the first scientific evidence for the absence of gluten in tef flour. Recently, this has been supported by results from the genome sequence initiative (Cannarozzi et al., 2014).

## References


## Conclusion

The broad spectrum of trait diversity in tef implies great opportunities for genetic improvement through either direct selection or intra-specific hybridization between parental lines with desirable traits. In addition, statistical tools such as correlation analysis can be used to aid selection of candidates in breeding programs. Additionally several mutagenized populations have been developed to supplement the natural diversity present in tef. As some studies reviewed here, used only few or selected tef genotypes, they may not be representative of the existing diversity in tef accessions. Future research is required to explore diversity in different traits of agronomic and nutritional importance. Concerted efforts of all stakeholders in research, development and funding are required to promote the research and development of vital crops such as tef in order to promote food and nutrition security.

## Acknowledgments

Research and development work in Tadele's Lab is supported by the Syngenta Foundation for Sustainable Agriculture, SystemsX and the University of Bern.


map for tef. *Theor. Appl. Genet.* 122, 77–93. doi: 10.1007/s00122-010- 1424-4


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2015 Assefa, Cannarozzi, Girma, Kamies, Chanyalew, Plaza-Wüthrich, Blösch, Rindisbacher, Rafudeen and Tadele. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Genetic diversity and genomic resources available for the small millet crops to accelerate a New Green Revolution

#### Travis L. Goron and Manish N. Raizada\*

Department of Plant Agriculture, University of Guelph, Guelph, ON, Canada

Small millets are nutrient-rich food sources traditionally grown and consumed by subsistence farmers in Asia and Africa. They include finger millet (Eleusine coracana), foxtail millet (Setaria italica), kodo millet (Paspalum scrobiculatum), proso millet (Panicum miliaceum), barnyard millet (Echinochloa spp.), and little millet (Panicum sumatrense). Local farmers value the small millets for their nutritional and health benefits, tolerance to extreme stress including drought, and ability to grow under low nutrient input conditions, ideal in an era of climate change and steadily depleting natural resources. Little scientific attention has been paid to these crops, hence they have been termed "orphan cereals." Despite this challenge, an advantageous quality of the small millets is that they continue to be grown in remote regions of the world which has preserved their biodiversity, providing breeders with unique alleles for crop improvement. The purpose of this review, first, is to highlight the diverse traits of each small millet species that are valued by farmers and consumers which hold potential for selection, improvement or mechanistic study. For each species, the germplasm, genetic and genomic resources available will then be described as potential tools to exploit this biodiversity. The review will conclude with noting current trends and gaps in the literature and make recommendations on how to better preserve and utilize diversity within these species to accelerate a New Green Revolution for subsistence farmers in Asia and Africa.

Keywords: finger millet, kodo millet, foxtail millet, barnyard millet, proso millet, little millet, New Green Revolution, biodiversity

## Small Millets—Valuable Crops Neglected by the Green Revolution

The "Green Revolution" represents a period of massive agricultural advancement, and is often credited with saving over a billion people from starvation in the developing world (Borlaug, 2000; Evenson and Gollin, 2003). The initial focus of the Revolution was the promotion of semi-dwarf varieties of major cereal grain crops especially rice, wheat, and maize. Such modern varieties were also methodically bred to deal with environmental stresses, and in many cases produced yields several times higher than local cultivars. A highly cited example is the global success of "miracle rice" in the 1960s (De Datta et al., 1968). When faced with potential mass famine, the

**Abbreviations:** EST, expressed-sequence tag; RFLP, restriction fragment length polymorphism; AFLP, amplified fragment length polymorphism; SSR, simple sequence repeat; WUE, water use efficiency; NUE, nitrogen use efficiency.

#### Edited by:

Joanna Marie-France Cross, Inonu University, Turkey

#### Reviewed by:

Dayong Li, Chinese Academy of Sciences, China Velu Govindan, CIMMYT, Mexico Anil Kumar, G B Pant University of Agriculture and Technology, India

#### \*Correspondence:

Manish N. Raizada, Department of Plant Agriculture, University of Guelph, 50 Stone Road East, Guelph, ON N1G 2W1, Canada raizada@uoguelph.ca

#### Specialty section:

This article was submitted to Plant Genetics and Genomics, a section of the journal Frontiers in Plant Science

> Received: 21 December 2014 Accepted: 27 February 2015 Published: 24 March 2015

#### Citation:

Goron TL and Raizada MN (2015) Genetic diversity and genomic resources available for the small millet crops to accelerate a New Green Revolution. Front. Plant Sci. 6:157. doi: 10.3389/fpls.2015.00157 Punjab region of India collaborated with international advisors to introduce IR8, a semi-dwarf rice modern variety. IR8 was found to produce up to 10 times the yield of traditionally grown varieties (De Datta et al., 1968) and helped to transform India's food production from deficit to surplus; national rice production tripled accompanied by a dramatic drop in price. IR8 and its progenitors as well as other modern varieties of cereals were further exported to other regions of the world with similar results especially in Latin America and Asia (Evenson and Gollin, 2003).

However, there are regions of the world that did not experience a Green Revolution. Sub-Saharan Africa experienced a lag in the benefits of modern varieties although efforts were made for their introduction and establishment (Ejeta, 2010). Reasons for the failure are complex. Many commentators point to institutional and political difficulties that may have hindered dissemination of new technology (Ejeta, 2010). However, it is also important to consider the agroeconomic complexities of the region, where a mixture of species less common elsewhere in the world are traditionally grown (Evenson and Gollin, 2003). A wide range of climatic zones and unique farming practices with a spectrum of soil types also created a challenge. In the early part of the Green Revolution, breeding generally consisted of modifying pre-existing genetic resources of wheat, maize, and rice in which research had already been conducted by developed nations. These varieties would be further bred to incorporate additional traits to increase yields. The strategy was not applicable to many African crops where essentially no formal work existed for researchers to build upon. In fact, it has been suggested that some African farmers faced increased hardship in response to the Green Revolution as a result of a global drop in food prices caused by its massive success elsewhere (Evenson and Gollin, 2003).

More optimistically, in the later years of the Green Revolution, research broadened to include less common food crops and began to close the gap in yield increases due to modern varieties. Locally administered organizations, such as the International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), established research programs that included farmers in the dialog to strategically build a bank of genetic resources for traditionally grown species better suited to local climates and cropping systems. One group of such species is collectively known as the small millets and includes six cereal crops: finger millet (Eleusine coracana), foxtail millet (Setaria italica), kodo millet (Paspalum scrobiculatum), proso millet (Panicum miliaceum), barnyard millet (Echinochloa spp.), and little millet (Panicum sumatrense). Though all six cereals share a similar superficial classification (small grained cereals), they differ vastly in their phylogenies and continue to be grown in some of the most remote farms on Earth—thus isolation has maintained a wealth of agricultural and functional diversity. Their uses vary from animal fodder to human consumption, in which the small seeds can be ground into flour, cooked as porridge, or alternately fermented into enriched foods or alcoholic products. Where they are traditionally grown (**Figure 1**), small millets are highly valued for their diverse benefits and in many instances are considered nutritionally superior to other carbohydrate sources like rice and wheat (Hegde et al., 2005). Additionally, many of the small millets require very little fertilizer input as compared to more

FIGURE 1 | Depictions of small millet cultivation. (A) A typical subsistence small millet farm in India where the crops are grown under low input conditions and valued for their high stress tolerance. Source: M. Raizada. (B) Finger millet seed heads nearing maturity at the University of Guelph in Canada. The seed heads resemble the fingers of a human hand. Source: T. Goron. (C) Finger millet growing in a terraced field on a smallholder farm in Nepal. Source: M. Raizada. (D) Drudgery associated with transporting grain in the rural areas of Nepal. Source: M. Thilakarathna.

intensive grain cropping monocultures. Many reports also exist regarding their high degree of pest resistance and long-term storability, both traits which make the cultivation of small millets good insurance against famine and crop failure (Tsehaye et al., 2006; Reddy et al., 2011).

Although previously neglected, the value of small millets in modern agricultural stability has begun to be identified. Much work has been accomplished toward the development of modern varieties with the goal of better directing existing diversity toward agricultural challenges of the new millennium. The purpose of this review is to highlight the diverse traits of each crop that are valued by farmers and consumers (e.g., nutritional quality) that have potential for selection, improvement or mechanistic study, along with other phenotypes of interest, then to describe the germplasm, genetic and genomic resources available as potential tools to exploit this biodiversity. The review will conclude with noting current trends and gaps in the literature and make recommendations on how to better preserve and utilize diversity within these species to accelerate a New Green Revolution.

## Diversity of the Small Millets

## Finger Millet (Eleusine coracana)

Finger millet was domesticated in western Uganda and the Ethiopian highlands (**Figure 2**) at least 5000 years ago before introduction to India approximately 3000 years ago (Dida et al., 2008). It is called finger millet, because the inflorescence resembles the fingers of a human hand (**Figure 1**). The morphology of the inflorescence can be used to differentiate between the two subspecies, africana and coracana (Dida and Devos, 2006). Each subspecies can be further divided into several races. Finger millet is an allotetraploid. Genomic donors of the "A" genome are most likely Eleusine indica and Eleusine trisachya (Liu et al., 2014). The "B" genome has yet to be uncovered, and may have been contributed by an extinct ancestor (Liu et al., 2014). It is cultivated on 1.8 million ha in India, and also fills a substantial niche in eastern Africa (**Table 1**) (Dida and Devos, 2006). Kenyan farmers receive a high price for the grain, often twice that of maize and sorghum (Dida and Devos, 2006). The crop is highly valued in part due to its nutritional content, being especially calcium rich. Finger millet also contains methionine and tryptophan, amino acids which are often absent in starch-based diets of some subsistence farmers (Bhatt et al., 2011). Health benefits have been investigated, including anti-cancer and anti-diabetic activity, arising, respectively, from the grain's polyphenol content (anti-oxidant activity) and high fiber (which promotes slow digestion and hence stability of blood sugar) (Chandrasekara and Shahidi, 2011a; Devi et al., 2014). The species will produce 5 tons/ha under optimum conditions (Dida and Devos, 2006) and requires very little nitrogen fertilization, with some reports indicating the most economic rate of application may be between 20 and 60 kg/ha (Hegde and Gowda, 1986; Pradhan et al., 2011). The plant is highly tolerant to drought and salt stress, though a wide diversity of stress resistance has been reported across genotypes (Uma et al., 1995; Bhatt et al., 2011). Unlike many crops consumed by subsistence farmers, finger millet has maintained high socio-economic importance in the Indian and African semi-arid tropics (Benin et al., 2004; Gull et al., 2014) and has received a level of investigation unattained by some of its cousins.

ICRISAT conserves 6804 finger millet germplasm accessions originating from 25 different countries. Other organizations



manage germplasm banks of their own, the largest of which are summarized in **Table 2**. From these large collections, ICRISAT and other institutions group all genotypes according to region of origin or other parameters (Brown, 1989; Diwan et al., 1995; Hu et al., 2000; Wang et al., 2007). A subset of each group is selected that is representative of the genetic diversity of the crop: this group is termed the "core collection" and typically consists of ∼10% of all available accessions. Core collections facilitate breeding by providing an efficient means to screen for desired traits from a large pool of genotypes. Mini-core collections, that represent ∼1% of the total accessions, can be used by these institutions to further streamline the available genetic diversity.

The morphological diversity present within finger millet is immense. For example, a range of seed colors can be produced which are correlated with protein and calcium content (Vadivoo et al., 1998). Landraces with different attributes (e.g., time to maturity, bird tolerance, drought tolerance, disease tolerance) are valued by farmers based on local agricultural complexities that reflect their productivity across multiple agroeconomic zones (Tsehaye et al., 2006). For example, in the Ethiopian highlands, three high-yield landraces were identified and further developed into the commercial lines Tadesse, Padet, and Boneya (Aduguna, 2007). During a severe drought, Tadesse finger millet was the only cereal that remained productive. Farmers received double the price for the grain as compared to maize (Aduguna, 2007). This study illustrates what can be accomplished if germplasm banks are properly utilized for the selection of desirable traits.

The degree of morphological differences in finger millet requires that even core collections to be quite large; specialized tools will be needed to simplify characterization of functional diversity. Molecular markers represent one class of such tools, including restriction fragment length polymorphisms (RFLP), amplified fragment length polymorphisms (AFLP), expressedsequenced tags (EST), and simple sequence repeats (SSR). Very few are reported for finger millet but more are beginning to appear in the literature. Molecular markers have been utilized in attempts to characterize calcium dynamics (Yadav et al., 2014b), disease resistance (Babu et al., 2014d), and in the association mapping of various agronomic traits as well as tryptophan accumulation (Babu et al., 2014a,b). Marker-assisted research has suggested that there was little sequence diversity in finger millet populations (Muza et al., 1995; Salimath et al., 1995; Yadav et al., 2014b), but this would be surprising given the geographic diversity in which finger millet is grown. Molecular markers have enabled linkage maps of the genome to be assembled (Dida et al., 2007). While progress has recently increased, the availability of a published genomic sequence would accelerate the development of markers to assist with genotype classification and breeding. In March 2014, the Bio-resources Innovations Network for Eastern Africa Development (Bio-Innovate) announced a finger millet sequencing project (**Table 3**); the initial genome assembly has been completed and the full sequence is expected by the end of 2014<sup>1</sup> .

Research illuminating the finger millet transcriptome is beginning to appear. As the crop is valued for its high calcium content, studies have characterized calcium sensing and accumulation mechanisms across genotypes differing in their grain calcium content with the use of transcriptome high-throughput sequencing (Kumar et al., 2014b; Singh et al., 2014). A similar transcriptome analysis has been conducted on salinity responsiveness (Rahman et al., 2014). To investigate mechanisms behind the crop's impressively high nitrogen utilization efficiency (NUE), the behavior of transcription factors Dof1 and Dof2 have been analyzed. It was found that in the roots of a high-protein variety, the EcDof1/EcDof2 ratio was greater than that of a low protein variety, indicating a higher activation of N uptake and assimilation genes (Gupta et al., 2014a). The authors suggest that this ratio may in the future be utilized to screen other genotypes for high NUE.

Homologs of genes known to be agronomically important in major cereals, such as the transcripts described above, may assist with targeted breeding efforts in crops that are less characterized. Specifically, sequence variants of these genes may be used to develop orthologous molecular markers; those variants that correlate with desired traits may be used to screen accessions and subsequently assist in marker-assisted breeding efforts. This strategy may represent a way forward in the small millets. For example, finger millet researchers have isolated orthologs of genes known to be involved in grain amino acid composition (Opaque 2) and calcium content (calcium transporters, calmodulin) (Reddy et al., 2011; Nirgude et al., 2014). The researchers then associated SSR polymorphisms within these genes to characterize accessions that differed in their protein and calcium content, thus creating a targeted, cost-effective crop improvement strategy. A similar strategy to improve finger millet seed calcium content was also reported independently that focused on orthologs of calcium-binding proteins (CBPs) with extensive characterization of a seed dominant calmodulin (Kumar et al., 2014a,c). A parallel strategy has been suggested for disease resistance in finger millet based on the initial isolation of disease resistance receptors (Reddy et al., 2011; Babu et al., 2014c).

Progress has also occurred with respect to transgenic protocols for finger millet utilizing Agrobacterium and callus cell bombardment (Kothari et al., 2005; Ceasar and Ignacimuthu, 2009, 2011; Sharma et al., 2011; Jagga-Chugh et al., 2012; Plaza-Wüthrich and Tadele, 2012). Such techniques have allowed finger millet plants to be improved for drought and salinity tolerance (Ramegowda et al., 2012; Anjaneyulu et al., 2014; Hema et al., 2014), zinc accumulation (Cakmak, 2008; Ramegowda et al., 2013), and disease resistance (Latha et al., 2005).

#### Foxtail Millet (Setaria italica)

Named for the bushy, tail-like appearance of its immature panicles, foxtail millet has received a promising amount of research attention. Domesticated in China (**Figure 2**) approximately 8700 years ago, foxtail millet is considered one of the world's oldest crops and ranks second in total world millet production, providing six million tons of grain for people throughout areas in southern Europe and Asia (Li and Wu, 1996; Yang et al., 2012). It is one of the main food crops in regions of the dry north of

<sup>1</sup>http://bioinnovate-africa.org / about-us / news / item / 162-finger-millet-genomicsproject-to-provide-researchers-with-better-tools-for-variety-production

#### TABLE 2 | Significant germplasm collections of the small millets.



<sup>a</sup>http://www.icrisat.org/crop-fingermillet.htm <sup>b</sup>http://www.ars-grin.gov/npgs/index.html <sup>c</sup>http://www.icrisat.org/crop-foxtailmillet.htm <sup>d</sup>http://www.gene.affrc.go.jp/index\_en.php <sup>e</sup>http://www.ars-grin.gov/npgs/index.html <sup>f</sup> http://www.ars-grin.gov/npgs/index.html <sup>g</sup>http://www.icrisat.org/crop-prosomillet.htm <sup>h</sup>http://www.ars-grin.gov/npgs/index.html <sup>i</sup>http://www.gene.affrc.go.jp/index\_en.php <sup>j</sup>http://www.icrisat.org/crop-barnyardmillet.htm <sup>k</sup>http://www.gene.affrc.go.jp/index\_en.php <sup>l</sup>http://www.ars-grin.gov/npgs/index.html <sup>m</sup>http://www.icrisat.org/crop-littlemillet.htm <sup>n</sup>http://www.ars-grin.gov/npgs/index.html

#### TABLE 3 | Small millet genomic resources and features.


<sup>a</sup>http://www.ncbi.nlm.nih.gov/ <sup>b</sup>http://bioinnovate-africa.org/about-us/news/item/162-finger-millet-genomics-project-to-provide-researchers-with-better-tools-for-variety-production

China (Wang et al., 2012). Foxtail millet is cultivated to a limited extent in North America for silage, birdseed, and as a cover crop. It is quick to mature, able to produce seed in 75–90 days, and sometimes grown as a "catch-crop" in between the plantings of other species (Baltensperger, 2002). Herbicide-resistant lines of foxtail millet have been identified and studied in detail (Zhu et al., 2006). Additionally, the plant is quite drought resistant and tolerant to salt stress (Jayaraman et al., 2008). The cultivar "Prasad" has been identified as being particularly salt-tolerant, perhaps due to an effective antioxidant mechanism mediated by polyamine accumulation (Sudhakar et al., 2015).

As opposed to finger millet which was the result of a single domestication event (Dida et al., 2008), the history of foxtail millet is more complex. Sequence diversity of 250 Chinese genotypes was found to be quite high, averaging 20.9 alleles per locus when examined with 77 SSRs (Wang et al., 2012). Alleles clustered into two main geographic diversity centers, indicating the possibility of two domestication events within China; more work is needed to confirm this hypothesis (Wang et al., 2012). Additionally, it has been suggested that foxtail millet was independently domesticated in Europe based on archeological evidence (Jusuf and Pernes, 1985; Hunt et al., 2008; Hirano et al., 2011).

Foxtail millet is closely related to the hardy weed Setaria viridis, which is assumed to be its progenitor. S. viridis, or green foxtail, often exists in close proximity to its cultivated cousin and is problematic throughout Eurasia and North America with many reports of herbicide resistance (Morrison et al., 1989; Marles et al., 1993; Heap, 1997). Some evidence suggests genetic clustering across foxtail species is dictated primarily by region and not taxonomy, implying that interspecific hybridization between S. viridis and modern S. italica is common (Li et al., 1942; Jusuf and Pernes, 1985). Indeed, deliberate crosses between these species have resulted in resistance to a variety of herbicides (Darmency and Pernes, 1985, 1989; Wang et al., 1996; Wang and Darmency, 1997). However, agronomic traits in many of the crosses were closer to the weedy variety of Setaria; hybrids displayed seed shedding, spindly shoot tissue, and low yield as well as the fertility losses associated with hybridization. These reports highlight the possibility of using interspecific hybridization to study different agronomically valuable traits from wild millet relatives in a domesticated genetic background for future breeding applications.

After its domestication in China, foxtail millet spread throughout Asia, Europe, and eventually to North America (Jusuf and Pernes, 1985). Its large range has resulted in three different races, each with multiple subraces. Moharia is common in Europe, Russia, and the Middle East. Maxima can be found in Eastern China, Georgia, Japan, Korea, Nepal, northern India, and the USA where it was introduced for the purposes of animal feed. Indica predominates in southern India and Sri Lanka (**Table 1**) (Jusuf and Pernes, 1985).

An interesting feature of modern foxtail millet diversity is the global distribution of two phenotypically different varieties the waxy and non-waxy grain type (Van et al., 2008). Waxiness in cereal grains is caused by lowered levels of amylose in the grain endosperm, which gives the grain a sticky texture when cooked (Van et al., 2008). Geographical occurrence of these two groups of foxtail millet varieties coincides with the ethnological preferences of local human populations. In East and South East Asia, some local communities are known to prefer sticky cereals (e.g., glutinous rice) driven by the use of chopsticks by these cultures—it is in these regions that the waxy millet phenotype can be found (Van et al., 2008). The non-waxy grain phenotype is more widespread, cultivated throughout Eurasia and parts of Africa (Kawase et al., 2005). Control of the phenotype is due to transposable-element (TE) insertion events interrupting amylase production, and foxtail millet has been suggested as a model for studying TE-mediated evolution (Kawase et al., 2005).

Like finger millet, there is an abundance of foxtail millet germplasm available to the scientific community (**Table 2**). Due to its importance in China, the Chinese National Genebank (CNGB) appears to maintain the largest collection by far, totalling 26,670 accessions as of 2012 (Wang et al., 2012). ICRISAT holds germplasm from 26 countries, and genebanks in Japan (National Institute of Agrobiological Sciences, NIAS) and the USA (USDA, Plant Genetic Resources Conservation Unit, PGRCU) ensure access to a wide range of foxtail millet diversity. Some core and mini-core collections have been assembled (Upadhyaya et al., 2008, 2011). However, considering the wide range of foxtail millet cultivation and the diversity of accessions, many more core collections should be generated, especially in China (Li et al., 1998) to facilitate breeding efforts. Diverse foxtail millet landraces may provide valuable alleles to assist in these breeding efforts. For example, landraces from the north of China are typically well-adapted to cold weather with short growing seasons, and are highly sensitive to light and temperature changes while those from southern regions grow better in high temperatures and humidity (Wang et al., 2012), demonstrating the types of useful alleles that may exist for this crop.

Foxtail millet has enjoyed more genetic characterization than the other small millets. Recently there has been a push to utilize the species as a model system for biofuel grasses. It is closely related to the bioenergy crops switchgrass (Panicum virgatum), napier grass (Pennisetum purpureum), and pearl millet (Pennisetum glaucum) (Doust et al., 2009). Foxtail millet has several characteristics that are valued in a model system—a small genome (∼490 Mbp), small plant size, and a quick generation time, unusual for C4 grasses. As a result, two full reference sequences have been compiled using genotypes Yugu1 and Zhang Gu (Bennetzen et al., 2012; Zhang et al., 2012). In these studies, the authors also created high-density linkage maps with another foxtail millet line and green foxtail, and examined the evolution and mechanisms of C4 photosynthesis in detail (Bennetzen et al., 2012; Zhang et al., 2012).

Instigated by the newly available sequence data, research in foxtail millet molecular genomics continues to rapidly progress. Many genetic markers have been reported and utilized in foxtail millet to generate maps, analyze DNA polymorphisms, evolutionary origin(s), and relatedness to other cereals for future crop improvement efforts (Wang et al., 1998; Schontz and Rether, 1999; Jia et al., 2009; Yadav et al., 2014a). A large library of markers consisting of intron-length polymorphisms (ILPs) has been generated, in part enabled by an abundance of EST data which can be used to generate flanking primers (Muthamilarasan et al., 2014). Initial work toward marker-based, high-throughput genotype identification has been accomplished (Gupta et al., 2012; Pandey et al., 2013). For example, an allele-specific single nucleotide polymorphism (SNP) coding for a dehydration responsive element binding (DREB) gene was shown to associate with stress tolerance (Lata et al., 2011). The SNP has potential in marker-assisted breeding selection, and was validated in a foxtail millet core collection in which the allele was found to account for 27% of total variation of stress-induced lipid peroxidation (Lata and Prasad, 2013). In an association mapping study, eight SSR markers were found to correlate with nine different agronomic traits (Gupta et al., 2014b). ESTs and peptides have been identified which are differentially expressed between salt tolerant and non-tolerant cultivars (Veeranagamallaiah et al., 2008; Puranik et al., 2011). A genome-wide transcriptome has been generated after exposure to drought stress, in which regulatory roles of small interfering RNAs and non-coding RNAs were described (Qi et al., 2013). From this study, 2824 annotated genes were identified with drought-responsive expression patterns. Such comprehensive studies should be extended to other stress pathways for better characterization of available foxtail millet germplasm. The data might also be used to design useful millet microarrays. Using the reference genomes described above, research groups have begun to re-sequence genotypes of foxtail millet and identify vast libraries of SNPs and other markers (Bai et al., 2013; Jia et al., 2013). This information has been used to classify landraces according to flowering time, yield attributes, waxy character, and other agronomically important traits (Jia et al., 2013; Bai et al., 2013). The re-sequencing of diverse foxtail millet germplasm should continue as a strategy to aid marker-assisted breeding efforts. Much work has also been accomplished in the behavior of transcription factors in foxtail millet under a variety of stressful conditions, details of which have been conveniently compiled in the database "FmTFDb" (Bonthala et al., 2014). The availability of this data is expected to greatly accelerate functional genomics in all small millet species.

Lastly, transgenic protocols have been developed for foxtail millet, with both Agrobacterium (Wang et al., 2011) and callus bombardment methods reported (Kothari et al., 2005; Ceasar and Ignacimuthu, 2009; Plaza-Wüthrich and Tadele, 2012), enabling some potentially useful molecular analyses. In one study, a pollen-specific gene has been altered to impair anther function by a co-suppression mechanism (Qin et al., 2008) which might be adapted for the development of male-sterile plants, valuable in breeding foxtail millet hybrid varieties.

### Kodo Millet (Paspalum scrobiculatum)

Kodo millet was domesticated roughly 3000 years ago in India (**Figure 2**), the only country today where it is harvested as a grain in significant quantities, mainly on the Deccan plateau (**Table 1**) (de Wet et al., 1983b). The grain contains a diverse range of highquality protein (Geervani and Eggum, 1989; Kulkarni and Naik, 2000), and has high anti-oxidant activity (anti-cancer) even when compared to other millets (Hegde and Chandra, 2005; Hegde et al., 2005; Chandrasekara and Shahidi, 2011b). Like finger millet, kodo is rich in fiber and hence may be useful for diabetics (Geervani and Eggum, 1989). It is drought tolerant and can be grown in a variety of poor soil types from gravelly to clay (de Wet et al., 1983b; M'Ribu and Hilu, 1996). Most genotypes take 4 months to mature (de Wet et al., 1983b). Like foxtail millet, a weedy counterpart of kodo exists and is problematic throughout old-world farming systems especially in damp areas (de Wet et al., 1983b; Becker and Johnson, 2001). It is believed that kodo was probably first harvested as a weed alongside other cereals like rice, perhaps leading to multiple domestication events of the millet across its current range (de Wet et al., 1983b). This practice continues in parts of Africa where the weed is also sometimes harvested during famine (de Wet et al., 1983b; Neumann et al., 1996; Ogie-Odia et al., 2010). In Africa, kodo is referred to as black rice or bird's grass (M'Ribu and Hilu, 1996). Limited molecular marker analysis has shown that kodo millet genotypes cluster by African vs. Indian origin (M'Ribu and Hilu, 1996).

Kodo millet is divided into three races (regularis, irregularis, and variabilis) based on panicle morphology (de Wet et al., 1983b). In southern India, there are small (karu varagu) and large seeded (peru varagu) varieties recognized, often grown together in the same field (de Wet et al., 1983b). General morphological variability is high, with large variance reported in many phenotypic parameters such as time before flowering, tiller number, and yield (Subramanian et al., 2010; Upadhyaya et al., 2014).

Kodo millet is a crop that might be described as incompletely domesticated, with some authors calling the cereal "pseudocultivated" (de Wet, 1992; Blench, 1997). As such, systematic breeding of kodo millet remains neglected but limited efforts have shown promise. Various metrics of plant productivity including dry fodder yield, plant height, and grain yield have revealed good heritability; improvement of these traits has been observed through breeding, with four highly productive genotypes thus far identified (Upadhyaya et al., 2014). Pathogen resistance has been noted as a good breeding target, in particular resistance to smut (Sorosporium paspali and Ustilago spp.) and rust (Puccinia substriata Ellis and Barth), which are both major hindrances of kodo yield (Upadhyaya et al., 2014). Another potential target for breeding may be resistance to the fungi Aspergillus flavus and Aspergillus tamari which produce cyclopiazonic acid that can cause sleepiness, tremors, and giddiness in those that consume infected grain, known as "kodua poisoning" (Rao and Husain, 1985). Grain lodging can occur before harvest, therefore an earlier maturity time might also be targeted (de Wet et al., 1983b). It is also interesting that some cultivated landraces have maintained the perennial nature of their wild ancestor and continue to initiate culms following the maturity of older shoots (de Wet et al., 1983b). If this regeneration trait can be encouraged through breeding and hybridization, it may reduce fertilization inputs and labor.

Unfortunately, no genetic or molecular maps of the kodo millet genome appear to be available (Dwivedi et al., 2012), likely because of the problem of persistent cross-hybridization with its wild relatives. Molecular markers for kodo millet are few, but have been utilized in characterizing diversity and phylogeny (M'Ribu and Hilu, 1996; Kushwaha et al., 2015). There has been some preliminary work in miRNA target site prediction using ESTs from kodo (Babu et al., 2013). In this study, target genes were found be involved in carbohydrate metabolism, cellular transport, and as structural proteins, but a severe lack of kodo DNA information limited this study; the closely-related rice genomic sequence was used for binding-site prediction. With respect to transgene methodology for kodo, the media conditions for callus regeneration protocols have been investigated; regenerated plantlets were successfully grown to maturity in soil (Ceasar and Ignacimuthu, 2010).

ICRISAT conserves 656 accessions of kodo millet, and a core collection has been established that reflects the phenotypic diversity of the entire collection (Upadhyaya et al., 2014). Some universities also maintain large kodo millet seed banks, a good example being the University of Agricultural Sciences in Bangalore (Ceasar and Ignacimuthu, 2010). As the crop is not significant outside of India, there are few reports of other banks with substantial numbers of accessions (**Table 2**). However, some organizations do keep collections for the purposes of studying the species as a weed as noted above; the US Department of Agriculture has 336 accessions in their National Plant Germplasm System (GRIN)<sup>2</sup> . While seed of African origin does exist in some of these sources, it is rare. Better coverage and ecological exploration of the African continent would help to reveal and preserve diversity of valuable traits which might otherwise be missed by international scientists.

### Proso Millet (Panicum miliaceum)

Proso millet, also called broomcorn and common millet, was domesticated in Neolithic China as early as 10,000 years ago (**Figure 2**) (Lu et al., 2009). The sequence diversity within proso

<sup>2</sup>http://www.ars-grin.gov/

provides evidence for a single site of domestication in the Chinese Loess Plateau (M'Ribu and Hilu, 1994; Hu et al., 2008, 2009). Proso millet expanded across Eurasia and was introduced to North America in the 1700s where it is now primarily used for animal fodder and birdseed (Bagdi et al., 2011). Proso is the true millet referenced in classical European and Middle Eastern sources, referred to by ancient Romans as "milium" (Smith, 1977). Archeological evidence of proso in Eastern Europe dating to 8000 years ago raises the possibility of a secondary independent domestication event, but additional study is needed to confirm this observation (Hunt et al., 2008, 2011). Proso millet was important in the diets of humans across Eurasia prior to the introduction of wheat, barley and potatoes (Kalinova and Moudry, 2006). Today it is only consumed in significant quantities in India (where it is known as pani varagu in Tamil), Nepal, western Myanmar, Sri Lanka, Pakistan, and South East Asian countries (Nirmalakumari et al., 2008). A weedy variety is widespread, which is likely the result of field escape and not due to the spread of the wild ancestor (McCanny and Cavers, 1988). Recent molecular analysis using chromosomal in situ hybridization has implicated Panicum capillare or a close relative as one of the genetic ancestors of proso (Hunt et al., 2014).

The benefits of consuming proso include its high protein content which ranges from 11.3 to 17% of grain dry matter (Kalinova and Moudry, 2006). Genotypic diversity in protein content and amino acid profile has been observed (Kalinova and Moudry, 2006). Like other small millets, the applicability of the grain in preventing cancer, heart disease, and managing liver disease and diabetes has been investigated with promising results (Nishizawa and Fudamoto, 1995; Nishizawa et al., 2002; Park et al., 2014; Zhang et al., 2014). There may be additional untapped phytochemical value as indicated by a wide range of genotype-specific grain colors (Zhang et al., 2014).

Proso millet is well-adapted to dry sandy soils, and might be the earliest dryland-farming crop in East Asia (Baltensperger, 2002; Lu et al., 2009). It may have the lowest water requirement of any cereal, able to produce harvestable grain with only 330– 350 mm of annual rainfall (Baltensperger, 2002; Seghatoleslami et al., 2008; Hunt et al., 2011). Proso millet matures quickly within 60–90 days, a feature that contributes to its drought resistance and also makes it a good catch-crop (Baltensperger, 2002; Hunt et al., 2014). Genotype has been shown to affect drought tolerance by influencing harvest-index, yield, and water use efficiency (WUE) (Seghatoleslami et al., 2008). In the latter study, a hybrid genotype outperformed local varieties, validating the potential in breeding highly WUE proso millet. Preliminary work in characterizing proso miRNAs has been accomplished with the goal of understanding mechanisms responsible for the cereal's impressive drought resistance (Wu et al., 2012). Despite its drought tolerance, proso is best adapted to temperate latitudes unlike other small millets. It grows further north than any other millet up to a latitude of 54◦N, and at elevations as high as 3500 m (Baltensperger, 2002). Substantial salinity tolerance has been reported in proso but with significant varietal diversity, with some especially tolerant varieties reported (Sabir et al., 2011; Liu et al., 2015). A higher sodium concentration in roots compared to shoots has been suggested as a biomarker for future breeding efforts (Sabir et al., 2011; Liu et al., 2015).

Cultivated proso millet is divided into five races (Reddy et al., 2007). Race miliaceum resembles wild proso with large, open inflorescences and sub-erect branches with few subdivisions. Patentissimum is very similar to miliaceum with narrow, diffuse panicle branches. These two races are found across the entire Eurasian range of proso, and are considered primitive. Contractum, compactum, and ovatum have more compact inflorescences which are drooped, cylindrical, and curved, respectively (Reddy et al., 2007). ICRISAT holds 842 accessions from all five races (**Table 2**) (Reddy et al., 2007). The diversity of this collection has been characterized in terms of flowering time, plant height, panicle exsertion, and inflorescence length (Reddy et al., 2007). Other significant collections of proso are summarized in **Table 2**. Perhaps the largest collection of proso is held by the N.I. Vavilov All-Russian Scientific Research Institute of Plant Industry in St. Petersburg, with roughly 8778 accessions as of 2012 (Dwivedi et al., 2012). Aside from ICRISAT (Upadhyaya et al., 2014), few proso millet core collections appear to exist for breeding purposes. Preliminary diversity clustering based on agronomic traits was performed on the Chinese collection for the purpose of SSRbased characterization (Hu et al., 2009). Perhaps the Chinese subset of 118 landraces could be repurposed and slightly modified to become a true core collection. Explant regeneration techniques have been published for proso, allowing transgenic work to be explored in the future (Plaza-Wüthrich and Tadele, 2012).

The genetic sequence diversity of proso has been examined to a limited degree. The sequence diversity is moderate to high (Karam et al., 2006; Cho et al., 2010; Hunt et al., 2011), perhaps due to continuing hybridization with wild varieties (Colosi and Schaal, 1997). Molecular markers in proso have often been derived from the available sequence data of related species including switchgrass, rice, wheat, barley and oat (Hu et al., 2009; Rajput et al., 2014). AFLP markers have shown promise in grouping proso based on biotype, but were insufficient in differentiating between wild and cultivated varieties (Karam et al., 2004). To the best of our knowledge, no genetic or molecular maps of the proso millet genome are available (Dwivedi et al., 2012).

Like kodo millet, waxy varieties of proso grain exist and are preferred in some areas of Asia because of their glutinous nature—again to facilitate consumption with chopsticks (Graybosch and Baltensperger, 2009). Clustering by geographical sequence diversity corresponds with this regional preference (Hu et al., 2008). Like other glutinous cereals, waxy types of proso have no detectable amylose in the seed endosperm, due to a mutation in the Waxy gene (Hunt et al., 2010). Molecular markers have been developed to identify these waxy genotypes and breed glutinous varieties that are highly valued by consumers (Araki et al., 2012). Proso has been compared to maize in its ethanol production ability, and fermentation efficiency was found to be the highest in waxy varieties (Rose and Santra, 2013). The authors suggest that encouraging the fermentation of proso millet could help stabilize its price in the USA where it is already grown for birdseed and fodder. Finally, proso millet has been utilized as a model organism for C4 carbon metabolism, specifically in the study of aspartate aminotransferase and malate translocation which both contribute to the higher efficiency of C4 photosynthesis (Taniguchi et al., 1995; Taniguchi and Sugiyama, 1996, 1997; Sentoku et al., 2000).

## Barnyard Millet (Echinochloa spp.)

Although sometimes referred to as a single taxonomic group, barnyard millet is composed of two separate species belonging to the genus Echinochloa. Echinochloa esculenta (syn. Echinochloa utilis, Echinochloa crusgalli) is cultivated in Japan, Korea, and the northeastern part of China while Echinochloa frumentacea (syn. Echinochloa colona) is found in Pakistan, India, Nepal, and central Africa (**Table 1**) (Yabuno, 1987; Wanous, 1990). Both species have overlapping morphological traits that make differentiation problematic. Visual identification is only possible based on the presence or absence of an awn and subtle differences in spikelet and glume morphology (de Wet et al., 1983c). Consequently, the common names Japanese and Indian barnyard millet have been suggested to simplify research and investigation of their phylogeny (Yabuno, 1987). Despite having such strong phenotypic similarities, cytology and marker work have shown the two millets to be genetically distinct; F<sup>1</sup> hybrids of the two species are sterile (Yabuno, 1962; Hilu, 1994). Both species are known for their fast maturity, high storability, and the ability to grow on poor soil (Yabuno, 1987). ICRISAT currently holds 743 accessions of these barnyard millets from nine countries, with a core collection of 89 varieties recently established (Upadhyaya et al., 2014). Other significant collections can be found at NIAS and the USDA (Hilu, 1994). Sequence data and genetic map availability for both millets are generally low (Dwivedi et al., 2012). Initial transgenic work has been reported on the Japanese variety, but callus regeneration protocols have been reported for both species (Gupta et al., 2001; Kothari et al., 2005).

In addition to the two cultivated species, research has also been conducted on 20–30 wild Echinochloa barnyard millet relatives, some of which have agriculturally interesting traits including rice-mimicry and perennial growth habit. Hybridization within the genus is rampant, and is thought to have contributed to the evolution and current diversity of barnyard millets (Hilu, 1994; Yamaguchi et al., 2005).

### Japanese Barnyard Millet (Echinochloa esculenta)

Japanese barnyard millet originated in eastern Asia (**Figure 2**) from its wild counterpart E. crus-galli, "barnyard grass" (Yabuno, 1987; Hilu, 1994). It can be differentiated from the Indian species by its larger, awned spikelets with glumes that appear papery instead of membranous (de Wet et al., 1983c). It is tolerant to cold and was historically grown in areas where the climate or land did not suit rice production, particularly in the north of Japan (Yabuno, 1987). In Japan, folklore states that barnyard millet originated from the dead body of a god. Along with proso millet, it makes up part of the "Gokoku," a general term for five staple grains (Yabuno, 1987). Japanese barnyard millet has been found in the coffins of 800-year-old mummies from the Iwate prefecture, and documents from the 1700s list different cultivars organized by maturity time (Yabuno, 1987). Its historical importance might be attributed to the relief it provided in times of rice crop failure. However, Japanese barnyard millet production has sharply decreased in the last century due to the introduction of cold-tolerant rice varieties and better irrigation practices (Yabuno, 1987). Nevertheless, today it remains the most common millet consumed in Japan, with reported health benefits common to many of the small millets such as its ability to lower plasma glucose concentration, insulin, adiponectin and tumor necrosis factor-α when fed to diabetic mice (Nishizawa et al., 2009). The protein content of Japanese barnyard millet is twice as high as that of rice (Yabuno, 1987). Across genotypes there is diversity in the levels of proteins and healthy lipids, with one genotype suggested as having particularly beneficial antioxidant activity (Kim et al., 2011).

Unlike other small millets consumed in East Asian countries such as foxtail and proso, barnyard millet has no glutinous variety. However, some landraces have been identified which contain very low levels of amylose due to a deletion in one of three waxy genes. One such landrace, "Noge-Hie," was treated with γradiation resulting in progeny lacking the Waxy (Wx) protein (Hoshino et al., 2010). The trait was stably inherited, and this new glutinous variety ("Chojuromochi" in Japan) might be useful for increasing demand for millet products among Japanese consumers.

The morphological and physiological diversity of Japanese barnyard millet is suggested to be high (Nozawa et al., 2006). Flowering time, inflorescence shape, and spikelet pigmentation, among other features, vary across landraces. The species can be grouped into the races utilis and intermedia (Upadhyaya et al., 2014). Molecular diversity studies for Japanese barnyard millet have begun using the non-coding regions of chloroplast DNA as well as nuclear molecular markers (RAPDs, SSRs) and isozymes, although these studies appear to be limited in their sample number (Hilu, 1994; Nakayama et al., 1999; Yamaguchi et al., 2005; Nozawa et al., 2006). Though DNA sequence information in Japanese barnyard millet is otherwise lacking, studies performed on the closely related barnyard grass (E. crus-galli) have generated important sequence information. For example, extensive transcriptomic profiling and annotation have been performed on herbicide resistant varieties of barnyard grass resulting in 74 ESTs, which might be adapted to the study of the cultivated relative (Li et al., 2013; Yang et al., 2013).

## Indian Barnyard Millet (E. frumentacea)

Indian barnyard millet, or sawa, was domesticated in India (**Figure 2**) across its current range from its wild counterpart E. colona, "jungle rice" (Yabuno, 1987; Hilu, 1994). In India, this millet is either harvested as a weed along with a main crop or is grown in a mixture with finger millet and foxtail millet (Gupta et al., 2009b). It is generally cultivated on hilly slopes in tribal areas where few other agricultural options exist and is indispensible in the northwest Himalayan region (Gupta et al., 2009b). Quick maturity makes the species well-adapted to regions with little rainfall (Channappagoudar et al., 2008). Indian barnyard millet contains antifeedants which are present at concentrations higher than in rice, and it displays resistance to the feeding activity of brown planthopper (Kim et al., 2008). In central Africa it is fermented to make beer or used for food, and has been found in the intestines of pre-dynastic Egyptian mummies (de Wet et al., 1983c). When fed to diabetic humans, significant reductions of blood glucose levels and LDL cholesterol have been reported (Ugare et al., 2014).

Significant phenotypic variation is observed in Indian barnyard millet. Four morphological races (laxa, robusta, intermedia, and stolonifera) were recognized by de Wet in 1983 based on the lengths of flag leaves, peduncles, inflorescences, racemes, as well as plant height and basal tiller number. Race laxa is endemic to the Sikkim Himalayas and only available in a few collections (de Wet et al., 1983c). More recently, a variety of morphological parameters were examined, and principle component analysis (PCA) indicated three morphotypes corresponding to races robusta, intermedia, and stolonifera; laxa was absent suggesting that efforts must be made to collect more of this race (Gupta et al., 2009b). The authors saw high variability in grain yield, straw yield, and number of productive tillers. They report that the number of racemes, flag leaf width, and internode length showed high correlation with grain yield and should be considered by breeders when performing selections, and promising donor genotypes of these and other traits have been reported (Channappagoudar et al., 2008; Gupta et al., 2009b). Variation across genotypes in photosynthesis and related traits such as transpiration and stomatal conductance has also been observed (Subrahmanyam and Rathore, 1999). Grain smut (Ustilago panici-frumentacei) is a major hindrance of yield, but progress has been made in advanced breeding lines which display low susceptibility when compared to other accessions in which high variability remains (Gupta et al., 2009a).

An early study (Hilu, 1994) using RAPD markers suggested that the sequence diversity of Indian barnyard millet is significantly higher than the Japanese species, perhaps because of multiple domestication events in different locations across India (Hilu, 1994). Variation of markers was 44%, which is high when considering the inbreeding nature of the crop (Hilu, 1994). However, more comprehensive studies are needed that utilize a greater number of molecular markers and genotypes. Similarly, DNA sequence analyses are lacking in Indian barnyard millet.

## Little Millet (Panicum sumatrense)

Also called sama, little millet is cultivated to a limited extent in India, Sri Lanka, Pakistan, Myanmar, and other South East Asian countries (**Table 1**) (Hiremath et al., 1990). In India it is important to tribes of the Eastern Ghat mountains and grown in combination with other millets (Hiremath et al., 1990). Little millet is a domesticated form of the weedy species Panicum psilopodium (de Wet et al., 1983a). The chromosomes of hybrids of Panicum sumatrense and P. psilopodium pair almost perfectly with only a single quadrivalent, indicating that divergence between the two species may have initially occurred through a single reciprocal translocation (Hiremath et al., 1990). Hybrid plants are fertile and vigorous with non-shattering spikelets, and thus introgression of genes between the two species is common (Hiremath et al., 1990). This hybridization ability combined with its wide range of cultivation across India suggests that little millet was domesticated independently several times, although exact dates remain undetermined (de Wet et al., 1983a). Little millet is comparable to other cereals in terms of fiber, fat, carbohydrates, and protein, and rich in phytochemicals including phenolic acids, flavonoids, tannins, and phytate (Pradeep and Guha, 2011). Like many other small millets, it is drought, pest and salt tolerant (Sivakumar et al., 2006b; Bhaskaran and Panneerselvam, 2013; Ajithkumar and Panneerselvam, 2014). The time to maturity for most cultivars is about 90 days (de Wet et al., 1983a).

Little millet is divided into two races based on panicle morphology, nana and robusta. Race nana matures faster and produces less biomass than robusta (de Wet et al., 1983a). In a tribal area of the Indian Kolli hills, diversity among locally grown landraces of little millet was found to be high for all morphological traits measured both within and between landraces despite a small sampling area (Arunachalam et al., 2005). High diversity, heritability and genetic advancement was observed in terms of yield and productive tillers in a collection of 109 landraces, meaning that the crop might be a good candidate for varietal development (Nirmalakumari et al., 2010). A different collection of 460 accessions of little millet held by ICRISAT displayed genetic variation for most of the traits examined (Upadhyaya et al., 2014). A core collection of 56 genotypes was identified which was representative of the entire seed bank. Increased heritable lodging resistance has been introduced to a population of little millet with γ-ray mutational breeding (Nirmalakumari et al., 2007).

The molecular biology of little millet has been explored to a limited extent. As part of a study to identify seven millet species based on their chloroplast DNA, the trnS-psbC gene region was characterized and subjected to RFLP analysis (Parani et al., 2001). This study showed that it was possible to distinguish all the millet species when the enzymes HaeIII and MspI were used in combination. To investigate mechanisms behind little millet's high prolamine content, a zein-like storage protein was isolated and sequenced (Sivakumar et al., 2006a). Furthermore, α-amylase from little millet has been isolated and characterized in terms of biomass and optimum pH (Usha et al., 2011). To the best of our knowledge, no protocols for callus regeneration or transgenic technology have been published. Little millet is perhaps the least studied of the small millet species and there is much that requires investigation, including the establishment of a genetic map and sequenced genome.

## Trends, Gaps and Recommendations on How to Foster Diversity within Orphaned Small Millets for the New Green Revolution

The World Summit on Food Security has set a target of 70% more food production by 2050, requiring annual increases of 44 million tons, 38% above current annual increases (Tester and Langridge, 2010). Climate change will cause additional difficulties as many regions are becoming drier with increasingly severe weather patterns (Dai, 2011), and fossil-fuel based nitrogen use is increasingly restricted by legislation intended to slow climate change (Tester and Langridge, 2010). The small millets have the potential to meet these challenges, given their drought tolerance and ability to grow under low input conditions, along with other health-promoting traits valued by humans. Unfortunately, the small millets suffer from low yields (only 0.8 tons grain per hectare) (Plaza-Wüthrich and Tadele, 2012). For the small millets to succeed, priority traits for breeding will need to include improving yield under stress conditions (low input, salt, drought, pests, pathogens). Fortunately, an attractive feature of the small millets is that they continue to be cultivated in remote areas which has preserved their biodiversity, giving breeders potential access to unique genes for crop improvement. Due to limited resources, however, current efforts thus far have concentrated primarily on characterizing and reporting the extensive diversity present in seed banks, with few genetic and genomic tools available to exploit this biodiversity for crop improvement. A further challenge in some species (e.g., foxtail millet) is persistent cross-hybridization with wild relatives. Improved varieties of small millets could play a role in the "New Green Revolution" a term coined to reflect novel strategies which will be required to deal with complex challenges in developing nations including increasing population and ever-diminishing arable land (Den Herder et al., 2010).

#### Exploiting Diversity within Seed Banks

Diversity is the basis of crop improvement. As described in this review, the small millets possess considerable morphological and genetic sequence variation that can be used by breeders to generate improved varieties. Seed banks across the globe conserve collections of small millets as shown in **Table 2**, but a challenge is that less diverse germplasm is available for species that are cultivated in a limited geographic region. For example, little millet, which is mainly grown in the Eastern Ghats of India, is represented by a collection of only 466 accessions (Upadhyaya et al., 2014). By contrast, ICRISAT currently holds 6804 accessions of finger millet, a crop widely grown on 1.8 million ha throughout India with extensive cultivation in Eurasia and Africa<sup>3</sup> . Core collections follow the same patterns, with several reported for finger millet but only one for little millet (Upadhyaya et al., 2014). It is essential that core collections be established for all of the millets, however, especially at larger seed banks, to facilitate efficient trait selection. As modern small millet cultivation for human consumption typically occurs in poor nations (with some exceptions), the seed bank infrastructure and associated reporting in the scientific literature and in online databases is sparse and difficult for breeders from foreign nations to access. Furthermore, trait descriptions for each accession are often not reported. Improved funding, coordination, communication and sharing of genetic resources are needed to overcome these problems.

#### Harvesting Genes from the Wild

Though interspecific hybrids between some cultivated and wild millets can be problematic, the wild relatives of the small millets may serve as donors of useful genes for crop improvement (e.g., herbicide resistance). To enable breeding, the hybridization ability of Indian and Japanese barnyard millet (Yabuno, 1962; de Wet et al., 1983c) may thus serve as an advantage. However, full realization of this breeding potential may require embryo rescue techniques to bring weak F<sup>1</sup> progeny to adulthood (Plaza-Wüthrich and Tadele, 2012) and better access by breeding programs to wild germplasm (Hajjar and Hodgkin, 2007). Today, the wild germplasm is sometimes studied only from a weed science perspective (Peterson and Nalewaja, 1992; Dilday et al., 2001).

## Combining Traditional Knowledge of Diversity with Modern Techniques

Small millets are often grown in remote regions of the world, and hence significant traditional knowledge of millet diversity persists that can serve as a valuable resource for crop improvement. Isolated farming communities often cultivate dozens of locally known millet landraces that are valued for a wide variety of traits (e.g., short duration to combat delayed rains as the result of climate change). Farmers use a complex system to classify their landraces, and in some instances this classification is considered more informative than scientific phylogeny (Rengalakshmi, 2005). On the opposite end of the technological spectrum, research using simple DNA barcoding in lieu of larger numbers of molecular markers is being attempted to classify the small millets down to the landrace level (Newmaster et al., 2013). A unique opportunity in the small millets is combining traditional knowledge with molecular techniques to characterize diversity for the purposes of crop improvement.

## The Need for Complete Linkage Maps, Molecular Markers and Genome Sequences

As described above, in some species, markers including RFLPs, AFLPs, ESTs, and SSRs have been linked to beneficial traits including stress tolerance (Lata et al., 2011). Other, less conventional selective biomarkers have been suggested including differing ratios of transcription factors under stress (Gupta et al., 2014a). However, several small millets lack molecular and genetic markers (e.g., little millet and kodo millet) and no robust linkage maps appear to exist (Dwivedi et al., 2012). Genome and EST sequencing efforts will assist in the development of molecular markers in these species, along with using reference genomes (e.g., from major cereal relatives) to identify orthologous markers. Currently, only the foxtail millet genome has been sequenced and published (Bennetzen et al., 2012; Zhang et al., 2012).

### Advances in Transgene Research and Molecular Mechanisms

As noted in this review, detailed protocols for callus regeneration and transgene protocols have been published for all small millet species except little millet (Kothari et al., 2005; Ceasar and Ignacimuthu, 2009; Plaza-Wüthrich and Tadele, 2012). Since small millet women farmers toil in the drudgery of removing weeds manually (Rengalakshmi, 2005), an attractive transgene trait may be glyphosate herbicide resistance (RoundupReady).

As the small millets are respected by traditional farmers for their extreme abiotic and biotic stress resistance, an understanding of the molecular mechanisms underlying these traits may lead to agronomic improvement of related major cereals. Unfortunately millet diversity remains largely unexplored at the level of molecular mechanism, with the exception of a limited number of studies noted earlier. One especially attractive target will be to understand the ability of barnyard millet to grow under extremely low nitrogen conditions.

<sup>3</sup>http://www.icrisat.org/crop-fingermillet.htm

FIGURE 3 | Indigenous technologies and practices of modern small millet farmers. (A) A typical granary in the Eastern Ghats of India used for small millet storage. (B) A woman farmer in Northern India holds a basket used for separating millet grain from chaff. She stands beside a manual millstone used for grinding millet grain into flour. Source: M. Raizada.

#### Socio-Economic Constraints

Despite the promise of the small millets, various socio-economic constraints have limited their consumption and hence contributed to a loss of cultivated diversity:

First, a major reason why the small millets are declining in production is that these crops are typically labor-intensive; women are often responsible for manual post-harvest processing, grain threshing and milling (Rengalakshmi, 2005) (**Figure 3**). To overcome this obstacle, inexpensive machinery is needed.

As noted above, a second challenge to greater adoption of small millets is their comparatively low yield (Plaza-Wüthrich and Tadele, 2012) as a result of the lack of scientific attention. However, the benefits of adding millet to the cropping system may outweigh the drawbacks of low yield (e.g., to combat local protein deficiency or crop failure in stressful environments) (Plaza-Wüthrich and Tadele, 2012). Furthermore, the small millets can be grown in very stressful environments, where major cereals may fail.

Third, family-farm-level diversity is heavily affected by community access to seed which may be limited by current rural seed systems (Nagarajan et al., 2007). However, the presence of local seed markets has been found to increase millet diversity indicating that such markets may serve as good points of introduction for improved varieties.

Finally, agricultural policies in different nations have negatively impacted the cultivation and research of small millets. Production in many areas is becoming displaced by mainstream cereals: in Kenya, the focus has been placed on the cultivation of maize instead of finger millet (Dida et al., 2008), while in Northern Japan, cold-tolerant rice has almost completely replaced barnyard millet (Yabuno, 1987). Reduced cultivation of these millets in financially-rich countries like Japan is problematic, because it may decrease global research funding for these crops. However, recent reports revealing medicinal and nutritional benefits of these species (absence of gluten, cancer inhibition, control of blood-glucose and cholesterol) might catalyze consumer interest and hence funding in the developed world (Hegde et al., 2005; Nishizawa et al., 2009; Kim et al., 2011; Zhang et al., 2014). Nevertheless, landraces from these areas should be preserved in seed banks to ensure their conservation.

Given these socio-economic constraints, millets must not be blindly advocated in the developing world in biodiversity strategies. Prior to their introduction, multi-disciplinary surveys must be undertaken with local farmers concerning their nutrition, seed availability, economy, climate, and other crops in the cropping system.

## Conclusions

Modern agriculture is characterized by dominance of a few crop species with a trend toward genetic homogenization as a result of the global exchange of alleles via breeding. In contrast, traditional farmer landraces of the small millets continue to be cultivated under relative genetic isolation, and hence provide living examples of genetic and phenotypic biodiversity in contemporary agriculture. The small millets are valued by traditional farmers for their nutritional content and health promoting properties, ability to grow under low input conditions and tolerance to extreme environmental stress, especially drought. In a world facing limiting natural resources and climate change, these crops thus hold tremendous potential as valuable instruments in the toolkit of the New Green Revolution. It is hoped that germplasm resources combined with modern genomic tools can help to accelerate exploitation of this biodiversity.

## Author Contributions

Both TG and MR conceived of the manuscript. TG wrote the manuscript and MR edited the manuscript. Both authors read and approved the final manuscript.

## Acknowledgments

We thank Dr. Malinda Thilakarathna (University of Guelph, Raizada Lab) for providing photos of millet cropping systems, and Dr. Kirit Patel (Canadian Mennonite University) for inspiring this review. TG received partial scholarship support from the Queen Elizabeth II Graduate Scholarship in Science and Technology and additional support from a grant to MR from the International Development Research Centre (IDRC) and the Canadian Department of Foreign Affairs, Trade and Development (DFATD) as part of the CIFSRF program.

## References


foxtail millet [Setaria italica (L.) P. Beauv]. Plant Cell Rep. 31, 323–337. doi: 10.1007/s00299-011-1168-x


molecular markers for grain protein and calcium content in finger millet (Eleusine coracana (L.) Gaertn.). Mol. Biol. Rep. 41, 1189–1200. doi: 10.1007/s11033- 013-2825-7


contrasting genotypes of finger millet (Eleusine coracana L.) through RNAsequencing. Plant Mol. Biol. 85, 485–503. doi: 10.1007/s11103-014-0199-4


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2015 Goron and Raizada. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

## Comparative analysis of fruit aroma patterns in the domesticated wild strawberries "Profumata di Tortona" (*F. moschata*) and "Regina delle Valli" (*F. vesca*)

#### *Alfredo S. Negri 1, Domenico Allegra2, Laura Simoni 2, Fabio Rusconi 2, Chiara Tonelli 3, Luca Espen1 \* and Massimo Galbiati 2,3\**

<sup>1</sup> Dipartimento di Scienze Agrarie e Ambientali - Produzione, Territorio, Agroenergia, Università degli Studi di Milano, Milan, Italy

<sup>2</sup> Plant Model System Platform, Fondazione Filarete, Milan, Italy

<sup>3</sup> Department of Life Sciences, Università degli Studi di Milano, Milan, Italy

#### *Edited by:*

Aleksandra Skirycz, Instituto Tecnológico Vale Desenvolvimento Sustentável/Vale Institute of Technology Sustainable Development, Brazil

#### *Reviewed by:*

Camila Caldana, Brazilian Bioethanol Science and Technology Laboratory (CTBE) - Centro Nacional de Pesquisa em Energia e Materiais/Associação Brasileira de Tecnologia de Luz Síncrotron, Brazil Detlef Ulrich, Julius Kühn-Institute, Germany

#### *\*Correspondence:*

Luca Espen, Dipartimento di Scienze Agrarie e Ambientali - Produzione, Territorio, Agroenergia, Università degli Studi di Milano, Via Celoria 2, Milan 20133, Italy e-mail: luca.espen@unimi.it; Massimo Galbiati, Department of Life Sciences, Università degli Studi di Milano, Via Celoria 26, Milan 20133, Italy e-mail: massimo.galbiati@unimi.it

Strawberry is one of the most valued fruit worldwide. Modern cultivated varieties (Fragaria × ananassa) exhibit large fruits, with intense color and prolonged shell life. Yet, these valuable traits were attained at the cost of the intensity and the variety of the aroma of the berry, two characteristics highly appreciated by consumers. Wild species display smaller fruits and reduced yield compared with cultivated varieties but they accumulate broader and augmented blends of volatile compounds. Because of the large diversity and strength of aromas occurring in natural and domesticated populations, plant breeders regard wild strawberries as important donors of novel scented molecules. Here we report a comprehensive metabolic map of the aroma of the wild strawberry Profumata di Tortona (PdT), an ancient clone of F. moschata, considered as one of the most fragrant strawberry types of all. Comparison with the more renowned woodland strawberry Regina delle Valli (RdV), an aromatic cultivar of F. vesca, revealed a significant enrichment in the total level of esters, alcohols and furanones and a reduction in the content of ketones in in the aroma of PdT berries. Among esters, particularly relevant was the enhanced accumulation of methyl anthranilate, responsible for the intensive sweetish impression of wild strawberries. Interestingly, increased ester accumulation in PdT fruits correlated with enhanced expression of the Strawberry Alcohol Acyltransferase (SAAT ) gene, a key regulator of flavor biogenesis in ripening berries. We also detected a remarkable 900-fold increase in the level of mesifurane, the furanone conferring the typical caramel notes to most wild species.

**Keywords: strawberry,** *Fragaria moschata***,** *Fragaria vesca***, gas chromatography-mass spectrometry, aroma, volatiles**

## **INTRODUCTION**

Garden strawberries (*Fragaria* × ananassa) are among the most appreciated fruits and represent a valuable economic crop with a global annual production that exceeds 4.5 Mt (FAOSTAT, 2014). Berries of high yield modern varieties are characterized by large size, attractive color and prolonged shell life (Hancock, 1999). Yet, the sensory quality of traded strawberries is often criticized, as they lack flavor and fragrance. Sensory perceptions originate from the combination of sweetness, texture and aroma (Christensen, 1983). Among these features, aroma remains the most valued quality indicator for consumers worldwide (Azodanlou et al., 2003; Colquhon et al., 2012). Just as in other fruits, strawberry's aroma is a complex blend of volatile organic compounds (VOCs). These compounds only represent 0.001 to 0.01% of the berry fresh weight but have a major effect on its flavor and fragrance (Buttery, 1983). As many as 360 volatiles have been identified in ripe strawberries; these include esters, aldehydes, ketones, alcohols, terpenes and furanones (Menager et al., 2004; Jetti et al., 2007). Individual compounds, although often present in minute quantities, may have a significant impact on the aroma. The reduced fragrance of most garden strawberries derives from the relatively limited accumulation of esters molecules, frequently combined with an excess of lactones, which usually cause a disproportionate peach note.

As opposite to garden varieties, wild strawberries are renowned for their intense flavor and fragrance. Most spontaneous species bear small fruits which accumulate higher levels and wider assortments of volatile molecules, compared with cultivated varieties (Honkanen and Hirvi, 1990). The ample natural variation occurring among the wild ancestors of garden strawberries provides a valuable source of novel volatile compounds for breeding new commercial strawberries with improved aroma properties (Ulrich and Hoberg, 2000a). Over 20 wild species are found within the *Fragaria* genus, of which the diploid woodland strawberry (*F. vesca*) is the most common (Rousseau-Gueutin et al., 2009). Among other species, musk strawberries (*F. moschata*) are recognized for their distinguished and extraordinary strong aroma. Native to highland areas from France to Siberia, musk strawberries were widely cultivated in Europe to the mid-1900, when they were replaced by firmer, higher yielding and more remunerative *F.* × ananassa cultivars (Darrow, 1966).

Today, only few musk strawberries survive in farm plantings, although on a very small scale. Noteworthy is the Italian clone Profumata di Tortona (PdT), regarded as one of the most fragrant strawberry types of all (Urruty et al., 2002). PdT is a dioecious strawberry having the male and female reproductive organs in separate flowers on separate plants. Berries, distinguishable for the intense red color of the peel and the whitish flesh, posses a delightful sour–sweet, slightly astringent flavor, with green, caramel and clove-like notes (Pet'ka et al., 2012). Hallmark of PdT is its peculiar floral, spicy aroma, with hints of honey, musk and wine. The fragrance of this strawberry is so intense that a few ripe berries can perfume an entire room with a penetrating mango-like, tropical scent. Differently from most cultivated strawberries, the harvesting season for PdT is extremely limited. Berries are only available for a period of 10–15 days, coinciding with the second half of June. Currently, the commercial cultivation of PdT is restricted to the municipality of Tortona, in the Pedimont region in Northern Italy. Remarkably, the first historical evidence of musk strawberries in this area dates back to year 1411 (Bergaglio, 2007). Cultivation lingered into the early 1960s, when strawberry fields succumbed to urban development. Most recently, there has been a renewed interest for the Profumata di Tortona, considered a delicacy both for fresh consumption and gourmet preparations.

Diversity of volatile patterns in woodland strawberries in comparison to cultivated garden varieties has been extensively investigated (Drawert et al., 1973; Ulrich et al., 1997). Recent surveys of aroma profiles across 16 *F. vesca* accessions and five *F.* × ananassa cultivars, identified significant differences in the accumulation of individual esters, ketones and terpinoids between the two strawberries (Ulrich and Olbricht, 2013, 2014). In particular, small esters, including ethyl hexanoate, methyl butanoate and methyl hexanoate, were found in higher amounts in garden strawberries compared with woodland accessions. Conversely, the key ester methyl antranilate (MA) was more abundant in *F. vesca*. Similarly, ketones (e.g., 2-pentanone, 2-heptanone, and 2-nonanone) and terpinoids (e.g., myrtenal, myrtenil acetate, α-terpineol) occurred at higher levels in wild berries, with the exception of the monoterpene linalool, which was more abundant in garden strawberries (Ulrich and Olbricht, 2013, 2014).

Detailed profiling of the aroma composition in musk strawberries has only been reported for few spontaneous populations (Ulrich et al., 2007; Pet'ka et al., 2012). Urruty and colleagues performed a first partial assessment of the volatile compounds produced by ripe PdT berries (Urruty et al., 2002). These authors determined the abundance of 23 preselected VOCs in the aroma of two *F. moschata* clones (Capron Royal and PdT) compared with 15 garden varieties. Selected compounds, representing major constituents of the strawberry aroma, included esters (e.g. methyl hexanoate, MA), monoterpenes (e.g., linalool, nerolidol), ketones (e.g., 2-pentanone, 2-heptanone), aldehydes (e.g., 2-hexenal), lactones and furanones (e.g., γ-decalactone, mesifurane). Among the volatiles analyzed, Capron Royal and PdT displayed lower levels of small esters as methyl hexanoate, compared with cultivated strawberries. In contrast, both *F. moschata* clones revealed exceptionally high levels of MA, which was barely detectable in most garden varieties (Urruty et al., 2002). Despite the relevance of these findings, a more comprehensive analysis of the aroma profile of PdT is required to fully uncover the volatile composition of ripe PdT berries and gain more insights into its extraordinary aromatic properties.

Here we report a re-assessment of the VOCs composition of PdT berries, based on a non-targeted Solid-Phase Micro-Extraction/Gas chromatography-Mass Spectrophotometry (SPME/GC-MS) approach (Ulrich and Hoberg, 2000b). We compared the aroma of PdT with that of the woodland strawberry Regina delle Valli (RdV). The latter represents a widely cultivated cultivar of *F. vesca*, renowned for its intense and pleasant aroma. Most importantly, the aroma composition of this strawberry has not been investigated in previous studies. In total, we identified 131 VOCs in the headspace of the two strawberries, which provide a comprehensive picture of the aroma patterns of PdT and RdV berries. As a whole, our results contribute to shed new light on the natural variation occurring in the aroma of wild strawberry species.

#### **MATERIALS AND METHODS PLANT MATERIAL**

Plants of Profumata di Tortona were provided by "Consorzio per la valorizzazione e la tutela della Fragola Profumata di Tortona," Tortona, Italy. Regina delle Valli plants were purchased from Azienda Agricola Ortomio, Forlì, Italy (http://www.ortomio.it/). Both strawberries were grown in the production area of Tortona (Italy) under commercial conditions, accordingly to the standards adopted by the "Consorzio per la valorizzazione e la tutela della Fragola Profumata di Tortona." Fruits for the analysis were harvested with the assistance of local producers, to ensure selection of uniform, healthy and fully ripe berries. Five randomized samples, composed of 15 individual ripe fruits each, were collected the early morning in a single harvest. The extremely reduced harvesting season of PdT did not justify the adoption of multiple harvests. Berries employed in our analysis represent a faithful sample of the commercial fruits that are normally available to consumers.

#### **SPME/GC-MS SAMPLE PREPARATION**

Harvested fruits samples were immediately frozen at −20◦C and stored at −80◦C. Prior to the analysis fruits were powdered in liquid nitrogen and 1 g of fresh weight for each sample was incubated at 30◦C for 5 min. Following addition of 300 μL of a NaCl saturated solution, 900μL of the homogenized mixture were transferred to a10 mL screw cap headspace vial. Three technical replicas were performed for each sample.

#### **AUTOMATED SPME/GC-MS**

Volatiles were sampled by SPME with a 2 cm × 50/30-micron DVB/Carboxen/PDMS Stable Flex fiber (Sigma, Milano, Italy). Extraction and desorption of the volatiles were performed automatically by a CombiPAL autosampler (CTC Analytics, Zwingen, Switzerland) as described (Zorrilla-Fontanesi et al., 2012). Chromatography was performed on a DB-5 ms (30 m × 0.25 mm × 1 mm) column (Sigma, Milano, Italy) with Helium at a constant flow of 1.2 mL/min, accordingly to Zorrilla-Fontanesi et al. (2012). Mass spectra were recorded in scan mode in the 35 to 220 mass-to-charge ratio range by a 5975B mass spectrometer (Agilent Technologies, Cernusco sul Naviglio, Italy) (ionization energy 70 eV; scanning speed 7 scans/s). The Enhanced ChemStation software (Agilent Technologies, Cernusco sul Naviglio, Italy) was used for recording and processing of chromatograms and spectra. Three technical replicas were conducted for each sample.

#### **COMPOUND IDENTIFICATION AND RELATIVE QUANTIFICATION**

Compound abundance was determined using the software MET-IDEA that directly extracts ion intensities exploiting a list of ions coupled with their relative retention time values (Broeckling et al., 2006). The ion list was built as following. For each biological replicate, a randomly chosen chromatogram was analyzed by Automated Mass Spectral Deconvolution and Identification System (AMDIS; http://chemdata.nist.gov/dokuwiki/doku.php? id=chemdata:amdis) comparing the deconvoluted spectra with entries at the National Institute of Standards and Technology MS library (NIST08) (Type of Analysis, Simple; Resolution, Medium; Sensitivity, Medium; Shape Requirement, Low; Component Width, 12; Minimum Match Factor, 70). The 113 entries with a match score larger than 800 on the NIST Search software were extracted and added to a .msl file containing the 1537 plant derived metabolites of the VOC BinBase library (Skogerson et al., 2011) to assemble the library used for the analysis of the 10 randomly-chosen chromatograms. During this second cycle of AMDIS analysis, the Minimum match Factor was set to 90 and the resulting .fin files were joined in a single ion list, manually eliminating overlapping entries. The resulting ion list was employed for Met-IDEA analysis over all the 30 chromatograms. Peak areas of selected specific ions were integrated for each compound. The relative content (R.C.) of each tentatively identified metabolite (expressed as percentage) was calculated as the ratio between each peak area and the sum of all the peak areas present in the chromatogram, multiplied by 100 [R.C. = (Areapeak/-Areaspeak) × 100]. CAS numbers and flavoring descriptors were retrieved from the web-based Chemical Search Engine (http://www.chemindustry.com/apps/chemicals) and from the online edition of the "Specifications for Flavorings" database (http://www.fao.org/ag/agn/jecfa-flav/), respectively.

#### **STATISTICAL ANALYSIS**

Differences in volatiles accumulation between the two cultivars were investigated through the Soft Independent Modeling of Class Analogies (SIMCA) provided by the software Unscrambler (Camo Process AS, Oslo, Norway), trough the construction of a Principal Component Analysis (PCA) model for each cultivar. Samples were projected on the orthogonal system constituted by the two models to assess their object-to-model distance and to judge their membership to one of the two classes. The capability of variable *k* in discriminating between model PdT and RdV (fitting samples from model PdT onto model RdV) was described by the Discrimination Power (DiscrPower), computed as:

$$\text{DiscrPower} = \frac{\text{S}\_{\text{PdT}}(\text{RdV, k})^2 + \text{S}\_{\text{RdV}}(\text{RdV, k})^2}{\text{S}\_{\text{RdV}}(\text{RdV, k})^2 + \text{S}\_{\text{PdT}}(\text{PdT, k})^2}$$

The significance of the differences in volatile levels in the two strawberries was tested through a *t*-Student test considering as significant variations with *p* < 0.01.

#### **QUANTITATIVE PCR (qPCR) ANALYSIS**

RNA was isolated from Small Green, Turning and Red fruits according to Schultz et al. (1994). Reverse transcription, and qPCR analysis were performed as previously described (Galbiati et al., 2011). *SAAT* expression was analyzed using primers SAAT-F1 (5 -TTGGATGGGGGAGGACATCAT-3 ) and SAAT -R1 (5 -CACCCACGCTTCAATTCCAGTA-3 ). Gene expression was normalized using the *ACTIN* gene (GenBank: JN616288.1), amplified with primers ACT-F1 (5 -ATGTTGCCCTTGACTACGAACAA-3 ) and ACT-R1 (5 - TGGCCGTCGGGAAGCTCATA-3 ). Primers efficiency was first assessed on both genomic DNA and cDNAs derived from PdT and RdV fruits, to avoid differences in amplification efficiency in the two genotypes. Changes in *SAAT* gene expression were calculated relative to *ACTIN* using the Ct method (Livak and Schmittgen, 2001).

#### **RESULTS**

#### **PdT AND RdV BERRIES DISPLAY DISTINCT VOCs PATTERNS**

Fully ripe berries of *F. moschata*, clone Profumata di Tortona (PdT) and *F. vesca*, cv Regina delle Valli (RdV) were analyzed by SPME/GC-MS. In total, through the construction of a non-redundant ion list collecting information from the AMDIS analysis of 10 different chromatograms, 131 VOCs were tentatively identified in the headspace of the two strawberries. GC-MS data obtained from individual biological and technical replicas were analyzed using the Soft Independent Modeling of Class Analogies (SIMCA) based on the models relative to the two genotypes built with the Principal Component Analysis (PCA) (Svante and Michael, 1977). The SIMCA Cooman's plot showed full differentiation of PdT and RdV, based on their aroma components. All the samples grouped in the relative membership class, as determined by a significance level of 5% (**Figure 1**). The model assigned the highest discriminating power to the ketones heptan-2-one and nonan-2-one (**Table 1**). A third ketone molecule, undecan-2-one, also scored among the most significant volatiles for the discrimination between the two strawberries. Additional molecules with high discriminating power included several esters, such as hexyl butanoate, methyl benzoate, ethyl hexanoate, pinocarvyl acetate, the furanone γhexalactone, and the terpenes pinocarveol and myrtenyl acetate (**Table 1**).

#### **COMPARATIVE ANALYSIS OF AROMA PATTERNS IN PdT AND RdV BERRIES**

Among the 131 volatiles identified in the aroma of the two strawberries, 47 were classified as esters, which are long known to represent the most abundant VOCs found in ripe strawberries (Latrasse, 1991) (**Table 2**). Terpenoids included 20 mono- and 4 sesqui-terpenes (**Table 3**). Amongst the remaining aroma constituents we identified 5 alcohols, 9 aldehydes (**Table 4**), 6 ketones,



and 4 lactones (**Table 5**). Additional compounds comprised a single fatty acid (hexanoic acid), a single alkane hydrocarbon (tetradecane), and 36 volatiles of unknown chemical identity.

The relative abundance of individual chemical classes significantly differed between the two strawberries (Supplementary Figure 1). Quantitatively, esters were the predominant volatiles in berries of PdT, accounting for nearly 50% of the aroma. In comparison, their relative amount was drastically reduced in the aroma of RdV, only representing 25% of the total volatiles. Terpenes were the second most abundant class of compounds found in PdT (24%). A comparable level of total terpenes was identified in RdV (21%). Conversely, the two strawberries disclosed a marked difference in the relative level of ketones, which were severely reduced in PdT (4%), while highly abundant in RdV (27%). We also observed a disparity in the relative abundance of alcohols, which were far more copious in PdT compared with RdV, representing 7and 2% of the total volatile molecules, respectively. No substantial differences were detected in the relative amount of aldehydes and lactones. Aldehydes accounted for 6 and 9% of total volatiles in PdT and RdV, respectively, whilst lactones were the least represented molecules, only covering 1.4 and 0.4% of the aroma of PdT and RdV, respectively. Finally, compounds of uncertain identity were slightly more abundant in RdV (14%) compared with PdT (8%) (Supplementary Figure 1).

Analysis of individual constituents of the aroma identified the monoterpene myrtenyl acetate as the most abundant volatile in both PdT and RdV (**Table 3**). This finding is in line with a


 **GC-MS.**


**Table 2 | Continued**


 **of PdT** 

 **RdV berries by GC-MS.**

**Table 3 | Terpene molecules**

 **identified in the headspace**


previous work, which demonstrated that this molecule dominates the terpenoid profile of wild strawberries (Aharoni et al., 2004). Interestingly, the level of myrtenyl acetate was significantly augmented in PdT (20.1%) compared with RdV (15.4%). The relative quantities of two esters molecules were also preeminent in the aroma of PdT, namely octyl acetate (12.7%) and 4-acetyloxybutyl acetate (11.6%) (**Table 2**). Together with myrtenyl acetate these two compounds constituted 45% of the total volatiles found in this strawberry. Both esters were also detected in the aroma of RdV, even tough their relative abundance was drastically reduced compared with PdT (4.6 and 3.6%, respectively). Two ketones, 2 heptanone, and 2-nonanone, were the most abundant molecules, after myrtenyl acetate, in the aroma of RdV, accounting for 12.7 and 11.9% of the total volatiles, respectively (**Table 5**). Conversely, the amount of these two molecules was significantly reduced in PdT (0.7 and 0.2%, respectively).

Additional abundant components of the aroma of PdT included, 1-hexanol (5%) (**Table 4**) and several other esters, as hexyl formate (4.5%), methyl anthranilate (MA) (3.4%) and hexyl acetate (2.7%) (**Table 2**). On average, the content of these volatiles was significantly higher in the aroma of PdT compared with RdV. The only aldehyde found in relatively high amounts in both strawberries was 2-hexenal (**Table 4**). Its content was significantly greater in RdV (5.4%) compared with PdT (2.7%). The most abundant ketone in the aroma of PdT was 2-tridecanone, reaching 2.3% of the total volatiles (**Table 5**). As opposite to the other ketones, the level of this molecule was significantly reduced in RdV (0.6%) compared with PdT.

Among less abundant molecules (<1% of total volatiles), major differences between the two strawberries were observed within terpenes, esters, and furanones. In particular, the levels of α-pinene, a monterpene specifically identified in the aroma of *F. vesca* (Aharoni et al., 2004), linalool, known to dominate the terpenoid profile of cultivated strawberry (Aharoni et al., 2004) and α-citronellol, were significantly reduced in the aroma of PdT in comparison with RdV (**Table 3**). Contrariwise, the level of megastigma-3,7(E),9-triene, the major terpene found in the essential oil of some Eucalyptus species (El-Mageed et al., 2011), was 18 times higher in PdT compared with RdV. We also observed a 74- and 108-fold increase in the accumulation of the terpenes 3-cyclohexen-1-ol,5-methylene-6-(1-methylethyl)-(9CI) and 1H-Cyclopropa[a]naphthalene, in the aroma of RdV compared with PdT (**Table 3**). Among minor esters, the most striking differences were detected for 2-methylbutanoic acid and methyl 2-methylbutanoate, whose content was 120- and 150-times more abundant in berries from PdT relatively to RdV, respectively (**Table 2**). Additional esters over-represented in the aroma of RdV included [(E)-3 phenylprop-2-enyl] acetate (24-fold), octyl 3-methylbutanoate (16-fold), tridecan-2-yl acetate (14-fold) and methyl tiglate (13 fold) (**Table 2**).

Even if present in lower amount compared to other volatiles, lactones and furanones, were more copious in the aroma of PdT compared to RdV. Remarkably, mesifurane, the typical furanone conferring sweet caramel notes to wild strawberries, was nearly 900 times more abundant in PdT compared with RdV (**Table 5**).


**AlcoholandaldehydemoleculesidentifiedintheheadspaceofPdTandRdVberriesby**



#### **ANALYSIS OF SAAT EXPRESSION IN DEVELOPING BERRIES**

In strawberries only very few genes have been directly associated with aroma biogenesis in ripening fruits. Among them, the *Strawberry Alcohol Acyltransferase* gene (*SAAT*), controlling a key step in esters biosynthesis (Aharoni et al., 2000), is of particular interest. The enzyme AAT catalyzes the transfer of an acyl moiety from acyl-CoA onto specific alcohols, resulting in the production of ester molecules (Harada et al., 1985). Intriguingly, octyl-acetate, the most abundant ester in both PdT and RdV aroma, has been demonstrated to be a genuine AAT product (Aharoni et al., 2000). The activation of *SAAT* expression during berry development has been positively correlated with the on-set of esters accumulation. In *F.* × ananassa *SAAT* expression is induced at early stage during fruit ripening, it peaks in correspondence of the turning stage, and it is rapidly down-regulated in red fruits (Aharoni et al., 2000). We compared the expression profile of the *SAAT* gene in developing berries from PdT and RdV (Supplementary Figure 2) to unravel potential differences in the level of gene expression and/or in the kinetic of *SAAT* activation between the two strawberries.

As shown in **Figure 2**, both RdV and PdT accumulated comparable low levels of *SAAT* transcripts in small green fruits. At the turning stage, both strawberries displayed very strong activation of *SAAT* expression. Interestingly, at this stage, the degree of gene activation was significantly enhanced in PdT compared with RdV (*t*-test, *p* < 0.01). This finding is conceivable with the increased accumulation of esters observed in PdT relatively to RdV fruits. Finally, in red fruits *SAAT* expression was down-regulated to the same extent in both berries (**Figure 2**).

#### **DISCUSSION**

The vast array of wild species found in the genus *Fragaria* provide an exceptionally large and conveniently located germplasm, which can serve to breeding novel quality traits into garden strawberries (Hancock and Luby, 1993). Key to the successful exploitation of the natural variation occurring in the *Fragaria* genus is the detailed characterization of the aroma profile of individual species, ecotypes and clones. This study unraveled the volatile composition of two domesticated wild strawberries: *F. moschata* clone Profumata di Tortona and *F. vesca* cv. Regina delle Valli. Both strawberries are regarded as highly aromatic, although the scent of PdT is far more intense and persistent (Urruty et al., 2002).

Sampling procedures are key to the analysis of aroma composition in strawberries, as significant changes in VOCs profiles can occur among harvest dates within a single season (Schwieterman et al., 2014). Different strategies employing multiple harvests or the laborious harvest of all the available fruits throughout the season have been proposed to overcome this obstacle (Ulrich and Olbricht, 2013; Schwieterman et al., 2014). Our results rely on the analysis of biological replicas from a single harvest. In contrast to perpetual flowering accession characterized by prolonged harvesting seasons, the seasonal flowering PdT clone, only bears fruit for less than 2 weeks. Environmental changes and differences in plant physiology, known to affect fruit quality over months (Schwieterman et al., 2014), are unlike to influence the aroma composition of PdT berries within a period of 15 days.

We identified 131 volatiles in ripe berries of PdT and RdV, a number exceeding the aroma compounds usually found in commercial strawberries. A recent survey of the chemical diversity of the aromas of 35 different garden varieties recognized no more than 80 VOCs even in the most fragrant commercial genotypes (Schwieterman et al., 2014). Comparative analysis of aroma patterns revealed the significant enrichment in esters and alcohols along with the severe reduction in ketones in berries from PdT compared with RdV. Conversely, the two strawberries disclosed comparable levels of terpenes, aldehydes and lactones (Supplementary Figure 1). Over 130 different ester molecules have been identified in strawberries (Latrasse, 1991). In *F.* × ananassa, the chemical composition of ester volatiles is usually dominated by methyl and ethyl esters even though the abundance of each form varies with cultivar (Forney et al., 2000). Methyl esters were the prevalent form in the aroma of both PdT and RdV (22 molecules), whereas ethyl esters were poorly represented (3 molecules) (**Table 2**).

Even if it is generally difficult to establish a direct correlation between individual aroma constituents and specific sensory impressions, the fragrance of wild strawberries largely depend upon the relatively high amounts of methyl anthranilate (MA) (Ulrich et al., 1997). The intensive sweetish-flowery impression of this ester is at the basis of the definition of strawberry aroma chemo-types, which are essentially subdivided into MAcontaining and MA-free types (Ulrich et al., 1997). Our analysis uncovered a nine-fold increase in the level of MA in berries from PdT compared with RdV, thus emphasizing the role of this key ester in determining the unique fragrance of musk strawberry. This conclusion is corroborated by previous studies, which reported an exceedingly higher concentration of MA in *F. moschata* relatively to *F. vesca* (Urruty et al., 2002). It is also interesting to note that, MA and γ-decalactone can directly inhibit the *in vitro* growth of relevant strawberry pathogens, thus implying that these volatiles may contribute to the healthiness of the berry at harvest (Chambers et al., 2013).

We also detected significant differences in the accumulation of low abundant esters between the two strawberries, as for instance methyl 2-methylbutanoate, which was 150 times more abundant in PdT compared with RdV (**Table 2**). Along with other esters of butanoic acid this molecule, conferring a sweet and fruity impact to the aroma, is found in higher amounts in garden strawberries compared with woodland accessions (Ulrich and Olbricht, 2013). We also detected a 16-fold increase in the relative level of octyl 3-methylbutanoate, in the headspace of PdT compared with RdV. This molecule, conferring an apple-pineapple odor, is present in the most flavored commercial varieties but undetected in the least flavorful. Evidence indicates that, octyl 3-methylbutanoate is an important component of strawberry aroma, potentially enhancing the perceived sweetness intensity, independently of individual sugars (Schwieterman et al., 2014).

Ester accumulation in ripening strawberries is directly associated with the expression of the *SAAT* gene, encoding a fruitspecific ALCOHOL ACYLTRANSFERASE (AAT) (Aharoni et al., 2000). It is intriguingly to speculate that the enhanced accumulation of ester molecules in the aroma of PdT, results from the hyper activation of *SAAT* expression observed in turning fruits from PdT, as compared with RdV (**Figure 2**). Further support to this hypothesis comes from the observation that genuine products of the SAAT enzyme, including octyl acetate, [(Z)-hex-3-enyl] acetate, 2-phenylethyl acetate and nonyl acetate (Aharoni et al., 2000), were found at higher levels in the headspace of PdT compared with RdV (**Table 2**).

The implication for the increased alcohol accumulation in PdT, as compared with RdV, on the quality of the berry is questionable, as these molecules have normally little impact on the aroma (Larsen and Watkins, 1995). Nevertheless, Schwieterman and colleagues recently reported a direct effect of the level of 1-hexanol on sweetness and flavor intensity in different garden cultivars (Schwieterman et al., 2014). Notably, we found a significant 4-fold increase in the relative amount of 1-hexanol in PdT compared with RdV, possibly suggesting a positive role for this molecule in determining the unique flavor of musk strawberries.

PdT and RdV displayed a similar terpene profile, largely dominated by myrtenyl acetate, by far the most abundant molecule found in the aroma of both strawberries. Higher concentrations of myrtenyl acetate are normally found in woodland strawberries compared with garden varieties (Ulrich and Olbricht, 2013). We observed a moderate, although significant, increased level of myrtenyl acetate in PdT compared with RdV (**Table 3**). This is in accord with previous comparative analysis of other *F. moschata* clones with *F. vesca* accessions (Ulrich et al., 1997). Major differences in the accumulation of low abundant terpenes involved linalool and 1H-Cyclopropa[a]naphthalene. The former, conferring pleasant flowery, citrus-like notes, represents the predominant monoterpene found in cultivated strawberries (Aharoni et al., 2004; Ulrich and Olbricht, 2014). Its relative concentration was 19 fold higher in RdV compared with PdT (**Table 3**). Conversely, 1H-Cyclopropa[a]naphthalene was 108 times more abundant in PdT in comparison with RdV. Interestingly, this molecule is among the major constituent of some agarwood oils, highly appreciated for their unique and intense fragrance and for their therapeutic properties (Takemoto et al., 2008). To our knowledge, this volatile compound has never been reported in previous analyses of strawberry aromas and could represent a novel target to enhancing the fragrance of traded strawberries.

Even if present in relative low amounts, furanones are considered as dominating components of strawberry aroma. In particular furaneol (DHF) and its methyl ether mesifurane (DMF), contribute to the typical caramel-like, sweet, floral and fruity aroma of the berry (Jetti et al., 2007). Interestingly, PdT strawberries revealed an enhanced accumulation of total furanones compared with RdT. In particular, we detected a nearly 900-fold increase in the relative amount of DMF compared with RdV (**Table 5**). Augmented levels of DMF in *F. moschata* accessions relatively to *F. vesca* have been previously reported, yet not to this very large extent (Ulrich et al., 1997). The identification of the genetic bases for such an increased mesifurane production is beyond the scope of this work. Yet it is important to note that, PdT may represent a valuable breeding material to enhance the DMF content of garden varieties. Our analysis did not reveal detectable amount of DHF neither in PdT nor in RdV berries. This is in contrast with a preceding work, which identified this furanone in different *F. vesca* and *F. moschata* genotypes (Urruty et al., 2002). One possible explanation for such a discrepancy could reside in differences in the analytic methods employed in previous studies and ours. It is interesting to note, that DHF accumulation has been negatively correlated to the quality of the berry, as DHF-type strawberries are generally characterized by medium to poor flavor (Ulrich et al., 1997). The lack of DHF in PdT and RdV berries could alternatively be correlated to the organoleptic excellence of their fruits.

As a whole, our data provide a comprehensive metabolic map of PdT, the most fragrant strawberry of all. Despite the fact that *F. moschata* is not a direct ancestor of commercial garden strawberries, the aroma profile of PdT could assist the exploitation of this ancient clone as breeding material to enhance fruit quality in commercial strawberries. The successful introgression of wild-derived traits into cultivated garden varieties largely depends upon the possibility to generate viable and fertile interspecific hybrids. Evidently, crosses between octoploid *F.* × ananassa and species at lower ploidy level, including hexaploid PdT, are rather difficult. Yet, breeders have successfully performed interploid crosses between cultivated strawberries with *F. vesca* and *F. moschata*, which produced viable hybrids with partial seed set (Luby et al., 1991). Synthetic octoploids containing varying levels of *F. moschata* have also been generated using colchicine-induced artificial doubling of chromosome number (Evans, 1982a,b). Finally, advancements in genetic engineering of cultivated strawberries have opened unprecedented possibilities for the breeding of new varieties with desirable traits. Genetic transformation has been successfully employed to enhance strawberry resistance to pests, herbicides, diseases, environmental stresses as well as fruit quality (reviewed in Qin et al., 2008). Further developments are expected in which metabolomic data, as those provided in this study, combined with genome-wide transcriptomic analysis and next generation genome-sequencing strategies will allow the identification of suitable molecular targets for engineering of volatile biosynthesis in strawberries. In this perspective, the genome of Profumata di Tortona will prove an invaluable source of genetic material.

## **ACKNOWLEDGMENTS**

The authors thank the "Consorzio per la valorizzazione e la tutela della Fragola Profumata di Tortona" (Tortona, Italy) and "Progetto Derthona" (Tortona, Italy) for providing the plant material along with useful technical and historical information on the clone "Profumata di Tortona"; and "Agrodinamica s.r.l." (Tortona, Italy) for technical support with field activities.

### **SUPPLEMENTARY MATERIAL**

The Supplementary Material for this article can be found online at: http://www.frontiersin.org/journal/10.3389/fpls.2015.00056/ abstract

#### **REFERENCES**


Hancock, J. (1999). *Strawberries.* Oxford, UK: CAB International.


of their active components. *J. Nat. Med.* 62, 41–46. doi: 10.1007/s11418-007- 0177-0


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 27 October 2014; accepted: 22 January 2015; published online: 11 February 2015.*

*Citation: Negri AS, Allegra D, Simoni L, Rusconi F, Tonelli C, Espen L and Galbiati M (2015) Comparative analysis of fruit aroma patterns in the domesticated wild strawberries "Profumata di Tortona" (F. moschata) and "Regina delle Valli" (F. vesca). Front. Plant Sci. 6:56. doi: 10.3389/fpls.2015.00056*

*This article was submitted to Plant Genetics and Genomics, a section of the journal Frontiers in Plant Science.*

*Copyright © 2015 Negri, Allegra, Simoni, Rusconi, Tonelli, Espen and Galbiati. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

## Extensive sequence variation in rice blast resistance gene Pi54 makes it broad spectrum in nature

Shallu Thakur 1, 2, Pankaj K. Singh<sup>1</sup> , Alok Das 1 †, R. Rathour <sup>3</sup> , M. Variar <sup>4</sup> , S. K. Prashanthi <sup>5</sup> , A. K. Singh<sup>6</sup> , U. D. Singh<sup>6</sup> , Duni Chand2, , N. K. Singh<sup>1</sup> and Tilak R. Sharma<sup>1</sup> \*

<sup>1</sup> National Research Centre on Plant Biotechnology, Pusa Campus, New Delhi, India, <sup>2</sup> Department of Biotechnology, Himachal Pradesh University, Shimla, India, <sup>3</sup> Department of Agricultural Biotechnology, CSK Himachal Pradesh Agricultural University, Palampur, India, <sup>4</sup> Central Rainfed Upland Rice Research Station, Central Rice Research Institute, Hazaribagh, India, <sup>5</sup> School of Agricultural Biotechnology, University of Agricultural Sciences, Dharwad, India, <sup>6</sup> Indian Agricultural Research Institute, New Delhi, India

#### Edited by:

Joanna Marie-France Cross, Inönü University, Turkey

#### Reviewed by:

Mehboob-ur- Rahman, National Institute for Biotechnology and Genetic Engineering, Pakistan Swarup Kumar Parida, National Institute of Plant Genome Research, India

#### \*Correspondence:

Tilak R. Sharma, National Research Centre on Plant Biotechnology, Pusa Campus, LBS Building, New Delhi-110012, India trsharma@nrcpb.org; trsharma1965@gmail.com

#### †Present Address:

Alok Das, Indian Institute of Pulses Research, Kanpur, India

#### Specialty section:

This article was submitted to Plant Genetics and Genomics, a section of the journal Frontiers in Plant Science

Received: 30 January 2015 Accepted: 30 April 2015 Published: 21 May 2015

#### Citation:

Thakur S, Singh PK, Das A, Rathour R, Variar M, Prashanthi SK, Singh AK, Singh UD, Chand D, Singh NK and Sharma TR (2015) Extensive sequence variation in rice blast resistance gene Pi54 makes it broad spectrum in nature. Front. Plant Sci. 6:345. doi: 10.3389/fpls.2015.00345 Rice blast resistant gene, Pi54 cloned from rice line, Tetep, is effective against diverse isolates of Magnaporthe oryzae. In this study, we prospected the allelic variants of the dominant blast resistance gene from a set of 92 rice lines to determine the nucleotide diversity, pattern of its molecular evolution, phylogenetic relationships and evolutionary dynamics, and to develop allele specific markers. High quality sequences were generated for homologs of Pi54 gene. Using comparative sequence analysis, InDels of variable sizes in all the alleles were observed. Profiling of the selected sites of SNP (Single Nucleotide Polymorphism) and amino acids (N sites ≥ 10) exhibited constant frequency distribution of mutational and substitutional sites between the resistance and susceptible rice lines, respectively. A total of 50 new haplotypes based on the nucleotide polymorphism was also identified. A unique haplotype (H\_3) was found to be linked to all the resistant alleles isolated from indica rice lines. Unique leucine zipper and tyrosine sulfation sites were identified in the predicted Pi54 proteins. Selection signals were observed in entire coding sequence of resistance alleles, as compared to LRR domains for susceptible alleles. This is a maiden report of extensive variability of Pi54 alleles in different landraces and cultivated varieties, possibly, attributing broad-spectrum resistance to Magnaporthe oryzae. The sequence variation in two consensus region: 163 and 144 bp were used for the development of allele specific DNA markers. Validated markers can be used for the selection and identification of better allele(s) and their introgression in commercial rice cultivars employing marker assisted selection.

Keywords: allele mining, Pi54, rice landraces, polymorphism, blast resistance genes, allele specific markers

## Introduction

Blast disease caused by the fungus, Magnaporthe oryzae is one of the most widespread and devastating diseases of rice. Management of rice blast through host resistance is a promising component of the Integrated Disease Management (IDM) programme. Till date, about 101 major rice blast resistance (R) genes have been identified, and 20 of them cloned and characterized (Sharma et al., 2012). Numerous R-genes identified, cloned and characterized are categorized in eight classes based on their amino acid motif organization (Sharma et al., 2014). Majority of loci associated with rice blast disease resistance have been reported on chromosome 11 of rice based on genome wide association studies (Wang et al., 2014). Although several blast resistance loci have been identified but only few of them has been employed in breeding for blast management in India (Singh et al., 2011). Further limited success has been realized in durable resistance breeding programmes due to variability of pathogen across locations. Harnessing rice diversity adapted in farmers' fields over the years appears promising alternative to look for resistance source.

Exploring the genetic variants from germplasm (wild and cultivated) is currently being envisaged in many crop species. One of the most widely used methods to identify variants employs polymerase chain reaction (PCR) based techniques to amplify homologs (possible alleles) from the gene pool, known as allele mining. Recently, allele mining for blast resistance has been reported from wild and cultivated species of rice (Yang et al., 2007; Geng et al., 2008; Huang et al., 2008). Studies of Pi-ta gene in rice lines including wild (AA and CC genome) and cultivated species indicated consensus conserved sequence before divergence (Wang et al., 2008). In another study, Pi-ta orthologs from 26 accessions (Oryza rufipogon, O. sativa, O. meridionalis, and O. officinalis), collected from 10 different countries highlighted dimorphic pattern of nucleotide polymorphism and low nucleotide diversity at the LRD region (Yoshida and Miyashita, 2009). In similar lines, the allelic variants and flanking sequences of Pi-ta have been studied in 159 geographically diverse accessions of Oryza species (AA genome) (Lee et al., 2009). The Pi-ta alleles also have been studied extensively in Indian landraces (Thakur et al., 2013b). Other blast resistance loci like Pid3, Pi9 and Piz(t) has been explored to study the nucleotide polymorphism and evolutionary pressure (Shang et al., 2009; Liu et al., 2011; Thakur et al., 2013a). However, such detailed analysis is lacking for the important blast resistance gene, Pi54 that confers broad spectrum resistance to blast disease (Sharma et al., 2010). The Pi54 gene located on chromosome 11 having unique zinc finger domain, besides LRR domain (Sharma et al., 2005a,b; Gupta et al., 2012). Functional complementation indicated that this gene provides stable and high level of resistance against geographically diverse strains of M. oryzae, collected from different parts of India (Rai et al., 2011). The gene possibly triggers up-regulation of defense response genes (callose, laccase, PAL, and peroxidase), transcription factors (NAC6, Dof Zinc finger, MAD box, bZIP, and WRKY) that fortify cell wall/plasmodesmata leading to hypersensitive response, and affecting resistance reaction (Gupta et al., 2011). Currently, the gene is being used in enhanced blast resistant breeding programme (Singh et al., 2011). An ortholog of Pi54 gene from wild species of rice has also been recently cloned and functionally validated (Das et al., 2012). However, the allelic variants of Pi54 gene have not been characterized from rice landraces, that are believed to have co-evolved with pathogen, and hence represents better "evo-devo" perspective of resistance reaction.

Till date, cultivated and wild species of rice have been employed for prospecting novel variants of blast resistance genes. Landraces too represents unmatched genetic potential for rice improvement. The local landraces or local rice varieties are genetically diverse, balanced population and are in equilibrium with the environment and pathogens, in contrast to the rice varieties. Unlike high yielding varieties, the landraces are endowed with tremendous genetic variability, as they are not subjected to subtle selection over a long period of time. Probably, it helps landraces to adapt in wide agro-ecological niches with unmatched qualitative traits, medicinal properties and important genetic resources for resistance to pests and diseases. Owing to their specific domination in geo-graphical niches, landraces have genes of resistance to biotic stresses, which have not been widely utilized or incorporated into modern varieties (Ram et al., 2007). The landraces grown in rice blast "hot- spots" of the Indian subcontinent has remained largely unexplored. Molecular markers linked to major R-genes represent an important tool for marker assisted selection (MAS) (Costanzo and Jia, 2010). Variations in terms of Single Nucleotide Polymorphisms (SNPs) and insertiondeletions (InDels) covering the entire genic segment can be compared among genotypes to identify functional markers to aid the selection process. Markers associated with two cloned blast R genes (Pi-b and Pi-ta) as well as a PCR-based SNP markers for Piz locus and Pik locus (Hayashi et al., 2004; Jia et al., 2009; Zhai et al., 2011) are to mention a few. Conventional breeding with MAS would therefore, benefit from the development of new R gene specific markers, which would allow pyramiding of multiple genes in adapted germplasm toward realizing broader spectrum disease resistance. Molecular population genetic analysis of local landraces and cultivated varieties might provide insight on the selection forces maintaining resistance and preventing evolution of new specificities in natural pathogen populations. Therefore, this study was conducted with objectives, (i) analysis of variants of Pi54 alleles from the cultivated varieties and Indian landraces of rice collected from different eco-geographical regions (ii) structural analysis of Pi54 alleles to understand molecular evolution at the loci, and (iii) development of allele specific functional markers for use in marker assisted selection.

## Materials and Methods

## Plant Material and Fungal Culture

A set of 92 rice lines (landraces and cultivated varieties) were selected from different geographic locations of India for prospecting of Pi54 alleles. The diagnostic isolate of M. oryzae (Mo-nwi-37-1) was used for the phenotypic evaluation of all the rice lines (Rai et al., 2011; Rathour et al., unpublished data).

## Preparation of Fungal Culture

Fungal culture of Mo-nwi-37-1 was maintained on Oat Meal Agar (HiMedia, India) medium in pre-sterilized pertiplates (90 mm diameter). For sporulation, the culture was multiplied in Mathur's medium (Dextrose 8 g/L, Magnesium sulfate 2.5 g/L, Potassium phosphate 2.75 g/L, Neo-Peptone 2.5 g/L, Yeast Extract 2.0 g/L, and agar 16 g/L). The culture plates were maintained at 22◦C for 12–16 days under constant illumination with white fluorescent light (55µF/Em/s). For the preparation of fungal spores, 5 ml of 0.2% gelatine solution was added to individual plate on agar surface and gently rubbed with scrapper to separate conidia from the conidiophores. The spore concentration was brought to approximately 10<sup>5</sup> spores/ml. The seedlings were sprayed with spore suspension of about 1 ml per plant at 2–3 leaf stage.

## Inoculation of Rice Lines with Diagnostic M. oryzae Isolate

Rice lines were grown in plastic pots (12 inch dia.) containing sterilized potting mixture in the rice blast testing facility, NRCPB, IARI, New Delhi. Rice lines, Tetep, and Taipei 309 were used as positive and negative controls, respectively. Physical parameters were set for 16 h/8 h light-dark photoperiod. The day and night temperatures were maintained at 25◦C and 21◦C, respectively, with relative humidity (RH) of more than 90%. All the seedlings assessed in the experiment were sprayed simultaneously with M. oryzae spore suspension of 10<sup>5</sup> spores/ml. Disease reaction was recorded after 7 days of inoculation using 0–5 disease assessment scale (Bonmann et al., 1986). Where, 0 = No evidence of infection; 1 = Brown specks smaller than 0.5 mm in diameter, no sporulation; 2 = Brown specks about 0.5–1.0 mm in diameter, no sporulation; 3 = Roundish to elliptical lesions, 1–3 mm in diameter, gray center surrounded by brown margins, lesions capable of sporulation; 4 = Typical spindle shaped blast lesions capable of sporulation, 3 mm or longer; 5 = lesions as in 4 but about half of 1–2 leaf blades killed by coalescence of lesions. Reaction types 0, 1, 2, and 3 were considered resistant, while 4 and 5 considered as susceptible.

## PCR Amplification and Sequencing

Genomic DNA was extracted from fresh leaves of selected rice lines using the modified Cetyltrimethyl Ammonium Bromide (CTAB) method of DNA isolation (Murray and Thompson, 1980). For PCR amplification, nucleotide sequence of the blast resistance gene Pi54 (Loc\_Os11g42010) was retrieved from NCBI database (www.ncbi.nlm.nih.gov/). Overlapping oligos Pi54\_F1 (CAATATAGCTGGGAATTTCAGAGG) and Pi54\_R1 (AGATAATGTGTTTGTCTGGCTGTC); Pi54\_F2 (CATGAA CAGAGCACTGATGACATA) and Pi54\_R2 (GGATAACAA GCACTGAGCCATATC); Pi54\_F3 (CCGTTCTGACCATAG AAATTATCG) and Pi54\_R3 (GTGCAATTACATAAGCTA GACCTTG) were designed using Primer 3 software (Rozen and Skaletsky, 2000) to amplify 1.5 kb region using primer walking technique. PCR was performed with genomic DNA isolated from rice landraces and cultivated varieties using Pfu polymerase (FINNZYMES OY, Keilaranta, Espoo, Finland) with the following thermal cycling conditions: initial DNA denaturation at 95◦C for 2 min followed by 30 cycles of 95◦C for 30 s, 58◦C (Pi54\_F1 and Pi54\_R1; Pi54\_F2 and Pi54\_R2) or 60◦C (Pi54\_F3 and Pi54\_R3) for 30 s, 72◦C for 1 min, final elongation at 72◦C for 10 min and hold at 4◦C. The PCR derived amplification products were used as template for determining DNA sequences using Sanger's dideoxy method of sequencing.

## Sequencing of PCR Amplicon

The purified PCR amplicon was sequenced according to manufacturer's instruction directly by using modified Sanger's dideoxy terminator cycle sequencing chemistry on an automated capillary-based DNA sequencer (ABI 3730xl DNA Analyzer) in both forward and reverse direction twice using amplified product specific primers. The PCR products were run in a cycle sequencing reaction with thermal cycling conditions as 30 cycles of denaturation (95◦C for 20 s), annealing (60◦C for 20 s), and extension (60◦C for 4 min) followed by hold at 4◦C. The purified sequencing products were resolved on a capillary-based automated DNA sequencer (ABI 3730xl DNA Analyzer). Full length sequence reads were obtained by assembly of multiple reads of each fragment using Phred/Phrap and Consed software (Ewing and Green, 1998). Each fragment was sequenced at least four times and high quality (Phred 20) consensus sequence was used for data analysis.

## Analysis of Sequenced Pi54 Alleles

The sequenced data were aligned using ClustalW 2.0 (Larkin et al., 2007) at their default alignment parameters and manually corrected by MEGA 4.0. Gene coding regions were predicted with FGENESH (Solovyev et al., 2006) using the original Pi54 (Tetep) sequence as a reference. The LRR domain was identified as described earlier (Sharma et al., 2005a). Motif was identified using motif scan software (http://hits. isb-sib.ch/cgi-bin/PFSCAN) and SMART tool (http://smart. embl-heidelberg.de/). Phylogenetic analysis was performed with MEGA 4.0 using the Neighbor-Joining method (Saitou and Nei, 1987). The bootstrap consensus tree inferred from 1000 replicates was used to represent the evolutionary history of the taxa analyzed (Felsenstein, 1985). Branches corresponding to partitions reproduced in less than 50% bootstrap replicates were collapsed. The tree was drawn to scale, with branch lengths in the same units as those of the evolutionary distances used to infer the phylogenetic tree. The evolutionary distances were computed using the Maximum Composite Likelihood method (Tamura et al., 2004) and were depicted in the units of the number of base substitutions per site. All positions containing gaps and missing data were eliminated from the dataset (Complete deletion option). Analysis of overall transition: transversion ratio, variable, and parsimony informative positions were calculated using MEGA 4.0 software (Tamura et al., 2007).

The DnaSP 5.10 software was used for the analysis of nucleotide polymorphism (Rozas et al., 2003). The aligned DNA sequences were imported into the DnaSP software to calculate S (number of polymorphic or segregating sites), π (nucleotide diversity), θ (Theta from S, Theta-W), and D (Tajima's D), and to draw the sliding window of nucleotide diversity (π). Haplotype networks were constructed for each potential SNP's sites by statistical parsimony with the software TCS 1.21 (Templeton et al., 1992; Clement et al., 2000). The networks were assembled based on an absolute distance matrix between haplotypes, i.e., the number of mutations separating each haplotypes, with a parsimony probability of 95%. Haplotype diversity (Hd) was analyzed between disease resistance and susceptible phenotypes of Oryza species. The DnaSP 5.10 program (Rozas et al., 2003) was used for the analysis of haplotype diversity Hd (Nei, 1987).

## Development of Functional Markers

The PCR-based co-dominant and dominant STS (sequence tagged site) markers targeting consensus InDels of 144 and 163 bp, respectively were designed. In case of dominant marker, the forward primer (DPi54\_163F) was designed flanking to the insertion (163 bp) and reverse primer (DPi54\_163R) was designed from the sequence of insertion. The forward (CPi54\_144F) and reverse (CPi54\_144R) primers in case of codominant markers were designed from the flanking regions of 144 bp insertion. Primer pairs [CPi54\_144F (AAGTACTTCAT GATCTATTCTACTGG) and CPi54\_144R (CCGTTCTGACC ATAGAAATTATCG)]; DPi54\_163F (ACCATGACTAGCTATG AAAAATCT) and DPi54\_163R (AGAATAGATCATGAAGTA CTTGAAAC)] were designed by using Primer 3 software (Rozen and Skaletsky, 2000). PCR amplification was carried out on programmable Thermal Cycler (BioRad, Washington DC, USA) using the following temperature profile: initial DNA denaturation, 95◦C for 2 min; followed by 35 cycles of denaturation, 94◦C for 20 s; annealing, 55◦C (DPi54\_163F and DPi54\_163R) or 58◦C (CPi54\_144F and CPi54\_144R) for 30 s; extension, 72◦C for 1 min; and final extension at 72◦C for 10 min and then hold at 4◦C using Taq DNA polymerase (Vivantis, USA). The PCR amplified product was resolved in 2% agarose gel using 1.0X TAE buffer.

## Results

## Phenotypic Evaluation of Rice Landraces

The rice lines (92) used in present study were grown in contained condition and the 15 days-old-seedlings were challenged with representative M. oryzae isolate, Mo-nwi-37-1. After a week of inoculation, all the rice lines were grouped into resistant and susceptible categories based on their reaction to M. oryzae. Out of 92 rice lines, 72 were found resistant and the rest 20 susceptible, based on disease assessment scale (Bonmann et al., 1986) (Table S1). These lines were used for the allele mining studies of Pi54 gene.

### Nucleotide Polymorphism

To determine the nucleotide diversity at the Pi54 locus in rice lines, 1.5 kb long fragments were amplified from 92 rice lines and sequenced using 3 different overlapping primer combinations (**Figure 1**). All the fragments were sequenced and high quality (>Phred 20) assembled sequence of each allele has been deposited in the EMBL database (Table S1). For sequence analysis, the Pi54 alleles were grouped into three different categories: (i) phenotypes: resistant and susceptible (ii) landraces and cultivated varieties; (iii) indica and japonica types (**Table 1**). Nucleotide variations were high in the genic region of Pi54 allele. A total of 197 SNPs, large insertions of 38, 49, 144, and 163 bps and single base pair deletions were identified in the Pi54 alleles.

We calculated percentage of mutational change with respect to the reference Pi54 gene and compared within and between disease resistance and susceptible alleles of Oryza species. All mutational changes were scored for the specific positions only. The number of mutations per site was found to be equal to and greater than 10 (i.e., Nmut ≥ 10). Overall, 40 mutational sites were identified across the alignment, out of which 27 are transitions (ti) and 12 transversion (tv) (Table S2).

TABLE 1 | Different groups and rice accessions used in mining for blast

resistance gene Pi54.

1–3, PCR fragments obtained from overlapping fragments of Pi-54 allele.


Further, mutational profiling of disease resistance and susceptible phenotypes of Oryza species was constructed in all alleles mined from 92 rice landraces. It was found that 61% of mutational sites have mutations at one site, 18% of the sites have mutations at twonine sites and rest of the 21% sites have mutation at 10 or more than 10 sites (**Figure 2A**). Most of the mutational sites exhibited a constant frequency distribution between the resistance and susceptible groups (**Figure 2B**).

The Pi54 alleles of landraces harbor substantially higher polymorphism as compared to the Pi54 alleles of cultivated species, because of heterogeneous nature of rice landraces used in present study, which might have accumulated more mutations during the course of evolution. The nucleotide diversity was high in the Pi54 alleles of resistant lines compared to the Pi54 alleles of susceptible lines (**Table 2**). The Pi54 alleles of indica and japonica species were almost equally diverse at nucleotide level. Higher nucleotide variation at the resistant loci is possibly due to interaction with highly avirulent and frequently mutating avirulent strains of M. oryzae. Overall, a total of 198 polymorphic sites excluding InDels, were identified in the 1.5 Kb region of Pi54 alleles. Maximum (225) polymorphic sites were identified in the Pi54 alleles isolated from Indian landraces; however, very less diversity was obtained in the alleles cloned from japonica species (**Table 2**). Average pair wise nucleotide diversity (π) and Watterson's nucleotide diversity estimator (θw) over the Pi54 alleles in susceptible rice lines (π = 0.0208 and θ<sup>w</sup> = 0.01916) was lowest among all other Pi54 alleles included in the present study. Within the groups, the nucleotide diversity was lowest in Pi54 alleles of cultivated varieties (π = 0.02254

and θ<sup>w</sup> = 0.02102) compared to landraces (π = 0.03417 and θ<sup>w</sup> = 0.03877). Among the Pi54 alleles of indica and japonica species, diversity was low in the alleles of japonica species (**Table 2**). The LRR domain harbors substantial diversity, explaining the selection pressure it underwent. Higher diversity was observed in LRR domain of japonica species as compared to others (**Figure 3**).

#### Phylogenetic Relationship between the Alleles

Phylogenetic tree constructed based on Pi54 allelic sequences obtained from 92 accessions resulted in two major clusters (**Figure 4**). Both the clusters were further divided into separate sub-clusters but species specific clustering was not obtained. Similarly, separate phylogenetic tree was also constructed for the Pi54 alleles derived from resistant and susceptible lines. All the resistant Pi45 orthologs were grouped into three clusters (**Figure 4A**). Cluster I and II were further divided into two sub-clusters, i.e., IA, IB, and IIA, IIB. The sub-cluster I<sup>A</sup> and cluster III consisted of all the resistance alleles belonging to indica group whereas all other sub-clusters included alleles of indica as well as japonica groups. The Pi54 alleles from susceptible lines were grouped into two major distinct clusters (**Figure 4B**).

## Pattern of Molecular Evolution

A haplotype network was constructed to identify mutational changes, based on potential SNPs of all Pi54 alleles isolated from 92 rice lines. We identified fifty new haplotypes from the nucleotide polymorphism of the Pi54 alleles. To determine the linkage among these haplotypes, a haplotype network was constructed (**Figure 5**). In this network, 50 haplotypes were clustered in five major haplogroup (major haplogroup contain three or more Pi54 alleles) and the rest as minor haplogroups. We identified resistant phenotype specific haplotype (H\_3) and rest of the major haplotypes contained alleles from resistant and susceptible lines as well. Furthermore, haplotype, H\_4 was the only mixed haplotype consisted of the Pi54 alleles of indica, japonica and aus rice accessions. Similar haplotype network has been constructed for resistant as well susceptible Pi54 alleles. In case of resistance alleles, total number of identified haplotypes was 41 which clustered into five major haplogroups (**Figure S1**). All the major haplogroups consisted of Pi54 alleles isolated from indica rice lines. Total number of 14 haplotypes identified in Pi54 alleles was from susceptible rice lines and were, clustered into four major haplogroups (**Figure S2**). The major haplogroup of these rice lines consisted of Pi54 alleles of Indica rice lines except for H1 which is a mixed cluster of alleles from both


TABLE 2 | Nucleotide polymorphism of the Pi54 alleles mined from different accessions of rice.

S, Number of Polymorphic or Segregating sites; π, Nucleotide diversity; 2, Watterson's nucleotide diversity estimator based on silent sites. Tajima's D calculation based on the difference between the number of segregating sites and the average number of nucleotide differences. Jukes–Cantor (JC) corrected synonymous differences per synonymous site (Ks) and non-synonymous differences per non-synonymous site (Ka). For Tajima's D, no value is significant at 0.10 level (P > 0.10) and 0.10 > P > 0.05 (for bold).

Disease susceptible phenotypes Whole sequence [1–4194] 87 0.0208 0.00286 0.01916 0.00728 0.27313 <1

CDS [1338–3585] 71 0.02208 0.00327 0.02104 0.00806 0.15424 <1 LRR [3324–3462] 3 0.0057 0.00174 0.00699 0.00454 −0.56505 >1

indica and japonica lines. Statistically high haplotypes diversity (0.935/0.019) was observed within the studied data set of 92 Pi54 alleles of Oryza species. In case of susceptible and resistant alleles, high haplotypes diversity (0.958/0.028) was observed in disease susceptible Pi54 alleles of Oryza species than resistance (0.935/0.019) alleles (**Table 3**).

In present study, sequence analysis indicated variable number (0–3) of Open Reading Frames (ORFs) in the Pi54 locus, with exception of 28 rice lines where no ORF was detected. Absence of ORF in these sequences might be due to reshuffling or recombination events of the locus resulting in absence of start codon or pseodogenized allele, which has lost its function in due course of time. In 40 rice lines, single exon was predicted. However, two and three exons were also predicted in the allelic sequences of 21 and 10 rice lines, respectively (Table S3). This may be due to the creation of new splice sites during the course of evolution. Variation in the number of ORF might have generated based on the selection pressure, it underwent during the evolutionary process. Various insertions have also been identified in the ORFs. The presence of insertion implicates its differential role in regulating disease resistance.

Percentage of substitutional change was calculated at protein level in the Pi54 gene and compared within and between disease resistance and susceptible phenotypes of Oryza species. The amino acid substitutions were scored for the specific positions. The number of substitution per site was found to be equal to and greater than 10 (i.e., Nmut ≥ 10). Overall, 23 substitutional sites were identified across the alignment of 64 Pi54 proteins (Table S4). The number of mutations per mutational site was calculated in all the aligned 64 predicted sequences implicating 66% of mutational sites having mutation in one site, 19% of the sites having mutations in two to nine sites and rest of the 15% sites having mutation in 10 or more than 10 sites (**Figure 6A**). The mutational profiling of disease resistance and susceptible phenotypes of Oryza species indicated that most of the mutational sites were showing a constant frequency distribution between the resistance and susceptible groups (**Figure 6B**).

The nucleotide sequences of all the Pi54 alleles were translated and the predicted proteins ranged between 73 and 486 amino acid residues having many predicted functional domains. The Zinc-finger domain (ZnF) was predicted in all the Pi54 proteins except for a few (**Figure S3**). The sequence of predicted ZnF domain was highly conserved (100% similarity) in all the Pi54 proteins (**Figure S3**). Important motifs identified in the translated sequences of the Pi54 alleles are N-glycosylation sites,

phosphorylation (kinase C phosphorylation site, casein kinase II phosphorylation site, tyrosine kinase phosphorylation site), tyrosine sulfation site, N-myristoylation site, and leucine zipper (Table S5). These sites were present in variable numbers in all the Pi54 alleles except for tyrosine kinase phosphorylation site and leucine zipper. The presence of phosphorylation sites in Pi54 alleles, indicate their involvement in signal transduction by activating further downstream genes. The presence of Nmyristoylation sites in the predicted proteins play important role in membrane anchoring whereas N-glycosylation sites has significant role in protein targeting. Unique leucine zipper was identified in 54 and 73% of the resistant and susceptible Pi54 proteins, respectively. They are usually found as a part of DNA-binding domain in many transcription factors, and are therefore involved in regulating gene expression. Similarly, tyrosine sulfation sites were also present in 18 and 21% of the resistant and susceptible Pi54 protein, respectively, which plays important role in strengthening the protein-protein interaction. From above results, it can be concluded that unique Leucine zipper and single tyrosine sulfation sites identified in some of the Pi54 predicted protein sequences which was absent in the reference Pi54 protein.

To evaluate the phylogenetic relationship amongst the predicted Pi54 proteins, neighbor-joining trees were constructed using the LRR regions (**Figure S4**). All the predicted Pi54 proteins were grouped into three separate clusters of mixed type. This is in contrast to species-specific groups obtained from NBS and LRR domains of Pi9 alleles (Liu et al., 2011).

#### Analysis of Evolutionary Dynamics

To test the evolutionary selection dynamics of the Pi54 alleles in 92 Oryza accessions, we evaluated the extent of neutral selection with D statistics (Tajima's D test) (Tajima, 1989). In the present study, Pi54 alleles have been subjected to positive selection [Tajima's D = −1.07559] for all the alleles and deviating from the model of neutrality (**Table 2**). It is noteworthy that Pi54 alleles of different groups, such as landraces and cultivated varieties, indica and japonica species, and blast resistant lines have been subjected to positive selection whereas the balancing selection operates in Pi54 alleles of susceptible rice lines [Tajima's D = 0.27313]. This might be due to the variable selection pressure acting on the locus, or diverse sample size used in present analysis. Further, coding region of 92 Pi54 alleles and Pi54 alleles of different groups have been analyzed. The value of Tajima's D was negative in the entire coding region (−1.03898) and LRR domain (−1.70363) indicating purifying selection in the CDS and LRR regions (**Table 2**). The ratio of synonymous (ks) and non-synonymous (ka) divergence in whole sequence as well as coding region and parts of the coding region (LRR domains) was calculated in all the 92 Pi54 alleles and separately for the Pi54 alleles from different groups (**Table 2**). The value of ka/k<sup>s</sup> ratio was used as a criterion for the presence or absence of positive selection for amino acid substitutions. The value of ka/k<sup>s</sup> in the whole sequence, coding region of the Pi54 alleles was less than one, indicating low level of polymorphism in these regions, in contrast to high level of polymorphism in LRR domain of all Pi54 alleles as the ratio of ka/k<sup>s</sup> was greater than one (**Table 2**). In the LRR region of the Pi54 alleles of different groups, the value of ka/k<sup>s</sup> is greater than one, which indicates that positive directional selection might have favored amino acid substitution in this region. In the present study, LRR region of Pi54 alleles was quite variable and might have role in different recognition specificities, possibly making it more durable.

## Development of Allele Specific Markers

Allele specific DNA markers are important for the introgression of resistant alleles in cultivated rice varieties using marker assisted selection (MAS) strategy. In this study, we developed dominant and co-dominant STS markers based on the large DNA insertions in the allelic sequences of Pi54 alleles. The dominant markers (DPi54\_163F and DPi54\_163R) were specifically designed to amplify a fragment of 278 bp in resistant Pi54 alleles and absence of band in the susceptible alleles (**Figure 7A**). By using this marker, we were able to distinguish 22 rice lines having resistant Pi54 allele (presence of 278 bp amplification product) and 15 rice lines having susceptible Pi54 alleles (**Table 4**). Similarly, codominant marker (CPi54\_144F and CPi54\_144R) was also tested in a set of rice lines used in present study. PCR amplification with co-dominant markers amplifies fragments of 557 and 313 bp (**Figure 7B**). The 557 bp fragments were amplified in 10 susceptible rice lines, and a 313 bp band was present in 28

resistant rice lines. Eight rice lines showed presence of both the alleles (**Table 5**).

## Discussion

Breeding efforts have capitalized only a fraction of the genetic diversity available to us. Food availability needs to be increased in face of intensifying demand, climate change, soil degradation, land, and water shortages. Farmers are saviors of seeds of crop species, primitive varieties (local domesticates called landraces), wild relatives of crop species (McCouch, 2013). The biodiversity present within the farmer adapted land races must be mined to discover novel sources of resistance to pests and diseases. Chromosome 11 of rice as reported, has the most associated disease resistance loci and the highest frequency of copy number variations (CNVs). Genes in most of CNVs were reported to be associated with resistance phenotype (Yu et al., 2011; Wang et al., 2014). The allelic variants of the dominant blast resistance gene, Pi54 located on chromosome 11 were prospected and variations in terms of SNPs and InDels were documented. These variations possibly might play an important role in the durability of Pi54 gene against M. oryzae population. In earlier studies, InDels and SNPs have shown to play a pivotal role in R-gene evolution through selection (Shen et al., 2006). The presence of 5 Mb region (super locus) physically linked to Pi-ta gene impart resistance phenotype (Jia and Martin, 2008; Lee et al., 2009). The higher frequency of SNP observed in present study might be due to the combined analysis of both landraces and cultivated species. Similarly, higher variation was also observed between O. sativa and O. rufipogon in 26 kb region of DNA sequence spanning 22 loci (Rakshit et al., 2007). Further in all the Pi54 alleles, transitions were more frequent than transversions. This complies with the common composition of any type of DNA, where transitions have been reported to occur at higher frequencies than transversions (Brown et al., 1982; Gojobori et al., 1982; Curtis and Clegg, 1984; Wakeley, 1996). This is in consistent with the earlier genome wide SNP discovery studies in multiple rice genotypes (Huang et al., 2009; McNally et al., 2009; Yamamoto et al.,



Hn, Number of Haplotypes; Hd, Haplotype diversity; V(Hd), Variance of Haplotype diversity; SD, Standard deviation of Haplotype diversity.

2010; Thakur et al., 2014). Relatively higher (70%) frequency of transition substitutions between indica and japonica was observed in earlier studies (Feltus et al., 2004; Shen et al., 2004; International Rice Genome Sequencing Project, 2005). The present study indicates that Pi54 alleles belong to type II category (intermediate diversified), similar to other blast resistance gene Pi 9 (Yang et al., 2008; Liu et al., 2011). It is increasingly believed that percent polymorphism is directly correlated to evolutionary change (Shen et al., 2006; Yang et al., 2008). Our results suggest that intermediate level of polymorphism in the Pi54 alleles may be due to the mixed evolutionary pressure experienced by the loci during co-evolution of rice blast pathogen. Since this gene has not been transferred to cultivated varieties and might have less pressure from pathogen side.

In present study, 198 polymorphic sites were identified in the 1.5 Kb region of Pi54 alleles. The nucleotide diversity was high in the Pi54 alleles of resistant lines compared to the susceptible alleles. Previously, it was reported that R-genes experience both high and low levels of sequence diversity depending upon the locus (Yang et al., 2008). Nucleotide diversity (0.024) in A. thaliana was higher than the average π (0.008) in 334 randomly distributed genomic regions due to nucleotide difference between resistant and susceptible alleles indicating that these alleles have been maintained for long period of time under natural conditions (Schmid et al., 2005). In barley, the frequency and distribution of the nucleotide diversity ranged from 0.0021 to 0.0189 for the genes associated with grain germination (Russell et al., 2004). In another study, the pattern of diversity observed was lowest in cultivated species as compared to other Oryza species (Lee et al., 2009). However, the variation at the flanking regions of the Pi-ta gene was highest in O. rufipogon (0.00355) followed by cultivated species of Oryza (Lee et al., 2011). Studies of many R gene loci, such as Rpp5 in Arabidopsis thaliana and Rp1 in Zea mays (Noel et al., 1999; Sun et al., 2001) have revealed a high level of polymorphism between the alleles. In resistance landraces, the expression of Os11g0225100 locus was higher compared to susceptible. Even after inoculation, the resistance level in the resistant landrace increased, while it has no change

in the susceptible landrace. This high diversity is interpreted as evidence for the fast evolution of these R gene loci.

Phylogenetic analyses of the Pi54 alleles from susceptible lines were grouped into two major clusters, while all the resistant Pi45 orthologs were grouped into three clusters. In all the clusters and sub-clusters mixed type of grouping was obtained. However, landraces are found in all the sub-clusters corroborating the claim of higher variability and heterogeneity. This is in contrast to Pi9 alleles wherein cultivated rice along with its ancestors clustered into one group and African cultivated rice along with its ancestors grouped into separate cluster suggesting that different selection pressure has occurred in two groups during domestication and/or natural selection (Liu et al., 2011).

Important motifs identified in the translated sequences of the Pi54 alleles, i.e., N-glycosylation, phosphorylation, tyrosine sulfation, N-myristoylation, and leucine zipper. These sites were present in variable numbers in all the Pi54 alleles except for tyrosine kinase phosphorylation site and leucine zipper. The Zn-finger domain protein is reported responsive to wounding, stress hormones, cold, salt, submergence, heavy metals and desiccation (Vij and Tyagi, 2008). The presence of motif sites in the translated sequence indicate its role in downstream signaling of defense response genes (callose, laccase, PAL, and peroxidase), transcription factors (NAC6, Dof Zinc finger, MAD box, bZIP, and WRKY) that fortify cell wall/plasmodesmata leading to hypersensitive response, and affecting resistance reaction (Gupta et al., 2011).

In the study, fifty new haplotypes were identified from the nucleotide polymorphism of the Pi54 alleles. Interestingly, we identified one haplotype which is resistant specific (H\_3). Small number of haplotypes was detected previously within the gene pool of cultivated (H. vulgare) barley. In Hordeum species, the total number of haplotypes identified (46) in H. spontaneum almost double from those detected in H. vulgare (Russell et al., 2004). Similarly, 16 haplotypes were identified from nucleotide polymorphism of fiftyone Pi-ta alleles (Huang et al., 2008). In another study, 53 Pi-ta haplotypes were identified from the nucleotide polymorphism of 229 rice accessions belonging to

and co-dominant PCR marker. (A) PCR amplification with the dominant marker (D163\_F and D163\_R) in 18 rice lines. Lanes 1–9: presence of band in rice lines (Acharmati, Aditya, Ananda, Beesginsali, Bidarlocal-2, Dhanaprasad, Gautam, Himdhan, and Pusa sugandh 5); lanes 10–18: absence of band in rice lines.(HR-12, TP-309, Co-39, Nipponbare, Shiva, Satti, Pusa basmati-1, Ram Jawain-100, and Sathia-2) (B) PCR amplification with the co-dominant 100 bp molecular marker. Lanes 1–10: presence of susceptible band (CO-39, HR-12, CT-10006-7-2M-5-1P3M, Jyoti, Kavali kannu, Mesebatta, Nipponbare, Samba mahsuri, Shiva, Taipei-309); lanes 11–18: presence of resistant band (Tetep, Budda, Chiti zhini, CSR 10, Gonrra bhog, Himalya 799, Pusa Sugandh 3, Ranbir basmati, Sadabahar); lanes 19–25: presence of both the bands (Belgaum basmati, CSR-60, HLR-142, Indrayani, IRAT-144, Kariya).

#### TABLE 4 | List of rice accessions distinguishable based on dominant marker.

Homozygous RR Homozygous rr Acharmati Pusa sugandh 5 Mesebatta Aditya Salumpikit HR-12 Ananda Samleshwari TP-309 Beesginsali Siddasala Co-39 Bidarlocal-2 V L Dhan 21 Nipponbare Dhanaprasad Vasanesanna batta Shiva Gautam Vikash Satti Himdhan Virendra Pusa basmati-1 IRBB-4 Ram Jawain-100 IRBLB-BIRBLB-5-M Sathia-2 Kala Dhan Lalan kanda Kulanji pille Superbasmati Mingola Parijat Mysore mallige Malviya dhan T23

TABLE 5 | List of rice accessions distinguishable based on co-dominant marker.


seven Oryza species. These findings highlighted the importance of analysis and utilization of haplotypes from landraces and related wild species for crop improvement (Lee et al., 2009).

Balancing and purifying selection have been observed for the evolution of R-genes. The value of Tajima's D was negative in the entire coding region (−1.03898) and LRR domain (−1.70363) indicating purifying selection in the CDS and LRR regions. Similar, values were also reported for Pi-ta alleles of O. rufipogon and Pi9 alleles of five Oryza species (AA genome) (Lee et al., 2009; Liu et al., 2011). Our analyses implicate selection pressure high on both CDS and LRR domains of resistant lines, in contrast to the LRR domain of susceptible rice lines. Elucidation of the selection mechanism acting on the domains could help to design new crop improvement strategies for the future.

In the LRR region of the Pi54 alleles of different groups, the value of ka/k<sup>s</sup> > 1.0 implying positive directional selection might have favored amino acid substitution in the region. Similar results were also obtained in Piz(t) alleles of Indian landraces of diverse locations (Thakur et al., 2013b). In contrast, the LRR regions encoded by Pi-km1 and Pi-km2 blast resistance genes were highly conserved (Ashikawa et al., 2010). Importantly, LRRs have direct interacting roles with effector proteins (Young and Innes, 2006). Most isolated R -genes encode proteins possessing LRR domain, of which the majority also contains a NBS domain (Sharma et al., 2012). Higher level of polymorphism in the LRR region as obtained in case of Pi54 gene is thought to be involved in the recognition of effector proteins, and consequently the evolutionary pressure on the host by virulent M. oryzae races results in high variability in LRR domain. The LRR regions of many Arabidopsis R genes have ka/k<sup>s</sup> ratios >1, suggesting that these R genes have evolved under positive selection pressure (Bergelson et al., 2001). Two basic strategies have evolved for an R protein to recognize a pathogen effector (which is also called avirulence (Avr) factor): direct physical interaction and indirect interaction via. association with other host proteins targeted by the Avr factor (Xiao et al., 2008). It has also been reported that variation for disease resistance is maintained by frequencydependent selection, even though there is a fitness cost associated with the maintenance of R genes in the absence of their matching Avr (Stahl et al., 1999). Flax L genes and their matching Avr genes in flax rust undergo strong diversifying selection, suggesting direct interaction (Dodds et al., 2006). However, in case of A. thaliana indirect interaction between RPM1 and AvrB, in which the RIN4 protein acts as a target for binding to AvrB, and the AvrB-induced phosphorylation of RIN4 then activates RPM1 has been reported as balancing selection (Mackey et al., 2002).

Development of allele specific functional markers holds the key for marker assisted selection. These functional markers based on genic Pi54 InDels can be applicable in MAS for blast resistance breeding programme. Absence of diagnostic bands in resistant/susceptible cultivars might be because of genetic recombination during meiotic cell cycle. However, the present study extends the repertoire of functional markers toward screening of genotypes. Similar functional InDel-based marker has been reported to be developed for blast resistance gene Pikm (Costanzo and Jia, 2010). The sequence of nine blast resistance genes was used for the development of functional markers based on InDels (Hayashi et al., 2006). However, allele specific markers are not known for blast resistance gene Pi54, hence the markers developed in this study would be of great significance to the breeders. Increasing efforts to clone more resistance genes worldwide will accelerate the development of more dominant resistance gene based markers for molecular breeding, thereby accelerating introduction of durable, broadspectrum blast resistant genes into widely adapted high yielding rice cultivars.

From the above discussion, we conclude that the nucleotide variation was high in the LRR domain of all the Pi54 alleles cloned and characterized in this study. In disease resistant alleles, selection pressure was high in LRR domain and CDS region whereas in susceptible counterpart selection pressure exerted only in the LRR domain. It was also evident that LRR domain of Pi54 alleles was diversified because of high selection pressure. The co-dominant and dominant functional markers developed in the present study can be used in marker-assisted breeding programs aimed at improvement of blast resistance in elite rice cultivars. The diversity information based on genetic structure is an extremely important pre-breeding material in selecting parents for intra- and inter- group crosses to broaden the genetic base of modern rice cultivars. This study helps understand the extent of variability present in the landraces and cultivated varieties of rice that can be employed in future for selection of better alleles and their utilization in resistant breeding programmes.

## Author Contributions

Conceived and designed the experiments: TRS. Performed the experiments and analyzed data: ST. Contributed materials/analysis tools: RR, MV, SP, DC, AD, NS, AS, UD, and PS. Wrote the paper: TRS, ST, and AD.

## Acknowledgments

TRS is thankful to the National Agricultural Innovation Project (NAIP) (C4/C1071), ICAR, for financial support. TRS is thankful to the Department of Science and Technology, Govt. of India for JC Bose National Fellowship. The authors are thankful to the Officer in Charge, National Phytotron Facility, Indian Agricultural Research Institute, New Delhi, for providing basic facilities for growing and maintaining Indian local landraces.

## Supplementary Material

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fpls.2015. 00345/abstract

Figure S1 | Haplotype network based on 187 potential SNPs of the Pi54 resistant alleles. Each group of haplotypes is shown as a solid circle, and five major haplotypes are marked in larger circles. Each branch represents a single mutational step. Branches with small solid circles indicate that there is more than a single mutational step between haplotypes. Different sizes of circles represent the different numbers of each haplotype.

Figure S2 | Haplotype network based on 97 potential SNPs of the Pi54 susceptible alleles. Each group of haplotypes is shown as a solid circle, and four major haplotypes are marked in larger circles. Each branch represents a single mutational step. Branches with small solid circles indicate that there is more than a single mutational step between haplotypes. Different sizes of circles represent the different numbers of each haplotype.

Figure S3 | Analysis of ZnF domain of 61 Pi54 proteins. Arrows indicate that the ZnF doamain was not predicted inspite of the sequence of the ZnF in the Pi54 proteins. Single amino acid substitution in two Pi54 proteins has been encircled.

Figure S4 | Phylogenetic tree of LRR domain of 64 Pi54 protein. Neighbor-Joining method was used for the construction of tree. Definite clustering was not obtained in all the tees.

## References


the Pi-k<sup>h</sup> gene of rice, which confers resistance to Magnaporthe grisea. Mol. Genet. Genomics 274, 569–578. doi: 10.1007/s00438-005-0035-2


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2015 Thakur, Singh, Das, Rathour, Variar, Prashanthi, Singh, Singh, Chand, Singh and Sharma. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

## Analysis of genetic variation and diversity of Rice stripe virus populations through high-throughput sequencing

### *Lingzhe Huang1†, Zefeng Li 2†, Jianxiang Wu1, Yi Xu1, Xiuling Yang1, Longjiang Fan2, Rongxiang Fang3\* and Xueping Zhou1\**

#### *Edited by:*

*Joanna Marie-France Cross, Inonu, Turkey*

#### *Reviewed by:*

*Abdelaleim Ismail ElSayed, Zagazig University, Egypt Angelantonio Minafra, CNR Institute for Sustainable Plant Protection, Italy Shahin S. Ali, United States Department of Agriculture, USA*

#### *\*Correspondence:*

*Xueping Zhou, State Key Laboratory of Rice Biology, Institute of Biotechnology, Zhejiang University, Hangzhou 310058, People's Republic of China zzhou@zju.edu.cn; Rongxiang Fang, State Key Laboratory of Plant Genomics, Institute of Microbiology, Chinese Academy of Sciences, Beijing 100101, People's Republic of China fangrx@im.ac.cn †These authors have contributed equally to this work.*

#### *Specialty section:*

*This article was submitted to Plant Genetics and Genomics, a section of the journal Frontiers in Plant Science*

> *Received: 31 October 2014 Accepted: 05 March 2015 Published: 24 March 2015*

#### *Citation:*

*Huang L, Li Z, Wu J, Xu Y, Yang X, Fan L, Fang R and Zhou X (2015) Analysis of genetic variation and diversity of Rice stripe virus populations through high-throughput sequencing. Front. Plant Sci. 6:176. doi: 10.3389/fpls.2015.00176* *<sup>1</sup> State Key Laboratory of Rice Biology, Institute of Biotechnology, Zhejiang University, Hangzhou, People's Republic of China, <sup>2</sup> Institute of Crop Science and Institute of Bioinformatics, Zhejiang University, Hangzhou, People's Republic of China, <sup>3</sup> State Key Laboratory of Plant Genomics, Institute of Microbiology, Chinese Academy of Sciences, Beijing, People's Republic of China*

Plant RNA viruses often generate diverse populations in their host plants through errorprone replication and recombination. Recent studies on the genetic diversity of plant RNA viruses in various host plants have provided valuable information about RNA virus evolution and emergence of new diseases caused by RNA viruses. We analyzed and compared the genetic diversity of Rice stripe virus (RSV) populations in *Oryza sativa* (a natural host of RSV) and compared it with that of the RSV populations generated in an infection of *Nicotiana benthamiana*, an experimental host of RSV, using the high-throughput sequencing technology. From infected *O. sativa* and *N. benthamiana* plants, a total of 341 and 1675 site substitutions were identified in the RSV genome, respectively, and the average substitution ratio in these sites was 1.47 and 7.05 %, respectively, indicating that the RSV populations from infected *N. benthamiana* plant are more diverse than those from infected *O. sativa* plant. Our result gives a direct evidence that virus might allow higher genetic diversity for host adaptation.

#### Keywords: Rice strip virus, population, diversity, deep sequencing

## Introduction

RNA viruses are known for their high evolutionary potential due mainly to their error-prone replication, and fast rate of genome replication (Domingo and Holland, 1997; Roossinck, 1997; Elena and Sanjuán, 2007; Lauring and Andino, 2010; Sanjuan et al., 2010). In addition, templateswitching of viral RNA-dependent RNA polymerase (RdRp) during virus replication, known as RNA recombination, also functions to accelerate virus evolution (Nagy and Simon, 1997; Garcia-Arenal et al., 2001). Using these strategies, viruses are capable of bringing beneficial mutations into and removing deleterious mutations from their genomes (Worobey and Holmes, 1999). Mutation in the viral genomic sequence may lead to changes of virus host range, disease symptoms, and emergence of new viruses in nature. Previous reports have indicated that mutant viruses often appeared soon after the RNA virus infected its host plant and to form diverse populations in infected cells, and these studies were analyzed through sequencing random selected clones derived from an infected plant (Domingo and Holland, 1997; Roossinck, 1997; Schneider and Roossinck, 2000; Garcia-Arenal et al., 2001, 2003).

Rice stripe virus (RSV) is one of the most damaging rice pathogens in China and was firstly identified in rice field in China in 1963 (Lin et al., 1990). It is now identified in rice fields in 16 provinces of China and caused significant yield losses (Xiong et al., 2008; Wei et al., 2009). RSV is the type member of the genus *Tenuivirus* and infects mainly rice and a few other species in the family *Poaceae* including maize, oat and wheat. In field, RSV is transmitted efficiently by small brown plant hopper (*Laodelphax striatellus*) in a persistent manner. RSV can propagate within the insect vector and transmitted to its progenies (Koganezawa et al., 1975; Falk and Tsai, 1998). Under the laboratory conditions, RSV can be transmitted to an experimental host, *Nicotiana benthamiana,* through mechanical inoculation (Kong et al., 2014).

Rice stripe virus is a single-stranded RNA virus containing four genome segments, designated as RNA 1, 2, 3, and 4 (**Figure 1**; Zhu et al., 1991, 1992; Takahashi et al., 1993; Toriyama et al., 1994). RNA 1 is in negative sense and encodes a putative protein of approximately 337 kDa. This protein was suggested to be part of the RSV RdRp complex and is associated with the RSV filamentous ribonucleoprotein (Toriyama et al., 1994). The other three RNA segments of RSV are all ambisense and each RNA segment contains two ORFs, one is in the viral sense (vRNA) and the other in the viral complementary-sense (vcRNA; Ramirez and Haenni, 1994). The vRNA 2 ORF encodes a membraneassociated protein of 22.8 kDa and the vcRNA 2 ORF encodes a poly-glycoprotein of 94 kDa. Functions of these proteins are still unknown (Takahashi et al., 1993; Falk and Tsai, 1998). The vRNA 3 ORF encodes a 23.9-kDa suppressor of RNA silencing (Xiong et al., 2009) and the vcRNA 3 ORF encodes a nucleocapsid protein of 35 kDa (Kakutani et al., 1991; Zhu et al., 1991). The vRNA 4 ORF encodes a major non-capsid protein of 21.5 kDa that functions in disease symptom development (Kong et al., 2014). The vcRNA 4 ORF encodes a protein of 32.5 kDa and is required for RSV cell-to-cell movement in infected plant (Xiong et al., 2008).

Determination of the complexity of RNA virus population in their host plants requires a rapid, reliable, and cost-effective sequencing method. Because viral RNAs with specific mutations may accumulate to lower levels than the most frequent viral RNAs

in infected cells, traditional small-scale sequencing used in many earlier studies might fail to detect these non-conserved or lowly accumulated viral RNAs. Although the evolution of RSV over long periods of time and in areas covering large geographical districts in China has been studied (Wei et al., 2009), the population diversity of RSV within a plant is still unknown. In this study, we used the high-throughput sequencing technology (also known as the second generation sequencing technology) to investigate the diverse population of RSV in an infected *Oryza sativa* plant (a natural host) and compare it with that observed in an infected *N. benthamiana* plant (an experimental host) and we found that the RSV populations from the infected *N. benthamiana* plant are more diverse than those from the infected *O. sativa* plant. Our results give a direct evidence that virus might allow higher genetic diversity for host adaptation.

## Materials and Methods

## Virus Source, Host Plants, and Virus Inoculation

Infected *O. sativa* plants with typical RSV symptoms were collected from rice fields in Jiangsu Province, China. A single plant with only RSV infection was identified through RT-PCR using specific primers and ELISA using specific antibodies as described previously (Wang et al., 2004) and leaves of this plant were then stored at −80◦C till use. *N. benthamiana* seedlings were grown inside an insect-free room with a constant temperature at 25◦C and a 16 h light supply. Leaves from the RSV-infected *O. sativa* plant were ground in 0.1 M phosphate buffer, pH 7.0, and the crude extract was mechanically inoculated to leaves of *N. benthamiana* seedlings at six- to seven-leaf stages. Young systemic leaves of one *N. benthamiana* plant showing yellow vein symptoms were harvested at 30 days post inoculation (dpi). The harvested fresh *N. benthamiana* leaves and the frozen RSV-infected *O. sativa* leaves were used for the viral population analysis.

## Viral cDNA Library Construction and Sequencing

Total RNA was extracted from 1g leaf tissues of the frozen *O. sativa* and fresh *N. benthamiana* using TRIzol reagent following manufacturer's instructions (Invitrogen, Carlsbad, CA, USA), respectively, and cDNA libraries of RSV were then constructed. First strand cDNA was synthesized using the SuperScript III reverse transcriptase (Invitrogen, Carlsbad, CA, USA), primers used for first strand cDNA synthesis were designed according to the highly conserved 3 terminal regions of the four RSV RNAs: 5 -acacaaagtccagaggaaaacaa-3 for RNA 1, 5 -acacaaagtctgggtataacttctt-3 for RNA 2, 5 -acacaaagtcctgggtaaaatag-3 for RNA 3, and 5 acacaaagtccagggcatttgt-3 for RNA 4. Second strand cDNAs were synthesized using DNA polymerase I and RNase H. Ends of double strand cDNAs were repaired using T4 DNA polymerase and Klenow DNA polymerase. cDNA pair-end libraries were prepared using standard Illumina protocols and then the libraries were sequenced using Illumina Genome Analyzer II instrument (Illumina, San Diego, CA, USA).

## Bioinformatics Analysis

To minimize artificial mistakes from sequencing errors, short reads were pre-processed using the FASTX-Toolkit (http://hannonlab.cshl.edu/fastx\_toolkit/) by removing the low quality reads (below Q30) and adaptors. The reads representing each sample were mapped onto the reference sequences of RSV genome segments (NC\_003753.1, NC\_003754.1, NC\_003755.1, and NC\_003776.1) using the MAQ program (Mapping and Assembly with Qualities, Version 0.7.1) with default parameters (Li et al., 2008).

## Results

## High-throughput Sequencing of RSV Genome from Infected *O. sativa* and *N. benthamiana* Plants

The cDNA libraries derived from RSV-infected *O. sativa* plant and that prepared from RSV-infected *N. benthamiana* plant were sequenced, using the Illumina/Solexa platform. The sequences have been submitted to Sequence Read Archive (SRA) with accession number SRP051574. A total of 27,965,556 reads were obtained and 6.67% of them were mapped to the RSV genomic sequence. The average sequencing depths of the RSV genome segments in *O. sativa* were from 953.8 (RNA2) to 2129.1 (RNA4) and those in *N. benthamiana* were from 3260.2 (RNA1) to 7330.5 (RNA3). The number of mapped bases and the sequencing depth of each RSV RNA segment are shown in **Table 1**. Our results indicate that the data we obtained are sufficient to represent the RSV populations in a single infected plant.

## Genetic Diversity of RSV Populations in *O. sativa*

With the high-throughput sequencing results, a "twodimensional view" of RSV populations is drawn. This includes average nucleotide diversity (number of sites with substitution/number of total sites of corresponding sequence) and substitution ratio within each site (number of substituted nucleotide observed/number of total nucleotide sequenced). Substitutions determined in the RSV populations from the infected *O. sativa* plant located at 123 different sites in RNA 1 (1.37%), 22 sites in RNA 2 (0.63%), 76 sites in RNA 3 (3.07%), and 60 sites in RNA 4 (2.78%). Among the four RSV RNAs, substitutions occurred more densely in RNA 4. There are 13 sites with substitution ratio above 20% in RSV RNAs and all of them

TABLE 1 | Number of mapped bases and sequencing depth of RSV genome segments.


were found in RNA 4, except one at the 3570th site in RNA 1 (**Figure 2A**). Average nucleotide diversities in the *NS3*, *CP*, *SP*, and *MP* ORFs were found to be much higher than those in the *RdRp, NS2, NSvc2* ORFs and non-coding regions (**Figure 3A**). Although the average nucleotide diversities in the *NS3* and *CP* ORFs were similar to those in the *SP* and *MP* ORFs, the average substitution ratios in these two ORFs were much lower (**Figure 3B**). Interestingly, the average substitution ratio (5.64%) in non-coding regions was higher than any ratios found in the ORFs analyzed. The average nucleotide diversity in non-coding regions was, however, only 0.89%, lower than all ORFs analyzed except the *NS2* (**Figure 3**).

Transitions (i.e., A↔G and C↔T) were detected at 238 sites and transversions (i.e., A ↔ C, A ↔ T, G ↔ C, and G ↔ T) were found at 103 sites. The ratio of transition to transversion was about 2.3 (**Table 2**). A previous report also indicated that transitions occurred on an average of 2.5 times more frequently than transversions (Lemey et al., 2009). Our result shows that there is no significant difference for substitution tendency among the four transitions (**Table 2**). At the transition or transversion sites, the average ratio of each substitution varied without clear preferences, e.g., the average ratio of A −→ T transversion (2.79%) was much higher than A −→ G transition (0.80%). The average ratio of C −→ A transversion (2.12%) was similar to C −→ T transition (2.19%; **Table 2**).

Non-synonymous substitutions were also found in 71 codons located within 6 of the 7 ORFs analyzed. These include 25 codons in *RdRp*, 7 in *NSvc2*, 9 in *NS3*, 12 in *CP*, 8 in *SP*, and 10 in *MP*. No non-synonymous substitution was detected

FIGURE 2 | Substitution patterns in the RSV genome occurred in *Oryza sativa* (A) and *Nicotiana benthamiana* (B). The substitution pattern in each of the four RSV RNA segments is presented. The vertical coordinates represent the substitution ratios derived from the master sequence of each populations at each site (number of substituted nucleotide observed/number of total nucleotide sequenced).

in *NS2* ORF. Only 4 out of the 71 codons had substitution ratio above 10%. These include the 150th codon in the *NS3* ORF (16.19%, Pro −→ Leu), the 8th codon in the *SP* ORF (10.92%, Val −→ Ile), and the 12th and the 200th codons in the *MP* ORF (12.62 and 34.45%, Ser −→ Asn and Ile −→ Val, respectively; **Figure 4A**). The sites of synonymous substitutions were 3.5 times higher than those of non-synonymous substitutions. The average substitution ratio of synonymous substitution

TABLE 2 | Substitution tendency among RSV quasispecies in different hosts.


was 1.64 times higher than that of non-synonymous substitutions.

## Genetic Diversity of RSV Populations in *N*. *benthamiana*

To investigate the role of host in RSV evolution, the virus was mechanically inoculated in *N*. *benthamiana*, an experimental host of RSV. In infected *N. benthamiana* plant, substitutions were found at 1675 sites in the RSV populations, approximately five times more than those found in the RSV populations from infected *O. sativa* plant. The average nucleotide diversity and substitution ratio were also very different between these two groups of RSV populations. For RSV populations from *N. benthamiana*, substitutions were detected in 874 sites in RNA 1 (9.74%), 368 sites in RNA 2 (10.47%), 204 sites in RNA 3 (8.24%), and 149 sites in RNA 4 (6.91%). The substitutions occurred more densely in RNA 1 and 2. There were 206 substitution sites with the substitution ratios above 20%, and more than half of them were found in RNA 1 (**Figure 2B**).

Our results also indicate that the average nucleotide diversity in each ORF and non-coding regions is higher for RSV populations from *N. benthamiana*, especially in the *RdRp*, *NS2*, and *NSvc2* ORFs and the non-coding region. Consequently the differences of average nucleotide diversities among the ORFs and non-coding regions are decreased. The highest average nucleotide diversity is in the *NSvc2* ORF (12.97%) and the lowest average nucleotide diversity is in non-coding regions (6.14%;



**Figure 3A**). The average substitution ratios are also increased in all ORFs and non-coding regions analyzed. The highest average substitution ratio is found in the *SP* ORF (9.25%) and the lowest in the *CP* ORF (3.75%). Interestingly, the average substitution ratio is not increased in non-coding regions (**Figure 3B**).

A total of 1323 transition sites and 352 transversion sites were detected in the sequences of RSV populations from *N. benthamiana*. The transitions occurred on an average of 3.76 times more frequently than the transversions, but no significant difference in frequency was observed for the four transition groups (**Table 2**). The average substitution ratios of the four transition groups varied without a clear preference. The highest average substitution ratio was for transversion A −→ T (8.11%) found in RSV populations from *O. sativa* (**Table 2**).

Non-synonymous substitutions were found in 147 codons distributed in the seven ORFs. These include 60 non-synonymous substituted codons in *RdRp*, 6 in *NS2*, 35 in *NSvc2*, 16 in *NS3*, 11 in *CP*, 6 in *SP*, and 13 in *MP* ORF. Twenty-four of the 147 codons had substitution ratios above 10%, and two of them were detected at the eighth codon in the *SP* ORF and the 200th codon in the *MP* ORF, respectively. It is worth to mention that the RSV populations from *O. sativa* also had high substitution ratios at these two codons. There are 24 unique non-synonymous substitutions in the RSV populations from *N. benthamiana*, likely to be the mutations caused by the interaction between RSV and *N. benthamiana* (**Figure 4B**). The average substitution ratios of the synonymous and non-synonymous substitutions were about five times of those found in the RSV populations from *O. sativa*.

### Population Structures of RSV in Different Host Plants

The master sequences of RSV populations from the infected *O. sativa* and *N. benthamiana* plants were identified after mapping the reads from the high-throughput sequencing data on the reference sequences of RSV genome segments. There are 156 different (0.9% of genome nucleotides) substitutions between the two master sequences. The master sequence from infected *O. sativa* plant is composed of 39.09% GC nucleotides while that from the infected *N. benthamiana* plant is 38.92%. When the substitution tendency bias was analyzed, more transitions than transversions (approximately fourfold) were observed, and there were remarkably more G/C to A/T substitutions than the reverse substitutions (**Table 3**). After RSV was transmitted from *O. sativa* to *N. benthamiana*, the G/C ↔ A/T substitutions varied among the regions in the RSV genome (**Table 4**). For example the G/C ↔ A/T substitutions were similar in the non-coding regions of the two master sequences. In the coding regions, the G/C −→ A/T substitutions seemed to occur more often than the reverse substitutions (**Table 4**). As reported previously, the rice genome is GC rich and rice coding sequences contain 45–75% G and C, whereas the tobacco genome is GC poor and its coding sequences contain 40–60% G and C (Salinas et al., 1988; Matassi et al., 1989; Carels and Bernardi, 2000). The decreased GC level after the virus was transmitted from *O. sativa* to *N. benthamiana* suggests that RSV may adjust its codon usage to ensure its efficient protein translation in a new host.

## Discussion

Since the first introduction of quasispecies hypothesis by Eigen and Schuster (1977), this population-based framework has been used extensively to study RNA virus evolution. Based on the complete genomic sequences or partial genomic sequences of RSV isolates, RSV isolates in China were mainly divided into two subtypes: subtype II from Yunnan province and subtype I from rest parts of China, RSV isolates in subtype II were more genetically diverse than those in subtype I, and the mean genetic distance of RSV genes between two subtypes, within subtype I and subtype II ranged from 0.0529 to 0.0865, 0.0123 to 0.0256, and 0.0183 to 0.0387, respectively (Wei et al., 2009; Huang et al., 2013). In this study, we used the high-throughput sequencing technology to analyze and profile RSV populations from infected *O. sativa* or *N. benthamiana* plants. Two parameters, average nucleotide diversity and average substitution ratio at single sites, were obtained through high-throughput sequencing and used to evaluate genetic diversities in RSV populations from an infected *O. sativa* or *N. benthamiana* plant. For the whole RSV genome, the average nucleotide diversity and average substitution ratio were 1.99% and 1.47% from *O. sativa*, and 9.79% and 7.05% from *N. benthamiana*, respectively. RSV isolates in China were reported to have low genetic diversity because of its narrow host range (Wei et al., 2009). Our study find the diversity of RSV

TABLE 4 | Number of A/T **↔** G/C substitutions in the master sequence after RSV was transmitted from *O. sativa* to *N. benthamiana.*


populations increased after RSV was transmitted from *O. sativa* to *N. benthamiana*. *N. benthamiana* might provide quite different select pressure from *O. sativa*, and this observation also gave a direct evidence that virus might allow higher genetic diversity for host adaptation (Fargette et al., 2006).

The two parameters are independent of each other regardless of the origin of the two virus populations. For example, the average nucleotide diversity in the *CP* ORF was the highest in the RSV populations from the infected *O. sativa* plant, whereas the average substitution ratio in this ORF was much lower than some of the other ORFs (**Figure 3**). Also transitions were much higher than transversions in the RSV populations from both infected *O. sativa* and *N. benthamiana* plants (**Table 4**), but the average substitution ratio did not have difference (**Figure 2**). These results indicate that these two parameters are likely driven by different factors.

The genetic diversity in RNA virus populations was considered to be controlled by the interactions between host and viral factors (Schneider and Roossinck, 2001). So there were at least two steps during RNA virus evolution: substitutions emerged via replication errors and the selection of the original and substituted bases by host. During the second step, the accumulation of the less adaptive one might be repressed by host–virus interaction. So if host–virus interaction is in favor of the substituted base, it will accumulate more than the original base and eventually replace the original one. Otherwise, the substituted base will accumulate at lower level or even be eliminated. So substitution ratio reveals the destiny of each substitutions under host selection. Thus the sites with high substitution ratios may be essential in virus evolution in corresponding host. In this study, sites with high substitution ratio in *O. sativa* and *N. benthamiana* were documented. Further work is necessary to confirm their function in virus evolution and host–virus interaction.

## References


The observation that the average substitution ratio of noncoding regions was not changed after RSV was transmitted form *O. sativa* to *N. benthamiana* indicates that the RSV noncoding sequences may not play an important role in RSV–host plant interactions. On the other hand both synonymous and non-synonymous substitution ratios were remarkably increased after RSV was transmitted form *O. sativa* to *N. benthamiana*. This observation suggests that both synonymous and nonsynonymous substitutions are also essential in RSV–host plant interactions. Possible scenarios that may be used to explain our hypothesis are that the non-synonymous substitutions may alter features of RSV protein(s) to better interact with specific host factors during its infection, e.g., replication and movement in plant, while synonymous substitutions alter the codon composition to accord with host codon usage bias.

## Conclusion

We find the RSV populations from infected *N. benthamiana* plant is more diverse than that obtained from rice and the high-throughput sequencing technology is a powerful technology to investigate population genetic diversity of plant RNA viruses.

## Acknowledgments

This research was supported by the National High Technology Research and Development Program of China (2012AA101505), the earmarked fund for Modern Agro-industry Technology Research System (CARS-01-19) and the National Natural Science of China (31272015).


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2015 Huang, Li, Wu, Xu, Yang, Fan, Fang and Zhou. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# A Genome-wide Combinatorial Strategy Dissects Complex Genetic Architecture of Seed Coat Color in Chickpea

#### Edited by:

Maria J. Monteros, The Samuel Roberts Noble Foundation, USA

#### Reviewed by:

Steven B. Cannon, United States Department of Agriculture - Agricultural Research Service, USA Jianping Wang, University of Florida, USA Bharadwaj Chellapilla, Indian Agricultural Research Institute, India

#### \*Correspondence:

Swarup K. Parida swarup@nipgr.ac.in; swarupdbt@gmail.com

† These authors have contributed equally to this work.

#### Specialty section:

This article was submitted to Plant Genetics and Genomics, a section of the journal Frontiers in Plant Science

Received: 21 February 2015 Accepted: 26 October 2015 Published: 17 November 2015

#### Citation:

Bajaj D, Das S, Upadhyaya HD, Ranjan R, Badoni S, Kumar V, Tripathi S, Gowda CLL, Sharma S, Singh S, Tyagi AK and Parida SK (2015) A Genome-wide Combinatorial Strategy Dissects Complex Genetic Architecture of Seed Coat Color in Chickpea. Front. Plant Sci. 6:979. doi: 10.3389/fpls.2015.00979 Deepak Bajaj 1 †, Shouvik Das 1 †, Hari D. Upadhyaya2 †, Rajeev Ranjan<sup>1</sup> , Saurabh Badoni <sup>1</sup> , Vinod Kumar <sup>3</sup> , Shailesh Tripathi <sup>4</sup> , C. L. Laxmipathi Gowda<sup>2</sup> , Shivali Sharma<sup>2</sup> , Sube Singh<sup>2</sup> , Akhilesh K. Tyagi <sup>1</sup> and Swarup K. Parida<sup>1</sup> \*

<sup>1</sup> National Institute of Plant Genome Research, New Delhi, India, <sup>2</sup> International Crops Research Institute for the Semi-Arid Tropics, Telangana, India, <sup>3</sup> National Research Centre on Plant Biotechnology, New Delhi, India, <sup>4</sup> Division of Genetics, Indian Agricultural Research Institute, New Delhi, India

The study identified 9045 high-quality SNPs employing both genome-wide GBS- and candidate gene-based SNP genotyping assays in 172, including 93 cultivated (desi and kabuli) and 79 wild chickpea accessions. The GWAS in a structured population of 93 sequenced accessions detected 15 major genomic loci exhibiting significant association with seed coat color. Five seed color-associated major genomic loci underlying robust QTLs mapped on a high-density intra-specific genetic linkage map were validated by QTL mapping. The integration of association and QTL mapping with gene haplotype-specific LD mapping and transcript profiling identified novel allelic variants (non-synonymous SNPs) and haplotypes in a MATE secondary transporter gene regulating light/yellow brown and beige seed coat color differentiation in chickpea. The down-regulation and decreased transcript expression of beige seed coat color-associated MATE gene haplotype was correlated with reduced proanthocyanidins accumulation in the mature seed coats of beige than light/yellow brown seed colored desi and kabuli accessions for their coloration/pigmentation. This seed color-regulating MATE gene revealed strong purifying selection pressure primarily in LB/YB seed colored desi and wild Cicer reticulatum accessions compared with the BE seed colored kabuli accessions. The functionally relevant molecular tags identified have potential to decipher the complex transcriptional regulatory gene function of seed coat coloration and for understanding the selective sweep-based seed color trait evolutionary pattern in cultivated and wild accessions during chickpea domestication. The genome-wide integrated approach employed will expedite marker-assisted genetic enhancement for developing cultivars with desirable seed coat color types in chickpea.

Keywords: chickpea, GBS, GWAS, haplotype, QTL, seed coat color, SNP, wild accession

## INTRODUCTION

Chickpea (Cicer arietinum L.) is the third largest produced food legume crop globally that serves as a vital human dietary source of protein enriched with essential amino acids (Kumar et al., 2011; Gaur et al., 2012; Varshney et al., 2013). The seeds are defining characteristics of chickpea aside from having high economic value. The physical appearance of seeds such as light colored and large sized seeds of chickpea has always been a trait of consumer preference and trade value, besides an important target quality component and adaptation traits (Cassells and Caddick, 2000; Graham et al., 2001; Hossain et al., 2010). The seed coat color is an important agronomic trait in chickpea that varies widely among core and mini-core germplasm collections, cultivated desi and kabuli types, landraces and wild species accessions (Upadhyaya and Ortiz, 2001; Upadhyaya et al., 2002, 2006). Considering the agronomic importance and wider level of existing phenotypic variation of seed coat color in chickpea, it is essential to identify the underlying heritable forces and potential genes/QTLs (quantitative trait loci) influencing this large trait variation in desi and kabuli chickpea. Based on defined agromorphological descriptors, the large-scale phenotyping of core, mini-core and reference core chickpea accessions have classified and characterized these accessions into 24 diverse classes of seed coat color types (like brown, light brown, green, yellow-brown and creamy) (Upadhyaya and Ortiz, 2001; Upadhyaya et al., 2002, 2008). This revealed the predominance of yellow/light brown and beige seed coat color in desi and kabuli chickpea.

Tremendous efforts have been made for studying the genetic inheritance pattern as well as identification and molecular cloning of multiple genes harboring the major QTLs regulating seed coat color trait in diverse crop plants, including rice, soybean, sesame, Brassica and grape (Furukawa et al., 2006; Liu et al., 2006; Sweeney et al., 2006, 2007; Yang et al., 2010; Rahman et al., 2011; Huang et al., 2012; Zhang et al., 2013). The formation of different seed coat color is governed by organ/tissue-specific accumulation of plant secondary metabolites like flavonoid compounds, including anthocyanin, flavonols, isoflavones, flavonol glucosides, phlobaphenes, proanthocyanidin, leucoanthocyanidin, and proanthocyanidin in the seed coat (Furukawa et al., 2006). A complex biochemical pathway like phenylpropanoid (flavonoid and anthocyanin biosynthetic pathway) pathway is known to be involved in synthesis of these secondary metabolites in the specific tissues of crop plants (Dixon et al., 2002). The advances in structural and functional genomics have greatly catalyzed isolation of most of the transcriptionally regulated loci/genes that encode enzymes of this important biosynthetic pathway governing seed coat color in rice, maize, Arabidopsis, Brassica, Medicago, common bean, pea Antirrhinum and Petunia (Ludwig et al., 1989; Mol et al., 1998; Hu et al., 2000; Forkmann and Martens, 2001; Sakamoto et al., 2001; Winkel-Shirley, 2001; Saitoh et al., 2004; Xie and Dixon, 2005; Furukawa et al., 2006; Lepiniec et al., 2006; Quattrocchio et al., 2006; Zhao and Dixon, 2009; Golam Masum Akond et al., 2011; Li et al., 2012; Ferraro et al., 2014; Liu et al., 2014). However, the genetic analysis of seed color trait has mostly been restricted to estimation of heritability and molecular mapping of only a few seed coat color-associated QTLs because of its quantitative and polygenic nature of inheritance (epistatic and pleiotropic effects) both in desi and kabuli chickpea (Hossain et al., 2010). However, the markers associated with these QTLs controlling seed color are yet to be validated across diverse genetic backgrounds and environments for their deployment in marker-assisted breeding program to develop improved chickpea cultivars with desirable seed coat color.

A combinatorial approach of SNP (single nucleotide polymorphism) and SSR (simple sequence repeat) marker-based genome-wide association study (GWAS), candidate gene-based association analysis and traditional bi-parental QTL mapping in conjunction with differential gene expression profiling and gene-based molecular haplotyping/LD (linkage disequilibrium) mapping is a widely recognized strategy for dissecting the complex yield and quality component quantitative traits in different crop plants, including chickpea (Konishi et al., 2006; Sweeney et al., 2007; Tian et al., 2009; Mao et al., 2010; Li et al., 2011; Kharabian-Masouleh et al., 2012; Parida et al., 2012; Kujur et al., 2013, 2014; Negrão et al., 2013; Saxena et al., 2014a; Zuo and Li, 2014; Bajaj et al., 2015). Natural variation and artificial selection in many cloned genes (e.g., red pericarp color-regulating Rc gene in rice) underlying the QTLs regulating seed coat color during domestication have established seed color as a common target trait for both domestication and artificial breeding in crop plants (Sweeney et al., 2007; Hossain et al., 2011; Meyer et al., 2012). Henceforth, the above-said integrated strategy can be deployed for identification of functionally relevant novel molecular tags (markers, genes, QTLs, and alleles) controlling seed coat color and unraveling the molecular genetic basis of such natural trait variation in various cultivated and wild accessions of chickpea adapted to diverse agro-climatic regions. The resultant findings would assist us to genetically select desi and kabuli accessions for diverse desirable seed coat color, driving genomics-assisted breeding and ultimately development of tailored chickpea cultivars with preferred seed color.

In view of the aforementioned prospects, our study integrated GWAS and candidate gene-based association analysis with traditional QTL mapping, differential gene expression profiling, gene haplotype-specific LD mapping and proanthocyanidin (PA) quantitation assay to identify novel potential genomic loci (genes/gene-associated targets) and alleles/haplotypes regulating seed coat color and understand their haplotype-based evolutionary/domestication patterns in cultivated and wild chickpea.

## MATERIALS AND METHODS

## Discovery, Genotyping and Annotation of Genome-wide SNPs

For large-scale discovery and high-throughput genotyping of SNPs at a genome-wide scale, 93 phenotypically and genotypically diverse desi (39 accessions) and kabuli (54) chickpea accessions (representing diverse eco-geographical regions of 21 countries of the world) were selected (Table S1) from the available chickpea germplasm collections (16991, including 211 minicore germplasm lines, Upadhyaya and Ortiz, 2001; Upadhyaya et al., 2001, 2008). In addition, 79 diverse Cicer accessions belonging to five annual wild species, namely C. reticulatum (14), C. echinospermum (8), C. judaicum (22), C. bijugum (19), and C. pinnatifidum (15) and one perennial species C. microphyllum were chosen for haplotype-based evolutionary study (Table S1). The genomic DNA from the young leaf samples of 172 cultivated and wild chickpea accessions were isolated using QIAGEN DNeasy 96 Plant Kit (QIAGEN, USA).

The genomic DNA of accessions was digested with ApeKI, ligated to adapters-carrying unique barcodes and 2 × 96-plex GBS libraries were constructed. Each of two 96-plex GBS libraries were pooled together and sequenced (100-bp single end) individually using Illumina two-lane flow cell of HiSeq2000 NGS (next-generation sequencing) platform following Elshire et al. (2011) and Spindel et al. (2013). The unique barcodes attached to each of 172 accessions were used as reference for demultiplexing of their high-quality sequence reads. The decoded high-quality FASTQ sequences were mapped to reference kabuli draft chickpea genome (Varshney et al., 2013) employing Bowtie v2.1.0 (Langmead and Salzberg, 2012). The sequence alignment map files generated from kabuli genome were analyzed using the reference-based GBS pipeline/genotyping approach of STACKS v1.0 (http://creskolab.uoregon.edu/stacks) for identifying the high-quality SNPs (SNP base-quality, 20 supported by minimum sequence read-depth, 10) from 172 accessions (as per Kujur et al., 2015). The kabuli genome annotation (Varshney et al., 2013) was utilized as an anchor for structural and functional annotation of GBS-based SNPs in different coding (synonymous and non-synonymous SNPs) and non-coding sequence components of genes and genomes (chromosomes/pseudomolecules and scaffolds) of kabuli chickpea.

## Mining and Genotyping of Candidate Gene-based SNPs

For large-scale mining and high-throughput genotyping of candidate gene-based SNPs in chickpea, a selected set of 28 cloned genes/QTLs known to be involved in regulation of seed pericarp color in related dicot species, Arabidopsis thaliana, Medicago, and soybean (Lepiniec et al., 2006; Zhao and Dixon, 2009; Yang et al., 2010) were acquired. This includes a tandem array of three seed color candidate co-orthologous tt12-MATE family transporter repeated genes located at the selective sweep target region of kabuli chromosome 4 (reported previously by Varshney et al., 2013). The coding sequences (CDS) of these known/candidate seed color-regulating genes were BLAST searched against the CDS of kabuli genes to identify the best possible gene orthologs in chickpea. The CDS (functional domains) and 1000-bp upstream regulatory regions (URRs) of identified true kabuli chickpea gene orthologs (E-value: 0 and bit score ≥500) were targeted to design (Batch Primer3, http:// probes.pw.usda.gov/cgi-bin/batchprimer3/batchprimer3.cgi) the forward and reverse primers with expected amplification product size of 500–700 bp. For mining the gene-based SNPs, primarily 24 desi, kabuli, and wild chickpea accessions were selected randomly from 172 accessions (utilized for genome-wide GBSbased SNP genotyping) (Table S1). The gene-based primers were PCR amplified using the genomic DNA of all these 24 accessions and resolved in 1.5% agarose gel. The amplified PCR products were sequenced, the high-quality sequences aligned and SNPs were detected in diverse sequence components of orthologous chickpea genes among accessions following Saxena et al. (2014a). The gene-derived coding and URR-SNPs discovered were further genotyped in the genomic DNA of 172 cultivated and wild chickpea accessions employing Sequenom MALDI-TOF (matrix-assisted laser desorption ionization-time of flight) MassARRAY (http://www.sequenom.com) as per Saxena et al. (2014a). According to allele-specific mass differences between extension products, the genotyping information of different coding and URR-SNPs among accessions were recorded.

## Estimation of Polymorphism and Diversity Statistics

To measure the PIC (polymorphism information content) and genetic diversity coefficient (Nei's genetic distance; Nei et al., 1983), and construct an unrooted neighbor-joining (NJ) based phylogenetic tree (with 1000 bootstrap replicates) among chickpea accessions, the high-quality SNP genotyping data were analyzed with PowerMarker v3.51 (Liu and Muse, 2005), MEGA v5.0 (Tamura et al., 2011) and TASSEL v5.0 (http://www. maizegenetics.net). A 100-kb non-overlapping sliding window approach of TASSEL v5.0 was implemented to estimate the various nucleotide diversity measures (θπ, θω, and Tajima's D) following Xu et al. (2011) and Varshney et al. (2013).

## Determination of Population Genetic Structure and LD Patterns

A model-based program STRUCTURE v2.3.4 was utilized to determine the population genetic structure among chickpea accessions adopting the methods of Kujur et al. (2013, 2014) and Saxena et al. (2014a). The high quality genome-wide and candidate gene-based genotyping data of SNPs physically mapped on eight kabuli chromosomes were analyzed using a command line interface of PLINK and the full-matrix approach of TASSELv5.0 (Saxena et al., 2014a; Kumar et al., 2015) to evaluate the genome-wide LD patterns (r<sup>2</sup> , frequency correlation among pair of alleles across a pair of SNP loci) and LD decay (by plotting average r<sup>2</sup> against 50 and 20 kb uniform physical intervals across chromosomes) in chromosomes and population of chickpea.

## Phenotyping for Seed Coat Color

One hundred seventy-two chickpea accessions were grown in the field [following randomized complete block design (RCBD)] during crop growing season for two consecutive years (2011 and 2012) at two diverse geographical locations (New Delhi; latitude/longitude: 28.4◦N/77.1◦E and Hyderabad;17.1◦N/78.9◦E) of India. The seeds of all accessions were phenotyped and classified/characterized into different seed color types by visual estimation of seed coat/pericarp color of mature seeds (stored no longer than 5 months) representing each accession with at least three replications (Upadhyaya and Ortiz, 2001; Upadhyaya et al., 2002, 2006; Hossain et al., 2011). Based on these large-scale replicated multi-location/years field phenotyping, the 172 chickpea accessions were broadly categorized/characterized in accordance with 24 predominant seed coat color types (BL: Black, B: Brown, LB: Light brown, DB: Dark brown, RB: Reddish brown, GB: Grayish brown, SB: Salmon brown, OB: Orange brown, GR: Gray, BB: Brown beige, Y: Yellow, LY: Light yellow, YB: Yellow brown, OY: Orange yellow, O: Orange, YE: Yellow beige, I: Ivory, G: Green, LG: Light green, BR: Brown reddish, M: Variegated, BM: Black brown mosaic, LO: Light orange and BE: Beige) (Figure S1). The frequency distribution and broad-sense heritability (H<sup>2</sup> ) of seed coat color were measured using SPSSv17.0 (http://www.spss. com/statistics) as per Kujur et al. (2013, 2014) and Saxena et al. (2014a).

## Trait Association Mapping

The genome-wide and candidate gene-based SNP genotyping information of 93 cultivated desi and kabuli accessions was integrated with their seed coat color phenotyping, ancestry coefficient (Q matrix obtained from population structure) and relative kinship matrix (K) data using GLM (Q model)- and MLM (Q+K model)-based approaches of TASSELv5.0 (Kujur et al., 2013, 2014; Kumar et al., 2015). Furthermore, mixed-model (P+K, K, and Q+K) association (EMMA) and P3D/compressed mixed linear model (CMLM) interfaces of GAPIT (Lipka et al., 2012) relying upon principal component analysis (PCA) were employed for GWAS. The relative distribution of observed and expected -log10(P)-value for each SNP marker-trait association were compared individually utilizing quantile-quantile plot of GAPIT. According to false discovery rate (FDR cut-off ≤0.05) (Benjamini and Hochberg, 1995), the adjusted P-value threshold of significance was corrected for multiple comparisons. By integrating all four model-based outputs of TASSEL and GAPIT, the potential SNP loci in the target genomic (gene) regions revealing significant association with seed coat color trait variation at highest R<sup>2</sup> (degree of SNP marker-trait association) and lowest FDR adjusted P-values (threshold P <1 × 10−<sup>6</sup> ) were identified.

## QTL Mapping

For validation of seed color-associated genomic loci through QTL mapping, an intra-specific 190 F<sup>7</sup> RIL mapping population (ICC 12299 × ICC 8261) was developed. The chickpea accessions ICC 12299 (originated from Nepal) and ICC 8261 (originated from Turkey) are LB/YB and BE seed coat colored desi and kabuli landraces, respectively. A selected set of 384 SNPs showing polymorphism between mapping parental accessions (ICC 12299 and ICC 8261) and 96 previously reported SSR markers (Winter et al., 1999, 2000; Hossain et al., 2011; Thudi et al., 2011) physically/genetically mapped on LGs (linkage groups)/chromosomes (considered as anchors) were genotyped using MALDI-TOF mass array SNP genotyping assay and fluorescent dye-labeled automated fragment analyser (Kujur et al., 2013, 2014; Saxena et al., 2014a,b; Bajaj et al., 2015). The genotyping data of parental polymorphic SNP and SSR markers mapped on LGs (chromosomes) of an intra-specific genetic map and replicated multi-location seed coat color field phenotyping information (following aforesaid phenotyping methods of 172 accessions) of 190 RILmapping individuals and parental accessions were analyzed using the composite interval mapping (CIM) function of MapQTL v6.0 (Van Ooijen, 2009, as per Saxena et al., 2014a). The phenotypic variation explained (PVE) by QTLs (R<sup>2</sup> %) was measured at significant LOD (logarithm of odds threshold >5.0) to identify and map the major genomic loci underlying robust seed color QTLs in chickpea.

## Differential Expression Profiling

To infer differential gene regulatory function, the expression profiling of seed color-associated genes validated by both GWAS/candidate gene-based association analysis and QTL mapping was performed. For expression analysis, 11 contrasting chickpea accessions as well as parental accessions and two homozygous individuals of a RIL mapping population (ICC 12299 × ICC 8261) representing two predominant seed coat color types, LB/YB (desi: ICCX810800, desi: ICC 12299, desi: ICCV 10, desi: ICC 4958, desi: ICC 15061, kabuli: ICC 6204, and kabuli: Annigeri) and BE (kabuli: ICC 20268, kabuli: ICC 8261, kabuli: Phule G0515, and desi: ICC 4926) were selected (Table S1). The vegetative leaf tissues, including seeds without seed coats and seed coats scrapped from the immature (15–25 days after podding/DAP) and mature (26–36 DAP) seeds (30–40 seeds per accession in replicates) of all these accessions and mapping parents/individuals were used for RNA isolation. The isolated RNA was amplified with the gene-specific primers using the semi-quantitative and quantitative RT-PCR assays as per Bajaj et al. (2015). Three independent biological replicates of each sample and two technical replicates of each biological replicate with no template and primer as control were included in the quantitative RT-PCR assay. A house-keeping gene elongation factor 1-alpha (EF1α) was utilized as internal control in RT-PCR assay for normalization of expression values across various tissues and developmental stages of accessions and mapping parents/individuals. The significant difference of gene expression in immature and mature seed coats as compared to vegetative leaf tissues and seeds without seed coats (considered as control) of each accessions and mapping parents/individuals was determined and compared among each other to construct a heat map using TIGR MultiExperiment Viewer (MeV, http:// www.tm4.org/mev).

## Molecular Haplotyping

For gene-based marker haplotyping/LD mapping, the 2 kb URR, exons, introns and 1 kb DRR (downstream regulatory region) of one strong seed color-associated gene (validated by GWAS, QTL mapping, and differential expression profiling) amplified from 172, including 93 cultivated and 79 wild chickpea accessions (Table S1) were cloned and sequenced. For these, the gene amplified PCR products were purified (QIAquick PCR Purification Kit, QIAGEN, USA), ligated with pGEM-T Easy Vector (Promega, USA) and transformed into competent DH5α E. coli strain by electroporation (BIO-RAD, USA) to screen blue/white colonies. The plasmids isolated from 20 random positive white clones carrying the desired inserts were sequenced in both forward and reverse directions twice on a capillarybased Automated DNA Sequencer (Applied Biosystems, ABI 3730xl DNA Analyzer, USA) using the BigDye Terminator v3.1 sequencing kit and M13 forward and reverse primers. The trace files were base called separately for all the 10 clones of individual accessions and/primers, checked for quality using phred and then assembled into contigs using phrap. The high-quality sequences generated for the gene were aligned and compared among 172 chickpea accessions using the CLUSTALW multiple sequence alignment tool in MEGA 6.0 (Tamura et al., 2013) to discover the SNP and SSR allelic variants among these accessions. The SNP-SSR marker-based haplotypes were constituted in a gene to determine the haplotype-based LD patterns among accessions following Kujur et al. (2013, 2014) and Saxena et al. (2014a). To evaluate the trait-association potential and determine the haplotype-based evolutionary pattern of the gene, the marker-based gene haplotype-derived genotyping information was correlated with seed color field phenotyping data of accessions used. To assess the potential of seed color-regulatory haplotypes constituted in a gene, differential expression profiling in the seed coats dissected from mature seeds of aforesaid six contrasting chickpea accessions as well as mapping parents and two homozygous RIL individuals representing LB/YB (ICCX-810800, ICC 12299, and Annigeri)- and BE (ICC 20268, ICC 8261, and ICC 4926)-specific haplotype groups was performed using their respective gene haplotype-based primers (following aforementioned methods).

## Estimation of Proanthocyanidin (PA)

The seed coats (∼50 mg) (three independent biological replicates per accession) from matured seeds of aforementioned six contrasting chickpea accessions, mapping parents and two homozygous RIL individuals representing LB/YB and BE seed coat color haplotypes were scraped. These seed coat tissues were used for soluble (with 0.2% DMACA) and insoluble (with butanol-HCl) PA extraction and quantification following the methods of Pang et al. (2008) and Liu et al. (2014).

## RESULTS AND DISCUSSION

## Large-scale Discovery, High-throughput Genotyping and Annotation of Genome-wide GBS- and Candidate Gene-derived SNPs

The sequencing of 2 × 96-plex ApeKI GBS libraries generated on an average of 200.7 million high-quality sequence reads that are evenly distributed (mean: 2.1 million reads per accession) across 172, including 93 cultivated and 79 wild chickpea accessions (Table S1). Notably, on an average 87% of these sequence reads were mapped to unique physical positions on kabuli reference genome. The sequencing data generated from 93 cultivated (desi and kabuli) and 79 wild chickpea accessions in this study have been submitted to a freely accessible NCBIshort read archive (SRA) database (www.ncbi.nlm.nih.gov/sra) with accession numbers SRX845396 (http://www.ncbi.nlm.nih. gov/gquery/?term=SRX845396) and SRX971856, respectively. In total, 8673 high-quality SNPs (with read-depth 10, SNP base quality ≥20, <1% missing data and ∼2% heterozygosity in each accession) differentiating 172 accessions were discovered using kabuli reference genome-based GBS assay (Table S2). This highlights the advantages of GBS assay in rapid largescale mining and high-throughput genotyping of accurate and high-quality SNPs simultaneously at a genome-wide scale in chickpea. The successful applications of GBS assay particularly in construction of high-density intra- and interspecific genetic linkage maps, high-resolution GWAS and fine mapping/positional cloning of genes/QTLs governing important agronomic traits have been demonstrated recently in diverse legume crops, including chickpea (Sonah et al., 2013, 2014; Deokar et al., 2014; Jaganathan et al., 2014; Liu et al., 2014). In this context, GBS-based genome-wide SNPs identified by us have potential to be utilized for genomics-assisted breeding applications in chickpea.

A number of diverse known cloned genes governing complex transcriptional regulatory anthocyanin and flavonoid biosynthetic pathways to accumulate specific secondary metabolites (majorly anthocyanin and proanthocyanidin) in the seed coat for its coloration/pigmentation have been identified and characterized in Arabidopsis, soybean and Medicago (Focks et al., 1999; Debeaujon et al., 2001; Sagasser et al., 2002; Saitoh et al., 2004; Baxter et al., 2005; Xie and Dixon, 2005; Furukawa et al., 2006; Lepiniec et al., 2006; Liu et al., 2006, 2014; Quattrocchio et al., 2006; Sweeney et al., 2006, 2007; Zhao and Dixon, 2009; Erdmann et al., 2010; Yang et al., 2010; Golam Masum Akond et al., 2011; Rahman et al., 2011; Bowerman et al., 2012; Chen et al., 2012, 2014; Huang et al., 2012; Li et al., 2012; Routaboul et al., 2012; Nguyen et al., 2013; Pourcel et al., 2013; Wang et al., 2013; Zhang et al., 2013; Ichino et al., 2014; Pesch et al., 2014; Wroblewski et al., 2014; Xu et al., 2014). Recently, a tandem array of three seed color-regulating candidate co-orthologous tt12-MATE family transporter repeated genes of M. truncatula primarily affecting proanthocyanidin biosynthesis under strong purifying selection pressure has been identified at the selective sweep target region of kabuli chromosome 4 (Varshney et al., 2013). Besides, the detailed genetic regulation of seed color in chickpea is poorly understood vis-a-vis other legumes. Therefore, the seed color-related known cloned and candidate gene orthologs of legumes (Medicago and soybean) and Arabidopsis were targeted in our study for genetic association mapping of seed color traits in chickpea. To access the potential of aforesaid known/candidate genes for seed color trait association in chickpea, genetic association mapping was performed by high-throughput genotyping of novel SNP allelic variants mined from diverse sequence components of these genes in 93 cultivated and 79 wild accessions. The PCR ampliconbased sequencing of 28 seed color-related known/candidate kabuli gene (CDS and 1 kb-URRs) orthologs in 24 selected chickpea accessions and further comparison of their high-quality gene amplicon sequences detected numerous coding and non-coding SNP allelic variants. The subsequent large-scale validation and high-throughput genotyping of these mined SNP alleles in 172 cultivated and wild accessions by MALDI-TOF mass array successfully identified 372 genic SNPs (233 coding non-synonymous/synonymous and 139 URR SNPs) with an average frequency of 2.82 SNPs/kb (varied from 1.58 to 3.65 SNPs/kb) (**Table 1**).

The genome-wide GBS- and candidate gene-based SNP genotyping overall identified 9045 SNPs showing polymorphism among 172 cultivated and wild chickpea accessions. The 7488 and 1557 SNPs of these, were physically mapped on eight chromosomes and scaffolds of kabuli genome, respectively (**Figure 1A**). A maximum number of 1628 SNPs (21.7%) were mapped on kabuli chromosome 4, whereas minimum of 321 SNPs (4.3%) were mapped on chromosome 8. The structural annotation of 9045 SNPs identified 5903 (65.3%) and 3142 (34.7%) SNPs in the 1859 genes and intergenic regions of kabuli genome (**Figure 1B**). The abundance of gene-based SNPs was observed in the CDS (2385 SNPs, 40.4%), followed by introns (1576, 26.7%) and least (928, 15.7%) in the URRs. In total, 1205 (50.5%) and 1180 (49.5%) coding SNPs in the 756 and 743 genes exhibited synonymous and non-synonymous substitutions, respectively (**Figure 1B**). The relative distribution of GBS-based SNPs, including non-synonymous SNPs physically mapped on eight kabuli chromosomes showing polymorphism among 93 cultivated (desi and kabuli) and 79 wild accessions are depicted individually in a Circos circular ideogram (**Figure 1C**). The functional annotation of 1859 SNP-carrying genes exhibited their maximum correspondence to growth, development and metabolism-related proteins (43%), followed by transcription factors (22%) and signal transduction proteins (11%). The structurally and functionally annotated genome-wide and genederived SNPs physically mapped on kabuli chromosomes and novel SNP allelic variants mined from diverse seed colorregulating known/cloned and candidate genes can be utilized for various large-scale genotyping applications particularly in complex seed color quantitative trait dissection of chickpea.

## Polymorphism and Molecular Diversity Potential of SNPs

The identified 9045 genome-wide GBS- and candidate genebased SNPs revealed polymorphism among 172 cultivated and wild chickpea accessions with a higher PIC (0.13–0.41 with a mean 0.32) and nucleotide diversity (θπ: 2.14, θω: 2.19, and Tajima's D: -1.59) potential (Table S3). The intra-specific polymorphic and nucleotide diversity potential detected by SNPs within wild and cultivated (desi and kabuli) chickpea species was lower as compared to that of inter-specific polymorphism and nucleotide diversity among species (Table S3).

The molecular diversity potential detected by 9045 SNPs among 172 desi, kabuli and wild chickpea accessions exhibited a broader range of genetic distances varying from 0.15 to 0.83 with a mean of 0.52. A lower genetic diversity level (mean genetic distance: 0.32) was observed among 93 desi and kabuli chickpea accessions than that of 79 wild chickpea accessions (0.47). The unrooted neighbor-joining phylogenetic tree construction using 9045 genome-wide and candidate gene-based SNPs depicted a distinct differentiation among 172 cultivated and wild chickpea accessions and further clustered these accessions into two major groups (cultivated and wild) as expected (Figure S2). A relatively higher intra- and inter-specific polymorphism and nucleotide diversity potential as well as broader allelic (functional) diversity detected by 9045 SNPs among 172 cultivated (desi and kabuli) and wild chickpea accessions is much higher compared to that of earlier similar documentation (Nayak et al., 2010; Gujaria et al., 2011; Roorkiwal et al., 2013; Varshney et al., 2013). This suggests the utility of our developed genome-wide and gene-based SNP markers for faster screening of preferable diverse accessions/inter-specific hybrids in cross (introgression) breeding program for chickpea varietal improvement.

## Association Analysis for Seed Coat Color Trait

For GWAS and candidate gene-based association mapping, 8837 SNPs showing polymorphism among 93 desi and kabuli chickpea accessions based on genome-wide GBS- and candidate gene-based SNP genotyping were utilized. The neighbor-joining phylogenetic tree, high-resolution population genetic structure and PCA differentiated all 93 chickpea accessions from each other and clustered into two distinct populations; POP I and POP II representing mostly the kabuli and desi chickpea accessions with BE and LB/YB seed coat color, respectively (**Figures 2A–C**). The implication of chromosomal LD patterns, including LD decay in GWAS to identify potential genomic loci associated with diverse agronomic traits have been well demonstrated in many crop plants, including chickpea (Zhao et al., 2011; Thudi et al., 2014). The determination of LD patterns in a population of 93 accessions using 7488 SNPs (physically mapped across eight kabuli chromosomes) demonstrated a higher LD estimate (r<sup>2</sup> : 0.68) and extended LD decay (r<sup>2</sup> decreased half of its maximum value) approximately at 350–400 kb physical distance in kabuli chromosomes (**Figure 2D**). This extensive LD estimates and extended LD decay in the domesticated self-pollinating crop plant like chickpea in contrast to other domesticated self-pollinated crops like rice and soybean (Hyten et al., 2007; Mather et al., 2007; McNally et al., 2009; Huang et al., 2010; Lam et al., 2010; Zhao et al., 2011) is quite expected. This could be due to extensive contribution of four sequential bottlenecks during the domestication of chickpea (Lev-Yadun et al., 2000; Abbo et al., 2003; Berger et al., 2005; Burger et al., 2008; Toker, 2009; Jain et al., 2013; Kujur et al., 2013; Varshney et al., 2013; Saxena et al., 2014b) leading to reduction of genetic diversity in cultivated chickpea than that of other domesticated selfing plant species. The longer LD in chickpea chromosomes requires fewer markers to saturate the genome resulting in low resolution and thus indicating low significant association between the genetic markers and genes controlling the phenotypes in chickpea. In the present context due to insufficient and non-uniform marker coverage on a larger genome of chickpea with low intra-specific polymorphism, the GWAS integrated with candidate gene-based association analysis will be of much relevance in efficient quantitative dissection of complex traits in chickpea (Neale and Savolainen, 2004). This strategy of trait association mapping has been successfully implemented in a domesticated as well as larger genome selfpollinated crop species like barley, where extended LD decay was observed at genome level (Haseneyer et al., 2010; Varshney et al., 2012). Henceforth, the combinatorial approach of GWAS



TABLE 1 | Twenty-eight

 seed color-related

known/candidate

 gene orthologs

 of chickpea selected for genetic association

 mapping.

and candidate gene-based association mapping would strengthen the possibility of identifying numerous high-resolution traitassociated valid genomic loci at a genome-wide scale in chickpea.

represent the distribution of SNPs mined from 93 cultivated (desi and kabuli) and 79 wild chickpea accessions, respectively.

Keeping that in view, the GWAS and candidate gene-based association analysis was performed by integrating 8837 SNP genotyping data of 93 accessions with their replicated multilocation/years seed coat color field phenotyping information. The normal frequency distribution along with broader seed color phenotypic variation (Figure S3A) and 75% broadsense heritability (H<sup>2</sup> ) among 93 accessions belonging to a population was evident. The accessions exclusively exhibiting consistent phenotypic expression of seed coat color across two geographical locations/years (supported by high H<sup>2</sup> ) were utilized for subsequent SNP marker-trait association. The integration of outcomes from four model-based approaches [TASSEL (GLM and MLM) and GAPIT (EMMA and CMLM)] with a quantile-quantile plot of the expected and observed log<sup>10</sup> P-values at FDR cut-off ≤0.05 in GWAS and candidate gene-based association analysis identified 15 genomic loci

showing significant association with seed coat color at a P ≤10−<sup>6</sup> (**Figures 3A,B**, **Table 2**). This includes eight genome-wide GBSbased and seven candidate gene-derived SNPs. Fourteen seed color-associated genomic loci were physically mapped on seven kabuli chromosomes. The rest one SNP locus was represented from scaffold region of kabuli genome. One of 15 seed colorassociated genomic SNP loci was derived from the intergenic region, whereas rest 14 SNP loci were represented from diverse coding (seven, including six non-synonymous SNPs) as well as non-coding intronic (two SNPs) and URR (five) sequence components of 13 kabuli genes (**Table 2**).

The identified 15 significant SNPs explained (R<sup>2</sup> ) an average of 28% (ranging 20–43%) seed coat color phenotypic variation in a population of 93 chickpea accessions. The total phenotypic variation for seed coat color explained by the 15 SNPs was 61%. The significant association of 15 multiple SNP loci in more than one genomic region (genes) with seed coat color gave clues for complex genetic inheritance pattern/regulation of this target quantitative trait in chickpea. This association was evident from

FIGURE 2 | (A) Unrooted phylogenetic tree (Nei's genetic distance), (B) Population genetic structure (optimal population number K = 2 with two diverse color) and (C) Principal component analysis (PCA) using 8837 genome-wide GBS- and candidate gene-based SNPs assigned 93 BE (beige) and LB/YB (light/yellow brown) seed coat color representing kabuli and desi chickpea accessions mostly into two major populations-POP I and POP II, respectively. In population structure, the accessions represented by vertical bars along the horizontal axis were classified into K color segments based on their estimated membership fraction in each K cluster. In PCA, the PC1 and PC2 explained 7.1 and 17.4% of the total variance, respectively. (D) LD decay (mean r2) measured in a population of 93 cultivated desi and kabuli chickpea accessions using 7488 SNPs physically mapped on eight kabuli chromosomes. The plotted curved line denotes the mean r <sup>2</sup>-values among SNP loci spaced with uniform 50 kb physical intervals from 0 to 1000 kb across chromosomes. The plotted line in uppermost indicates the mean r <sup>2</sup>-values among SNPs spaced with uniform 20 kb physical intervals from 0 to 200 kb on chromosomes.

high chromosomal and population-specific LD estimates with preferable LD decay of trait-associated linked/unlinked multiple SNP loci and further by the observed genetic heterogeneity of seed color traits across two populations. We observed no significant differences concerning the association potential of SNP loci with seed color in different populations despite diverse genetic architecture of seed coat color traits in two populations and along entire population. Therefore, seed coat color-associated genomic loci (genes) identified by us employing both GWAS and candidate gene-based association mapping could have potential for establishing rapid marker-trait linkages and identifying genes/QTLs regulating seed color in chickpea. Seven (three non-synonymous coding and four URR SNPs) SNP loci in the six different known cloned and candidate seed color genes (tt12, tt4, tt3, tt19, ban-ANR, and tt18) had significant association (P: 1.3 × 10−<sup>6</sup> to 1.5 × 10−<sup>8</sup> with R<sup>2</sup> : 20–35%) with seed coat color trait in chickpea (**Table 2**). Interestingly, significant association of non-synonymous and regulatory SNPs identified in the different dormancy and stress-related known/candidate genes (encoding inositol monophosphate and disease resistance proteins) with seed coat color was observed in chickpea. This is consistent with previous studies that reported strong correlation/interactions among genes/QTLs regulating all three important agronomic traits, including seed coat pigmentation, seed dormancy/germination ability and stress tolerance in legumes (Harborne and Williams, 2000; Furukawa et al., 2006; Lepiniec et al., 2006; Kovinich et al., 2012; Saxena et al., 2013; Smýkal et al., 2014). The transcriptional regulation of numerous genes involved in anthocyanin and flavonoid biosynthetic pathways lead to accumulation of diverse major secondary metabolites in specific plant organs, which are known to play most important role in collective alteration of seed coat pigmentation/coloration, dormancy and defense (stress) response in crop plants

between expected and observed <sup>−</sup>log10(P)-values with FDR cut-off <sup>&</sup>lt;0.05 to detect significant genomic loci (genes) associated with seed color trait in chickpea.

(Gore et al., 2002; Benitez et al., 2004; Senda et al., 2004; Isemura et al., 2007; Caldas and Blair, 2009; Diaz et al., 2010; Kongjaimun et al., 2012). Notably, one non-synonymous SNP locus (T/G) in the CDS of the MATE (multidrug and toxic compound extrusion) secondary transporter gene (Ca18123, mapped on kabuli chromosome 2) exhibiting strong association (P: 1.7 × 10−<sup>9</sup> with R<sup>2</sup> : 43%) with seed color than that of other 14 seed color-associated SNPs was selected as a potential candidate for further analysis. The discovery of non-synonymous and regulatory SNP loci particularly in coding and URRs of known/candidate genes associated with seed color traits signifies their functional relevance in rapid seed color trait-regulatory gene identification and characterization in chickpea. The non-synonymous (amino acid substitutions) and regulatory SNPs in the genes have known to alter their transcriptional regulatory mechanism for controlling diverse traits of agronomic importance, including seed/grain size and weight in rice (Fan et al., 2006; Mao et al., 2010; Li et al., 2011; Zhang et al., 2012) and chickpea (Kujur et al., 2013, 2014; Bajaj et al., 2015; Saxena et al., 2014b).

Two candidate seed color-associated MATE secondary transporter genes [including one (Ca05557) reported earlier at the selective sweep region of kabuli chromosome 4, (Varshney et al., 2013)] identified by use of both candidate gene-based association mapping and GWAS were selected (Table S4) to compare and deduce the nucleotide diversity measures as well as the direction and magnitude of natural selection acting between these genes. These genes revealed a similar trend of nucleotide diversity (θπ and θω) and Tajima's D-based selection pressure occurring on LB/YB and BE seed colored desi and kabuli chickpea accessions, respectively (Table S4). Interestingly, the degree of nucleotide diversity was reduced by 61% in the LB/YB seed colored cultivated desi accessions than that of kabuli chickpea accessions with BE seed color. The extreme reduction of nucleotide diversity level (θπ: 0.73–0.79 and θω: 0.77–0.82) and strong purifying selection along with reduced Tajima's D (−2.89 to −2.63) specifically in LB/YB seed colored cultivated desi chickpea accessions compared with BE seed colored kabuli accessions (θπ: 1.93–1.96, θω: 1.95–1.98 and D: 1.82–1.90) was evident (Table S4). These findings provide definite evidence regarding signature of strong purifying selection in LB/YB seed colored desi accessions in contrast to BE seed colored kabuli chickpea accessions, which is in line with enduring purifying selection for seed coat/pericarp color traits in crop plants,



\*CWSNP (cultivated wild SNP) and gSNP (gene-derived SNPs).

CDS, coding sequence; NSyn, non-synonymous; Syn, synonymous; URR,upstream regulatory region.

<sup>a</sup>Details regarding SNPs are mentioned in the Table S2.

gSNP: gene-derived SNPs associated with seed coat color trait identified by candidate gene-based association mapping.

including chickpea (Sweeney et al., 2007; Varshney et al., 2013). Therefore, seed color in chickpea vis-à-vis other crop plants can be considered as a target trait for both domestication and artificial breeding (Sweeney et al., 2007; Hossain et al., 2011; Meyer et al., 2012).

## Validation of Seed Color-associated Genomic Loci through QTL Mapping

We constructed a high-density intra-specific genetic linkage map (ICC 12299 × ICC 8261) by integrating 415, including 382 SNP and 33 previously reported parental polymorphic SSR markers across eight chickpea LGs (LG1 to LG8). This genetic map covered a total map length of 1065.7 cM, with a mean inter-marker distance of 2.57 cM (Table S5). The most and least saturated genetic map was LG1 (average intermarker distance: 1.92 cM) and LG5 (4.0 cM), respectively. The mean map density (average inter-marker distance: 2.57 cM) observed in our constructed intra-specific genetic linkage map was higher/comparable with earlier documentation (Cho et al., 2002, 2004; Cobos et al., 2005; Radhika et al., 2007; Kottapalli et al., 2009; Gaur et al., 2011; Kujur et al., 2013, 2014; Sabbavarapu et al., 2013; Stephens et al., 2014; Varshney et al., 2014). Therefore, this constructed high-density intra-specific genetic linkage map have potential to be utilized as a reference for rapid targeted mapping of potential genes/QTLs governing diverse agronomic traits, including seed coat color in chickpea. The bi-directional transgressive segregation-based normal frequency distribution (Figure S3B), including a significant variation in seed coat color trait along with 78% H<sup>2</sup> in this developed mapping population (ICC 12299 × ICC 8261) was observed. Two principal seed coat color (BE and LB/YB) groupings; primary (PC1) and secondary (PC2) components accounted for 52 and 48% of the total seed coat color variance, respectively in 190 RIL mapping individuals and parental accessions were evident (Figure S3C). This continuous color distribution inferred the quantitative genetic inheritance pattern of seed coat color trait in the mapping population and thus the developed bi-parental mapping population could be utilized for complex seed color trait dissection through QTL mapping in chickpea.

The QTL mapping was performed by correlating the genotyping data of 415 mapped SNP and SSR markers with multilocation/years seed coat color field phenotyping data of the RIL mapping individuals and parental accessions. This analysis identified five major (LOD: 5.6–11.5) genomic regions harboring robust QTLs (validated across two locations/years) controlling seed coat color traits, which were mapped on five kabuli chromosomes (Table S6). The phenotypic variation explained (PVE) for seed color by individual QTL (R<sup>2</sup> ) varied from 20.1 to 38.7%. The PVE estimated for all five major QTLs was 39.4%. All these QTLs revealed additive gene effects (ranging −3.9 to −2.5), inferring the effective contribution of ICC 8261 alleles at these loci for BE seed coat color trait. The five SNPs in the genes showing tight linkage with all five major seed color-associated robust QTLs (CaqSC2.1, CaqSC4.1, CaqSC4.2, CaqSC6.1, and CaqSC7.1) had high seed color trait association potential based on our GWAS and candidate gene-based association analysis (**Table 2;** Table S6). Remarkably, strong association potential of one non-synonymous SNP (T/G) identified in the CDS of the MATE secondary transporter gene as compared to other genomic loci associated with seed coat color trait was ascertained both by QTL mapping (R<sup>2</sup> : 35.6–38.7%) and genetic association analysis (R<sup>2</sup> : 43%) (**Table 2**). Therefore, we selected MATE secondary transporter gene as a target candidate for seed coat color trait regulation by its further validation through differential expression profiling and molecular haplotyping in chickpea.

To determine the potential and novelty of seed coat colorregulating 15 genomic loci and five major QTLs detected by association and QTL mapping, respectively, the markers linked/flanking the seed color-associated known QTLs/genes (reported earlier in QTL mapping studies, Hossain et al., 2011) were selected for their validation in seed color-specific 93 diverse desi and kabuli chickpea accessions and mapping population under study. These comparative analyses revealed correspondence of one known QTL (CaqSC2.1) regulating seed coat color between past (SSR marker interval: TA194- TR19) and our present studies based on congruent flanking [CWSNP 1273 (25.2 cM)-CWSNP1277 (30.2 cM)]/linked marker (CWSNP 1275) physical/genetic positions on chromosome (LG) 2. Henceforth, 14 genomic loci and 4 QTLs regulating seed color identified by us employing high-resolution GWAS/candidate gene-based association and QTL mapping are novel and exhibited population-specific genetic inheritance pattern for seed color trait regulation in chickpea. Further validation of these functionally relevant molecular tags is required in diverse genetic backgrounds and/or through fine mapping/map-based positional cloning prior to their utilization in marker-assisted genetic improvement of chickpea for desirable seed color.

## Validation of Seed Color-associated Genes through Differential Expression Profiling

To determine the regulatory pattern of genes for seed coat color, thirteen, including five seed color-associated genes validated both by association analysis and QTL mapping were utilized for differential expression profiling. The RNA isolated from vegetative leaf tissues, seeds without seed coats and seed coats scrapped from the immature and mature seeds of 11 contrasting chickpea accessions as well as mapping parents and two homozygous RIL individuals representing two predominant seed coat color types, LB/YB (desi: ICCX810800, desi: ICC 12299, desi: ICCV10, desi: ICC 4958, desi: ICC 15061, kabuli: ICC 6204, and kabuli: Annigeri) and BE (kabuli: ICC 20268, kabuli: ICC 8261, kabuli: Phule G0515, and desi: ICC 4926) was amplified with these gene-specific primers using semi-quantitative and quantitative RT-PCR assays (Table S7). Five genes, including four seed color known cloned genes (tt4-chalcone synthase, tt3 dihydroflavonol-4-reductase, ban-ANR-anthocyanin reductase, and tt19-glutathione-s-reductase) showed seed coat-specific differential (>2.5-fold up- and/down-regulation, P ≤ 0.01) expression [compared with leaves and seeds (without seed coats)] specifically in two seed developmental stages of LB/YB and BE seed colored desi and kabuli accessions (Figure S4). This suggests that differential regulation and accumulation of transcripts encoded by these four known genes might be essential in synthesis of proanthocyanidin and anthocyanin biosynthetic enzymes in chickpea seed coats for its coloration/pigmentation (Lepiniec et al., 2006; Zhao and Dixon, 2009; Zhao et al., 2010).

Remarkably, a seed coat-specific MATE secondary transporter gene (validated by both association and QTL mapping) showing strong association with seed coat color trait exhibited its higher differential down-regulation (∼5–7-folds, P ≤ 0.001) in the mature/immature seed coats of BE seed colored chickpea accessions (ICC 20268, ICC 8261, Phule G0515, and ICC 4926) as compared to that of mature/immature seed coats of LB/YB seed colored accessions (ICCX810800, ICC 12299, ICCV10, ICC 4958, ICC 15061, ICC 6204, and Annigeri) (Figure S4). A pronounced down regulation (∼13-folds, P ≤ 0.001) of this seed coatspecific gene in the mature seed coats of all four BE seed colored chickpea accessions in contrast to immature seed coats of all seven LB/YB seed colored accessions was observed. However, differential down-regulation (∼3-folds, P ≤ 0.001) of MATE gene in the mature seed coats than the immature seed coats of the accessions within an individual LB/YB and BE seed coat colored group was evident (Figure S4). The reduction of chickpea MATE gene transcript level specifically in the mature seed coat parallels with differential expression pattern that observed previously using TT12 and MATE1 genes of Arabidopsis and Medicago, respectively (Marinova et al., 2007; Pang et al., 2008; Zhao and Dixon, 2009; Zhao et al., 2010). These outcomes over imply the functional significance of MATE gene in possible transcriptional regulation of seed color trait in chickpea.

## Molecular Haplotyping of a Strong Seed Color-regulating Gene

For molecular haplotyping of a strong seed color-associated gene (validated by association analysis, QTL mapping and expression profiling), the 6843 bp cloned amplicon covering the entire 2 kb URR, 7 exons, 1 kb DRR and 6 intronic regions of MATE secondary transporter kabuli gene (Ca18123) were sequenced and compared among 93 cultivated and 79 wild chickpea accessions. These analyses identified four SNPs, including one URR and two coding SNPs as well as one coding SSR in the MATE gene (**Figure 4A**, Table S8). This includes one missense non-synonymous coding SNPs (T/G) showing valine (GTG) to glycine (GGG) amino acid substitution in the gene. The gene-based haplotype analysis by integrating the genotyping data of four SNPs and one SSR among 172 accessions constituted four haplotypes (**Figure 4B**). The haplotype-specific LD mapping and association analysis using four marker-based haplotypes identified in a MATE gene inferred strong association potential (P: 2.3 × 10−<sup>11</sup> with R<sup>2</sup> : 45%) of the gene with seed coat color trait. Moreover, we observed a significant higher degree of LD (r <sup>2</sup> > 0.85 with P < 1.1 × 10−<sup>6</sup> ) resolution across the entire 6843 bp sequenced region of this strong seed color-associated gene (**Figure 4C**). Remarkably, two specific haplotypes, haplotype I [C-(GTTG)3-T-T-A] and haplotype II [C-(GTTG)3-G-T-A] affected by one functional

non-synonymous coding SNPs (T/G) in the MATE gene had strong association potential for BE and YB/LB seed coat color differentiation, respectively (**Figure 4B**). The haplotypes I and II than other two haplotypes (III and IV) identified in the gene was represented most commonly by YB/LB and BE seed colored desi (66.7%) and kabuli (68.5%) accessions, respectively. A higher down-regulated expression (∼5-fold) of BE seed color-specific gene haplotype I was observed in mature seed developmental stages of BE seed colored three contrasting chickpea accessions (ICC 20268, ICC 8261, and ICC 4926) as well as mapping parents and homozygous RIL individuals compared with three YB/LB seed colored accessions (ICCX 810800, ICC 12299, and Annigeri), mapping parents and homozygous mapping individuals (**Figure 4D**). To correlate the differential expression/regulatory pattern of MATE gene-encoding transcript with proanthocyanidins (PA) synthesis/accumulation (Lepiniec et al., 2006; Marinova et al., 2007; Zhao and Dixon, 2009; Zhao et al., 2010), we estimated the soluble and insoluble PA content (three replicates) in aforementioned mature seeds of six contrasting accessions as well as mapping parents and homozygous RIL individuals representing BE and YB/LBspecific haplotype groups I and II, respectively. This detected a significant reduction of both soluble and insoluble PA contents in the mature seed coats of BE seed colored accessions (soluble and insoluble PA: 2.1 and 1.6 nmol/mg seeds, respectively) than that of LB/YB (6.7 and 5.9 nmol/mg seeds) seed colored accessions (**Figure 4E**). Collectively, a strong association potential of MATE gene with seed color trait was ensured by integrating GWAS and candidate gene-based association analysis with QTL mapping, differential expression profiling, SNP-SSR marker-based high-resolution gene-specific LD mapping/haplotyping, haplotype-specific transcript profiling and PA estimation (soluble and insoluble) assay. These analyses identified novel natural allelic variant (T/G) showing nonsynonymous amino acid substitution (valine to glycine) and potential haplotypes [C-(GTTG)3-T-T-A] and [C-(GTTG)3-G-T-A] in a MATE gene regulating BE and YB/LB seed coat color differentiation, respectively in kabuli and desi chickpea accessions. The implication of such combinatorial approach for rapid delineation of functionally relevant genes/QTLs controlling diverse agronomic traits, including starch biosynthesis, seed pericarp coloration/pigmentation, seed shattering and seed setting rate in rice as well as seed size/weight and pod number in chickpea have been well understood (Konishi et al., 2006; Sweeney et al., 2007; Kujur et al., 2013, 2014; Li et al., 2013; Saxena et al., 2014a; Bajaj et al., 2015). Therefore, natural allelic variants and haplotypes identified in a strong seed color-associated MATE gene by use of an integrated genomic approach can have potential to decipher its complex transcriptional regulatory function of seed coat coloration during seed development and subsequently for marker-assisted genetic enhancement of chickpea for preferred seed color.

We observed sharing of all four SNP-SSR marker-based haplotypes among 93 cultivated desi and kabuli as well as 79 wild accessions representing primary, secondary and tertiary gene pools (**Figure 4B**). The B/GR and BE seed color-associated haplotypes IV and I were represented most (40.5%) and least (10.1%) in the wild chickpea accessions, respectively. However, the B/GR-associated haplotype IV was primarily represented by the wild chickpea accessions (38%) of secondary gene pool (C. bijugum, C. judaicum, and C. pinnatifidum) (**Figure 4B**). The wild accessions of C. reticulatum from primary gene pool had maximum YB/LB (10.1%) seed color-associated haplotype II. A maximum haplotypes sharing of BE seed color-associated haplotype I in cultivated kabuli accessions with YB/LB seed color-associated haplotype II in desi accessions following YB/LB seed color-associated haolotype II between desi and wild C. reticulatum accessions was evident (**Figure 4B**). The B/GR seed color-associated haplotype IV gave evidence for its putative recombination event during chickpea domestication. These marker haplotypes sharing analyses reflected single origin of "G" functional SNP allele primarily from C. reticulatum, and subsequently it was introgressed into most of the LB/YB seed colored desi accessions as compared to kabuli accessions with BE seed coat color. The nucleotide diversity (θπ and θω) and Tajima's D estimation (Table S3) further inferred the occurrence of very strong purifying selection (with reduced D) in favor of retention of LB/YB seed coat color-associated "G" SNP locus and haplotype II in the MATE secondary transporter gene specifically in desi and wild C. reticulatum accessions toward assortment of more preferential LB/YB seed coat color in chickpea. These findings may be a result of domestication bottlenecks coupled with artificial selection/modern breeding efforts that are constantly practiced during the genetic improvement program of chickpea accessions for diverse seed coat color characteristics of high consumer preference and trade value. These findings infer that the natural allelic/haplotype variation discovered in the MATE gene are possibly associated with seed coat color trait evolution and therefore, the seed color is expected to represent an important component of domestication trait in chickpea.

The chickpea MATE gene is a homologs of Arabidopsis TT12 (transparent testa 12) and Medicago MATE1 genes encodes a membrane protein of multidrug and toxic compound extrusion (MATE) secondary transporter (Debeaujon et al., 2001). The detailed phenotypic characterization of tt12 and mate1 mutants and complementation analysis of their corresponding genes in Arabidopsis and Medicago, respectively inferred that these tonoplastic late biosynthetic genes mediate transportation of precursors for proanthocyanidin (PA) biosynthesis into the vacuoles of seed coat endothelial cells in developing immature seeds for their coloration/pigmentation (Lepiniec et al., 2006; Marinova et al., 2007; Zhao and Dixon, 2009; Zhao et al., 2010). The use of combinatorial approach of GWAS, QTL mapping, transcript profiling, gene-based marker haplotyping/LD mapping and PA estimation assay indicate that BE seed color-associated haplotype I [C-(GTTG)3-T-T-A] identified in a MATE gene might be involved in downregulation/lower expression and accumulation of its encoding transcript, which led to reduced synthesis of both soluble and insoluble PAs in pericarps of BE seed colored mature seeds than that of YB/LB seed colored mature seeds with haplotype II [C-(GTTG)3-G-T-A]. These results are in agreement with earlier observations on lower accumulation of PAs

(soluble/insoluble) in the vacuole endothelial cells of mature seed coat during later stages of seed development and their involvement in differential seed coat/pericarp coloration in crop plants (Abrahams et al., 2003; Kitamura et al., 2004; Liu et al., 2014). However, detailed molecular characterization of MATE secondary transporter gene is required to decipher its underlying regulatory mechanism, evolutionary cues and biochemical interactions governing diverse seed coat coloration in desi, kabuli and wild accessions during chickpea domestication. Therefore, the strong seed color-regulatory MATE gene once functionally well characterized, can be utilized in marker-assisted seed color breeding for developing varieties with improved seed coat color of consumer preference and trade value in chickpea.

## REFERENCES


## ACKNOWLEDGMENTS

The authors gratefully acknowledge the financial support for this research study provided by a research grant from the Department of Biotechnology (DBT), Government of India (102/IFD/SAN/2161/2013-14). SD and RR acknowledge the DBT and CSIR (Council of Scientific and Industrial Research) for Junior/Senior Research Fellowship awards.

## SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fpls.2015. 00979


synthesis in rice pericarp. Plant J. 49, 91–102. doi: 10.1111/j.1365-313X.2006. 02958.x


mutation conferring white pericarp in rice. PLoS Genetics 3:e133. doi: 10.1371/journal.pgen.0030133


**Conflict of Interest Statement:** The reviewer Bharadwaj Chellapilla declares that, despite being affiliated with the same institute and having previously collaborated with the author Shailesh Tripathi, the review process was handled objectively. The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2015 Bajaj, Das, Upadhyaya, Ranjan, Badoni, Kumar, Tripathi, Gowda, Sharma, Singh, Tyagi and Parida. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

## **Mitogen-activated protein kinase cascades in** *Vitis vinifera*

#### *Birsen Çakır <sup>1</sup> \* and Ozan Kılıçkaya2*

<sup>1</sup> Department of Horticulture, Faculty of Agriculture, Ege University, Izmir, Turkey, <sup>2</sup> Department of Pharmacetical Biotechnology, Faculty of Pharmacy, Cumhuriyet University, Sivas, Turkey

Protein phosphorylation is one of the most important mechanisms to control cellular functions in response to external and endogenous signals. Mitogen-activated protein kinases (MAPK) are universal signaling molecules in eukaryotes that mediate the intracellular transmission of extracellular signals resulting in the induction of appropriate cellular responses. MAPK cascades are composed of four protein kinase modules: MAPKKK kinases (MAPKKKKs), MAPKK kinases (MAPKKKs), MAPK kinases (MAPKKs), and MAPKs. In plants, MAPKs are activated in response to abiotic stresses, wounding, and hormones, and during plant pathogen interactions and cell division. In this report, we performed a complete inventory of MAPK cascades genes in Vitis vinifera, the whole genome of which has been sequenced. By comparison with MAPK, MAPK kinases, MAPK kinase kinases and MAPK kinase kinase kinase kinase members of Arabidopsis thaliana, we revealed the existence of 14 MAPKs, 5 MAPKKs, 62 MAPKKKs, and 7 MAPKKKKs in Vitis vinifera. We identified orthologs of V. vinifera putative MAPKs in different species, and ESTs corresponding to members of MAPK cascades in various tissues. This work represents the first complete inventory of MAPK cascades in V. vinifera and could help elucidate the biological and physiological functions of these proteins in V. vinifera.

## Matthew R. Willmann,

*Edited by:*

University of Pennsylvania, USA Pao-Yang Chen, Academia Sinica, Taiwan Samia Daldoul, Center of Biotechnology of Borj Cedria, Tunisia

Joanna Marie-France Cross,

Inönü University, Turkey *Reviewed by:*

˙

#### *\*Correspondence:*

Birsen Çakır, Department of Horticulture, Faculty of Agriculture, Ege University, Bornova/Izmir 35100, Turkey birsencakir@hotmail.com

#### *Specialty section:*

This article was submitted to Plant Genetics and Genomics, a section of the journal Frontiers in Plant Science

*Received:* 08 February 2015 *Accepted:* 07 July 2015 *Published:* 22 July 2015

#### *Citation:*

Çakır B and Kılıçkaya O (2015) Mitogen-activated protein kinase cascades in Vitis vinifera. Front. Plant Sci. 6:556. doi: 10.3389/fpls.2015.00556 **Keywords: MAP kinase,** *Vitis vinifera***, signal transduction, protein phosphorylation**

## **Introduction**

Mitogen-activated protein kinase (MAPK) cascades are higly conserved modules of signal transduction in eucaryotes including yeast, animals, and plants. MAPK cascades play an important role in protein phosphorylation of signal transduction events (Rodriguez et al., 2010). MAPK cascades typically consist of three protein kinases, MAPK, MAPK kinase (MAPKK), and MAPK kinase kinase (MAPKKK), but sometimes include MAP3K kinase (MAP4K) that phosphorylate the corresponding downstream substrates (Jonak et al., 2002; Champion et al., 2004).

MAPK is activated via phophorylation of conserved threonine (T) and tyrosine (Y) residues in the catalytic subdomain by its specific MAPKK, which is in turn activated by phophorylation of two serine/threonine residues in a conserved S/T-X33-5-S/T motif by an upstream MAPKKK (Stulemeijer et al., 2007; Zaïdi et al., 2010; Huang et al., 2011). Upon activation, the MAPK could be translocated into the nucleus or cytoplasm to trigger the cellular responses through phosphorylation of downstream transcription factors or components of transcription machinery while some MAP kinases, like ERK3, are constitutively present in the nucleus and may function

**Abbreviations:** MAPK, mitogen-activated protein kinase; ORF, open reading frame.

in the nucleus (Lee et al., 2004; Pedley and Martin, 2005; Fiil et al., 2009; Nadarajah and Sidek, 2010). MAPKKK is usually activated by a G protein, but sometimes activation is mediated via an upstream MAP4K (Champion et al., 2004).

MAPK proteins contain 11 evolutionary conserved kinase domains that may be involved in substrate specifity or proteinprotein interaction (Nadarajah and Sidek, 2010). MAPK cascade proteins have TEY or TDY phophorylation motifs in the region between kinase domains VII and VIII (Group et al., 2002), which provides a protein-binding domain for the activation of MAPKs (Rohila and Yang, 2007).

In plants, MAPKs are involved in cellular responses to hormones, plant growth and development, regulation of the cell cycle, and responses to biotic and abiotic stresses (Jonak et al., 1993; Wilson et al., 1997; Zhang and Klessig, 1997; Bögre et al., 1999; Nishihama et al., 2001; Bergmann et al., 2004; Lukowitz et al., 2004; Katou et al., 2005; Meng et al., 2012).

A variety of genes encoding MAPKs have been cloned from *Arabidopsis*, rice, tobacco and barley, and oat (Huttly and Phillips, 1995; Knetsch et al., 1996; Mizoguchi et al., 1998; Nadarajah and Sidek, 2010; Zaïdi et al., 2010; Sun et al., 2014). The *Arabidopsis* genome contains 20 MAPK genes (Group et al., 2002; Jonak et al., 2002). MAPK genes such as AtMPK4 and AtMPK6, have been identified in *Arabidopsis* (Ichimura et al., 1998, 2000; Nadarajah and Sidek, 2010). It has been reported that MAPK genes are involved in biotic and abiotic stress responses (Mizoguchi et al., 1996; Ichimura et al., 2000; Asai et al., 2002; Nadarajah and Sidek, 2010). For example, OsMAPK3, OsMAPK6, and the MAPK kinase OsMKK4 are induced by a chitin elicitor in rice and the activated form of OsMKK4 induces cell death (Kishi-Kaboshi et al., 2010). Similarly, NtWIPK, OsMPK5, and AtMPK3 were activated by pathogens and abiotic stresses (Zhang and Klessig, 2001; Hamel et al., 2006; Rohila and Yang, 2007). AtMPK4 and AtMPK6 are activated by osmotic stress, low humidity, low temperature, and wounding (Ichimura et al., 2000; Teige et al., 2004). AtMPK3 and AtMPK6 are also regulated by biotic elicitors via AtMKK4/5 and AtMPK4 is a negative regulator of defense response (Asai et al., 2002). In addition, AtMPK3 and AtMPK6 are involved in the embryo, anther and inflorescence development and stomatal distribution on the leaf surface (Bergmann et al., 2004; Gray and Hetherington, 2004; Bush and Krysan, 2007).

MKKs are activated by the phosphorylation on conserved serine and threonine residues in the S/T-X3-5-S/T motif and characterized by a putative MAPK-docking domain K/R-K/R-K/R-X1-6-L-X-L/V/S, and a kinase domain (Group et al., 2002). To date, many MAPKKs have been identified from several plant species. All the identified MAPKK genes from *Arabidopsis*, rice and poplar contain 11 catalytic subdomains (Ichimura et al., 2002; Rao et al., 2010; Wang et al., 2014c). In *Arabidopsis*, MKK1 was activated by wounding and abiotic stress (Matsuoka et al., 2002). Alfalfa SIMKK mediates both salt and elicitor-induced signals (Kiegerl et al., 2000; Cardinale et al., 2002). NtMEK2 activates SIPK and WIPK resulting in cell death (Yang et al., 2001).

MAPKKKs form the largest class of MAPK cascade enzymes with 80 members classified into three subfamilies, MEKK, Raf, and ZIK containing 21, 11, and 48 genes, respectively in *Arabidopsis* (Jonak et al., 2002). Plant MAPKKKs are characterized by different primary structures of their kinase domains, but are conserved within a single group (Champion et al., 2004). The MEKK subfamily comprises a conserved kinase domain of G(T/S)Px(W/Y/F)MAPEV (Jonak et al., 2002). The ZIK subfamily contains GTPEFMAPE(L/V)Y while the Raf subfamily has GTxx(W/Y)MAPE (Jonak et al., 2002). All the MAPKKK proteins have a kinase domain, and most of them have a serine/threonine protein kinase active site (Wang et al., 2015). In the RAF subfamily, most of the proteins have a long N-terminal regulatory domain and C-terminal kinase domain. By contrast, majority of the members in the ZIK subfamily have an N-terminal kinase domain (Wang et al., 2015). However, the MEKK subfamily has a less conserved protein structure with a kinase domain located either at the C- or Nterminal or in the central part of the protein (Wang et al., 2015). Homologs of MAPKKKs have been identified in plant species such as alfalfa, *Arabidopsis,* tobacco (Kovtun et al., 2000; Nishihama et al., 2001; Lukowitz et al., 2004; Nakagami et al., 2004). The MEKK subfamily contains NPK1, NbMAPKKKα, NbMAPKKKγ, NbMAPKKKε in tobacco (Jin et al., 2002; del Pozo et al., 2004; Liu et al., 2004; Melech-Bonfil and Sessa, 2010), MEKK1 in *Arabidopsis* (Asai et al., 2002), and SIMAPKKKα and SIMAPKKKε in tomato (Oh et al., 2010; Sun et al., 2014). The second subfamily, Raf, includes *Arabidopsis* CTR1/raf1 (Kieber et al., 1993), EDR/Raf2 (Frye et al., 2001), and DSM1 in rice (Ning et al., 2010). In *Arabidopsis*, MEKK1 regulates defense responses against different pathogens including bacteria and fungi (Asai et al., 2002; Qiu et al., 2008; Galletti et al., 2011). In addition, AtEDR1, a Raf-like MAPKKK, regulates SA-inducible defense responses (Frye et al., 2001). The ZIK subfamily which contains 10 and 9 members in *Arabidopsis* and rice, respectively, are able to regulate flowering time and circadian rhythms (Wang et al., 2008; Kumar et al., 2011).

A putative phosphorylation domain T/Sx5T/S is found between domains VII and VIII in MAP4Ks, which is identical to the phosphorylation motif of MAPKKs from plants (Jouannic et al., 1999; Ichimura et al., 2002). Both domains participate in peptide-substrate recognition (Champion et al., 2004). MAP4Ks can be linked to the plasma membrane through association with a small GTPase or lipid (Qi and Elion, 2005). They are directly activated by stimulated interaction with adaptor proteins (Qi and Elion, 2005). The MAP4Ks are divided into eight classes including PAK-related, Gck, Mst, Tao, Ste/PAK, Sok (Champion et al., 2004). The majority of MAP4Ks are from the large class of Ste20 protein kinases, which exhibit a highly diverse noncatalytic domain (Dan et al., 2001). The PAKs, which have a C-terminal catalytic domain, are separated from the GC Kinase-related polypeptides, which contain an N-terminal catalytic domain (Dan et al., 2001). Most of the MAP4Ks contain an N-terminal catalytic domain, but members of the STE20/PAK group have a C-terminal kinase domain and some plant MAP4Ks have their kinase domain in the middle of the sequences (Leprince et al., 1999). The *Arabidopsis* genome contains 10 putative MAP4Ks (Champion et al., 2004). A maize gene encoding MIK is a GCKlike kinase being a subfamily of MAP4K (Llompart et al., 2003), which relates membrane-located receptors to MAP kinases (Dan et al., 2001). Some MAP4K are able to phosphorylate MEKK or Raf members whereas other MAP4Ks either phosphorylate MAPKKs or function as adaptors (Champion et al., 2004).

However, the functions of most MAPK genes in plants are still unknown. Although MAPK cascades are involved in signaling multiple defense responses, the role of *Vitis* MAPK cascades in response to biotic and abiotic stresses are not elucidated. In previous studies in grapevine, a few components of the MAPK gene family were isolated (Wang et al., 2014a). In addition, the gene family of MAPKKKs were identified and their expression profiles were analyzed in different organs in response to different stresses (Wang et al., 2014b). Interestingly, the expression of *VvMAP* kinase gene was induced by salinity and drought (Daldoul et al., 2012). However, the MAPKK and the MAPKKKK subfamilies have not yet been characterized. To explore the role of MAPK cascade proteins in biotic and abiotic stress responses in grapevine, the publicly available grapevine genome (Jaillon et al., 2007) was analyzed to identify all members of MAPK cascade proteins. Using these databases, we characterized all members of MAPK cascades of *V. vinifera* and performed a phylogenetic analysis in comparison with members of *Arabidopsis* MAPK cascade proteins.

## **Materials and Methods**

## **Genome-wide Identification of MAPK Cascade Genes in Grapevine**

The MAPK cascade protein sequences of *Arabidopsis thaliana* were used to search against the *V. vinifera* proteome 12× database (http://www.genoscope.cns.fr/externe/ GenomeBrowser/Vitis/) using a BLASTP analysis (http://www. ncbi.nlm.nih.gov/blast) (Altschul et al., 1990) with scores higher than 400 and an "E" value > e-120 (Çakır and Kılıçkaya, 2013). The sequences of Arabidopsis MAPK cascade proteins were obtained from the TA˙IR (http://www.arabidopsis.org/). MAPK domain (PS01351), ATP-binding domain (PS00107), protein kinase domain (PS50011), serine/threonine protein kinase active site (PS00108) were identified in the sequences of polypeptides corresponding to *V. vinifera* MAPK cascade proteins by the Conserved Domain Database (CDD) at NCBI (http://www.ncbi. nlm.nih.gov/Structure/cdd/wrpsb.cgi) and PROSITE (http:// prosite.expasy.org/) (Marchler-Bauer et al., 2009). In addition, the NCBI non-redundant protein database was screened with each sequence in order to independently validate the automatic annotation.

## **Multiple-sequence Alignment and Phylogenetic Tree Construction**

Multiple-sequence alignments of the putative MAPK cascade proteins were aligned using CLUSTAL W and subjected to phylogenetic analysis by both the maximum parsimony and distance with neighbor-joining methods with 1000 bootstrap replicates (Saitou and Nei, 1987; Thompson et al., 1994). The phylogenetic tree was illustrated using MEGA5. Because similar results were obtained with both methods, only the single tree retrieved from the distance analysis is discussed in detail.

For MAPK cascade subfamilies from both *V. vinifera* and *A. thaliana*, multiple sequence alignment was performed using the multiple sequence comparison by log-expectation (MUSCLE) alignment tool (http://www.ebi.ac.uk/Tools/msa/ muscle/) (Edgar, 2004). The phylogenetic analysis was performed using a neighbor-joining method with 1000 bootstrap replicates andvisualized with MEGA5 software (Tamura et al., 2011). The protein theoretical molecular weight and isoelectric point were predicted using compute pI/MW (http://au.expasy.org/tools).

## **Orthology Analysis and Database Search**

Orthology analysis was performed using the PHOG web server (http://phylofacts.berkeley.edu/orthologs/) (Datta et al., 2009). The sequences of conserved domains with similarity over 70% and an "E" value of 0.0 were selected as queries. The selected sequences of conserved domains from different species were then used in a BLASTP search against the *V. vinifera* protein sequence database. The best hits were annotated as putative orthologous sequences (Moreno-Hagelsieb and Latimer, 2008).

Expressed sequence tags (ESTs) were identified by BLASTn of the *V. vinifera* expressed sequence tag (EST) database (http:// www.ncbi.nlm.nih.gov/dbEST). Using the sequences of all of the MAPK cascade proteins as queries. The positives sequences were then confirmed by alignment with the query ORF.

## **Results and Discussion**

## **Genome-wide Identification of MAPK Cascade Genes in** *Vitis vinifera*

*Vitis vinifera* MAPK cascade sequences were mined from the grapevine genome proteome 12x database (Jaillon et al., 2007). We identified 88 ORFs encoding putative MAPK cascade proteins containing at least MAPK domain by BLAST searches of the grapevine genome proteome 12× database with the amino acid sequences of the MAPK cascade proteins from *A. thaliana* as queries (**Table 1**). The completed *Vitis* genome contains 14 MAPKs, 5 MAPKKs, 62 MAPKKKs, and 7 MAPKKKKs (**Table 1**).

## **Phylogenetic Analysis**

All predicted MAPK cascade family sequences were aligned using ClustalW (Thompson et al., 1994). A rooted phylogenetic tree was constructed by alignment of full length amino acid sequences using the MEGA5 program and maximum parsimony and distance with neighbor-joining methods (Saitou and Nei, 1987) (**Figure 1**). One thousand bootstrap replicates were produced for each analysis.

*Vitis* MAPK cascade sequences can be divided into four subfamilies on the basis of the presence of conserved threonine and tyrosine residues in the motif TxY located in the activation loop (T-loop) between kinase subdomains VII and VIII. In addition, we identified MAPKKKK subfamily with 7 members in *Vitis* genome, which has the conserved amino acid motifs TFVGTPxWMAPEV as described (Jonak et al., 2002). The members of four subfamilies clustered more tightly with each other than with members of other subfamilies (**Figure 1**).


**TABLE 1 | Detailed inventory of the** *Vitis* **MAPK cascade proteins.**


(Continued)

**TABLE 1 | Continued**


### **MAPKs**

The phylogenetic analysis showed that the VvMAPKs were devided into five distinct groups, which is higher than previous reports (Kumar and Kirti, 2010; Nadarajah and Sidek, 2010). Group V MAPKs are found only in the grapevine genome among other plant species. All of identified ORFs encoding MAPK were named VvMPK1 through 14. Hyun et al. (2010) reported 12 MAPKs based on 8x sequence coverage in grapevine genome whereas we identified a total of 14 ORFs in *Vitis* 12x genome coverage (Hyun et al., 2010), which may be due to the errors corrected in 12x genome sequence coverage. The grapevine genome contains less MAPKs than *Arabidopsis* (20 MAPKs) (Ichimura et al., 2002) and rice (17 MAPKs) (Liu and Xue, 2007). Members of the *Vitis* MAPK subfamily show 20–86% identity to each other. Full length MAPK proteins ranged in size from 195 to 769 amino acids (**Table 1**). Variation in length of the entire MAPK gene is usually due to differences in the length of MAPK domain and/or, due to the number of introns. The difference in length among *MAPK* genes may indicate the presence or absence of motifs which could affect functional specifity.

VvMPK12, VvMPK14 belong to the group I., which contains well-characterized *MAPK* genes including *AtMPK3*, *AtMPK6* (**Figure 2**). It has been demonstrated that *AtMPK3*, *OsMPK5* were activated in response to pathogens and abiotic stresses (Zhang and Klessig, 2001; Hamel et al., 2006; Rohila and Yang, 2007). *OsMPK5* plays an important role for the resistance to blast disease (Song and Goodman, 2002; Huang et al., 2011). *AtMPK6* can be activated by various abiotic and biotic stresses (Ichimura et al., 2000; Yuasa et al., 2001; Feilner et al., 2005; Huang et al., 2011). Similarly, *PtrMAPK* is involved in resistance to both dehydration and cold (Huang et al., 2011).

Group II MAPKs are involved in both abiotic stresses and cell division in *Arabidopsis*. VvMPK13, VvMPK11, and VvMPK9 are clustered with Group II., which includes AtMPK4, AtMPK5, AtMPK12, and AtMPK11. AtMPK4 and its upstream MAPKK AtMKK2 can be activated by biotic and abiotic stresses (Ichimura et al., 2000; Teige et al., 2004).

VvMPK4 and VvMPK8 belong to group III. AtMPK1 in the group III is regulated by salt stress treatment (Mizoguchi et al., 1996). In addition, AtMPK1 and AtMPK2 are activated by ABA (Ortiz-Masia et al., 2007). The group III genes, such as rice BWMK1 and alfalfa TDY1, are activated by wounding and pathogens (Nowak et al., 1997; Lynch et al., 2001).

Group IV, which includes VvMPK1, VvMPK3, VvMPK5, VvMPK6, and VvMPK7 of the *Vitis* MAPKs, have the TDY motif in their T-loop and the absence of the C-terminal CD domain, which is consistently found in members of the other MAPK groups. VvMPK2 and VvMPK10 belonging to group V were separated from other groups.

The orthology analysis program identified one hundredfourteen orthologs from various plant species for this subfamily (**Table 2**). The VvMPK3 amino acid sequence shows 83% similarity with AtMPK9, and VvMPK12 shows 84% similarity with AtMPK3 from *A. thaliana.* The members of VvMAPK subfamily share between 75.8 and 91.8% similarity to the MAPK members from *Ricius communis*,

**TABLE 1 | Continued**

*Oryza sativa*, and *A. thaliana*. The phylogenetic analysis of *A. thaliana* and *V. vinifera* MAPK subfamilies confirmed the orthologs of VvMPK14/AtMPK6, VvMPK12/AtMPK3, VvMAPK11/AtMAPK13, VvMPK13/AtMPK12, VvMPK7/AtMPK16, and VvMPK3/AtMPK9 (**Figure 2**).

All of the 14 *Vitis* MAPK proteins are represented in the Vitis ESTs database (Supplementary Table 1) and are expressed in different tissues such as fruits, berries, buds, flowers, leaves, and roots. In addition, 12 *VvMPK* genes were isolated (Wang et al., 2014a). Expression analysis of *VvMPK* genes showed that all *VvMPK* genes are expressed during grapevine growth and development, and in biotic and abiotic stresses (Wang et al., 2014a).

#### **MAPKKs**

This subfamily consists of 10 members in *Arabidopsis* genome (Group et al., 2002), whereas *Vitis* genome contains 5 members of MAPKK subfamily. The full length VvMKK sequences range in size from 224 to 519 amino acids (**Table 1**). The members of the MAPKK subfamily in the *Vitis* genome share 29–40%

similarity with each other. By phylogenetic analysis, we also identified orthologs of *Vitis* MAPKKs in *Arabidopsis* such as VvMKK5/AtMKK3 (78.6% similarity), VvMKK3/AtMKK6 (83.1% similarity), and VvMKK2/AtMKK2 (70.4% similarity) supported with significant bootstrap values. The phylogenetic analysis confirmed that VvMKK3 shares 83.3% similarity with its homolog from *Arabidopsis* on the basis of orthology analysis, (**Figure 3**, **Table 2**).

To date, none of the *Vitis* MAPKK homologs have been cloned or characterized. However, 98 ESTs were identified for this subfamily in different tissues in response to biotic or abiotic stresses (Supplementary Table 2). A role of MAPK kinase, MKK1 in abiotic stress signaling was previously demonstrated (Matsuoka et al., 2002). Analysis of MKK1 revealed that drought, salt stress, cold, wounding activated MKK1, which in turns activates its downstream target MPK4 (Matsuoka et al., 2002). Tobacco NtMEK2 is functionally interchangeable with two *Arabidopsis* MAPKKs, AtMKK4, and AtMKK5 in activating the downstream MAPKs (Ren et al., 2002). MdMKK1 was reported to be downregulated by ABA (Wang et al., 2010). In *Arabidopsis*, AtMKK3 is upregulated in response to ABA (Hwa and Yang, 2008). Interestingly, AtMKK1/AtMKK2 play an important role in signaling in ROS homeostasis (Liu, 2012).

### **MAPKKKs**

With 62 members, the MAPKKK subfamily represents the largest subfamily of *V. vinifera* MAPK cascade proteins, which is smaller than those of *Arabidopsis* (80 members) and rice (75 members) (Colcombet and Hirt, 2008; Rao et al., 2010). Recently, Wang et al. (2014b) identified 45 MAPKKK genes in grapevine 12x

#### **TABLE 2 | Orthologs of** *Vitis* **MAPK cascade proteins identified in diverse plant species.**





Columns 1–4 contain the protein name, Vitis proteome 12× ID, GenBank ID, species, percentage identity (%ID), UniprotKB ID.

genome coverage (Wang et al., 2014b). The difference in the number of MAPKKK members in grapevine genome may be related to the "E" value > E-120 used in this report, which is more significant. In addition, domain scan using two different databases (PROSITE and CDD) can identify more sequences in the grapevine genome.

The members of the *Vitis* MAPKKK subfamily share 11– 35% identity with each other and distributed on various chromosomes (from 2 to 18) (**Table 1**). The full length *Vitis* MAPKKK sequences range from 175 (VviMAPKKK38) to 1397 (VviMAPKKK17) amino acids. The phylogenetic analysis of both *Vitis* and *Arabidopsis* MAPKKK sequences shows that this subfamily is categorized into three main groups with bootstrap values up to 93% (**Figure 4**).

The first group contains MAPKKKs whose kinase domains have similarity to MEKK subfamily members (**Figure 4**) (Jonak et al., 2002). A second group includes Raf subfamily members while a third group presents ZIK subfamily members (**Figure 4**) (Jonak et al., 2002). In total, there are 21 VviMAPKKKs in the MEKK subfamily, while there are 12 in the ZIK subfamily and 29 in the Raf subfamily among the 62 members in the *Vitis* genome.

Analysis of conserved domain of VviMAPKKKs identified a long regulatory domain in the N-terminal region and a kinase domain in the C-terminal region in most of VviMAPKKKs. It is suggested that the long regulatory domain in the N-terminal region of the Raf subfamily may be involved in protein-protein interactions and regulate or specify their kinase activity (Jouannic et al., 1999). Twenty members of the *Vitis* MAPKKK subfamily share 75.1–89.2% similarity with their orthologs from different plant species (**Table 2**).

We identified at least 640 ESTs for 59 of the *Vitis* MAPKKKs (Supplementary Table 3) indicating that MAPKKK subfamily is transcriptionally active. Expression profile of *VviMAPKKK* genes suggested that some of them are involved in response to biotic and abiotic stresses in different tissues and organs (Wang et al., 2014b). In support of a role for some *Vitis* MAPKKKs, *AtMEKK1* expression is enhanced by drought, salt, stress (Mizoguchi et al., 1996). Recently, it was reported that AtMKK1/MKK2 and AtMEKK1 were able to negatively regulate programmed cell death (PCD) as well as immune responses (Kong et al., 2012). In tobacco, NPK1-MEK1-Ntf6 are also involved in resistance to tobacco mosaic virus (TMV) (Jin et al., 2002; Liu et al., 2004). In addition, AtEDR1, a Raf-like MAPKKK could regulate SA-inducible defense responses negatively (Frye et al., 2001).

#### **MAPKKKKs**

In non-plants, MAPKKKs are activated either through phosphorylation by MAPKKK kinase (MAPKKKK or MAP4K) (Posas and Saito, 1997; Sells et al., 1997) or by G protein and G

**129**

similarity), VvMAP4K7/AtMAP4K4 (68% similarity), and

In addition, we identified several orthologs from different species for 3 VvMAP4Ks (**Table 2**). Among 7 ORFs encoding

VvMAP4K6/AtMAP4K10 (64% similarity) (**Figure 5**).

**from** *Arabidopsis* **and** *Vitis vinifera***.** The amino acid sequences of all Arabidopsis MAPKKKK proteins and those of Vitis vinifera were aligned using the MUSCLE program and subjected to phylogenetic analysis by the distance with neighborjoining method using MEGA5 programme.

> the roles of MAP4Ks in plants. Seven ORFs showing strong similarity with the 10 *Arabidopsis* MAP4Ks were identified in *Vitis* genome (**Figure 5**) and shared 18–74% similarity with each other. They have been named VvMAP4K1 through 7 (**Table 1**). The phylogenetic analysis of *V. vinifera* and *A. thaliana* MAP4Ks proteins identified several orthologs in the two species such as VvMAP4K4/AtMAP4K8 (70% similarity), VvMAP4K1/AtMAP4K3 (66%

AtMAP4K9 (At1g23700), AtMAP4K10 (At4g14480).

AtMAP4K4 (At5g14720), AtMAP4K5 (At4g24100), AtMAP4K6 (At4g10730), AtMAP4K7 (At1g70430), AtMAP4K8 (At1g79640),

protein-coupled receptors (Fanger et al., 1997; Sugden and Clerk, 1997).

Several MAP4Ks have been identified in plant genomes based on phylogenetic analyses of their kinase domain. A MAP4K, named MIK, was characterized from the *Zea mays* (Wang et al., 2014d). Recently, a new MAP4K from GCK-II subfamily named ScMAP4K1, which play important roles in ovule, seed, and fruit development was characterized (Major et al., 2009).

In fully sequenced genomes, like *Arabidopsis* and rice at least 10 protein kinases can be phylogenetically classified as MAP4K (Champion et al., 2004). Little is known about

Çakır and Kılıçkaya MAP kinase cascades in grapevine

*Vitis* MAP4Ks, all of them are transcriptionally active (Supplementary Table 4), but none of them has been cloned and characterized.

## **Conclusions**

This report represents the first complete genome-wide analysis of MAPK cascade proteins in grapevine. The identification of *Vitis* MAPK cascade proteins and their comparative analysis with the *Arabidopsis* MAPK cascade proteins indicates that MAPK cascade genes have been conserved during evolution. In this report, we annotated 90 ORFs encoding MAPK cascade proteins in *V. vinifera* using a bioinformatics approach. Taken as a whole, our data provide significant insights into future biological and physiological analysis of MAPK cascades from *V. vinifera*.

## **References**


## **Author Contributions**

BÇ conceived and designed all research. OK performed the bioinformatic analyses. BÇ analyzed data and wrote the article.

## **Acknowledgments**

This work was funded by the Department of Horticulture, Ege University, Turkey.

## **Supplementary Material**

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fpls.2015. 00556


weighting, position-specific gap penalties and weight matrix choice. *Nucleic Acids Res.* 22, 4673–4680. doi: 10.1093/nar/22.22.4673


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2015 Çakır and Kılıçkaya. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# **Biodiversity of genes encoding anti-microbial traits within plant associated microbes**

*Walaa K. Mousa1, 2 and Manish N. Raizada1 \**

<sup>1</sup> Department of Plant Agriculture, University of Guelph, Guelph, ON, Canada, <sup>2</sup> Department of Pharmacognosy, Faculty of Pharmacy, Mansoura University, Mansoura, Egypt

The plant is an attractive versatile home for diverse associated microbes. A subset of these microbes produces a diversity of anti-microbial natural products including polyketides, non-ribosomal peptides, terpenoids, heterocylic nitrogenous compounds, volatile compounds, bacteriocins, and lytic enzymes. In recent years, detailed molecular analysis has led to a better understanding of the underlying genetic mechanisms. New genomic and bioinformatic tools have permitted comparisons of orthologous genes between species, leading to predictions of the associated evolutionary mechanisms responsible for diversification at the genetic and corresponding biochemical levels. The purpose of this review is to describe the biodiversity of biosynthetic genes of plant-associated bacteria and fungi that encode selected examples of antimicrobial natural products. For each compound, the target pathogen and biochemical mode of action are described, in order to draw attention to the complexity of these phenomena. We review recent information of the underlying molecular diversity and draw lessons through comparative genomic analysis of the orthologous coding sequences (CDS). We conclude by discussing emerging themes and gaps, discuss the metabolic pathways in the context of the phylogeny and ecology of their microbial hosts, and discuss potential evolutionary mechanisms that led to the diversification of biosynthetic gene clusters.

**Keywords: genes, biodiversity, evolution, plant associated microbes, rhizosphere, endophyte, antimicrobial**

## **Introduction**

**secondary metabolites**

The plant is an attractive versatile home for diverse microbes that can colonize internal plant tissues (endophytes), live on the surface (epiphytes) or in the soil surrounding the root system (rhizosphere microbiota) (Barea et al., 2005; Johnston-Monje and Raizada, 2011). Plant associated microbes have the potential to be used as biocontrol, the use of living organisms to suppress crop disease (Eilenberg, 2006) through various mechanisms including the production of antibiotics (Compant et al., 2005). Diverse classes of antimicrobial secondary metabolites of microbial origin have been reported (Mousa and Raizada, 2013), including polyketides, non-ribosomal peptides, terpenoids, heterocylic nitrogenous compounds, volatile compounds, bacteriocins as well as lytic enzymes. Polyketides and non-ribosomal peptides constitute the majority of microbial derived natural products (Cane, 1997). Interestingly, the tremendous structural diversity of antimicrobial secondary metabolites originated via limited metabolic pathways utilizing few primary metabolites as precursors (Keller et al., 2005). Underlying the diversification of antimicrobial metabolites must have

#### *Edited by:*

Joanna Marie-France Cross, Inönü University, Turkey

#### *Reviewed by:*

Nancy Keller, University of Wisconsin, USA Antoine Danchin, Amabiotics SAS, France

#### *\*Correspondence:*

Manish N. Raizada, Department of Plant Agriculture, University of Guelph, 50 Stone Road East, Guelph, ON, N1G 2W1, Canada raizada@uoguelph.ca

#### *Specialty section:*

This article was submitted to Plant Genetics and Genomics, a section of the journal Frontiers in Plant Science

> *Received:* 21 January 2015 *Accepted:* 23 March 2015 *Published:* 10 April 2015

#### *Citation:*

Mousa WK and Raizada MN (2015) Biodiversity of genes encoding anti-microbial traits within plant associated microbes. Front. Plant Sci. 6:231. doi: 10.3389/fpls.2015.00231 been a corresponding genetic diversification of ancestral genes driven by co-evolutionary pressures (Vining, 1992).

The revolution in genomics, genome mining tools and bioinformatics offers a new opportunity to connect biochemical diversity to the underlying genetic diversity and to analyze the evolutionary events leading to biodiversity (Zotchev et al., 2012; Scheffler et al., 2013; Deane and Mitchell, 2014).

The scope of this review is to describe the biodiversity of biosynthetic coding sequences (CDS) of plant-associated microbes (bacteria and fungi) that encode selected examples of antimicrobial secondary metabolites and lytic enzymes. For each example, the target pathogen(s) and mode of action are described where known, in order to highlight the diversity of biochemical targets. Out of necessity, the review focuses on compounds for which in depth molecular analysis has been conducted. We review data pertaining to the underlying molecular diversity and highlight comparative genomic data of the orthologous genes. The review concludes with a discussion of common themes and gaps in the literature, and discusses the role of evolution in the diversification of biosynthetic gene clusters including horizontal gene transfer (HGT).

## **Biosynthetic Genes Encode Diverse Chemical Classes of Anti-Microbial Compounds**

The diversity of compounds described in this review, the underlying genes, microbes, and pathogenic targets are summarized (**Tables 1**, **2**).

## **Polyketides**

The structures of Polyketides described in this review are shown (**Figure 1**)

## **2,4-Diacetylphloroglucinol**

2,4-diacetylphloroglucinol (2,4-DAPG) is a well-studied fluorescent polyketide metabolite produced by many strains of fluorescent *Pseudomonas* spp. that contributes to disease-suppressive soils of crops (McSpadden Gardener et al., 2000; Mavrodi et al., 2001). 2,4-DAPG is synthesized by the condensation of three molecules of acetyl coenzyme A and one molecule of malonyl coenzyme A to produce the precursor monoacetylphloroglucinol (MAPG) (Shanahan et al., 1992). In *P. fluorescens* strain Q2-87, four coding sequences (CDS) within the *phl* operon are responsible for biosynthesis of 2,4-DAPG: a single CDS (*phlD*) encoding a type III polyketide synthase is responsible for the production of phloroglucinol from the condensation of three acetyl-CoAs, and then three CDS (*phlACB*) encoding acetyltransferases are sufficient to convert phloroglucinol to 2,4-DAPG via MAPG (Bangera and Thomashow, 1999; Yang and Cao, 2012). It was suggested that the peptides encoded by *phlACB* may exist as a multi-enzyme complex (Bangera and Thomashow, 1999). *phlD* has been the subject of interest, because it has homology to chalcone and stilbene synthases from plants, which suggests horizontal gene transfer (HGT) between plants and their rhizosphere microbial populations (Bangera and Thomashow, 1999). Whereas, *phlACB* coding sequences are highly conserved between eubacteria and archaebacteria (Picard et al., 2000), a considerable degree of polymorphism was reported for *phlD* (Mavrodi et al., 2001). *phlA* transcription is negatively regulated by the product of *phlF* (Delany et al., 2000) which also appears to mediate repression by fusaric acid (Delany et al., 2000), a metabolite of pathogenic fungi of plants, that has previously been implicated in repression of biosynthesis of the anti-fungal compound, phenazine (see above) (van Rij et al., 2005). These observations demonstrate the ongoing arms race between plants, their fungal pathogens and associated anti-fungal antagonists, leading to gene diversification.

## **Mupirocin**

The polyketide mupirocin or pseudomonic acid is one of the major antibacterial metabolites produced by *Pseudomonas fluorescens* (Fuller et al., 1971) and is widely used as a clinical antibiotic (Gurney and Thomas, 2011). Mupirocin can inhibit the growth of methicillin resistant *Staphylococci, Streptococci, Haemophilus influenza,* and *Neisseria gonorrheae* (Sutherland et al., 1985). In terms of the mode of action, mupirocin inhibits isoleucyl-tRNA synthetase, and hence prevents incorporation of isoleucine into newly synthesized proteins, thus terminating protein synthesis (Hughes and Mellows, 1980). Biochemically, mupirocin has a unique chemical structure that contains a C9 saturated fatty acid (9-hydroxynonanoic acid) linked to C17 monic acid A (a heptaketide) by an ester linkage (Whatling et al., 1995). Mupirocin is derived from acetate units incorporated into monic acid A and 9—hydroxynonanoic acid via polyketide synthesis (Whatling et al., 1995). At the molecular level, the mupirocin biosynthetic gene cluster (*mup* operon) in *P. fluorescens* is complex, and includes 6 Type I polyketide synthases that are multifunctional as well as 29 proteins of single function within a 65 kb region, which are incorporated into 6 larger coding sequences (modules *mmpA-F*) (El-Sayed et al., 2003; Gurney and Thomas, 2011).The gene cluster is non-standard as the CDS are not in the same order as the biosynthetic steps (El-Sayed et al., 2003; Gurney and Thomas, 2011). The acyltransferase (AT) domains of the polyketide synthases (PKS) are not present in each genetic module but are instead encoded by a separate CDS (from the *mmpC* module) and this classifies these PKS as *in-trans* AT PKSs (El-Sayed et al., 2003). With respect to gene regulation, two putative regulatory genes, *mupR* and *mupI,* were identified within the cluster that are involved in quorum sensing (QS) dependent regulation (El-Sayed et al., 2001).

An interesting feature of this system in *P. fluorescens* is that self-resistance to mupirocin is also encoded by a CDS (*mupM*) within the biosynthetic gene cluster (El-Sayed et al., 2003). *mupM* encodes a resistant Ile t-RNA synthetase (IleS) due to polymorphisms within the binding site of mupirocin (El-Sayed et al., 2003; Gurney and Thomas, 2011). A second resistant IleS was cloned from *P. fluorescens* NCIMB 10586 outside of the *mup* gene cluster which showed 28% similarity to the *mupM* product (Yanagisawa et al., 1994). Human pathogens that have high level mupirocin-resistance are associated with an additional gene that encode a novel IleS with similarity to eukaryotic counterparts; this resistance gene is associated with transposable elements and is carried on plasmids, facilitating its rapid spread (Eltringham, 1997; Gurney and Thomas, 2011).



There is also genetic evidence that the entire *mup* gene cluster in *Pseudomonas* arose by horizontal gene transfer; specifically the genes encoding tRNAVal and tRNAAsp were found upstream of the *mupA* promoter region leading to speculation that the *mup* cluster arose from homologous recombination between chromosomal tRNA genes and possibly a plasmid containing the *mup* cluster (El-Sayed et al., 2003). The inclusion of a resistant IleS (*mupM*) within the *mup* biosynthetic cluster might have facilitated such horizontal gene transfer, as otherwise uptake of the mupirocin gene cluster would have been immediately suicidal.

## **Difficidin**

Difficidin is a polyketide with an interesting geometry that involves four double bonds in the Z configuration (Chen et al., 2006). Difficidin is produced by various *Bacillus species* such as *B. subtilis* and *B. amyloliquefaciens* FZB 42 with broad antibacterial activity against human and crop pathogens (Zimmerman et al., 1987; Chen et al., 2006, 2009). A large gene cluster (*pks3*) encoding difficidin (and oxydifficidin) was characterized in *B. amyloliquefaciens* (Chen et al., 2006). This compound is included in this review, because *pks3* is adjacent to other polyketide synthesis gene clusters, *pks1* and *pks2*, that encode bacillaene and macrolactin, respectively (Chen et al., 2006; Schneider et al., 2007). All three gene clusters share sequence homology, a similar order of CDS and are located close to another on the chromosome, leading Chen et al. (2006) to hypothesize that they emerged from homologous recombination from a common ancestral gene cluster resulting in gene duplication. This system provides insights into the diversification of polyketides.

## **Pyoluteorin**

bAgrobacterium

cB. cereus (Takeno et al., 2012).

 radiobacter (Zhang et al., 2014) Pyoluteorin (PLt) is a phenolic polyketide with bactericidal, herbicidal, and fungicidal properties. Plt can suppress damping-off disease in cotton caused by the fungus, *Pythium ultimum* (Howell and Stipanovic, 1980). Both PLt and phenazine (see below) may act synergistically to suppress such soil-borne fungal diseases in plants, as some studies have suggested that the two biosynthetic pathway interact with one another (Ge et al., 2007; Lu et al., 2009). The biosynthesis of Plt involves condensation of proline with three acetate equivalents through chlorination and oxidation. The carbon skeleton is built up by the action of a single multienzyme complex (Nowak-Thompson et al., 1999). In *Pseudomonas fluorescens* Pf-5, a 24 kb segment contains the PLt biosynthetic operon (*pltABCDEFG*). PLt biosynthesis is catalyzed by type I polyketide synthases (*pltB*, *pltC*), an acyl-CoA dehydrogenase (*pltE*), an acyl-CoA synthetase (*pltF*), a thioesterase (*pltG*), and halogenases (*pltA, pltD, pltM*) with *pltM* located adjacent to the gene cluster (Nowak-Thompson et al., 1999). A significant delay in the expression of the PLt biosynthetic operon was reported in the cucumber spermosphere compared to cotton, which correlated to the timing of infection with the fungal root pathogen *Pythium ultimum* (Kraus and Loper, 1995). The authors suggest that such temporal differences may be responsible for differential disease suppression in diverse plant hosts.

The *plt* biosynthetic operon has been shown to be regulated by a LysR family transcriptional activator, encoded by *pltR* (Nowak-Thompson et al., 1999). Interestingly, *pltR* is tightly linked and

**TABLE 1 | Continued**



transcribed divergently to the biosynthetic gene cluster (Nowak-Thompson et al., 1999). In earlier studies involving the biosynthetic operon of phenazine, its LysR transcriptional regulator gene (*phzR*) was also shown to be tightly linked to its corresponding biosynthesis gene cluster (Pierson et al., 1998). As both phenazine and PLt combat soil-borne fungal diseases in plants, we speculate that strong evolutionary pressures in the rhizosphere may have promoted HGT of the biosynthetic operons to new rhizosphere microbial hosts; the activator-cluster gene module would facilitate activation of the biosynthetic CDS following such gene transfer.

## **Jadomycin**

Jadomycin is a member of angucycline antibiotics produced by *Streptomyces* species such as *S.venezuelae.* Jadomycin (*jad*) production is induced under stress conditions such as phage infection or heat shock (Doull et al., 1994; Jakeman et al., 2009). The *jad* biosynthetic gene cluster in *S. venezuelae* is closely related to type II polyketide synthase genes (Han et al., 1994) with a complex biosynthetic gene cluster (Zou et al., 2014). Jadomycin is of interest here because upstream of the *jad* operon are sets of negative regulatory genes including *jadR1R2R3* and *jadW123* (Yang et al., 1995; Zou et al., 2014). *jadW123* encodes enzymes for the biosynthesis of gamma-butyrolactones (GBL), whereas JadR2 is a pseudoreceptor for GBL which upon its binding activates JadR1 and JadR3 that subsequently act as positive and negative transcriptional regulators of the *jad* biosynthetic operon, respectively (Zou et al., 2014). GBLs are becoming well known as regulators of secondary metabolism in gram positive bacteria, analogous to the related acyl homoserine lactone compounds which mediate QS in gram negative bacteria (Nodwell, 2014). QS is a method of communication between bacterial populations that activates genes based on high cell density through the signal molecule N-acyl-homoserine lactone (AHL) (Whitehead et al., 2001). Whereas, QS signaling molecules are thought to be synthesized and sensed by the same species (Nodwell, 2014), the GBL/*jad* system is interesting, because recent data suggests that GBL can signal across different *Streptomyces* species to activate different polyketide biosynthetic pathways (Nodwell, 2014; Zou et al., 2014). Biologically, it has been shown that different *Streptomyces* species, which are soil microbes, can live on the same grain of soil alongside a diversity of bacteria (Keller and Surette, 2006; Vetsigian et al., 2011), suggesting there may have been evolutionary selection for inter-species coordination for antibiotic production (Nodwell, 2014), resulting in enhanced genetic complexity associated with the *jad* locus.

### **Non-Ribosomal Peptides**

The structures of non-ribosomal peptides described in this review are shown (**Figure 2**).

### **Zwittermicin A**

bAgrobacterium

cP. polymyxa (Xie et al., 2014),

dP. fluorescens

(Martínez-García

 et al., 2015), eB. subtilis (Barbe et al., 2009; Belda et al., 2013).

 radiobacter (Zhang et al., 2014),

Zwittermicin A is a polyketide/nonribosomal peptide hybrid antibiotic produced by *B. cereus* and *B. thuringiensis* (Raffel et al., 1996) with activity against oomycetes such as *Phytophthora medicaginis* and some other pathogenic fungi (Silo-Suh et al., 1998). Zwittermicin A has a unique structure that includes glycolyl

moieties, D amino acid, and ethanolamine in addition to the unusual terminal amide produced from the ureidoalanine (nonproteinogenic amino acid) (Kevany et al., 2009). Zwittermicin A is thought to be biosynthesized as part of a larger metabolite that is processed twice to form zwittermicin A and two other metabolites (Kevany et al., 2009). The complete biosynthetic operon encoding zwittermicin A includes 27 open reading frames (CDS, *zmaA*, and *zmaV*) that extend over 62.5 kb of the *Bacillus cereus* UW85 genome, in addition to five individual genes (*kabR* and *kabA—kabD*) (Kevany et al., 2009). In this study, support was gained for the hypothesis that the skeleton of zwittermicin A is catalyzed by a megasynthase enzyme involving multiple nonribosomal peptide synthetases (NRPS) and PKS; the megasynthase has multiple modules containing distinct domains that catalyze the different steps in the pathway (Emmert et al., 2004; Kevany et al., 2009). Evidence suggested that the CDS included 5 NRPS modules (Kevany et al., 2009). It is noteworthy that a similar gene cluster was characterized on a plasmid in *B. cereus* AH1134, suggesting that the pathway can be transferred horizontally (Kevany et al., 2009). Consistent with the mobility of this operon, an orthologous 72-kb region encoding for zwittermicin A in *Bacillus thuringiensis*, was shown to be flanked by putative transposase genes on both edges, suggesting that it may be a mobile element that was gained by *B. cereus* through horizontal gene transfer. Since zwittermicin A has been reported to enhance the activity of protein toxins that attack insects (Broderick et al., 2000), it was hypothesized that transfer of this operon into *B. thuringiensis* permitted the microbe to gain insecticide-promoting factors to combat insects during co-evolution (Luo et al., 2011).

## **Fusaricidins A–D**

Fusaricidins are guanidinylated ß-hydroxy fatty acids attached to a cyclic hexapeptide including four D-amino acids (Kajimura and Kaneda, 1997; Schwarzer et al., 2003). These antibiotics are produced by *Paenibacillus polymyxa* strains and exhibit antifungal activity against diverse plant pathogens including, *Aspergillus niger*, *Aspergillus oryzae*, *Fusarium oxysporum,* and *Penicillium thomii* (Kajimura and Kaneda, 1996, 1997) as well as *Leptosphaeria maculans*, the causal agent of black root rot in canola (Beatty and Jensen, 2002). The amino acid chains of fusaricidins are linked together and modified by a non-ribosomal peptide synthetase (NRPS). The multi-domain NRPS consists of up to 15,000 amino acids and is therefore considered among the longest proteins in nature (Schwarzer et al., 2003). NRPS incorporation is not limited to the 21 standard amino acids translated by the ribosome, and this promiscuity contributes to the great structural diversity and biological activity of non-ribosomal peptides (Li and Jensen, 2008).

In *P. polymyxa* E68, the fusaricidin biosynthetic gene cluster (*fusGFEDCBA*) has been characterized in which the NRPS coding sequence, the largest CDS in the cluster, was observed to encode a six-module peptide (Choi et al., 2008; Li and Jensen, 2008; Li et al., 2013). The biosynthetic cluster includes other CDS responsible for biosynthesis of the lipid moiety but does not contain transporter genes (Li and Jensen, 2008). In *P. polymyxa*, a promoter for the *fus* operon was identified and shown to be bound by a transcriptional repressor (AbrB) which previous studies implicated as a regulator of sporulation; this is of interest since fusaricidin was observed to be synthesized during sporulation,

thus coordinating the microbe's secondary metabolism with its life cycle (Li et al., 2013).

Allelic diversity is typically thought to be responsible for producing chemical diversity. However, an interesting feature of the *fus* cluster is that a diversity of fusaricidins, differing in their incorporated amino acids (Tyr, Val, Ile, allo-Ile, Phe), can be produced by a single allele of *fusA;* the underlying mechanism is that the NRPS A-domain, responsible for recognition of amino acids, has relaxed substrate specificity (**Figure 3**) (Han et al., 2012).

#### **Polymyxins**

Polymyxins are a family of non-ribosomal lipopeptide antibiotics composed of ten amino acids, a polycationic heptapeptide ring and a fatty acid derivative at the N terminus (Storm et al., 1977). They are produced by Gram positive bacteria and target Gram negative species, by altering the structure of the cell membrane. The polymyxin family includes polymyxins A, B, D, E (colistin), and M (mattacin) (Shaheen et al., 2011). Polymyxin B exhibits potent antibacterial activity against

*Klebsiella pneumoniae*, *Pseudomonas aeruginosa,* and *Acinetobacter* spp. (Gales et al., 2006). However, polymyxins exhibit a remarkable degree of neurotoxicity and nephrotoxicity which limit their clinical use (Li et al., 2006).

In *Paenibacillus polymyxa* PKB1 (the same strain that controls plant fungi by producing fusaricidins, see above), a 40.8 kb polymyxin biosynthetic gene cluster was shown to encode five coding sequences, *pmxA-E.* Three CDS (*pmxA, B, E*) encode subunits of NRPS, each responsible for the modular incorporation of amino acids, while two genes (*pmxC, D*) encode a permease belonging to the ABC-type transporter family (Shaheen et al., 2011). In both *P. polymyxa* PKB1 and *P. polymyxa* E681, the arrangement of the NRPS coding sequences in the *pmx* cluster does not match the amino acid sequence in the produced polymyxin, which is unusual for NRPS-encoded peptides (Choi et al., 2009; Shaheen et al., 2011).

With respect to the diversity of polymyxins, polymyxins differ in the amino acid composition of residues 3, 6, and 7, in the D vs. L stereochemistry of the incorporated amino acids, as well as in the lipid moiety (Choi et al., 2009; Shaheen et al., 2011). In *P. polymyxa* SC2 and *P. polymyxa* PKB1, an allelic variant was uncovered within NRPS domain 3 using bioinformatic analysis of the genome which correlated with incorporation of the D rather than L form of 2,4-diaminobutyrate in amino acid position 3, explaining the mechanism for the production of two subtypes of polymyxin B (Shaheen et al., 2011). With respect to the diversity of residues 6 and 7, *P. polymyxa* E681 and *P. polymyxa* PKB1 produce polymyxins that differ in these amino acids, producing polymyxin A and B, respectively (Shaheen et al., 2011). Bioinformatic analysis revealed that the DNA sequences of the *pmx* gene clusters were 92% conserved at the nucleotide level, but differed considerably in the domains corresponding to modules 6 and 7 (Shaheen et al., 2011). These two sets of observations led the authors to suggest that the diversity of polymyxins arises from mixing and matching of alleles of the NRPS modular domains, hence combinatorial chemistry, rather than relaxed substrate specificity as seen in other secondary metabolites such as fusaricidins (see above).

Another interesting feature of the *pmx* gene clusters is that the polymyxin transporters might also transport fusaricidin, since the *fus* biosynthetic cluster lacks any transporter genes (see above), and as both antibiotics are cationic lipopeptides (Shaheen et al., 2011). The authors found support for this hypothesis, as deletion mutations in *pmxC* and *D* genes also reduced the antifungal activity of fusaricidin against *Leptosphaeria maculans* although the two biosynthetic gene clusters are not linked. It is worth noting that there is no evidence yet of genes responsible for lipidation of the peptide residue in the characterized polymyxin clusters, suggesting that this function might be encoded elsewhere in the genome (Shaheen et al., 2011).

#### **Iturins**

Iturins are a family of non-ribosomal cyclolipopeptides consisting of seven α-amino acid residues and one ß-amino acid, the latter noted as a unique feature compared to other lipopeptide antibiotics (Constantinescu, 2001; Leclère et al., 2005; Hamdache et al., 2013). The iturin family includes compounds such as bacillomycins D, F and L, bacillopeptins, iturins A, C, E and E, and mycosubtilins (Hamdache et al., 2013). Iturins are produced by different strains of *B. subtilis* and *B. amyloliquefaciens,* and exhibit potent antifungal activity against major phytopathogens including *R. solani, Fusarium oxysporum,* and *F. graminearum*, the latter responsible for Fusarium head blight in wheat (Gueldner et al., 1988; Constantinescu, 2001; Tsuge et al., 2001; Dunlap et al., 2013). The mechanism of action involves disruption of the target fungal plasma membrane (Thimon et al., 1995). In both *B. subtilis* RB14 and *B. amyloliquefaciens* AS43.3, the iturin A biosynthetic operons were shown to contain four coding sequences (*ituDABC*) coding for: a putative malonyl coenzyme A transacylase, a protein with three functions (fatty acid synthetase, amino acid transferase, and peptide synthetase), and two peptide synthetases, respectively (Tsuge et al., 2001; Dunlap et al., 2013).

Regarding diversification within the chemical family, iturin A from *B. subtilis* RB14 has a similar structure as mycosubtilin that is produced by *B. subtilis* ATCC 6633 but with inverted amino acids at the 6th and 7th positions (Tsuge et al., 2001). By comparative analysis of orthologous CDS between these two strains (*ituC* and *mycC*, respectively), it was suggested that the NRPS amino acid adenylation domain may have been intragenically swapped during evolution, which would also imply a HGT event (Tsuge et al., 2001). Comparative genome analysis between at least three sequenced *itu* clusters may reveal further information concerning the diversification of the iturin family (Tsuge et al., 2001; Blom et al., 2012; Dunlap et al., 2013).

#### **Bacilysin**

Bacilysin is a non-ribosomally produced dipeptide composed of an L-alanine residue at the N terminus and a non-proteinogenic amino acid, L-anticapsin, at the C terminus (Walker and Abraham, 1970; Stein, 2005). Compared to the more elaborate nonribosomal peptides noted above, bacilysin is noteworthy because it is amongst the simplest peptides in nature, adding to the structural diversity of observed non-ribosomal peptides. Bacilysin is produced by *Bacillus* species such as *B. pumilus, B. amyloliquefaciens*, and *B. subtilis* (Leoffler et al., 1986; Phister et al., 2004)

and shown to have antimicrobial activity against various bacteria and fungi such as *Candida albicans* (Kenig and Abraham, 1976). Mechanistically, bacilysin is a prodrug that is activated by the action of a peptidase enzyme that releases the active moiety, anticapsin (Rajavel et al., 2009). Anticapsin inhibits bacterial peptidoglycan or fungal protein biosynthesis through blockage of glucosamine synthetase, resulting in cell lysis (Kenig et al., 1976). Biosynthesis of bacilysin originates from the prephenate aromatic amino acid pathway (Hilton et al., 1988; Parker and Walsh, 2012).

In *B. subtilis* the biosynthesis of bacilysin is encoded by the operon, *bacABCDE* (*ywfB-G*), in addition to a monocistronic gene (*ywfH*) (Inaoka et al., 2003). *bacABC* is likely responsible for the biosynthesis of anticapsin while *bacDE* (*ywfEF*) encodes a ligase and an efflux transporter protein for self protection, respectively (Steinborn et al., 2005; Rajavel et al., 2009). The bacilysin biosynthetic operon is positively regulated by QS pheromones, in particular PhrC (Yazgan et al., 2001; Köroglu et al., 2011 ˘ ) and negatively regulated by ScoC, a transition state regulator (Inaoka et al., 2009). The transition state in bacteria is a period of decision making.

## **Terpenoids**

The structures of terpenoids described in this review are summarized (**Figure 4**).

## **Trichodermin and Harzianum A**

Trichothecene mycotoxins are produced by some fungal genera such as deoxynivalenol (DON) from *Fusarium*, and harzianum and trichodermin from *Trichoderma arundinaceum and T. brevicompactum*, respectively (Cardoza et al., 2011). Trichodermin was reported to have antifungal activity against the fungal pathogens *Rhizoctonia solani and Alternaria solani* (Chen et al., 2007) as well as other fungal genera (Tijerino et al., 2011). Trichodermin inhibits protein synthesis in eukaryotes by inhibiting peptidyl transferase that catalyzes translational elongation and/or termination (Wei et al., 1974) and by inhibiting peptidebond formation at the initiation stage of translation (Carter et al., 1976).

Comparative analysis has been conducted on the CDS responsible for trichothecene biosynthesis in *Fusarium* and *Trichoderma. In Fusarium*, trichothecenes are encoded by a gene cluster called the TRI cluster; this cluster also encodes regulatory and transport proteins (Proctor et al., 2009). In *Trichoderma*, an orthologous TRI cluster was discovered in which 7 CDS were conserved with *Fusarium*, but the two clusters showed interesting evolutionary divergence (Cardoza et al., 2011) which may be informative for understanding the genetics underlying other anti-fungal metabolites. In *Fusarium*, the TRI cluster includes *tri5* that encodes trichodiene synthase, the first committed step in trichothecene biosynthesis, which catalyzes the cyclization of farnesyl pyrophosphate to form trichodiene (Hohn and Beremand, 1989). In *Fusarium*, *tri5 is* located within the TRI cluster, but surprisingly it is not associated with the orthologous cluster in *Trichoderma*. Three additional CDS responsible for trichothecene biosynthesis in *Fusarium* (*tri7*, *tri8*, *tri13*) are missing from the *Trichoderma* cluster, along with an CDS of unknown function (*tri9*) (Cardoza et al., 2011). Interestingly, two of the apparently conserved biosynthetic CDS (*tri4* and *tri11*, based on sequence homology) were demonstrated to have diverged functionally between *Trichoderma* and *Fusarium* based on heterologous expression analysis: in *Trichoderma*, *tri4* catalyzes three out of four oxygenation reactions carried out by its corresponding *Fusarium* ortholog; *tri11* catalyzes distinctive hydroxylation reactions in *Fusarium* (C-15) and *Trichoderma* (C-4). Finally, amongst the CDS which are conserved between *Fusarium* and *Trichoderma*, head-to-tail vs. head-tohead rearrangements are observed (e.g., *tri3*, *tri4*) (Cardoza et al., 2011). These results demonstrate multiple evolutionary events (rearrangement, functional diversification, gene loss, gene gain) within one biosynthetic gene cluster (**Figure 5**).

### **Phomenone**

Phomenone is a sesquiterpene synthesized by various fungi including *Xylaria* sp., an endophytic fungus isolated from *Piper aduncum*, and reported to have antifungal activity against the pathogen *Cladosporium cladosporioides* (Silva et al., 2010). Phomenone is structurally similar to the PR toxin metabolite of *Penicillium roqueforti* which functions by inhibiting RNA polymerase and thus inhibits protein synthesis at the initiation and elongation steps (Moule et al., 1976). A biosynthetic precursor for phomenone A is aristolochene (Proctor and Hohn, 1993). In *P. roqueforti* NRRL 849, a gene required for aristolochene (*aril*) biosynthesis was characterized and shown to encode a sesquiterpene cyclase named aristolochene synthase (AS) (Proctor and Hohn, 1993). Expression of *aril* occurs in stationary phase cultures and is regulated transcriptionally (Proctor and Hohn, 1993).

#### **Paclitaxel (Taxol)**

The diterpene paclitaxel (Taxol) is reported to be produced by at least 20 diverse fungal endophyte genera inhabiting various plant species (Zhou et al., 2010). Taxol was reported to be produced by some fungal endophytes that inhabit conifer wood and its ecological function was suggested to be a fungicide against host pathogens (Soliman et al., 2013). Taxol acts by stabilizing microtubules and inhibiting spindle function leading to disruptions in normal cell division (Horwitz, 1994). However, Taxol was originally purified from *Taxus* trees (Wani et al., 1971) and shown to be encoded by plant nuclear genes, apparently redundantly. As the number of plant genera that produce Taxol is very few, it is interesting to speculate whether its biosynthetic genes may have been transferred horizontally from fungi to plants.

The Taxol biosynthetic pathway in plants requires 19 enzymatic steps. The first committed step in biosynthesis of plant Taxol is cyclization of GGDP to taxa-(4,5),(11,12)-diene catalyzed by taxadiene synthase (TS) (Hezari et al., 1995). Thirteen plant Taxol biosynthetic genes from *Taxus* were used in BLASTP searches to identify potential homologs in *Penicillium aurantiogriseum* NRRL 62431 (Yang et al., 2014). Seven putative homologous genes were identified though the homology scores were as low as 19%; these genes were claimed to encode: phenylalanine aminomutase (PAM), geranylgeranyl diphosphate synthase (GGPPS), taxane 5α-hydroxylase (T5OH), taxane 13α-hydroxylase (T13OH), taxane 7β-hydroxylase (T7OH), taxane2α-hydroxylase (T2OH) and taxane 10β-hydroxylase (T10OH). Another gene encoding an AT (PAU\_P11263) was identified by using BLASTP against the GenBank database. However, no homologs were identified to plant TS; the authors claimed that the fungus might catalyze taxadiene synthesis by a unique enzymatic system (Yang et al., 2014). Position-Specific Iterative BLAST showed one gene from the bacterial genus *Mycobacterium* with potential similarity to plant TS suggesting lateral gene transfer from plants to mycobacteria (Yang et al., 2014).

In a parallel study to isolate fungal Taxol biosynthetic genes, a different approach was taken where PCR primers designed from the plant genes that encode Taxol were used as a primary screen against fungi (Xiong et al., 2013). The study identified putative homologs of fungal TS as well as BAPT (which encodes the critical C-13 phenylpropanoid side-chain CoA acyltransferase) with ∼40% sequence identities to their plant counterparts. Despite this progress, other reports remain skeptical that fungi actually encode Taxol (Heinig et al., 2013).

Recent studies have demonstrated complex three-way interactions in Taxol biosynthesis between a Taxol-producing fungal endophyte, other endophytes and the host plant. Host endophytic

fungi appear to elicit plant TS transcription or transcript accumulation. Specifically, TS transcript and the corresponding protein were reduced upon treating both young plantlets and old *Taxus* wood with fungicide (Soliman et al., 2013). In a parallel study, co-culture of the Taxol-producing endophyte *Paraconiothyrium* SSM001 with two presumptive fungal endophytes of the same yew tree host elicited paclitaxel accumulation from the endophyte, suggesting inter-species interactions between endophytes inhabiting the same host niche (Soliman and Raizada, 2013).

#### **Helvolic Acid**

Helvolic acid is a fusidane triterpene produced by *Aspergillus fumigatus* (Lodeiro et al., 2009) and the yeast, *Pichia guilliermondii* Ppf9 (Zhao et al., 2010). Helvolic acid was reported to inhibit the spore germination of *Magnaporthe oryzae*, the causal agent of rice blast disease (Zhao et al., 2010). The biosynthetic genes for helvolic acid are clustered as nine genes coding for protostadienol synthase which catalyzes the precursor (17Z)-protosta-17(20),24-dien-3-ol, along with genes that encode squalene-hopene cyclase, four cytochrome P450 monooxygenases, short chain dehydrogenase, two transferases and 3-ketosteroid 1-dehydrogenase (Lodeiro et al., 2009). The authors reported that the P450 monooxygenases from different fungi shared substantial sequence identity across recent evolution, while the transferases duplicated and diversified into paralogous gene families (Lodeiro et al., 2009). This observation suggests that even within a single gene cluster, there may be different selection pressures on adjacent genes belonging to the same biosynthetic pathway. Interestingly, the helvolic acid biosynthetic gene cluster in *A. fumigates* is located in the subtelomere chromosome region (Lodeiro et al., 2009) which is associated with high rates of evolutionary recombination and diversification. However, the gene cluster lacks introns which is a trait sometimes associated with subtelomeric regions, but this observation might also be evidence of HGT from bacteria (Lodeiro et al., 2009).

## **Alkaloids**

The structures of alkaloids described in this review are summarized (**Figure 6**).

### **Ergot**

Ergot alkaloids are produced from the sexual *Epichloe* fungi and their asexual derivatives *Neotyphodium* within the Clavicipitaceae family which inhabit Pooideae grasses (Schardl, 2010). Ergot alkaloids can interact with receptors of the central nervous system and exhibit toxic effect on nematodes, insects, and mammalian herbivores including livestock which graze these grasses (de Groot et al., 1998; Gröger and Floss, 1998). In Europe in the Middle Ages, consumption of ergot-infected grain or grasses caused convulsions, paranoia and hallucinations in livestock and humans, known as St. Anthony's Fire (Dotz, 1980). The diverse ergot alkaloids share a tetracyclic ergoline backbone derived from tryptophan and dimethylallyl diphosphate (Flieger et al., 1997). Gene clusters for ergot alkaloid biosynthesis have been identified in various Ascomycete species belonging to *Aspergillus*, *Penicillium,* and *Claviceps*. Seven genes encode the ergoline scaffold including dimethylallyltryptophan synthase (DMATS) which catalyzes the first committed step. DMATS is responsible for the prenylation of L-tryptophan with dimethylallylpyrophosphate (DMAPP) to produce 4-dimethylallyltryptophan (4-DMAT) (Heinstein et al., 1971). Ergots have diversified into three classes, caused by diverse substituents attached to the carboxyl group of the tetracyclic ergoline backbone, in particular the presence of an amide group (creating ergoamides), a peptidelike amide moiety (creating ergopeptines) or the absence of these moieties (creating clavine alkaloids) (Wallwey and Li, 2011). These structural modifications are responsible for the differential physiological and pharmacological effects of the ergot family, that include treatment of postpartem hemorrhage, leukemia, and Parkinson's disease. The genetic basis for ergot diversification into these 3 major classes is associated with the presence or absence of nonribosomal peptide synthases (NRPS) which catalyze the biosynthesis of the peptide moieties on the ergoline backbone (Wallwey and Li, 2011). For example, four NRPS genes are present in *Claviceps purpurea* (which encodes ergopeptines) but absent in *Aspergillus fumigatus* (which produces clavine alkaloids). Inactivation of these genes suggests that two of the NRPS genes (*lpsA* and *lpsB*) are also responsible for synthesis of the ergoamides (Haarmann et al., 2008). Interestingly, further diversification of the peptide moiety within *C. purpurea* has been reported to be caused by fine-scale allelic diversification of the NRPS genes (Haarmann et al., 2005). There is additional evidence to suggest that diversification of the ergot alkaloid gene clusters is associated with DNA transposons and retroelements, which were observed in the cluster encoding ergovaline, an ergot alkaloid from *Epichloe festucae* associated with livestock toxicity (Fleetwood et al., 2007). As an interesting note, the genes encoding ergovaline were highly expressed only during biotrophic growth of the fungus within the host grass plant not when the mycelia were cultured *in vitro*, suggesting that the host might have a regulatory role in the expression of the fungal gene cluster (Fleetwood et al., 2007).

### **Loline Alkaloid**

Loline is an indole alkaloid produced by *Neotyphodium uncinatum* fungus, the asexual mutualistic derivative of *Epichloe*, which is known to protect its host plants from insects (Blankenship et al., 2001; Schardl, 2010). The loline biosynthetic pathway was suggested to involve proline and homoserine (Spiering et al., 2005). In *N. uncinatum*, two homologous gene clusters encoding loline were identified, named LOL-1 and LOL-2 (Spiering et al., 2005). The cluster LOL-1 involves nine genes-(*lolF-1, lolC-1, lolD-1, lolO-1, lolA-1, lolU-1, lolP-1, lolT-1, lolE-1*) within a 25-kb chromosomal segment, while the LOL-2 cluster contains the same homologs (except for *lolF*) ordered and oriented the same as in LOL-1. This evidence suggests that the loline clusters may represent a recent segmental duplication event (Spiering et al., 2005).

An interesting ecological situation exists in grasses infected with *Epichloe* fungi (sexual form of *Neotyphodium*): the fungus reduces the ability of these plants to propagate sexually (they choke the inflorescences), which, without compensatory mechanisms, would prevent vertical transmission of the fungus (Zhang

et al., 2010). However, to compensate, the fungal stromata attract fly vectors which transmit the fungal spores to other plants, permitting horizontal transfer of the fungus. Loline accumulates in young tissues of the grasses, providing insect protection to these young hosts; however if loline was also to accumulate in the grass inflorescences, it would kill the fly vector of the fungus. Upon further investigation, this apparent paradox was resolved: in these grass inflorescences, transcription of the loline biosynthesis genes was dramatically downregulated compared to plants with healthy inflorescences (infected with the symbiotic asexual *Neotyphodium),* permitting the fly vectors to survive (Zhang et al., 2010). These observations suggest strong selection pressure to evolve the regulatory elements of these genes.

## **Heterocyclic Nitrogenous Compounds**

The structures of heterocyclic nitrogenous compounds described in this review are summarized (**Figure 6**).

### **Phenazines**

Phenazines are a group of naturally occurring heterocyclic nitrogenous antibiotics produced exclusively by bacteria and widely reported in fluorescent *Pseudomonas* (Mavrodi et al., 2006, 2013). Phenazines are potent antifungal compounds that can combat soil borne pathogens (Ligon et al., 2000) such as *Rhizoctonia solani, Gaeumannomyces graminis* var*. tritici, Pythium* spp. (Gurusiddaiah et al., 1986) and *Fusarium oxysporum* (Anjaiah et al., 1998). Mechanisms of action include: (1) accumulation of toxic molecules such as hydrogen peroxide and superoxide due to the redox potential of phenazine (Hassan and Fridovich, 1980; Hassett et al., 1995); and (2) elicitation of induced host resistance (Audenaert et al., 2002). Ecologically, the evidence suggests that the plant rhizosphere promotes phenazine-producing bacteria to combat pathogens (Mazzola et al., 1992; Mavrodi et al., 2013).

Phenazine is derived from the shikimic acid pathway, with amino-2-deoxyisochorismic acid (ADIC) as the branchpoint to phenazine (McDonald et al., 2001). ADIC is then converted to trans-2, 3-dihydro-3-hydroxy anthranilic acid which undergoes dimerization to form phenazine-1-carboxylic acid, the first derivative of the phenazines (McDonald et al., 2001). Phenazine biosynthesis in *Pseudomonas fluorescens* is encoded by a single or duplicated core of five CDS, *phzADEFG,* that encode ketosteroid isomerase, isochorismatase, anthranilate synthase, trans-2,3-dihydro-3-hydroxyanthranilate isomerase, and pyridoxamine oxidase respectively (Mavrodi et al., 2013). In *Pseudomonas*, the core may include other CDS such as *phzB* which was duplicated from *phzA*, and *phzC* which encodes 3-deoxy-Darabino-heptulosonate-7-phosphate synthase that is responsible for diverting carbon from the shikimate pathway to phenazine (Pierson and Pierson, 2010).

Comparisons between *Pseudomomas* species and other genera have revealed conservation yet diversity of the core phenazine biosynthetic CDS. For example, the phenazine biosynthesis operon in *Burkholderia cepacia* maintains the five core enzymes observed in *Pseudomonas* as reviewed (Mavrodi et al., 2006). However, there is evidence to suggest that these coding sequences spread to enteric bacteria and *Burkholderia* species via horizontal gene transfer, because these genes can be observed in plasmids and transposons (Mavrodi et al., 2010). For example, in *Erwinia herbicola*, a biosynthetic cluster of 16 CDS (*ehp*) was isolated from a plasmid, of which 15 coded for D-alanyl griseoluteic acid (AGA) while *ehpR* was observed to encode for resistance to AGA (Giddens et al., 2002). Other differences in the core have also been observed between *Pseudomonas* species and others; for example in both *Burkholderia cepacia* and *Erwinia herbicola, phzA* is not duplicated (Mavrodi et al., 2006).

Structural diversity of phenazines in different species is achieved by specific genes that may be located within the cluster or elsewhere in the genome. For example, in *P. chlororaphi,* the *phzH* gene is located downstream of the phenazine operon, where it encodes an aminotransferase responsible for converting phenazine-1-carboxylic acid (PCA) to phenazine-1 carboxamide, the characteristic green pigment of *P. chlororaphis* (Chin-A-Woeng et al., 1998). In *P. aureofaciens, phzO* was identified as the gene that encodes an aromatic monooxygenase, responsible for catalyzing the hydroxylation of PCA to form the broad spectrum antifungal compound, 2-OH-PCA (Delaney et al., 2001). In *P. aeruginosa* (PAO1), two diversification genes were discovered: *phzM* was shown to be involved in the synthesis of pyocyanin while *phzS* gene encodes a monooxygenase that catalyzes the production of 1-hydroxy phenazine (Mavrodi et al., 2001).

### **Pyrrolnitrin**

Pyrrolnitrin is a chlorinated phenylpyrrole antibiotic purified initially from *Burkholderia pyrrocinia* (Arima et al., 1964) then subsequently from other species including pseudomonads, *Myxococcus fulvus, Enterobacter agglomerans*, and *Serratia* sp (Chernin et al., 1996; Kirner et al., 1998; Hammer et al., 1999). Pyrrolnitrin was initially used for treatment of skin mycoses caused by *Trichophyton* fungus, then was developed as an effective fungicide for crops against *Botrytis cinerea* (Hammer et al., 1993), *Rhizoctonia solani* (El-Banna and Winkelmann, 1998) and *Gaeumannomyces graminis* var. *tritici* (Tazawa et al., 2000). In *P. fluorescens,* the pyrrolnitrin biosynthetic operon consists of four coding sequences (*prnABCD*) coding for tryptophan halogenase (*prnA*), a decarboxylase (*prnB*), monodechloroaminopyrrolnitrin halogenase (*prnC*), and an oxidase (*prnD*) (Hammer et al., 1997; Kirner et al., 1998). Comparative analysis indicates that the pyrrolnitrin biosynthetic operon is differentially conserved between divergent species with 59% similarity among diverse bacterial strains such as *Pseudomonas*, *Myxococcus fulvus,* and *Burkholderia cepacia,* with lower similarity shown for *prnA* in *M. fulvus* (45%) (Hammer et al., 1999). Furthermore, RFLP-based polymorphisms within a 786 bp *prnD* fragment suggested that there may have been lateral gene transfer of the *prn* operon from *Pseudomonas* to *Burkholderia pyrrocinia* (Souza and Raaijmakers, 2003). Consistent with such mobility, transposase-encoding genes surrounding the *prn* biosynthetic operon were observed in *Burkholderia pseudomallei* (Costa et al., 2009).

## **Volatile Compounds**

In this section, only the most well studied volatile compound, hydrogen cyanide, is discussed.

## **Hydrogen Cyanide (HCN)**

Hydrogen cyanide (HCN) is a volatile secondary metabolite produced by *P. aeruginosa,* and diverse rhizosophere fluorescent pseudomonads, where they exhibit biocontrol activity against pathogenic fungi such as *Thielaviopsis basicola,* the fungal causal agent of black root rot of tobacco (Voisard et al., 1989, 1994; Frapolli et al., 2012). Mechanistically, HCN functions by inhibiting important metalloenzymes such as cytochrome *c* oxidase (Blumer and Haas, 2000) and/or by complexing metals in the soil (Brandl et al., 2008). HCN is biosynthesized from glycine (Castric, 1977) in an oxidative reaction catalyzed by HCN synthase, a membrane-bound flavoenzyme (Castric, 1994; Blumer and Haas, 2000). The biosynthesis of HCN occurs in the presence of an electron acceptor such as phenazine methosulfate (Wissing, 1974).

In *P. aeruginosa* PAO1, the HCN synthase biosynthetic operon *hcnABC* was characterized (Pessi and Haas, 2000)*. hcnA* was reported to encode a protein similar to formate dehydrogenase while *hcnB* and *hcnC* encode products with similarity to amino acid oxidases (Laville et al., 1998; Svercel et al., 2007). In a phylogenetic analysis of 30 fluorescent pseudomonads, no evidence was found for HGT of the *hcn* gene cluster, but rather that the locus appears to be exclusively inherited vertically (Frapolli et al., 2012). HCN has also been detected in *Chromobacterium violaceum* but the underlying genes have not been reported which might otherwise give new insights into HCN biosynthesis outside of the pseudomonads (Blom et al., 2011).

## **Bacteriocin**

In this section, only the most well studied compound from this class is discussed. The structure of agrocin 84 described in this review is included (**Figure 6**).

## **Agrocin 84**

Agrocin 84 is a 6-N-phosphoramidate of an adenine nucleotide analog (Roberts and Tate, 1977). This compound is produced by non-pathogenic strains of *Agrobacterium radiobacter* to biocontrol crown gall, a tumorous disease resulting from overproduction of auxin and cytokinin hormones stimulated by the Ti plasmid after it has transferred from *A. radiobacter* and integrated within host plant chromosomal DNA (Wang et al., 1994). Recently, it was shown that agrocin 84 employs a novel mechanism to inhibit leucyl-tRNA synthetases and hence inhibit translation (Chopra et al., 2013), though it was earlier suggested that agrocin 84 acts by inhibiting DNA synthesis (Das et al., 1978).

In *Agrobacterium radiobacter* K84, the biosynthesis and immunity to agrocin 84 is encoded by 17 coding sequences (the *agn* operon) located on a 44-kb conjugal plasmid, *pAgK84,* though the plasmid has 36 CDS in total (Kim et al., 2006). The two most interesting CDS are *agnB2* and *agnA* which encode aminoacyl tRNA synthetase homologs. The agrocin 84 antibiotic is essentially a nucleotide attached to an amino acid-like moiety (methyl pentanamide), and its mode of action was proposed to involve competitive binding to the active site of leucyl-tRNA synthetases (Reader et al., 2005). *agnB2* encodes a leucyl-tRNA synthetase homolog that confers self-immunity to agrocin 84 since it does not bind the antibiotic (Kim et al., 2006). Normally, a tRNA synthetase acts as a ligase that catalyzes the attachment of an amino acid to a tRNA which includes an anticodon; the catalysis results in a phosphoanhydride bond between the amino acid and ATP as the initial step in the aminoacylation of tRNA (Ibba and Söll, 2000). Surprisingly, *agnA* encodes a truncated homolog of an asparaginyl-tRNA synthetase which lacks the anticodonbinding domain, but maintains the catalytic domain. Thus, *agnA* appears to be a fascinating example of a gene that evolved from an ancient tRNA synthetase (for arginine), but now is a biosynthetic enzyme for an antibiotic that inhibits a paralogous enzyme (for leucine attachment) (Kim et al., 2006).

The *agn* operon may have an evolutionary history of horizontal gene transfer, as *pAgK84* is inter and intra species transferable: *Rhizobium* that received the *pAgK84* plasmid from *Agrobacterium* as trans-conjugates could synthesize agrocin 84 and received immunity as well (Farrand et al., 1985).

A final fascinating feature of the *agn* system is an apparent second form of ancient evolutionary pressure on the genes responsible for the biosynthesis of the antibiotic. Agrocin 84 is a chemical mimic of agrocinopines, a class of compounds that is a source of plant-derived nitrogen for the pathogens targeted by the antibiotic; the pathogens have their own Ti plasmids that encode for transporters that not only transport agrocinopines but also agrocin 84 (Ellis and Murphy, 1981; Hayman and Farrand, 1988; Kim and Farrand, 1997). Hence the *agn* biosynthetic genes evolved to create a chemical structure that not only mimics the tRNA synthetase substrate of the pathogen target, but also targets its nitrogen uptake machinery.

#### **Enzymes**

In this section, only the most well studied anti-fungal enzyme, chitinase, is discussed.

#### **Chitinase**

Chitinases are enzymes that break down chitin, one of the fungal cell wall components composed of repeated units of N-acetyl-Dglucos-2-amine, linked by β-1,4 glycosidic bonds (Bhattacharya et al., 2007). Fungi and hence chitin are enriched in soil and thus soil microbes are abundant sources of chitinases (also to target insects) (Hjort et al., 2010). Examples of chitinaseproducing microbes include: fluorescent *Pseudomonas* strains isolated from the sugarcane rhizosphere that can target *Colletotrichum falcatum*, the causative agent of red rot disease in this crop (Viswanathan and Samiyappan, 2001); *Actinoplanes* *missouriensis* that antagonizes *Plectosporium tabacinum*, the causal agent of lupin root rot in Egypt (El-Tarabily, 2003); and *Stenotrophomonas maltophilia* that suppresses summer patch disease in Kentucky bluegrass (Kobayashi et al., 2002). Chitinases are produced by diverse bacterial genera including *Pseudomonas*, *Streptomyces*, *Bacillus,* and *Burkholderia* (Quecine et al., 2008). Chitinases are divided into two major categories, exochitinases and endochitinases. Of the four reported endochitinase family members (glycoside hydrolase families 18, 19, 23, and 48), primarily families 18 and 19 have been reported in bacteria, with only a single example of a family 23 chitinase (Prakash et al., 2010).

In *Stenotrophomonas maltophilia* 34S1, the chitinase family 18 gene has one CDS that encodes for a protein with seven domains: a catalytic domain, a chitin binding domain, three putative binding domains, a fibronectin type III domain and a polycystic kidney disease domain (Kobayashi et al., 2002). Bacterial chitinase family 18 has been shown to display different types of diversity. First, sequence analysis has shown that the catalytic domain and substrate binding domain, which are separated by a linker, have evolved independently. As the domain sequences do not match the taxonomies of their hosts, it has been suggested that domain swapping has been an important generator of diversity in this family, combined with HGT (**Figure 7**) (Karlsson and Stenlid, 2008). Unusual examples of chitinase genes are those that contain multiple family 18 catalytic domains within the same peptide that appear to function independently of one another (Howard et al., 2004). Additional examples of family 18 biodiversity include genes that contain non-consensus sequences at the catalytic site, as well as a bacterial subgroup that consists solely of a catalytic domain (Karlsson and Stenlid, 2008).

Unlike family 18 chitinases that are widely distributed among the prokaryotes, family 19 chitinases are restricted to green nonsulfur and purple bacteria, as well as actinobacteria (Prakash et al., 2010). Based on sequence alignments of family 19 chitinases in prokaryotes and eukaryotes, strong evidence has emerged that this gene family in actinobacteria and purple bacteria was derived from flowering plants by HGT (Prakash et al., 2010). Furthermore, HGT from plants to purple bacteria may have occurred as two independent events in the distant past, followed more recently by HGT to actinobacteria (Prakash et al., 2010). The core architecture and catalytic sites of bacterial and plant family 19 chitinases are nearly identical. The sequence analysis further suggests that there was subsequent HGT from purple bacteria

and actinobacteria to nematodes and arthropods, respectively (Prakash et al., 2010).

## **Discussion**

The objective of this paper was to review the biodiversity of antimicrobial compounds, their mode(s) of action and underlying biosynthetic genes within plant associated microbes. This review covered diverse biosynthetic gene clusters that encode polyketides, non-ribosomal peptides, terpenoids, alkaloids, heterocyclic nitrogenous compounds, volatile compounds, bacteriocins, and lytic enzymes. The reviewed evidence suggests that these biosynthetic genes have diversified at different orders, each based on distinct evolutionary mechanisms:

## **Species Level Diversification**

An emerging theme from the literature is that horizontal gene transfer (HGT) appears to have played a major role in the evolutionary diversification of plant-associated microbial species through inheritance of anti-microbial traits. There is evidence that HGT may have occurred from: bacteria to bacteria such as those that inhabit the rhizosphere (e.g., phenazine); from bacteria to fungi (e.g., helvolic acid); from bacteria to nematodes and arthropods (e.g., chitinase family 19); possibly from fungi to plants (e.g., Taxol); from plants to bacteria (e.g., phenazine and chitinase family 19); and even from higher eukaryotes to bacteria (e.g., IleS, pseudomonic acid resistance protein) (**Figure 8**). As noted in the literature, diverse factors might have facilitated these remarkable gene transfer events including: (1) the clustering of genes encoding the secondary metabolite; (2) homologous recombination between chromosomes and transconjugated plasmids (e.g., phenazine, zwittermicin A); (3) the presence of mobile elements (DNA transposons and retroelements) flanking the biosynthetic operons (e.g., zwittermicin A, phenazine and ergovaline); and (4) the presence of genes that

encode self-immunity to the antibiotic within the biosynthetic cluster as otherwise receiving the cluster would have caused immediate suicide (e.g., mupirocin). It is worth noting that some gene clusters show no evidence of HGT (e.g., hydrogen cyanide).

## **Genome Level Diversification**

A second emerging theme from the literature is that a subset of associated plant- associated microbial genomes have diversified with respect to duplications of entire gene clusters responsible for the synthesis of antimicrobial compounds. For example, as noted above, in *Neotyphodium uncinatum*, there are two homologous gene clusters that encode loline, LOL-1 and LOL-2, the likely result of a segmental duplication event within this fungus. Another noted example is from *Bacillus* species in which three tandemly duplicated gene clusters, *pks1*, *pks2,* and *pks3*, encode the polyketides, bacillaene, macrolactin, and difficidin, respectively**,** the likely result of a homologous recombination event.

## **Intra Gene Cluster Diversification**

A third interesting theme from the literature is that gene clusters encoding anti-microbial compounds have extensively diversified within, to permit biochemical diversification. The biosynthetic genes for these compounds are clustered in fungi or organized into operons in bacteria—in the latter, they are generally located on chromosomes but occasionally on plasmids (e.g., agrocin 84). Diversity within each cluster can include varying combinations of biosynthetic coding sequences (CDS), transporters for the respective compound, regulatory genes and CDS that confer selfimmunity (e.g., mupirocin, agrocin 84). The biosynthetic operons vary in how many CDS synthesize the core skeleton (e.g., synthetases) as well as in how many encode decoration enzymes (e.g., hydroxylases, acyltransferases). However, the decoration enzymes may be encoded outside the gene cluster (e.g., phenazine operon). Furthermore, the biosynthetic CDS may be organized into genetic modules (e.g., NRPS) that vary in number. Each gene cluster is also associated with distinct DNA regulatory elements, for example to receive signals such as from quorum sensing. For example, comparative analysis of the trichothecene biosynthetic gene clusters (TRI) in *Fusarium* and *Trichoderma* showed multiple evolutionary diversification events within a single biosynthetic gene cluster family (e.g., head-to-tail vs. head-to-head rearrangements) (**Figure 5**). In another example, comparative analyses of the polymyxin operon showed mixing and matching of CDS, resulting in diversification of the compounds. Similarly, diversification of the phenazines likely arose through a diversity of biosynthetic decoration enzymes (e.g., hydroxylases). Another intriguing observation is from the helvolic acid biosynthetic gene cluster, in which transferase CDS were shown to have duplicated and diversified into paralogous gene families. As noted above, an interesting feature of this gene cluster is that it located in the subtelomere chromosome region which is associated with high rates of evolutionary recombination.

## **Diversification Within Coding Sequences (CDS)**

A final emerging theme from the literature is that diversification of anti-microbial traits in plant-associated microbes arose from allelic diversification. For example, intragenic swapping of domains was observed within the same genetic module (e.g., iturin A, mycosubtilins). As another example, whereas most chitinase genes possess a single catalytic domain, examples were noted where a single CDS encodes two catalytic domains (**Figure 7**). In general, the literature notes that domains within the same CDS can evolve independently (e.g., catalytic vs. substrate binding domains of chitinase); combined with the existence of linker peptides between domains as sites of homologous recombination, these features can result in novel alleles following domain swapping (e.g., family18 chitinases). Whereas, such allelic diversification plays a major role in the diversification of compound structures, caused for example by DNA mutations within the substrate binding domain, the literature demonstrates examples where biochemical diversity has arisen from relaxed substrate specificity of the biosynthetic enzymes (**Figure 3**). A representative example of the latter is the promiscuous fusaricidin NRPS in which the same recognition domain in different species can recognize and incorporate different amino acids, and furthermore it can recognize amino acids beyond the 21 standard amino acids translated by the ribosome, which results in significant structural diversity.

## **Dynamic Evolutionary Driven by Selection Pressures**

It is interesting to speculate on the evolutionary selection pressures that have led to the diversification at the various biological levels noted above. At the most basic level, diversification was likely driven by a three-way co-evolution between the plantassociated microbe, its target pathogen and the host plant. This co-evolution may have occurred within a specific plant tissue niche or within soil associated with the rhizosphere (e.g., phenazine and PLt to combat soil-borne pathogens). However, there is also evidence for four-way interactions, to also include additional microbes (e.g., Taxol, jadomycin) and insects (e.g., ergovaline). These complex interactions can be bi-directional (e.g., loline). Within the producing organism, there is evidence for selection pressure to coordinate biosynthesis of the antimicrobial compound with the life cycle of the microbe (e.g., fusaricidins). There may also have been selection for genetic efficiency (e.g., potential sharing of transporter genes between polymyxin and fusaricidin). These selection pressures have led to fascinating individual stories, including the evolution of mimicry to facilitate antibiosis (e.g., agrocin).

## **Ecological and Evolutionary Lessons**

When the examples of anti-microbial pathways were grouped by the phylogeny of their host microbes, several trends were observed (**Figure 9**, **Tables 1**, **2**). Specifically: (1) some anti-microbial genes are apparently widely distributed among diverse taxonomic classes of bacteria (e.g., chitinases); (2) some metabolic pathways are widely distributed within one taxonomic class such as pyrrolnitrin that shows up in more than half of the presented species of Proteobacteria; (3) other anti-microbial pathways appear to be more restricted (e. g., fusaricidin, polymyxin, jadomycin). These results may correlate to the evolutionary age of these genetic pathways, or may represent a bias based on how well the pathway has been studied. More widespread genome sequencing and/or the use of orthologous gene probes may help to inform the evolutionary origins of these anti-microbial pathways.

## **Bacterial Pathway Lessons**

The selected examples of anti-microbial pathways from plantassociated bacteria found in the literature and presented in this review are distributed across Proteobacteria, Actinobacteria and Firmicutes (**Figure 9**). This may not be surprising as Proteobacteria and Actinobacteria are among the most widespread bacterial taxa associated with plants, perhaps because of their saprophytic capabilities (Bulgarelli et al., 2012).

Within these phyla, *P. fluorescens* (Proteobacteria) and *B. subtilis* (Firmicutes) were observed to produce a plethora of diverse antimicrobial compounds belonging to diverse chemical classes including polyketides, non-ribosomal peptides, heterocyclic nitrogenous compounds, volatiles and enzymes which reflect the diversity of the metabolic machineries of these species. As *P. fluorescens* and *B. subtilis* are both model systems, these results also support the above note of the bias within this literature.

*Bacillus* sp. and *Pseudomonas* sp. are ubiquitous microbes that can survive in diverse ecological niches (Compant et al., 2005). Both have elegant survival strategies that involve the production of antibiotics, surfactin, cyanide, biofilms, and induction of host resistance (Espinosa-Urgel, 2004; Dini-Andreote and van Elsas, 2013). These unique adaptations have led to their widespread study and use as biocontrol agents (Santoyo et al., 2012).

Genome analysis of *P. fluorescens* has provided insight into its ecological competency and evolutionary mechanisms. The versatile and rapid adaptability of *P. fluorescens* to diverse environmental clues may be attributed to over 200 characterized signal transduction proteins which enhance its sensing capability (Garbeva and de Boer, 2009; Humair et al., 2010). With respect to co-evolution, the *P. fluorescens* genome is exceptionally rich in repetitive extragenic palindromic (REP) elements, target sites for transposases and recombinases, with 1052 REP elements identified in *P. fluorescens* Pf-*5* (compared to 21 in *P. aeruginosa* PAO1 and 365 in *P. syringae* DC 3000) (Tobes and Pareja, 2005, 2006). REPs likely affected genome evolution either by gene gain, loss or rearrangement (Silby et al., 2011). The latest version of the genome sequence and annotation of *P. fluorescens* was recently released (Martínez-García et al., 2015).

*B. subtilis* is naturally competent genetically, with a cascade of competence-specific DNA-uptake proteins that bind and transport DNA, in addition to a dynamic recombination mechanism which transforms chromosomal or plasmid DNA via different pathways (Chen and Dubnau, 2004; Kidane et al., 2009). Additionally, the *B. subtilis* genome encodes integrative and conjugative element binding (*ICEBs1*) proteins responsible for excision, integration, transfer of DNA (Lee et al., 2007) that likely have facilitated HGT. Comparative genomic analysis of *B. subtilis* strains revealed 298 accessory segments that potentially originated from mobile elements including plasmids, transposons and phages. This implies extensive HGT events that lead to diversification of the arsenal of anti-microbial pathways within *Bacillus*

(Zeigler, 2011). The complete genome sequence and genome annotation of *B. subtilis* is available (Barbe et al., 2009; Belda et al., 2013).

## **Fungal Pathway Lessons**

In contrast to bacteria, all the fungal examples presented in the review belong to a single classification—the Pezizomycotina (filamentous fungi), a subdivision of Ascomycota, the largest phylum of fungi (Blackwell, 2011) including representatives from Eurotiomycetes and Sordariomycetes (**Figure 9**). Pezizomycotina has an ancient origin in the Cambrian period, ca 530 Mya (Prieto and Wedin, 2013).

Pezizomycotina species are the most ubiquitous fungi with extremely diverse lifestyles, suggesting a corresponding diversity of ecological strategies (Spatafora et al., 2006; Beimforde et al., 2014) reflected in their production of a range of antimicrobial metabolites. There may be at least two reasons for this metabolic diversity, HGT and recombination. First, HGT from bacteria to fungi was previously reported in Ascomycota, of which 65% were observed in Pezizomycotina (Marcet-Houben and Gabaldón, 2010). Second, secondary metabolism gene clusters in Pezizomycotina show evidence of recent gene expansion (Arvas et al., 2007). Interestingly, most of these genes are located in the sub-telomere region (Rehmeyer et al., 2006) that is associated with a considerable high rate of recombination and correspondingly rapid evolution compared to other regions in the genome (Freitas-Junior et al., 2000), an example represented in this review by helvolic acid (**Table 1**).

## **Gaps and Future Perspectives**

Despite the apparent progress in understanding the genetic mechanisms underlying the diversity of anti-microbial compounds produced by plant-associated microbes, significant gaps and opportunities remain. The major challenge is that a vast majority of plant associated microbes are unculturable, a phenomena that, to a far extent, limits our understanding of species diversification and evolution. It is worth noting that considerable progress toward cultivation of unculturable microbes has started to be achieved (Pham and Kim, 2012; Stewart, 2012). The modified cultivation methods attempt to simulate the natural environment, and include community culturing, and the use of high-throughput microbioreactors and laser microdissection (Pham and Kim, 2012). Another challenge is that the literature appears to be biased for model organisms, with insufficient data from other organisms in the phylogenetic tree for comparative genomics and evolutionary studies. For example, despite our efforts to originally focus this review only on antimicrobial pathways from endophytes, it became clear that the number of associated genes from endophytes has largely been unexplored, compared to free living rhizosphere model species. Indeed, there remains a lack of detailed genetic analysis underlying many anti-microbial compounds across microbes (endophytic and non-endophytic) and a lack of information to connect allelic diversity with compound diversity.

With respect to understanding the biosynthetic pathways of these metabolites, more information is needed as to the extent that diverse anti-microbial pathways coordinate and share biosynthetic enzymes. An important question in metabolic biosynthesis is understanding how chemical substrates are channeled along metabolic pathways from one enzyme to the next; from this review, it appears that some anti-microbial pathways solve this problem by using mega-synthase enzymes (e.g., zwittermicin A), but for other pathways, investigation of enzymeenzyme interactions will be informative. An interesting future area of study will be to investigate the subcellular location of biosynthetic and storage proteins, especially of self-toxic compounds that may need to be sequestered. To that end, there have been advances in studying compartmentalization and secondary metabolite trafficking machinery (Roze et al., 2011; Lim and Keller, 2014; Kistler and Broz, 2015), which offer strategies to move forward.

A significant challenge in this discipline is the study of antimicrobial compounds in their native ecological context, as most reports are based only on *in vitro* studies. In particular, because the target pathogen affects the host plant, more information is needed as to how the plant and the anti-pathogenic microbe coordinate and regulate one another. For example, in the jadomycin pathway, evidence suggests that the plant sensing of the pathogen stimulates the anti-pathogen pathway in the associated beneficial microbe. The potential complexity of plant-microbe interactions and associated signaling networks are well studied in model systems such as *Rhizobium* (Janczarek et al., 2015). Though *Rhizobium* is a symbiotic microbe of legume plants, these studies suggest that a wealth of information remains to be explored for other plant-associated microbes, in particular endophytes (Kusari et al., 2012).

Also within the ecological context, basic biochemical questions are raised such as whether the anti-microbial pathway is regulated by the target pathogen, for example feedback inhibition once the pathogen has been eliminated. To help understand the genetic regulation of these anti-microbial pathways, analysis of gene expression with respect to the microbial life cycle would be a

## **References**

Anjaiah, V., Koedam, N., Nowak-Thompson, B., Loper, J. E., Höfte, M., Tambong, J. T., et al. (1998). Involvement of phenazines and anthranilate in the antagonism of *Pseudomonas aeruginosa* PNA1 and Tn5 derivatives toward useful avenue of investigation, similar to the interesting findings from the fusaricidin pathway. An interesting study concerning aflatoxin, a polyketide mycotoxin, revealed strong evidence for the potential link between the fungal growth stage and polyketide biosynthesis (Zhou et al., 2000). Furthermore, intracellular tracking of aflatoxin biosynthetic enzymes in *Aspergillus parasiticus* showed significant accumulation in the vacuoles of specific cells but its absence in neighboring ones (Hong and Linz, 2008). This surprising result led Roze et al. (2011) to hypothesize the possibility of special and temporal gene expression of the associated biosynthetic pathway, at different developmental resolutions ranging from a single cell to fungal colony.

A related major challenge is that there are many natural products that exist in the literature that were initially isolated as part of screens for new compounds from total extracts, and hence the ecological functions of these compounds, as well as their underlying genes, remain unknown.

As anti-pathogenic metabolites may be self-toxic, the evolution of self-resistance is a particularly fascinating avenue of study, which this review demonstrates has been investigated for a limited number of pathways (e.g., mupirocin). Diverse selfresistance mechanisms have been reported in the microbial literature (Schäberle et al., 2011; Westman et al., 2013; Stegmann et al., 2015), suggesting that each plant-asssociated microbe with antimicrobial activity may employ unique self-protection strategies.

The recent advances in genome sequencing combined with gene editing tools will facilitate more in-depth analysis of orthologous biosynthetic genes in diverse species. Bioinformatic genome mining of biosynthetic gene clusters, combined with new advances in metabolomics, may also lead to the discovery of a diverse array of novel bio-active natural products. Moreover, merging these techniques with knowledge of microbial coevolution and ecology (Vizcaino et al., 2014) along with advanced microscopy and imaging techniques will open a new era of discovery to harvest the diversity of natural products to combat evolving pathogens.

## **Author Contributions**

WM wrote the manuscript, and WM and MR edited the manuscript.

## **Acknowledgments**

We thank Travis Goron (University of Guelph) for helpful comments on the manuscript. WM was supported by a generous scholarship from the Government of Egypt. MR was supported by grants from the Grain Farmers of Ontario, OMAFRA, and the CIFSRF program by the International Development Research Centre and DFATD of the Government of Canada.

*Fusarium* spp. and *Pythium* spp. *Mol. Plant Microbe Interact.* 11, 847–854. doi: 10.1094/MPMI.1998.11.9.847

Arima, K., Imanaka, H., Kousaka, M., Fukuta, A., and Tamura, G. (1964). Pyrrolnitrin, a new antibiotic substance, produced by *Pseudomonas*. *Agric. Biol. Chem*. 28, 575–576. doi: 10.1271/bbb1961.28.575


*Appl. Microbiol. Biotechnol*. 97, 9479–9489. doi: 10.1007/s00253-013- 5157-6


the quorum-sensing regulators LasR and RhlR in *Pseudomonas aeruginosa. J. Bacteriol*. 182, 6940–6949. doi: 10.1128/JB.182.24.6940-6949.2000


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2015 Mousa and Raizada. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

## Target or barrier? The cell wall of early- and later-diverging plants vs cadmium toxicity: differences in the response mechanisms

#### **Luigi Parrotta<sup>1</sup> , Gea Guerriero<sup>2</sup> , Kjell Sergeant <sup>2</sup> , Giampiero Cai <sup>1</sup>\* and Jean-Francois Hausman<sup>2</sup>\***

<sup>1</sup> Dipartimento Scienze della Vita, Università di Siena, Siena, Italy

<sup>2</sup> Environmental Research and Innovation, Luxembourg Institute of Science and Technology, Esch-sur-Alzette, Luxembourg

#### **Edited by:**

Joanna M.-F. Cross, ˙ Inönü University, Turkey

#### **Reviewed by:**

Zoë A. Popper, National University of Ireland, Ireland Danuta M. Antosiewicz, University of Warsaw, Poland Metin Turan, Yeditepe University, Turkey

#### **\*Correspondence:**

Giampiero Cai, Dipartimento Scienze della Vita, Università di Siena, via Mattioli 4, I-53100 Siena, Italy e-mail: giampiero.cai@unisi.it; Jean-Francois Hausman, Environmental Research and Innovation, Luxembourg Institute of Science and Technology, 5, Avenue des Hauts-Fourneaux, L-4362 Esch-sur-Alzette, Luxembourg e-mail: jean-francois.hausman@ list.lu

Increasing industrialization and urbanization result in emission of pollutants in the environment including toxic heavy metals, as cadmium and lead. Among the different heavy metals contaminating the environment, cadmium raises great concern, as it is ecotoxic and as such can heavily impact ecosystems. The cell wall is the first structure of plant cells to come in contact with heavy metals. Its composition, characterized by proteins, polysaccharides and in some instances lignin and other phenolic compounds, confers the ability to bind non-covalently and/or covalently heavy metals via functional groups. A strong body of evidence in the literature has shown the role of the cell wall in heavy metal response: it sequesters heavy metals, but at the same time its synthesis and composition can be severely affected. The present review analyzes the dual property of plant cell walls, i.e., barrier and target of heavy metals, by taking Cd toxicity as example. Following a summary of the known physiological and biochemical responses of plants to Cd, the review compares the wall-related mechanisms in early- and later-diverging land plants, by considering the diversity in cell wall composition. By doing so, common as well as unique response mechanisms to metal/cadmium toxicity are identified among plant phyla and discussed. After discussing the role of hyperaccumulators' cell walls as a particular case, the review concludes by considering important aspects for plant engineering.

**Keywords: cadmium, plant cell wall, heavy metal stress, heavy metal biosorption, cell wall polysaccharides, lignin**

#### **INTRODUCTION**

Plants, differently from animals, are sessile organisms and therefore cannot escape from potentially life-threatening conditions. This, in part, explains the great metabolic plasticity of plant cells, which have evolved mechanisms enabling them to adapt to and cope with environmental challenges. Plants growing on contaminated soils typically display either tolerance or avoidance. Plants tolerate heavy metals by sequestering them in specific plant organelles to keep them segregated from vital cellular components, or by the synthesis of enzymes involved in detoxification. Alternatively, the uptake and translocation of the heavy metal is decreased. However, there is a wide diversity in the details of the response, and species- or even clone-specific molecular responses have been recorded. Cadmium (Cd) is known to be phytotoxic and to affect plant physiological processes, from roots to shoots. Adverse effects caused by Cd are of considerable importance for all plants but, from an agricultural point of view, Cd is a toxic environmental pollutant that affects crop productivity (Du et al., 2012). While numerous reports have analyzed the cellular and enzymatic mechanisms involved in the response to Cd stress in plants, only a handful of studies have focused on the role of the plant cell wall. The cell wall is the outermost structure of plant cells and therefore the first in contact with the environment. It has important chemical characteristics, which make it a very good biosorbent of heavy metals. The presence of different functional groups (e.g., carboxyl, hydroxyl) deriving from the different wall polysaccharides favors for instance ion-exchange mechanisms with wall counter-ions. The cell wall is in its turn affected by heavy metals, since both its biosynthesis and composition can be altered. This leads to modifications in its physico-chemical properties which increase the binding capacity and at the same time lower the entry of heavy metals in the protoplast (Krzesłowska, 2011).

#### **IMPACT OF Cd POLLUTION ON PLANT CELLS AND ORGANS**

Cd can have many consequences on plant physiology. It can affect both shoot growth and leaf biomass in *Zea mays* by impairing chlorophyll synthesis and by promoting the expression of defense proteins (Lagriffoul et al., 1998). In garlic (*Allium sativum*), Cd reduces root growth in a concentration-dependent manner (Liu et al., 2003).

It is also capable of damaging processes related to plant reproduction, with logical impacts on plant dispersal and biodiversity. For example, Cd can affect pollen germination and pollen tube growth by altering the polarization mechanism and by inducing abnormalities within the cell, such as blocking cytoplasmic streaming and alteration of the cytoplasmic organization (Sawidis, 2008). Moreover, in a recent study, the influence of exposure to a variety of heavy metals on the germination of 23 flax cultivar seeds was assessed. Using root elongation as the main parameter, large cultivar-dependent differences were found for some heavy metals (notably Cd, Ni and Co), while for other metals a more homogenous influence is described (Soudek et al., 2010).

The impact of Cd on crops can also be exerted by negatively influencing the translocation of nutrients, especially of minerals; in Cd-exposed tomato K, Fe, Mn, and Zn are reported to be poorly translocated in roots, while P and Mn uptake is drastically reduced in fruits (Moral et al., 1994). In pea, Cd has effects on both roots and leaves, and a significant inhibition of growth is combined with a reduction of transpiration and photosynthesis rate, as well as a general deterioration of the nutrient status (Sandalio et al., 2001). These are just a few examples describing the broad range of Cd-caused impacts on plants. For more information on the overall effects caused by Cd, readers can refer to specific reviews (Benavides et al., 2005; Qadir et al., 2014).

Cd is essentially absorbed by roots, but it localizes to all plant organs and tissues. Leaves are usually the preferential accumulation site of Cd. In leaves, Cd accumulates in different regions, which correspond to less metabolically active areas and to forthcoming necrotic regions. This suggests that an active metabolism might be required to detoxify Cd in plants and that the absence of detoxification mechanisms precedes necrotic events (Cosio et al., 2005). In the leaves of sensitive plants, Cd exerts its toxic activity by inducing chlorosis, thereby inhibiting growth. The effects caused by Cd range from morphological changes to reduction of the photosynthesis rate to decreased transpiration up to cell apoptosis (Souza et al., 2011).

Cd also affects the structure and function of roots. In *Allium cepa*, Cd treatment induces abnormalities, such as extensive vacuolization, condensation of the cytoplasm, damages to mitochondrial cristae, plasmolysis, and condensation of chromatin. Although dense granules were detected between the cell wall and the plasma membrane, no specific sites of Cd-accumulation were found in the cell walls of *Allium* roots (Liu and Kottke, 2004).

As plants are exposed to different environmental stresses, the effects of Cd contamination cannot be isolated from injuries caused by other environmental stresses and it might sometimes be difficult to predict the combinatory effect (Mittler, 2006). The simultaneous presence of different stresses might multiply the negative effects; however, as already observed in rice seedlings, one type of stress can also partially suppress the effects of another. In rice, for example, heat stress can ameliorate the negative effects of Cd pollution through the activation of protective systems based on anti-oxidant activities (Shah et al., 2013). However, the study of combined stresses is much less advanced than that of single stresses, although some studies have been performed combining Cd-exposure with other treatments (Sergeant et al., 2014). Given that polluted soils are generally not contaminated with Cd alone, the exposure of plants to mixtures of metals (Printz et al., 2013a,b), or the use of real-life polluted soils in experiments (Evlard et al., 2014a), could give new insights into the metabolic adjustments of plants when exposed to high concentrations of trace nutrients.

## **ROLE OF PHYTOCHELATINS AND METALLOTHIONEINS IN THE RESPONSE OF PLANTS TO Cd STRESS**

Although Cd is a generally toxic contaminant of the ecosystem, plants have evolved diverse mechanisms to respond to Cd contamination of their growth substrate. As for many compounds with detrimental effects on cellular metabolism, the toxicity of Cd is often alleviated through its sequestration to specific cellular compartments, such as vacuoles, or within specialized cells, such as trichomes (Harada et al., 2010). Adaptation of plants to Cdcontaminated soils might also rely on the symbiotic cooperation with other organisms. In *Medicago sativa*, colonization of roots by arbuscular mycorrhizae increases the tolerance of plants to Cd and the decreased Cd toxicity observed is somewhat proportional to the extent of colonization. It is likely that fungi possess a battery of enzymatic activities/mechanisms capable of chemically modifying Cd, thereby making it less toxic (Wang et al., 2012).

The cell wall can act effectively as a biosorbent of Cd, alleviating the toxic effects of this heavy metal. It is clear that the cell wall-based protection mechanism is only one of the processes used by plants to cope with the damage induced by Cd. Before exploring more specifically the molecular mechanisms in which the cell wall is involved, it is necessary to introduce the other protective mechanisms with which plants are equipped. These are based on specialized oligopeptides (namely phytochelatins and metallothioneins), on biochemical responses and on the intracellular sequestration of Cd.

Plants can respond by producing specific chelating agents, such as phytochelatins, that complex with Cd, thereby reducing its toxic potential (Carrier et al., 2003). The production of phytochelatins coincides with the activation of sulfur metabolism. This increases the synthesis of cysteine and of reduced glutathione (GSH), an antioxidant precursor of phytochelatins (Zhang and Shu, 2006). Increased levels of sulfur can also affect the distribution of Cd in plants by remobilizing it from the cell wall to the cytosolic compartment. As sulfur is important in the biosynthesis of sulfhydryl proteins, it is suggested that a higher content of sulfur generates larger amounts of proteins capable of sequestering Cd in the cytosol, thereby reducing the accumulation of the metal in the cell wall (Zhang et al., 2014c). Therefore, the availability of sulfur might control the synthesis rate of these sulfur-rich proteins (Loeffler et al., 1989; Uegsegger et al., 1990). Although this information is inferred from indirect observations (i.e., by determining that, in the presence of higher levels of sulfur, Cd is preferentially located in the vacuole), this also suggests that the balance between cell wall-bound and cytosol-associated Cd depends on the availability of proteins capable of sequestering Cd.

The use of phytochelatins for the detoxification of Cd-stressed plants is considered an important mechanism, but its actual role is still unclear and seems to be organ-dependent. Roots of the hyperaccumulator *Sedum alfredii* do not make efficient use of the phytochelatin-based mechanism for protection against Cd, while this process is conversely used in the shoots of the same plant (Zhang et al., 2010b). The use of different tolerance mechanisms in shoots and roots is also found in other hyperaccumulators such as *Typha angustifolia*, one based on GSH-related antioxidant systems (in leaves) and the other based on GSH-related chelation system (in roots; Xu et al., 2011).

An important class of cysteine-rich proteins binding heavy metals in plants and animals are metallothioneins (MTs). Their role in heavy metal stress in plants is known and their potential use in biotechnology is supported by recent studies, which have analyzed the effects of MTs overexpression in *Arabidopsis thaliana* and tobacco (Gu et al., 2014; Zhou et al., 2014). The transformed plants showed increased tolerance to Cd stress, which was accompanied by a lower accumulation of H2O<sup>2</sup> in *A. thaliana* and by a higher ROS scavenging activity in tobacco.

Data in the literature indicate that a universal mechanism of Cd tolerance is not present and that different, often closelyrelated plants, respond differently when exposed to Cd. For instance, Cd-tolerant cultivars of black oat (*Avena strigosa*) accumulate Cd in the leaves, mainly in the cell wall. Those plants use a phytochelatin-based response when exposed to Cd, but also induce the up-regulation of ascorbate peroxidase and superoxide dismutase, indicating that an anti-oxidant response is triggered upon Cd treatment. By contrast, the phytochelatinbased response appears unimportant in Cd- sensitive plants, although their total content increased upon heavy metal exposure (Uraguchi et al., 2009).

## **OTHER FACTORS AFFECTING THE RESPONSE/ADAPTATION/ DEFENSE OF PLANTS TO Cd**

The number and diversity of resistance mechanisms used by plants might be larger than expected. For example, it was reported that treatment of Cd-stressed plants with salicylic acid alleviates a number of effects caused by Cd. It is likely that the positive effects of salicylic acid are not directly correlated with the removal of Cd, but with the activation of response and defense genes, which in turn may code for enzymes involved in the maintenance of homeostasis of plant cells (Moussa and El-Gamal, 2010). Activation of pathogenesis-related defense proteins of the glucanase family is likely to be a common trait of many plants in response to Cd. This process was observed in different plants, for instance maize and soybean treated with Cd and other metals, suggesting that this defense mechanism is shared by relatively distant species (Piršelová et al., 2011).

The response of plants to Cd also requires the activation of enzymes capable of refolding proteins, as one of the most dramatic effects of intracellular Cd is protein denaturation. Therefore, some of the enzymes activated during response and adaptation to Cd belong to the chaperone family and chaperonelike proteins are likely active during the response to Cd pollution. As Cd can denature proteins, heat-shock proteins (HSPs) are involved in the refolding of denatured proteins and indeed levels of HSPs have been reported to increase after Cd treatment (Sergio et al., 2007). Increased levels of HSP70 were also detected in the bryophyte *Conocephalum conicum* exposed to Cd and Pb, suggesting that the need to refold proteins is a prerequisite to maintain cellular activity during heavy metal stress (Basile et al., 2013).

In pea roots, Cd and Cu treatment were reported to affect the enzymatic activity of proteins involved in the oxidation and peroxidation of cell wall components, such as guaiacol peroxidase, ascorbate peroxidase, coniferyl alcohol peroxidase, NADH oxidase and indole-3-acetic acid (IAA) oxidase (Chaoui et al., 2004). The evidence that Cd might increase the activity of cell wall-bound coniferyl alcohol peroxidase is of particular interest because this enzymatic activity metabolizes coniferyl alcohol, a monolignol used as substrate presumably involved in the lignification process (Quiroga et al., 2001). This suggests that Cd pollution might have severe repercussions on the development of secondary cell walls, by reducing their rigidity and robustness. An increase in guaiacol peroxidase activity was also reported in lichens exposed to higher concentrations of Cd (Sanità Di Toppi et al., 2005). As lichens are fungi-algae symbionts generally resistant and tolerant to heavy metal pollution, the use of plasma membrane-associated peroxidases might be a general mechanism to counteract the negative effects caused by Cd.

Changes in cell wall-associated peroxidase activity were also observed in roots of *Brassica juncea* stressed by Cd treatment. Here, Cd increases the activity of ionically cell wall-bound proteins, thereby increasing peroxidase activity too (Verma et al., 2008). The upregulated peroxidase activity is supposedly involved in the adaptation mechanism to Cd stress.

Although specific studies have not been conducted to determine the impact of Cd exposure on the cell wall proteome and on the diversity of responses in closely related species, numerous cell wall localized proteins were identified in general proteome studies on the impact of heavy metal pollution, namely Cd excess, on plants. These include general stress responsive proteins. For instance, in a study on two flax cultivars with different tolerance levels to Cd exposure, the cell wall-localized chitinase increased in both cultivars, but the protein accumulated at significantly higher levels in the more tolerant cultivar (Hradilova et al., 2010).

Proteomics studies in poplar found that exposure to Cd resulted in an increased accumulation of β-1,3-glucanase and chitinase (Kieffer et al., 2008; Durand et al., 2010). Moreover Kieffer et al. (2008) identified the significantly higher accumulation of different isoforms of cell wall localized peroxidase, linking the cell wall and anti-oxidative enzymes found in this compartment with Cd-stress. An intra-specific comparison of different *Populus nigra* ecotypes found differences in the mechanism of tolerance, indicating the variability and highlighting the difficulty to generate a physiological model, even at the species level. In Cd-stressed poplar plants, the activity of superoxide dismutase decreases, while the enzymatic activity of both peroxidase and catalase increased significantly in roots; this suggests that these plants trigger a response based on enhanced oxidase activity. Although the cytology of root cells is significantly altered, these plants exhibit a considerable capacity as hyperaccumulators and they have been consequently proposed as phytoremediators (Ge et al., 2012).

In addition to the various mechanisms outlined above, extracellular compartmentalization is another strategy used by many plants to limit the injuries caused by Cd. This mechanism involves the binding and accumulation of Cd in the cell wall, which serves as a reservoir for Cd. It is capable of concentrating and accumulating Cd, thus preventing it from penetrating inside the cell where it can cause damages. In the legume white lupin, the ability of the cell wall and vacuole-deposited phytochelatins to sequester Cd is relatively comparable in stems, but the cell wall is more effective in both roots and leaves (Vázquez et al., 2006). This suggests that these organs have specialized cell walls capable of sequestering Cd, while stems have mostly developed the phytochelatin-based mechanism for protection against Cd. Predominant accumulation of Cd in the cell wall of roots was also observed in *Spartina alterniflora* (Pan et al., 2012) and *Kandelia obovata* (Weng et al., 2012) treated with different concentrations of Cd. Enhanced accumulation of Cd in the cell wall of leaf cells was conversely reported in *Alternanthera philoxeroides* (Xu et al., 2012a).

Before discussing how Cd affects plant cell wall and how it can be a barrier against Cd, we need to introduce its key features.

### **THE PLANT CELL WALL: A LIVING STRUCTURE**

Plant cells are enveloped by a layer composed of structural proteins and polysaccharides (which can be impregnated by the aromatic polymer lignin), interwoven to form an intricate mesh, the cell wall. This structure has a complex three-dimensional organization and is considered an example of a natural biocomposite. The plant cell wall constitutes a material with exceptional mechanical performance and is taken as a model for the creation of materials with superior properties.

Cell walls protect the living protoplasts against external insults but, at the same time, they are very plastic since modifications in their composition and structure are responsible for enabling cell growth and development. Two types of cell walls can be distinguished: primary and secondary cell walls. The first is typically found in actively growing plant cells and is characterized by the presence of cellulose, pectin and hemicellulose. It is quite thin and flexible, thus enabling cell expansion (Guerriero et al., 2014a). Secondary walls are specialized structures synthesized when plant cells have ceased to elongate. They are responsible for mechanical strengthening (namely via the deposition of lignin) and accompany the phase of secondary growth, which determines the increase in stem girth (Guerriero et al., 2014a,b). The progressive stem thickening ensures resistance to bending. The main components of secondary walls are cellulose, xylan and lignin (Cosgrove and Jarvis, 2012).

A third type of wall (more specifically a wall layer) can be distinguished, which is found in specialized cell types and/or in response to mechanical stimuli: the gelatinous wall layer (Glayer). This type of wall is characterized by the presence of a thick layer of crystalline cellulose and is typically found in tension wood. Cells with a G-layer are also found in the stem of fiber crops like hemp, flax and nettle and are associated with the phloem. These cells are very long and known as bast fibers (Mellerowicz and Gorshkova, 2012; Guerriero et al., 2013).

Besides encasing and shielding plant protoplasts, the cell wall can be affected by nutritional stress and undergoes variations in composition. For example grapevine callus subjected to deficiency in S, N, and P showed walls with a decreased content in cellulose, increased lignin amount, alterations in the methylesterification of pectin (Fernandes et al., 2013). These results show how plant cell walls are remodeled in response to mineral deficits and how their composition is modified according to the stress condition applied. The plant cell wall indeed takes active part in transmitting exogenous signals to the interior of the cells: evidences show the existence of a cell wall integrity (CWI) maintenance mechanism which is triggered upon exogenous stresses (Hamann and Denness, 2011; Wolf et al., 2012; Engelsdorf and Hamann, 2014). The knowledge concerning CWI maintenance is far from being complete; however, given the similarity existing with that of yeast, both chemical and physical signals are supposed to contribute to it (Hamann and Denness, 2011). An example of a physical signal is represented by the weakening of the cell wall (for instance as a consequence to treatments with drugs like dichlobenil or isoxaben) up to a point that it can no longer counteract the internal turgor pressure with consequent cell swelling and bursting. A weakened wall causes stretching of the plasma membrane and activation of mechanosensitive channels which trigger an increase in the cytoplasmic concentration of calcium (Humphrey et al., 2007; Hamann and Denness, 2011).

Chemical signals are generated upon degradation of the cell wall by pathogens: for example, pectin-derived oligogalacturonides (OGAs) with a specific length, degree of acetylation and methylesterification, are capable of inducing rapid defense responses (Hamann and Denness, 2011; Vallarino and Osorio, 2012; Ferrari et al., 2013). The presence of signals, which allow the wall to "sense" the status of the cell, implies the existence of receptors that can intercept those signals. Several receptors have been described in the literature and among them wall associated kinases (WAKs) have been well studied. Those receptors are located in the plasma membrane, but have a domain protruding into the cell wall and a cytoplasmic kinase domain (Anderson et al., 2001; Humphrey et al., 2007). Their structure is therefore ideal to create a continuum cell wall-plasma membranecytoplasm and to transfer the signal from the outside to the inside of the cell. The role of these receptors is particularly interesting if one considers the relationship existing between heavy metal stress and water balance within plant cells. Gating of aquaporins was recorded within 10 min of heavy metal application to onion epidermal cells, which results in alterations in the turgor pressure (Przedpelska-Wasowicz and Wierzbicka, 2011). Since the WAKs membrane receptors are activated upon changes in turgor pressure (a phenomenon observed in pollen grains of *Picea wilsonii* after Cd stress; Wang et al., 2014), their functional study could provide important insights into the sensing mechanism and reveal downstream players involved in the signaling cascade.

### **THE CELL WALL AS A BARRIER TO Cd**

The composition of the cell wall shows many differences among phyla and it has played a vital role for their survival throughout evolution (Sarkar et al., 2009 and references therein). In the attempt to understand how different taxonomic groups respond to Cd (or heavy metals in general), it is valuable to compare the chemical composition and structure of cell walls from different phyla. Information might shed light either on alternative resistance mechanisms, or on how to improve the strategy used to tolerate Cd. The rationale at the basis of this assumption is that the variety of responses to Cd pollution, as found among different phyla, might be mirrored by molecular and cytological "tricks" that plants have evolved at the cell wall level. In discussing this issue, it is necessary to consider that evidence for the cell wall acting as a barrier against Cd (and other heavy metals) is not immediately clear from every study. The availability of molecular and structural data is low, which leads to fragmented pictures strongly affecting the development of a unifying model. Moreover, it is not possible to rule out that a unifying model does not exist and therefore it is necessary to present individual situations. The ability of either a plant cell or a plant *in toto* to tolerate Cd can be implemented at different levels and with different mechanisms. In the attempt to outline the different resistance mechanisms in which cell walls are involved, it is possible to rank them based on (a) the morphology of the barrier organ (for example the root), (b) the transport capacity of the contaminant, (c) the absorption of Cd into the cell wall by polysaccharides or proteins, (d) the impediment of Cd to penetrate through the plasma membrane.

The first part of the section aims at surveying the wall-related strategies developed by early- and later-diverging embryophytes. Charophytes, the closest algal relatives of embryophytes, are also briefly treated, given their importance in the study of land plant cell wall origin (Sørensen et al., 2011; Mikkelsen et al., 2014). The second part of the section discusses the mechanisms of Cd absorption/translocation in different plant organs and tissues, while considering the cell wall organization/composition.

### **PLANT CELL WALL ACROSS DIFFERENT PHYLA: COMPOSITION AND MECHANISMS OF RESISTANCE TO Cd**

Charophyte cell walls possess high amounts of mannosecontaining hemicellulose, glucuronic acid, mannuronic acid, 3- *O*-methyl rhamnose (Popper and Fry, 2003) and possess neither lignin nor cutin. More importantly, they are capable of calcifying and calcite is known to bind and sequester heavy metals (Gomes and Asaeda, 2013). The macrophyte alga *Chara australis* can accumulate Cd and be used for remediation of contaminated soils (Clabeaux et al., 2013). Studies carried out on another species of macrophyte, *Chara fragilis*, showed that binding of uranyl species to the cell walls was mainly due to the presence of calcite (Dakovic´ et al., 2008). The biomineralization potential of Charophytes is therefore very promising for phycoremediation (i.e., the use of algae for the removal of pollutants) and further studies should be performed to link the cell wall composition of these algae to their calcifying potential.

Bryophytes, a group of early-diverging non-vascular plants, possess walls with mannose-containing hemicellulose, uronic acids and 3-*O*-methyl rhamnose, similarly to Charophytes (Sarkar et al., 2009). They lack lignin, although they contain lignans and other lignin-like polymers (Sarkar et al., 2009 and references therein). Bryophytes are known bioindicators of heavy metal pollution and their walls show high biosorption capacity because of the numerous ion-exchange sites (e.g., from uronic acids). The wall of the moss *Pohlia drummondii* was shown to form, together with the plasma membrane, an efficient shield against elevated zinc doses (Lang and Wernitznig, 2011); therefore, the living protoplast is protected from harmful effects. The moss *Scorpiurum circinatum* treated with different heavy metals, including Cd, immobilized the toxic ions in the cell walls, which is therefore used as the main detoxification site (Basile et al., 2012).

The Pteridophytes, another group of early-diverging plants, are very interesting organisms, as their cell walls show specialization to support the differentiation of a vascular tissue: galactomannan, glucomannan are abundant in secondary walls of leptosporangiates and lignin, together with xylan, has also been observed (Sarkar et al., 2009 and references therein). In the fern *Lygodium japonicum* it was shown that the wall pectins bind copper via homogalacturonans (Konno et al., 2005). In *Salvinia auriculata* Cd was shown to induce severe deformations at the cell wall level. Moreover an opaque layer could be observed along the middle lamella (Wolff et al., 2012). This finding might be explained by the binding capacity of pectins, the chief component of middle lamellas, a feature which is shared by seed plants. Among Pteridophytes, Equisetales certainly deserve attention, as their walls show the occurrence of mixed-linkage glucans and they are known for their biosilicification activity. Interestingly, a relationship was proposed between these hemicelluloses and Si mineralization (Fry et al., 2008). It would be interesting to investigate whether the sequestration of heavy metals by Equisetales takes place via mixed-linkage glucans.

These data indicate that the cell wall is a target for the accumulation of Cd in early-diverging plants, such as Bryophytes and Pteridophytes. These organisms use molecular mechanisms that are compatible with the chemical composition of their cell wall. However, as outlined above, both the structure and composition of the cell wall changed during evolution and it is therefore expected that chemically-different cell walls might respond differently to Cd pollution. The in-depth study of early-diverging plant cell wall biosynthesis and composition can reveal fine molecular details and inspire further avenues for improving the cell wall capacity to accumulate toxic heavy metals (see section "Future outlook").

In later-diverging plants (such as spermatophytes), the increasing number of tissues and organs, coupled with the more sophisticated architecture, also generated cell walls with an increasing variability at the chemical and physical level. Consequently, it is likely that the diversity in Cd tolerance extends to the organ, tissue and even to the cellular level (see next paragraph).

In seed plants, and more specifically in flowering plants, most of the information was obtained from a limited number of species and, sometimes, on crop plants.

The evidence that the cell wall can exert a protective role was supported by studying Cd toxicity in crop plants. Pea plants were shown to be more sensitive than maize plants to Cd exposure; compared to pea plants, maize plants exhibited a higher percentage of cell wall-associated Cd, while pea showed higher content of Cd in the cytoplasm (Lozano-Rodríguez et al., 1997). This finding suggests that the cell wall acts as a barrier, preventing Cd from entering the cytoplasm where it is extremely toxic for plant cells. It is consequential that different cell wall structures and textures might bind Cd differently and thus provide different levels of protection. Comparable results were found in strawberry, where the treatment with different concentrations of Cd highlighted that the cell wall of leaves and roots is the primary reservoir for Cd, with root cell walls showing higher binding capacity than leaf cell walls (Zhang et al., 2010a). The fact that roots display a higher accumulating capacity than leaves can be an index of their protective activity preventing the entry and accumulation of Cd in the cytoplasm. If the concept of the cell wall as a barrier is true, the differential accumulating capacity of plant cell walls has to be found in their chemical and physical diversity. This diversity can cover not only the polysaccharide component of the cell wall, but also the protein fraction or modifications induced on one, the other or both components.

#### **PLANT CELL WALL COMPOSITION IN DIFFERENT ORGANS AND TISSUES: LINK WITH Cd ABSORPTION AND TRANSLOCATION**

Since Cd is primarily adsorbed at the root level, it follows that the anatomy and molecular structure of the cell wall in root cells is a critical parameter. Findings in different plant species indicate that roots with higher content of suberin and lignin may be more impermeable to Cd and therefore more resistant to its absorption and translocation (Lux, 2010). Increased accumulation of suberin was observed in the roots of the monocotyledonous medicinal plant *Merwilla plumbea* exposed to Cd, a process interpreted as a protective response against penetration of Cd in the cells (Lux et al., 2011). The presence of root mucilage is another factor playing an important role in Cd adsorption, as shown in sunflower (Yang and Pan, 2013). The differential structure of roots might explain why Cd translocation is different between plants and variable even within the same taxonomic group.

Screening of different clones of *Salix* grown on identical reallife soil revealed remarkable differences in the translocation of Cd and other metals (Evlard et al., 2014b). Similar differences were found in solanaceous plants (Yamaguchi et al., 2011), different poplar species (He et al., 2013) and between ecotypes of the hyperaccumulators *Thlaspi caerulescens* and *T. praecox* (Xing et al., 2008).

In plants restraining Cd in roots, Cd is sequestered in the endoderm and the cortical regions of roots. When this filter fails, Cd can penetrate the xylem vessels and translocate to the shoot. Therefore, the specific composition of cell walls in the endoderm and cortex is critical (Akhter et al., 2014). Ultimately, the cell wall of roots is considered as an effective barrier against penetration of Cd and a relationship can be established between higher accumulation of Cd in the roots *vs* lower accumulation in the leaves (Sun et al., 2013). This suggests that Cd-tolerance goes through the capacity of roots to limit the diffusion of Cd towards the entire plant.

The protective activity exerted by the cell wall may not only depend on the polysaccharide fraction, but also on other cell wall molecules that can bind Cd and other heavy metals. As earlier suggested by Sanità Di Toppi and Gabbrielli (1999), the response of later-diverging plants to Cd may involve different types of molecular mechanisms, from the phytochelatin-based sequestration, to cell wall immobilization and plasma membrane exclusion, to the use of stress proteins. In this context, the cell wall (by acting as a molecular barrier) may be part of both adaptation mechanisms to permanent pollution and response mechanisms to acute stress. However, additional factors might be required to achieve full functionality and protection against Cd pollution. In fact, in addition to the binding of Cd to polysaccharides and proteins, tolerance to Cd might also involve the sequestration of Cd complexes in the cell wall. In crop plants such as maize and wheat, addition of P concomitantly to Cd treatment increased the tolerance of plants to Cd (Zhimin et al., 1999). P may associate with Cd thereby generating insoluble complexes that remain within the cell wall. Combination of binding to P and polysaccharides is also possible.

Therefore, the addition of P to contaminated soils seems to modify their physiochemical status, making Cd less available for plants, thereby improving the tolerance of plants to Cd pollution. In addition, Cd accumulates more consistently in the cell wall of plants growing on P-supplemented soils, which therefore hinder its entry in the cytoplasm (Siebers et al., 2013).

If the differential capacity of plants to resist Cd is linked to the chemical variability of their cell walls, it is clear that such variability can be consistently increased by modifying the protein component of the cell wall. Specific cell wall proteins could provide an additional barrier by binding Cd and thus preventing it from penetrating into the cytoplasm. In barley treated with Ni and/or Cd, the apoplastic fraction of leaves showed increased amount of apoplastic proteins that accumulate concomitantly with the metal treatment (Blinda et al., 1997). Therefore, the response of plants to contaminating metals may also involve the specific *de novo* synthesis of proteins that potentially make plants more tolerant. Proteins not directly involved in Cd binding have also been reported to increase Cd tolerance in plants. As previously discussed, an example is represented by cysteine-rich proteins as identified in *Digitaria ciliaris* and *Oryza sativa*. The genes coding for these proteins were found to be upregulated following Cd treatment and transgenic plants overexpressing cysteine-rich proteins are more tolerant to Cd stress by preventing entry into the cytoplasm (Kuramata et al., 2009). This finding also highlights the possibility of manipulating plants in order to increase their mechanisms of tolerance to pollutant agents.

Accumulation of Cd in the cell wall (or more generally in the apoplast) is also dependent on the efficient transport of Cd through the plasma membrane so that cytoplasmic Cd can be transported outside and incorporated in the cell wall. While doing so, the content of cytoplasmic Cd decreases while Cd accumulates in the cell wall. Transport of Cd through the plasma membrane is likely dependent on membrane transporters. Their involvement is also suggested by indirect evidence showing that, while Cd is promptly adsorbed in the cell wall, its influx in the symplast is linear with Cd concentration, which suggests the activity of low-affinity transport systems (Redjala et al., 2009). Recently the transporter IRT1 (IRON-REGULATED TRANSPORTER 1) was shown to be not only involved in the uptake of iron, but also of toxic metals, among which Cd (Barberon et al., 2014). This root transporter is recycled via a mechanism involving a phosphatidylinositol-3-phosphate-binding protein, FYVE1, which controls its delivery to the outer plasma membrane domain of root epidermal cells.

These observations stress the importance of studying the cell wall structure of both tolerant and resistant plants in order to identify those molecules that provide the highest protection against Cd. One example of an important molecule is lignin. The protective activity of lignin is suggested by pretreatment of wheat plants with wheat germ agglutinin, which generally exerts an amelioration of Cd effects. This pretreatment likely involves a different balance between hormone levels but an important phenotypic effect is an accelerated lignification of cell walls (Bezrukova et al.,

2011). Such secondary modifications might render cell walls less permeable to Cd thus limiting its entry in the cell. Lignin can also bind Cd thereby inhibiting its diffusion into the cells. Higher abundance of lignin in secondary cell walls might thus make plants more tolerant to the effects of Cd treatment.

In *Pisum sativum* leaves, lignin was shown to be deposited following Cd stress and this process was accompanied by an oxidative burst in the xylem vessel cell walls (Rodríguez-Serrano et al., 2009). Moreover, Cd was shown to favor the process of xylogenesis, by inducing the activity of enzymes via H2O<sup>2</sup> accumulation (Schützendübel et al., 2001). The use of sophisticated labeling techniques involving quantum dot technologies can highlight the interaction between Cd and specific cell wall components (Djikanovic et al., 2012). The multitude of activities of the cell wall as a barrier against Cd pollution is schematically summarized in **Figure 1**.

## **THE CELL WALL AS TARGET: EFFECTS OF Cd ON CELL WALL SYNTHESIS AND COMPOSITION**

In the previous paragraphs, the cell wall has been considered as a potential reservoir for Cd and thus as an efficient barrier to its penetration in the plant. In some cases, Cd is not efficiently neutralized and it might become an injuring factor by affecting the structure of cell walls. The toxic activity of Cd can be exerted at different levels. Cd can alter the physical structure of the cell wall by interacting with negatively charged molecules (i.e., carboxyl groups and sulfates). In this way, Cd can substantially modify the cell wall resistance to turgor pressure thereby making the cells weaker. In addition to interacting with the polysaccharide component, Cd can interfere with the enzymatic activities present in the cell walls by inhibiting the enzymatic reactions that strengthen the cell wall structure. Furthermore, Cd can affect directly or indirectly the synthesis of cell wall components, or interfere with the transport of the latter to the final destination.

Presence of Cd also affects the deposition pattern of pectins. In the hypocotyl of *Linum usitatissimum*, the presence of an excess of Cd changes the ratio between low and high methylesterified pectins, causing the former to accumulate consistently in epidermal cells and determining the collapse of the subepidermal cell layer (Douchiche et al., 2007). This unbalance affects the primary walls and may cause swelling of hypocotyl tissues. It is not clear how Cd determines such effects, but it is likely that it alters the expression level of pectin-metabolizing enzymes such as pectin methyl-esterase (PME). Overexpression of this enzyme can change the ratio between low and high methyl-esterified pectins by altering their level of esterification. This leads to a relatively higher abundance of acidic pectins and extremely impairs the shaping of plant cells, which is based on a precise balance in degree of esterification. In addition to changing the degree of esterification of pectins, Cd can also affect more generally the chemistry of these polysaccharides for example by altering the expression of peroxidases, which in their turn alter the chemical structure of homogalacturonans (Paynel et al., 2009). The effects caused by Cd on the development of cell walls and, more specifically, on pectins might be more subtle and not immediately appreciable. This is the case for flax fibers where Cd treatment affects the deposition of secondary cell walls in terms of adhesion of cellulose microfibrils. On the contrary, the effects on both low and high methylesterified pectins are not immediately explicable (Douchiche et al., 2011).

That pectins are important in binding Cd is demonstrated by experiments with plants grown in P deficiency. Lower levels of P alleviate the negative effects caused by Cd because P reduces the content of pectins and PME in the cell wall. The reduced level of this enzyme enhances the ratio between low and high methyl-esterified pectins further, making a lower number of active sites available for Cd binding, thereby triggering a heavy metalexclusion mechanism (Zhu et al., 2012).

Further studies on the role of pectins in response to Cd exposure were carried out in rice. The addition of NO donors concurrently with Cd treatment makes plants more resistant to Cd injuries. The resistance is mainly based on the overproduction of pectins and on the simultaneous reduction of cellulose levels (Xiong et al., 2009). It is likely that pectins are Cd adsorbent via carboxylic groups, thus preventing the metal from entering the cytoplasm. It would be fascinating to analyze if NO donors alter the expression of genes coding for enzymes involved in pectin synthesis. The possible role of NO in alleviating the symptoms caused by Cd treatment has been already reviewed (Xiong et al., 2010).

The evidence that toxic elements such as heavy metals could bind to pectin in the cell walls has been already observed and described (Malovikova and Kohn, 1982). There is a vast literature on the chemical interaction between heavy metals and pectins (Krzesłowska, 2011) although the exact mechanism of damage caused by Cd on pectins is not completely clear. It is likely that Cd can substitute calcium ions thereby modifying the rigidity of the pectin skeleton. This fact has been suggested in the alga *Ulva lactuca* where the rhamnose subunits are cross-linked by calcium and the replacement of calcium by Cd can decrease the rigidity of the cell wall (Webster and Gadd, 1996). It is noteworthy that calcium can alleviate the injuries caused by Cd, likely either because calcium can compete for the binding to cell wall polysaccharides, or Cd might use calcium channels to penetrate inside cells (Suzuki, 2005). At the cellular level, Cd can induce remarkable aberrations on the cell wall structure resulting in changes in cell shape (Sawidis and Reiss, 1995; Sawidis, 2008). These observations are consistent with the hypothesis that Cd interferes with the process of cell wall structuring probably by altering the exact arrangement of pectins. In some experimental cases, exposure to soils contaminated with Cd induces a general increase of the levels of pectin and a reduction of methylesterified pectins (Astier et al., 2014). Nevertheless, understanding exactly the alterations induced by Cd on the structure of the pectin skeleton is not simple and the use of antibodies directed against different chemical forms of pectin cannot help to define exactly how Cd damages the pectin cell wall (Douchiche et al., 2011).

Direct effects of Cd on the synthesis of other polysaccharides (such as cellulose and callose) are not known. However, it was reported that Cd triggers the accumulation of specific Cd-induced glycine-rich proteins (cdiGRPs). These are cell wall-associated proteins that might positively regulate the synthesis of callose. The cdiGRPs likely work in cooperation with an additional cell wall protein, the so-called GrIP (cdiGRP-interacting protein; Ueki and Citovsky, 2005). The model proposed by the authors suggests that Cd increases the post-translational accumulation of cdiGRP, which is also dependent on the expression of GrIP. Once accumulated in the cell wall, cdiGRP enhances the production and/or accumulation of callose, which consequently represents a cell wall adaptation to Cd pollution. Accumulation of callose in response to metal stress probably does not involve altered mechanisms of callose removal. In maize and soybean, treatment with metals (including Cd) induces the accumulation of callose in the cell wall. Results indicate that the accumulation is not related to decreased enzymatic activity of glucanases and reinforce the hypothesis that accumulation of callose is primarily dependent on a higher rate of synthesis and/or deposition (Piršelová et al., 2012). Like callose, there are few reports on the relationship between Cd and cellulose and the relationship between Cd pollution and cellulose synthesis is far from being completely understood.

The effects of Cd on cell wall synthesis might also be indirect. Cd is known to affect the level of apoplastic sucrose probably by altering the activity of cell wall invertase. Consequently, sucrose accumulates in the cell wall and decreases in the cytoplasm (Podazza et al., 2006). Lower levels of cytoplasmic sucrose can have detrimental effects on cell wall synthesis by making substrates for the biosynthesis of cell wall precursors less available. Synthesis of both cellulose and callose requires the addition of UDP-glucose, which can be produced by two different metabolic pathways. The first requires the enzyme UDP-glucose pyrophosphorylase (UGPase) that catalyzes the reversible production of UDP-glucose and pyrophosphate from glucose-1 phosphate and UTP; the second mechanism involves the activity of sucrose synthase, which breaks down sucrose (in the presence of UTP) into UDP-glucose and fructose (Kleczkowski et al., 2010). Lower levels of cytoplasmic sucrose may reduce the activity of sucrose synthase and consequently may lower the production rate of UDP-glucose. As sucrose synthase is associated with both callose synthase and cellulose synthase (Brill et al., 2011), it follows that the Cd-induced reduction of cytoplasmic sucrose might affect the synthesis of cellulose and callose.

Another pathway by which Cd can affect the synthesis and deposition of cell wall is indirect and relates to the effects of Cd on cytoplasmic streaming. It is known that delivery of cell wall synthesizing enzymes requires an intact cytoskeleton and an intact membrane system: cell wall synthesizing enzymes traffic along the cytoskeleton, which guides the enzymes towards the final insertion sites in the plasma membrane (Crowell et al., 2010). The precise balance between insertion and removal of cell wall synthesizing enzymes is critical to correctly deposit the cell wall and to assemble a proper cell wall texture. Cd affects the intracytoplasmic movement of organelles and vesicles in the epidermal cells of the model plant *A. cepa* (Wierzbicka et al., 2007). Cytoplasmic streaming disorders are likely to impact negatively on the cell wall assembly. We do not know how Cd affects organelle streaming, if through an unspecific pathway (for example, by altering the ionic homeostasis of cells), or a more specific mechanism that directly inhibits the proteins involved in organelle transport. Nevertheless, it is clear that Cd has a considerable influence on the structure of actin filaments and consequently on cytoplasmic streaming. In *A. thaliana* root hairs, Cd alters the balance between influx and efflux of Ca2+, which consequently abolishes the Ca2<sup>+</sup> gradient. Since Ca2<sup>+</sup> regulates the dynamics of actin filaments in tip-growing cells, the Cd-induced imbalance of Ca2<sup>+</sup> has negative effects on actin filaments that consecutively impair the motility of organelles and vesicles, as reviewed by Wan and Zhang (2012). As observed by the authors, the cell wall structure is also affected because of the altered cytoplasmic streaming (Fan et al., 2011). The relationship between Cd and Ca2<sup>+</sup> is likely more complex than expected in view of the finding that a member of the plant cadmium-resistance (PCR) protein family is a Ca2<sup>+</sup> efflux transporter, which is involved in the finer regulation of intracellular Ca2<sup>+</sup> concentration (Song et al., 2011). The global effects of Cd on cell wall synthesis are summarized in **Figure 2**.

### **Cd ACCUMULATION IN PLANTS: THE CASE OF HYPERACCUMULATORS AND THE EFFECT OF NUTRIENTS**

In the previous sections, data concerning how Cd can be a harmful element for plants and how plants can respond to damage caused by Cd were presented and discussed. Emphasis was put on how evolutionarily-distant plants can respond in different ways to Cd treatment; this is an indication that several different mechanisms have evolved in response to Cd. In this section, the focus will be put on those plants showing heavy metal hyperaccumulating capacity, since they represent an extreme case. Moreover, the study of their cell walls can lead to important observations, which can be transferred to biotechnological approaches. In this section considerations on the role of macro- and micro-nutrients in the increased Cd accumulation/tolerance will also be made, since this aspect is linked with the potential of enhancing the plant capacity to accumulate Cd.

Hyperaccumulator plants are characterized by the extraordinary capacity of accumulating Cd and of translocating it towards the aerial parts of plants (namely leaves) and of detoxifying the element locally (Rascio and Navari-Izzo, 2011). At the level of leaves, Cd can be accumulated in specific sites, such as the vacuoles or the cell wall. It is therefore possible to assume that in hyperaccumulators the ability of the cell wall to absorb Cd is so high that Cd fails to actively penetrate inside the cytoplasm and remains confined within the cell wall. Even if the molecular mechanisms working in hyperaccumulator plants are the same as those present in Cd-resistant plants, the main difference is that the former can concentrate Cd in the cell wall at extreme levels without apparent damage. There are many examples of hyperaccumulators and, as the study is continuing, new species are added to this list. The root hairs of *T. caerulescens*, a metal hyperaccumulator, can accumulate consistent levels of Cd in their cell walls thereby preventing Cd from penetrating in the cytoplasm (Nedelkoska and Doran, 2000). Massive accumulation of Cd in hyperaccumulator plants is also based on complexes with phosphates within the cell wall. However, not all cell types accumulate Cd at the same level (Küpper et al., 2000). Accumulation of Cd and Zn may have particular relevance in hyperaccumulator plants, such as *Sedum plumbizincicola*, in which heavy metals are chiefly accumulated in shoots and relatively less in roots (Cao et al., 2014). *Dittrichia viscosa* is another model-accumulator plant that accumulates large amounts of Cd in the cell wall (Fernández et al., 2014).

The ability of certain species to accumulate Cd is linked not only to the higher ROS scavenging activity, but also to Fe homeostasis: studies carried out on the Cd-accumulator *Solanum nigrum* and the low Cd-accumulating *Solanum torvum* showed that Cd treatment caused a lower Fe accumulation in *S. torvum* compared to *S. nigrum* (Xu et al., 2012b). This difference is related to the increased expression of Fe transporters, namely IRT1, and IRT2, in the roots of the Cd-accumulator *S. nigrum* (Xu et al., 2012b).

The most important application of hyperaccumulator plants is linked to the possibility of decontaminating polluted sites

(see also "Future outlook"). Consequently, researchers are looking for hyperaccumulators that can be used to detoxify contaminated soils in an economically feasible and reasonable way. This has been already proposed for two species of *Iris* (Han et al., 2007). Those plants, capable of restoring optimal conditions in contaminated soils, have evolved molecular mechanisms to tolerate high concentrations of contaminants. The acquisition of tolerance to contaminants involves the development of mechanisms for the accumulation of pollutants in specific cell sites, such as the vacuole and cell wall of root cells where Cd might be complexed with insoluble phosphates that strongly limit the translocation of Cd to aboveground tissues and organs (Zhang et al., 2014b). This suggests that the hyperaccumulating capacity of cell walls can be induced or improved by adding specific chemical groups. This can involve both the polysaccharide and the protein component of the cell wall. In this respect, studies on transgenic tobacco plants overexpressing a xyloglucan endotransglucosylase/hydrolase gene support this statement. In those plants, xyloglucans accumulated at lower levels, which prevents plants from binding Cd. This suggests that xyloglucans could be relevant binding sites for Cd and are involved in Cd tolerance; conversely, plants with lower levels of xyloglucans accumulate lower amounts of this heavy metal (Han et al., 2014).

Another significant example is represented by the addition of Si. Si considerably reduced the net influx of Cd by likely producing Si-cell wall complexes that efficiently adsorb Cd thus preventing its entry in the cytoplasm (Liu et al., 2013). In such a way, the cell wall can be potentiated, thus becoming a more effective barrier against Cd. The ameliorating activity of Si is dependent on its concentration, as well as on the concentration of Cd, and Si has a major effect on Cd influx by modifying the cell wall texture. However, Si does not affect intracellular processes like photosynthesis (Lukacová et al., 2013). In mangrove plants, Si is known to alleviate the effects of Cd pollution by restricting Cd to the cell wall of roots and reducing the concentration of Cd in the symplast. Therefore Si enhances the capacity of cell walls to restrain Cd, thereby limiting its diffusion (Zhang et al., 2014a).

The possibility of improving the hyperaccumulator capacity of plants may not be simple, as the number of genes involved may be huge. A preliminary assay in the hyperaccumulator *S. alfredii* revealed that plants under Cd stress respond by modifying the expression level of more than 100 genes responsible for different functions (including cell wall modification; Gao et al., 2013). Comparable analyses by using suppression subtractive hybridization screens showed that the activity of several genes is important in making plants of *Salix caprea* tolerant to Cd pollution. Some of the investigated genes belong to the family of cell wall modifying enzymes (Konlechner et al., 2013). This finding opens the way to the identification of potential genes (and gene products) capable of raising the tolerance capacity of plants against Cd.

The study of hyperaccumulators' cell walls will lead to the identification of those functional chemical groups capable of absorbing high amounts of Cd thereby preventing its entry in the cytoplasm. In addition to understanding the chemistry of this process, it is also necessary to identify the cellular mechanisms allowing the addition of such chemical groups in the cell wall (e.g., specific enzymes that modify the chemical structure of cell wall components). It follows that the development of hyperaccumulator plants requires a high level of knowledge. Ideally, the most convenient result would be to combine the hyperaccumulating capacity of non-crop plants (in which this process has been widely studied) with the economically advantageous crop plants. By transferring the genetic information at the base of Cd resistance into crop plants, it will be possible to design improved plants capable at the same time to detoxify the soil and to maintain the soil productivity rate (Wu et al., 2010).

## **PLANT BIOTECHNOLOGY AND Cd TOXICITY: WHAT SHOULD BE CONSIDERED?**

The examples listed previously show that plants are equipped with different mechanisms to cope with the toxicity induced by heavy metals (Cd, specifically). Studying the variety of responses will allow us to identify specific mechanisms used to tolerate large doses of pollutant. Those mechanisms (based either on chelating proteins, on general biochemical responses, or on sequestering in the vacuole/cell wall) may not necessarily be present in plants of economic interest and may concern either plant products or plant waste. For example, an interesting applicative aspect concerning the detoxifying ability of cell walls is the use of waste vegetable components rich in pectins as biosorbent of heavy metals (Wan Ngah and Hanafiah, 2008). The knowledge of these biosorbent mechanisms can be used to develop plants capable of sequestering Cd (hyperaccumulators; Douchiche et al., 2010). Pectins can also find applications outside the plant context. Fruits rich in pectin can absorb significant amounts of heavy metals (Schiewer and Patil, 2008). Such technologies can be extended to pectin gels extracted from commercial plants, such as sugar beet. Such studies, in addition to demonstrating that pectin-enriched matrices can actually be applied for Cd removal, confirm that Cd binds pectin matrices by removing calcium ions (Mata et al., 2009, 2010). Other food wastes have been proposed in this regard, for example citrus peels (Schiewer and Iqbal, 2010).

Several studies have shown altered Cd accumulation in plants engineered to express genes involved in heavy metal uptake, translocation or chelation. For example, as previously mentioned, tobacco plants expressing the Zn and Cd translocator *HMA4* from *A. thaliana*, which controls translocation to the shoot (Wong and Cobbett, 2009), show decreased Cd accumulation, but also deep transcriptional remodeling leading to increased lignification (Siemianowski et al., 2014). Restricting the examples to the cell wall, which is the topic of this review, the upregulation of a peroxidase, *O*-methyltransferase and hydroxycinnamoyl reductase leads to an increased deposition of lignin in a specific layer between the root epidermis and the first cortical layer, which consequently blocks Cd apoplastic movement towards the stele (Siemianowski et al., 2014). These transgenic tobacco plants exposed to Cd, however, show an enhanced Fe and Zn deficiency status linked to the overexpression of two metal uptake genes, *ZIP1* and *IRT1* (Siemianowski et al., 2014). These results show how challenging biotechnological prospects for improving heavy metal tolerance in plants are, as plants overexpressing a specific gene might show unfavorable/unexpected features linked to the response of the host to the transgene.

The insertion of new genes whose products change the chemical structure of cell wall might be a way to engineer plants that are more Cd-resistant. An alternative strategy would be to introduce specific mutations in genes already present. Mutation in one gene coding for a specific cellulose synthase subunit significantly alters the content of cellulose and decreases the cell wall thickness in rice plants. Although there were no clear phenotypical differences in comparison to control plants, mutant plants showed a significant decrease in the translocation rate of Cd along xylem vessels, which appeared as abnormal in shape. The new morphological structure of xylem, although not impairing the total shape of plants, determines a consistent reduction in the transport of Cd to leaves, thereby making rice plants more tolerant to Cd pollution (Song et al., 2013). Together with other studies, this finding highlighted that the translocation of Cd from roots to shoots is likely more critical than the mere uptake of Cd at the root level (Xin et al., 2013). Therefore, the accumulation of Cd in the leaves seems more closely related to the translocation rate than to Cd uptake and reinforces the necessity of studying how Cd is translocated through the xylem and the factors that can be modified/regulated to delay such process.

The presence of the so-called "metal cross-homeostasis" has been identified as a primary factor influencing the phenotype of plants with a modified expression of genes involved in metal homeostasis (Antosiewicz et al., 2014). This cross-homeostasis is primarily due to the cross-talk existing between the homeostatic mechanisms of different metals: for example the concentration of different metals regulates the same players of the transcriptional wiring, the same transporters and chelators (Antosiewicz et al., 2014).

Approaches in biotechnology represent the most effective way to engineer heavy metal accumulation in plants. In particular, three elements are relevant to devise effective engineering approaches for heavy metal uptake/accumulation in plants: (1) the choice of specific promoters, for instance tissue-specific promoters (Antosiewicz et al., 2014), which restrict eventual modifications to a specific tissue or organ; (2) integrative studies aimed at deciphering the response of plants to a specific metal at a transcriptional, protein and metabolic level; (3) the choice of the host to express a specific transgene. This last feature is very relevant, as fast-growing plants with high biomass yield should be preferred. As an example, fiber crops like hemp (*Cannabis sativa* L.) represent very good candidates for engineering approaches. Hemp has a deep root system, a fast growth rate and is known to require less water than other crops (like cotton): all these features make it extremely interesting for biotechnology. Moreover the hemp genome has been sequenced (van Bakel et al., 2011), a feature which can greatly favor the design of constructs, e.g., for synthetic biology. Last but not least, the tissues of hemp stems show a great diversity of chemical composition: the core is lignified, while the cortex is rich in cellulosic bast fibers. This heterogeneity in cell wall composition offers a variety of functional groups potentially useful for heavy metal sequestration in hemp stem tissues.

### **FUTURE OUTLOOK: WHAT COMES NEXT?**

To conclude this review, a list of points that need further study is hereafter proposed to achieve a more comprehensive knowledge of the mechanisms involved in Cd toxicity in plants and to devise new strategies for the biotechnological improvement of plant tolerance to toxic heavy metals.


there might be several possibilities for the cell wall to interfere with the toxic activity of Cd. Therefore, the study of cell walls (in terms of assembly and composition) in hyperaccumulators can provide important information on how to improve the survival or the productivity of plants growing on Cdcontaminated soils. In this context, it is important to identify the natural targets of Cd at the cell wall level and to figure out which of these targets can be modified to prevent the negative effects of Cd. One approach to develop this point is the functional study of those targets via generation of mutants/overexpressors. The purpose of this approach is to generate plants that either cannot absorb Cd or can sequester Cd in the cell wall thereby reducing or avoiding detrimental effects in terms of growth or productivity.

(3) Studies on the effects induced at the cell wall-level by the application of beneficial mineral nutrients deserve further attention. In particular the role of Si and P in alleviating the toxic effects of Cd should be analyzed via an integrative biology approach to identify the targets of this interaction.

## **ACKNOWLEDGMENTS**

GG, KS, and JFH thank partial financial support obtained through the National Research Fund, FNR Project CANCAN C13/SR/57742025774202 and project CADWALL INTER/FWO/ 12/14. The authors also thank Lucien Hoffmann for critical reading. The conceiving of this Review is linked to activities conducted within the COST FP1105 and COST FP1106 actions.

#### **REFERENCES**


exposed barley seedlings. *Plant Cell Environ.* 20, 969–981. doi: 10.1111/j.1365- 3040.1997.tb00674.x


accessions of the hyperaccumulators *Thlaspi caerulescens* and *Thlaspi praecox*. *New Phytol.* 178, 315–325. doi: 10.1111/j.1469-8137.2008.02376.x


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 27 October 2014; accepted: 19 February 2015; published online: 13 March 2015.*

*Citation: Parrotta L, Guerriero G, Sergeant K, Cai G and Hausman J-F (2015) Target or barrier? The cell wall of early- and later-diverging plants vs cadmium toxicity: differences in the response mechanisms. Front. Plant Sci. 6:133. doi: 10.3389/fpls.2015. 00133*

*This article was submitted to Plant Genetics and Genomics, a section of the journal Frontiers in Plant Science.*

*Copyright* © *2015 Parrotta, Guerriero, Sergeant, Cai and Hausman. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*