# GENOMICS IN AQUACULTURE TO BETTER UNDERSTAND SPECIES BIOLOGY AND ACCELERATE GENETIC PROGRESS

EDITED BY: José Manuel Yáñez, Ross Houston and Scott Newman PUBLISHED IN: Frontiers in Genetics

#### *Frontiers Copyright Statement*

*© Copyright 2007-2016 Frontiers Media SA. All rights reserved. All content included on this site, such as text, graphics, logos, button icons, images, video/audio clips, downloads, data compilations and software, is the property of or is licensed to Frontiers Media SA ("Frontiers") or its licensees and/or subcontractors. The copyright in the text of individual articles is the property of their respective authors, subject to a license granted to Frontiers.*

*The compilation of articles constituting this e-book, wherever published, as well as the compilation of all other content on this site, is the exclusive property of Frontiers. For the conditions for downloading and copying of e-books from Frontiers' website, please see the Terms for Website Use. If purchasing Frontiers e-books from other websites or sources, the conditions of the website concerned apply.*

*Images and graphics not forming part of user-contributed materials may not be downloaded or copied without permission.*

*Individual articles may be downloaded and reproduced in accordance with the principles of the CC-BY licence subject to any copyright or other notices. They may not be re-sold as an e-book.*

*As author or other contributor you grant a CC-BY licence to others to reproduce your articles, including any graphics and third-party materials supplied by you, in accordance with the Conditions for Website Use and subject to any copyright notices which you include in connection with your articles and materials.*

> *All copyright, and all rights therein, are protected by national and international copyright laws.*

*The above represents a summary only. For the full conditions see the Conditions for Authors and the Conditions for Website Use.*

ISSN 1664-8714 ISBN 978-2-88919-957-0 DOI 10.3389/978-2-88919-957-0

## About Frontiers

Frontiers is more than just an open-access publisher of scholarly articles: it is a pioneering approach to the world of academia, radically improving the way scholarly research is managed. The grand vision of Frontiers is a world where all people have an equal opportunity to seek, share and generate knowledge. Frontiers provides immediate and permanent online open access to all its publications, but this alone is not enough to realize our grand goals.

## Frontiers Journal Series

The Frontiers Journal Series is a multi-tier and interdisciplinary set of open-access, online journals, promising a paradigm shift from the current review, selection and dissemination processes in academic publishing. All Frontiers journals are driven by researchers for researchers; therefore, they constitute a service to the scholarly community. At the same time, the Frontiers Journal Series operates on a revolutionary invention, the tiered publishing system, initially addressing specific communities of scholars, and gradually climbing up to broader public understanding, thus serving the interests of the lay society, too.

## Dedication to Quality

Each Frontiers article is a landmark of the highest quality, thanks to genuinely collaborative interactions between authors and review editors, who include some of the world's best academicians. Research must be certified by peers before entering a stream of knowledge that may eventually reach the public - and shape society; therefore, Frontiers only applies the most rigorous and unbiased reviews.

Frontiers revolutionizes research publishing by freely delivering the most outstanding research, evaluated with no bias from both the academic and social point of view. By applying the most advanced information technologies, Frontiers is catapulting scholarly publishing into a new generation.

## What are Frontiers Research Topics?

Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are collections of at least ten articles, all centered on a particular subject. With their unique mix of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author by contacting the Frontiers Editorial Office: researchtopics@frontiersin.org

## **GENOMICS IN AQUACULTURE TO BETTER UNDERSTAND SPECIES BIOLOGY AND ACCELERATE GENETIC PROGRESS**

## Topic Editors:

**José Manuel Yáñez,** University of Chile, Chile **Ross Houston**, University of Edinburgh, UK **Scott Newman**, Genus, plc, USA

Salmon eyed-eggs. Image by Aquainnovo S.A. From a global perspective aquaculture is an activity related to food production with large potential for growth. Considering a continuously growing population, the efficiency and sustainability of this activity will be crucial to meet the needs of protein for human consumption in the near future. However, for continuous enhancement of the culture of both fish and shellfish there are still challenges to overcome, mostly related to the biology of the cultured species and their interaction with (increasingly changing) environmental factors. Examples of these challenges include early sexual maturation, feed meal replacement, immune response to infectious diseases and parasites, and temperature and salinity tolerance.

Moreover, it is estimated that less than 10% of the total aquaculture production in the world is based on populations genetically improved by means of

artificial selection. Thus, there is considerable room for implementing breeding schemes aimed at improving productive traits having significant economic impact. By far the most economically relevant trait is growth rate, which can be efficiently improved by conventional genetic selection (i.e. based on breeding values of selection candidates). However, there are other important traits that cannot be measured directly on selection candidates, such as resistance against infectious and parasitic agents and carcass quality traits (e.g. fillet yield and meat color). However, these traits can be more efficiently improved using molecular tools to assist breeding programs by means of marker-assisted selection, using a few markers explaining a high proportion of the trait variation, or genomic selection, using thousands of markers to estimate genomic breeding values. The development and implementation of new technologies applied to molecular biology and genomics, such as next-generation sequencing methods and high-throughput genotyping platforms, are allowing the rapid increase of availability of genomic resources in aquaculture species. These resources will provide powerful tools to the research community and will aid in the determination of the genetic factors involved in several biological aspects of aquaculture species. In this regard, it is important to establish discussion in terms of which strategies will be more efficient to solve the primary challenges that are affecting aquaculture systems around the world.

The main objective of this Research Topic is to provide a forum to communicate recent research and implementation strategies in the use of genomics in aquaculture species with emphasis on (1) a better understanding of fish and shellfish biological processes having considerable impact on aquaculture systems; and (2) the efficient incorporation of molecular information into breeding programs to accelerate genetic progress of economically relevant traits.

**Citation:** Yáñez, J. M., Houston, R., Newman, S., eds. (2016). Genomics in Aquaculture to Better Understand Species Biology and Accelerate Genetic Progress. Lausanne: Frontiers Media. doi: 10.3389/978-2-88919-957-0

# Table of Contents

*06 Genomics in aquaculture to better understand species biology and accelerate genetic progress* José M. Yáñez, Scott Newman and Ross D. Houston

## **Reviews**


Paulino Martínez, Ana M. Viñas, Laura Sánchez, Noelia Díaz, Laia Ribas and Francesc Piferrer

## **Mini Reviews**

*47 Genetic considerations for mollusk production in aquaculture: current state of knowledge*

Marcela P. Astorga


## **Original Research Articles**

*67 Primary analysis of repeat elements of the Asian seabass (***Lates calcarifer***) transcriptome and genome*

Inna S. Kuznetsova, Natascha M. Thevasagayam, Prakki S. R. Sridatta, Aleksey S. Komissarov, Jolly M. Saju, Si Y. Ngoh, Junhui Jiang, Xueyan Shen and László Orbán


David Marancik, Guangtu Gao, Bam Paneru, Hao Ma, Alvaro G. Hernandez, Mohamed Salem, Jianbo Yao, Yniv Palti and Gregory D. Wiens


Ali Ali, Caird E. Rexroad, Gary H. Thorgaard, Jianbo Yao and Mohamed Salem

## **Perspectives**


Marc Vandeputte and Pierrick Haffray

*148 Genetic improvement of Pacific white shrimp [***Penaeus (Litopenaeus) vannamei***]: perspectives for genomic selection* Héctor Castillo-Juárez, Gabriel R. Campos-Montes, Alejandra Caballero-Zamora and Hugo H. Montaldo

## Genomics in aquaculture to better understand species biology and accelerate genetic progress

José M. Yáñez 1, 2 \*, Scott Newman<sup>3</sup> and Ross D. Houston<sup>4</sup>

<sup>1</sup> Faculty of Veterinary and Animal Sciences, University of Chile, Santiago, Chile, <sup>2</sup> Aquainnovo, Puerto Montt, Chile, <sup>3</sup> Genus plc, Hendersonville, TN, USA, <sup>4</sup> The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Midlothian, UK

Keywords: aquaculture, genome, breeding programs, QTL, single nucleotide polymorphisms, next-generation sequencing

The production of fish and shellfish through aquaculture is an increasingly important source of high-quality animal protein, with a worldwide production of 66.6 million tons in 2012 (FAO, 2014). Considering the continuously growing global human population and increasing demand for fish products, improvements in the scale, efficiency, and sustainability of aquaculture are essential. To achieve this, several challenges facing the culture of fish and shellfish species need to be overcome. These relate to the diverse biology of the cultured species and their interaction with environmental factors. Examples include outbreaks of infectious diseases, control of sexual maturation, sustainable feed for carnivorous species, and tolerance of diverse and changing environments. This "Frontiers in Livestock Genomics" Research Topic highlights the opportunities offered by recent developments in the field of genomics, and in particular high-throughput sequencing, to contribute to addressing these challenges, with a focus on selective breeding programmes.

The use of selective breeding as a tool to improve the biological efficiency of production in aquaculture generally lags behind plant and farm animal industries, and less than 10% of aquaculture production is based on genetically-improved stocks (Gjedrem et al., 2012). Encouragingly, annual genetic gains reported for aquatic species are in general substantially higher than that of terrestrial farm animals (Gjedrem et al., 2012) and there is considerable scope for achieving significant positive economic impact via improved breeding schemes. However, the status of breeding programs and the level of technology used for aquatic species production are wide-ranging, from use of wild seed stocks through to family-based selection incorporating genomic tools. Family selection and genomic tools can be applied to improve traits that are expensive or difficult to measure on the selection candidates themselves including disease resistance (Yáñez et al., 2014; Ødegård et al., 2014), flesh color (Colihueque and Araneda, 2014; Ødegård et al., 2014) and other appearance traits such as body shape and skin pigmentation (Colihueque and Araneda, 2014) in finfish species. In contrast, despite the global importance of mollusc species for aquaculture, few selective breeding programmes exist and the state of genomic tools and knowledge for these species is typically lacking (Astorga, 2014).

Genomics resources such as whole genome reference sequences, high-density SNP genotyping arrays and genotyping-by-sequencing are in development for several aquaculture species. Fuller characterisation of these resources is underway and is resulting in improved fundamental knowledge of the genome structure and biology, highlighted in this issue by the analysis of repeat elements in the Asian sea bass genome (Kuznetsova et al., 2014). These resources will provide powerful tools for the research community and will aid in the determination of the genetic factors involved in the regulation of complex traits. For example, high-throughput RNA sequencing can give a holistic view of the host response to infectious diseases, and help identify the important genes and pathways defining genetic resistance, as demonstrated in this issue for

Edited and reviewed by: Max F. Rothschild, Iowa State university, USA

> \*Correspondence: José M. Yáñez, jmayanez@uchile.cl

#### Specialty section:

This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics

> Received: 10 February 2015 Accepted: 17 March 2015 Published: 01 April 2015

#### Citation:

Yáñez JM, Newman S and Houston RD (2015) Genomics in aquaculture to better understand species biology and accelerate genetic progress. Front. Genet. 6:128. doi: 10.3389/fgene.2015.00128 Yáñez et al. Genomics in aquaculture

rainbow trout (Ali et al., 2014; Marancik et al., 2014) and panaeid shrimp (Santos et al., 2014). Sequencing technology has also facilitated the development of abundant genetic markers that have multi-faceted applications for selective breeding of aquatic species, including parentage assignment in mixedfamily environments, providing greater control over family representation and inbreeding (Vandeputte and Haffray, 2014). Medium or high-density SNP arrays can be used to predict genomic breeding values for economically-important traits in well-developed breeding programmes, such as Atlantic salmon (Ødegård et al., 2014). For instance, based on simulations of a Pacific white shrimp breeding program, genetic progress of disease resistance traits is faster with genomic-enabled selection compared to conventional phenotype-based selection due to higher accuracy (Castillo-Juárez et al., 2015). Incorporation of genetic marker information can also be a useful asset to optimize genetic diversity and future genetic gain when establishing base populations for breeding programmes (Fernández et al., 2014). Furthermore, these genomic tools can be applied to investigate putative genomic signatures of selection during the domestication process of farmed fish species, thus potentially identifying genomic regions underlying variation in relevant phenotypes in wild and domestic fish populations (López et al., 2015).

Aquaculture species typically have several common features, for example high fecundity and external fertilization, plus a short evolutionary distance from their wild ancestors. The reproductive features enable flexible mating structures to be used for breeding programmes, and can provide a powerful resource for genetic studies of complex traits, such as disease resistance (Yáñez et al., 2014). However, the diversity between these species is enormous and often necessitates the establishment

## References


of species-specific reproduction and breeding programmes. For example, there is a remarkable variety of sex-determination systems within aquatic farmed species, and the study of Martínez et al. (2014) highlights various methods of controlling sex ratio with aquaculture breeding programmes. This species diversity also presents an issue for choosing suitable model organisms to inform on the biology of the farmed species of interest. Model finfish species, such as zebrafish, have been well-characterized and Ulloa et al. (2014) highlight their utility for the evaluation of the response to alternative diets. However, due to the vast evolutionary distance between certain farmed aquatic and model species, it is clear that direct research on the species of interest can often be the most feasible and informative.

The aquaculture industry has often been innovative and visionary in their application of new technologies to improve production. Genomics present another major opportunity, and the research published in this special issue provides several excellent examples of their potential or realized application. Using genomic tools to more effectively utilize genetic variation in economically-important traits via sustainable breeding programmes is paramount to the continued successful growth and stability of aquaculture production.

## Acknowledgments

The authors would like to acknowledge funding from Genus plc, CORFO (11IEI-12843 and 12PIE-17669), Government of Chile, Programa U-Inicia, Vicerrectoría de Investigación y Desarrollo, Universidad de Chile, the UK Biotechnology and Biological Sciences Research Council (BBSRC) (BB/H022007/1) and from the Roslin Institute's BBSRC Institute Strategic Funding Grant.

seabass (Lates calcarifer) transcriptome and genome. Front. Genet. 5:223. doi: 10.3389/fgene.2014.00223


Yáñez, J. M., Houston, R. D., and Newman, S. (2014). Genetics and genomics of disease resistance in salmonid species. Front. Genet. 5:415. doi: 10.3389/fgene.2014.00415

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2015 Yáñez, Newman and Houston. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

## Genetics and genomics of disease resistance in salmonid species

## *José M. Yáñez1,2 \*, Ross D. Houston3 and Scott Newman4*

<sup>1</sup> Faculty of Veterinary and Animal Sciences, University of Chile, Santiago, Chile

<sup>2</sup> Aquainnovo, Puerto Montt, Chile

<sup>3</sup> The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Midlothian, UK

<sup>4</sup> Genus plc, Hendersonville, TN, USA

#### *Edited by:*

Peng Xu, Chinese Academy of Fishery Sciences, China

#### *Reviewed by:*

Zhi-Liang Hu, Iowa State University, USA Yniv Palti, United States Department

of Agriculture, USA

#### *\*Correspondence:*

José M. Yáñez, Faculty of Veterinary and Animal Sciences, University of Chile, Avenue Santa Rosa 11735, P.O. Box 8820808, La Pintana, Santiago, Chile e-mail: jmayanez@uchile.cl

Infectious and parasitic diseases generate large economic losses in salmon farming. A feasible and sustainable alternative to prevent disease outbreaks may be represented by genetic improvement for disease resistance. To include disease resistance into the breeding goal, prior knowledge of the levels of genetic variation for these traits is required. Furthermore, the information from the genetic architecture and molecular factors involved in resistance against diseases may be used to accelerate the genetic progress for these traits. In this regard, marker assisted selection and genomic selection are approaches which incorporate molecular information to increase the accuracy when predicting the genetic merit of selection candidates. In this article we review and discuss key aspects related to disease resistance in salmonid species, from both a genetic and genomic perspective, with emphasis in the applicability of disease resistance traits into breeding programs in salmonids.

**Keywords: salmon, disease resistance, breeding programs, QTL, genomic selection**

## **INTRODUCTION**

Farming of salmonid species is one of the largest aquaculture industries, with a worldwide production of approximately 1.9 million tons of high value product in 2010 (Food and Agriculture Organization of the United Nations [FAO], 2012). As in other animal production systems, the success and sustainability of salmonid aquaculture largely depends on the control of diseases. A clear example of the negative impact of infectious diseases in salmon farming is the unprecedented economic loss caused by outbreaks of the viral disease infectious salmon anemia (ISA) between 2007 and 2009 in Chile (Asche et al., 2010).

Genetic improvement programs are focused on increasing economic return of aquaculture systems via selective breeding (Gjedrem, 2012). In this regard, all heritable and economically relevant traits should be included in the breeding objective. Thus, in salmonid species, traits such as growth rate, flesh color, and resistance to viral, bacterial, and parasitic diseases should be included (Gjedrem,2000,2012). Selective breeding can utilize trait information recorded on selection candidates themselves or, particularly in the case of disease or invasive traits, on relatives. Until now, salmon breeding programs have typically included disease resistance based only on information from relatives, which affects the degree of genetic progress achievable on each generation. This is because of the lower accuracy of estimated breeding values (EBVs) when using only sib information compared to the accuracy obtained when using information of the selection candidates themselves (Falconer and Mackay, 1996).

Recent advances in molecular biology techniques, such as next generation sequencing and high throughput genotyping methods, have helped identify genetic variants influencing phenotypic variation for different traits in a wide range of organisms (Goddard and Hayes, 2009). Molecular markers can be used for a variety of applications in livestock and aquaculture species, such as strain and hybrid identification, genetic variability and genetic diversity evaluation, parentage analyses, quantitative trait loci (QTL) mapping, marker assisted selection (MAS), and genomic selection (GS; Liu and Cordes, 2004; Goddard and Hayes, 2009). Information from a few molecular markers linked to QTL (i.e., genomic regions harboring genes with a significant effect on the trait) might be implemented in breeding schemes through MAS, if they explain a high proportion of genetic variation in the trait. Additionally, the information of 1000s of markers might be simultaneously incorporated into genetic evaluation to estimate genomic breeding values (GEBVs; Meuwissen et al., 2001). These marker-based methods may be particularly useful for the improvement of traits that are complicated or impossible to measure directly on selection candidates, as is the case of resistance to disease (Sonesson and Meuwissen, 2009; Villanueva et al., 2011; Taylor, 2014). A typical first step to implement MAS or GS is to quantify the level of genetic variation in the trait by dissecting its genetic architecture. In salmonids, there is limited information on the genetic architecture of disease resistance. Nevertheless, it is expected that more knowledge on the QTL or genes affecting disease resistance traits will be revealed in the near future, facilitated by the increasing availability of genomic resources and better understanding of the biology of immune response in these species.

This paper reviews aspects of conventional breeding to improve disease resistance in salmonids and the application of molecular tools for the identification of genetic factors involved in these traits. Additionally, the incorporation of molecular information into breeding schemes to improve disease resistance is discussed.

## **IMPORTANCE OF DISEASE CONTROL IN SALMON FARMING**

The health status of farmed fish is one of the main factors affecting the economic return in the salmon industry. Despite scientific, professional, and technical strategies aimed at improving health management, many novel pathological conditions have emerged in salmonid fish species worldwide in recent decades. A detailed description of each disease affecting culture of salmonid species would greatly exceed the purpose of this review. However, some particular examples are discussed to demonstrate the large economic impact diseases can cause in salmon production.

One of the most striking cases affecting salmon farming was the economic crisis triggered by ISA virus outbreaks since mid-2007 in Chile. The production of Chilean Atlantic salmon suffered a dramatic decrease due to increasingly frequent outbreaks between 2007 and 2009. In fact, total production of Atlantic salmon between 2005 and 2010 decreased by more than 60% in volume (Asche et al., 2010). Currently, ISA virus outbreaks appear to be controlled to a low number of events per year in Europe and North and South America. However, the prevalence and emergence of other viral diseases is still of concern. In Northern European countries including Norway and Scotland, ISA outbreaks have been rare in recent years. Nevertheless, infectious pancreatic necrosis (IPN), caused by an aquatic birnavirus, has caused large levels of mortality in Europe, particularly during the window of susceptibility following transfer from freshwater to seawater (Roberts and Pearson, 2005). Other viral diseases have also emerged and pose serious threats to salmon aquaculture, such as skeletal muscle inflammation (HSMI) – a piscine reovirus – and pancreas disease (PD) – an alphavirus, which has shown an increase in recent years (Biering et al., 2012). These viruses cause direct economic losses through mortality and indirect losses through reduced growth rate and treatment costs.

Among bacterial diseases with a negative impact on salmon farming, salmon rickettsial syndrome (SRS) caused by the Gramnegative bacterium *Piscirickettsia salmonis*, is one of the main sanitary challenges in Chilean salmon industry. This disease affects different salmonid species, including Atlantic salmon (*Salmo salar*), coho salmon (*Oncorhynchus kisutch*), and rainbow trout (*O. mykiss*; Fryer and Hedrick, 2003) and can generate economic losses equivalent to 25% of total profit in salmon exports in Chile (Rozas and Enríquez, 2014). Other bacterial diseases, such as those caused by *Aeromonas salmonicida*, *Vibrio anguillarum*, and*Vibrio/Aliivibrio salminonicida*, are recognized to be efficiently controlled by vaccination and do not currently represent a major economic threat for salmon (Biering et al., 2012).

In terms of parasitic diseases, two different species of sea lice, *Lepeophtheirus salmonis* and *Caligus rogercresseyi*, are most detrimental parasites for salmon farming at a worldwide level. In this regard, it has been estimated that on average, the economic impact of sea lice infestation is about 6% of the total value produced by the world salmon industry (Costello, 2009). An emerging threat to salmon production worldwide is Amoebic Gill Disease (AGD), which has been the major disease of farmed salmonid production in Tasmania for several decades (Mitchell and Rodger, 2011) and has appeared relatively recently in most major salmon-producing countries (Ruane and Jones, 2013). The free-living amoebic protozoan *Neoparamoeba pemaquidensis* is the primary causative agent for the disease that can cause serious morbidity and reduced growth, in addition to increasing susceptibility to other pathogens (Mitchell and Rodger, 2011).

The measures used for prevention and treatment (vaccinations, antibiotics, and antiparasitic drugs, biosecurity measures) of some of the diseases presented above have typically been only partially effective in field conditions (Bravo et al., 2013; Jones et al., 2013; Rozas and Enríquez, 2014). Where effective vaccines do exist, administration typically requires individual handling and treatment of all production fish, which can be expensive and impractical in a large-scale production environment. Due to the fact that improvement in economic efficiency of salmon farming is dependent on disease prevention and control, (Asche and Roll, 2013) it is imperative to develop alternative effective and sustainable strategies. Genetic improvement of disease resistance represents a feasible solution to increase the sanitary status in animal production (Stear et al., 2001; Bishop, 2010). In this regard, there is increasing scientific literature aiming at both quantifying levels of host genetic variation for resistance against different diseases and identifying the specific genetic factors that influence these traits in salmonid species, as discussed below.

## **CONVENTIONAL BREEDING FOR DISEASE RESISTANCE IN SALMONIDS**

Resistance to diseases can be defined as the ability of the host to limit infection by reducing pathogen replication (Råberg et al., 2007; Doeschl-Wilson et al., 2012). Selecting animals with increased resistance to specific diseases is a feasible method to improve productivity and animal welfare and offers advantages over other control methods against infection, such as the cumulative and permanent benefits of the improved resistance (Stear et al., 2001; Bishop, 2010). Disease resistance has been a target trait for the salmon breeding industry for at least 20 years, with a Norwegian salmon breeding program including resistance to bacterial and viral diseases into its breeding goal since 1993 (Gjøen and Bentsen, 1997). However, the study of disease resistance and its incorporation into breeding programs can be hindered by the difficulty in determining and measuring accurate and appropriate phenotypes (Bishop and Woolliams, 2014). This in turn influences the accuracy of disease resistance EBVs that can be achieved. Another limiting step is that disease information is typically only available from relatives of the selection candidates and not directly from the candidates themselves. In the following, we review the main aspects of breeding for resistance to infectious diseases in salmonids and discuss current status and future directions of research in this area.

## **CHALLENGE AGAINST PATHOGENS**

Host resistance to viral and bacterial pathogens can often be measured, in practical terms, as survival (and/or mortality) of individuals during an outbreak (Ødegård et al., 2011). Data and samples from field outbreaks can be used opportunistically to make inference about genetic resistance to infectious diseases. For this purpose, it is necessary that the pedigree of the population be accurately determined, often using genetic markers or electronic tagging of the fish (Guy et al., 2006). However, using the information from field outbreaks has some disadvantages, such as difficulty to identify the exact cause of death because the factors that influence survival under these conditions are likely to be diverse. Furthermore, the availability of information depends on the occurrence of high-mortality outbreaks, which are usually prevented or controlled to avoid serious economic loss. Moreover, the inference of pedigree using molecular markers can be expensive and laborious. Therefore, survival data are often obtained from experimental challenges, which can readily be standardized to control other variables and potentially allow a clearer interpretation of the results. In this case, it is necessary that a high genetic correlation between the trait measured in experimental and field conditions exists. High genetic correlations (*r*<sup>g</sup> ≥ 0.95) between field trials and experimental challenges to furunculosis in Atlantic salmon have been reported (Gjøen et al., 1997; Ødegård et al., 2006), suggesting that results from experimental challenges are likely to be directly applicable to commercial production systems. Therefore, challenge tests will often be more accurate and reliable than field outbreaks, due to decreased environmental variability and higher practical feasibility. In fact, challenge testing is currently used to select for resistance to viral, bacterial, and parasitic diseases in breeding programs for Atlantic salmon and rainbow trout (Gjøen and Bentsen, 1997; Leeds et al., 2010; Yáñez and Martínez, 2010; Ødegård et al., 2011; Gjedrem, 2012; Wiens et al., 2013a).

## **IMMUNOLOGICAL AND PHYSIOLOGICAL VARIABLES AS INDIRECT MEASURES OF RESISTANCE**

Direct genetic selection for improved disease resistance based on challenge testing can be costly and time consuming, and has negative animal welfare implications. Furthermore, selection decisions using this strategy can only be carried out using information from relatives and not the candidates themselves. Indirect selection based on the measurement of other characteristics that are genetically correlated with disease resistance, would simplify the data collection and allow the incorporation of individual information. Some studies have aimed at determining the genetic variation of physiological and immunological variables, and the correlation between them and survival in challenge tests in salmon. Examples of variables that have been studied to date are hemolytic activity of serum and lysozyme activity (Røed et al., 1993; Lund et al., 1995), plasma levels of cortisol (Fevolden et al., 1993; Weber et al., 2008), and levels of IgM and antibody titer (Lund et al., 1995), serum α2-antiplasmin (Salte et al., 1993), bactericidal and complement activity (Hollebecq et al., 1995). However, even when some studies show significant correlations between resistance and immune parameters, the proportion of the total variation in survival that could be explained by immune variables has been considered too low to be useful as a selection criterion. Hence, the prediction of breeding values for survival based on these variables may not be practically useful (Gjøen and Bentsen, 1997). This may be due in part to the complexity of the mechanisms involved in the immune response and the large number of factors that may be involved in disease resistance, which results in a great difficulty when trying to use the information from a single parameter for the genetic evaluation of disease resistance.

## **GENETIC VARIATION IN RESISTANCE TO INFECTIOUS DISEASES**

A requirement to improve a trait by means of artificial selection is that sufficient genetic variation for this trait exists in the population. Heritability is the proportion of the total phenotypic variance that is attributable to additive genetic variation (Falconer and Mackay, 1996). For disease resistance traits, heritability estimates can vary due to differences in trait definitions and the statistical models used in the analysis (Yáñez and Martínez, 2010; Ødegård et al., 2011). For example, some studies consider the binary trait of survival or mortality as a measure of resistance, whereas others consider the survival time following challenge. These require different statistical approaches to analysis, and are likely to inform on different components of the host response to infection. Nonetheless, there have been several studies aimed at determining levels of additive genetic variation for resistance to different diseases affecting salmonid species (see **Table 1**). The results from these studies show the feasibility of improving disease resistance through genetic improvement and the potential of this approach for helping in the control of disease problems in salmonids.

## **GENETIC CORRELATIONS BETWEEN DISEASE RESISTANCE AND OTHER TRAITS**

The potential to simultaneously improve resistance to different diseases and other economically important traits is partly dependent on the genetic correlations between the traits. Few studies to date have aimed to determine the genetic correlations for disease resistance traits in salmon. Some studies in Atlantic salmon have indicated positive genetic correlations between resistance to bacterial diseases such as furunculosis, BKD, and cold water Vibriosis (Gjedrem and Gjøen, 1995; Gjøen et al., 1997). Weak negative genetic correlations between resistance against ISA and bacterial diseases such as furunculosis, vibriosis, and cold-water vibriosis have been reported (Gjøen et al., 1997). However, positive genetic correlations between resistance against ISA virus and furunculosis have also been reported (Ødegård et al., 2007b). In rainbow trout, weak genetic correlations between viral hemorrhagic septicemia (VHS) and bacterial diseases such as enteric redmouth disease (ERM) and rainbow trout fry syndrome have been found (Henryon et al., 2005). Kjøglum et al. (2008)reported only weak genetic correlations when they estimated genetic correlations between resistance to IPN, ISA, and furunculosis. Verrier et al. (2013b) also failed to detect any genetic correlation between host resistance to two rhabdoviral pathogens (VHS and Infectious Hematopoietic Virus). Additionally, weak genetic correlations have also been calculated between resistance against SRS and *C. rogercresseyi* in Atlantic salmon (Yáñez et al., 2014a). In general, these results suggest no clear-cut relationship between genetic resistance to one pathogen and genetic resistance to another pathogen.

Moreover, it is important to know the genetic correlations between disease resistance and other economically important traits in salmon production, especially for selection index development. Previous reports of correlations between disease resistance and production traits have ranged from zero


**Table 1 | Heritabilty values (***h<sup>2</sup>* **) and their SE for resistance to different infectious and parasitic diseases in salmonid species.**

(Silverstein et al., 2009), low to moderately negative (Henryon et al., 2002), inconsistent (Beacham and Evelyn, 1992; Henryon et al., 2002), and low to moderately positive (Gjedrem et al., 1991; Perry et al., 2004; Yáñez et al., 2014a). Thus, it is important to

evaluate these parameters for each particular breeding population to maximize the impact of selecting for improved disease resistance. Additionally, the relationship between ploidy levels and disease resistance is of interest because all female triploid rainbow trout are advantageous for production. Weber et al. (2013)showed that triploid fish are generally slightly more susceptible to Bacterial Coldwater Disease than diploids, but that selection for improved resistance in diploids is also effective for triploid production.

## **RESISTANCE VERSUS TOLERANCE**

If resistance is defined as an individual's ability to block the reproduction of a pathogen, then disease tolerance can be defined as the ability to limit the impact of infection on a host (Råberg et al.,2007; Doeschl-Wilson et al., 2012). Information relating to the genetic basis of disease tolerance has been sparse in animal genetics studies to date (Doeschl-Wilson et al., 2012). However, it is important to disentangle resistance from tolerance because the appropriate trait to select for may differ depending on the disease, host species, and environment. It is also possible that the two traits will be antagonistic, which could result in inadvertent undesirable outcomes of selection for disease resistance (Doeschl-Wilson et al., 2012). The implications of selection for resistance or tolerance on the host–pathogen interaction and pathogen evolution have also been considered. For example, selection for resistance *per se* may lead to selection pressure for higher virulence in the pathogen, whereas selection for tolerance could result in co-existence of pathogen and host with minimal impact on the performance of the host population. Therefore, in future disease studies and in selective breeding programs, it will be important to carefully consider the optimal disease trait to target for maximum benefit at the population level, but also the analytic issues of resistance versus tolerance.

## **STRATEGIES FOR GENETIC DISSECTION OF DISEASE RESISTANCE TRAITS**

Major advances in nucleotide sequencing, ever-improving bioinformatics pipelines, and high-throughput genotyping tools have helped to identify genes associated with complex traits in various species of vertebrates. In salmonids, this is reflected in the substantial increase in genomic resources for these species during recent years and, for example, in the formation of an international collaboration to sequence the Atlantic salmon genome (Davidson et al., 2010). The salmon genome has undergone a relatively recent duplication and has a very high content of long repeat elements. This has hampered the sequencing and assembly of a reference genome for salmonids. However, a high-quality reference sequence has been published for rainbow trout (Berthelot et al., 2014) and is available for Atlantic salmon (Davidson et al., 2010). Further, high density SNP genotyping arrays were recently developed for Atlantic salmon (Houston et al., 2014; Yáñez et al., 2014b), and a lower density platform (Lien et al., 2011) has previously been used for QTL mapping and population genetics. Currently, these genomic resources are increasingly being used for the identification of the genetic factors involved in the resistance to different diseases in salmonids. The strategies used in the study of the genetic architecture of disease resistance can be classified as: (i) candidate gene approaches, (ii) QTL mapping, and (iii) gene expression studies.

## **CANDIDATE GENE APPROACH**

Candidate gene theory states that a significant proportion of phenotypic variance of one trait in a population is determined by the presence of polymorphisms within genes known to be involved in the physiological regulation of that trait (Rothschild and Soller, 1997). This approach requires previous knowledge on the biology of the species, biochemical pathways, and especially gene sequences, to study the variation within specific candidate genes. In aquaculture species, the availability of annotated gene sequences of known function is typically low, but is likely to increase in the short term with the use of high-throughput sequencing and ongoing genome sequencing projects.

In vertebrates, the major histocompatibility complex (MHC) has attracted much attention in studies of association between genetic variants and disease resistance. However, other genes are likely to play an important role in the mechanisms of disease resistance in production animals, model organisms, and humans (Hill, 1999; Qureshi et al., 1999). To our knowledge, there are no studies aimed at establishing association between candidate genes, other than the MHC, and resistance to infectious diseases in salmonid species.

### *Major histocompatibility complex*

The MHC is a multigene family that acts at the interface between the immune system and infectious pathogens. The MHC gene family comprises two subfamilies: class I and II. Both classes are membrane glycoproteins involved in the processing and removal of pathogens (Thorgaard et al., 2002). MHC genes have been identified, cloned, and characterized in Atlantic salmon, rainbow trout, and other salmonids (Grimholt et al., 1993; Hordvik et al., 1993; Hansen et al., 1996; Shum et al., 1999, 2002). Furthermore, it has been shown that these genes are highly polymorphic in these species (Grimholt et al., 1994, 2002; Miller and Withler, 1996; Hansen et al., 1999; Garrigan and Hedrick, 2001; Aoyagi et al., 2002). As in other vertebrates there are two types of class I genes; the UAA which are highly divergent, non-polymorphic and expressed at low levels, and the UBA which are polymorphic, expressed at high levels in spleen, and with structural features similar to those of class Ia molecules (classical) which present antigen to T lymphocytes (Shum et al., 1999). The class II genes are divided into Class II A (DAA) and II B (DAB), depending on whether encoding α or β chain of the molecule, respectively (Grimholt et al., 2000; Stet et al., 2002). Both loci (DAA and DAB) co-segregate as haplotypes, suggesting a close physical linkage between them in Atlantic salmon (Stet et al., 2002).

The association between a polymorphism linked to MHC class II genes and resistance to virus infectious hematopoietic necrosis (IHN) in backcrosses of rainbow trout and cutthroat trout (*O. clarki*) has been found, but was relatively weak and dependent on the family analyzed (Palti et al., 2001). Suggestive associations between rainbow trout MHC class IB alleles and bacterial cold water disease have also been shown (Johnson et al., 2008). In Atlantic salmon, the association between MH class IIB alleles and resistance against *A. salmonicida* has been reported (Langefors et al., 2001; Lohm et al., 2002). In the same species, MHC class I and class II variants have been associated with susceptibility to IHN (Miller et al., 2004) and resistance to furunculosis and ISA (Grimholt et al., 2003; Kjøglum et al., 2006). Although associations between MH gene variants and disease resistance have been established, the MHC is most likely not the only factor influencing genetic variation in disease resistance. For instance, in Atlantic salmon, a non-MHC effect for resistance to IPN, furunculosis and ISA has been detected (Kjøglum et al., 2005). Because disease resistance traits will typically be polygenic in nature, it is important to consider variants in a genome-wide context and possible interactions that can occur between genes (epistasis), which hinders the candidate gene approach as a comprehensive strategy for incorporating molecular information into the genetic evaluation of these traits.

#### **QTL MAPPING**

Quantitative trait loci mapping is a strategy providing information on the location and effect of the gene variants influencing complex quantitative traits, but without prior hypotheses. The QTL detection methodologies are based on the use of (typically anonymous) DNA markers dispersed throughout the genome to identify genomic regions involved in the genetic variation of a particular trait, by means of statistical analyses utilizing the co-segregation between markers and the (unknown) causative variants.

### *DNA markers*

The development of DNA markers has had a major impact on studies of genetic variation in animals and fish. The categories of DNA markers widely used historically include Restriction Fragment Length Polymorphisms (RFLP), Random Amplified Polymorphic DNA (RAPD), Amplified Fragment Length Polymorphism (AFLP), microsatellites, and single nucleotide polymorphisms (SNPs). More recently, SNP markers have become the predominant marker due to their abundance, ease of discovery and low cost of genotyping per locus (Houston et al., 2014). However, each of these marker types vary from each other in their mode of inheritance (i.e., dominant or codominant), identification and detection methods, number of spanning loci and polymorphic information content (Liu and Cordes, 2004).

Microsatellites are tandem repeat nucleotide sequences that generally span between one to six base pairs. Different alleles are generated due to the variation in the number of repeats. Their main useful features for genetic studies include high variability, codominant inheritance, abundance, and wide distribution across the genome. One of the major advantages of microsatellites is their high degree of polymorphism, with 10s of alleles often observed at a single locus in an outbred population. Additionally, their genotyping is typically based on a simple DNA amplification by means of polymerase chain reaction (PCR). Thus, genotyping can be relatively rapid, cheap, and the amount of DNA required is minimal (nanograms). However, the scoring of microsatellites often requires optimization and manual input, which reduces their scalability for large genetic studies. There are a considerable number of microsatellites identified for different salmonid species, available for use in genetic studies (e.g., Cairney et al., 2000; Gilbey et al., 2004; Phillips et al., 2009).

Historically, AFLPs (which typically score SNPs) had the advantage of being generated more easily and cheaply than SNPs and microsatellites, because previous knowledge on genomic sequence

is not needed for their generation. However, their mode of inheritance, similar to RAPDs, is dominant, i.e., it is not possible to distinguish between heterozygous and one category of homozygous genotypes without the use of special equipment and software (Piepho and Koch, 2000). This reduces the amount of information provided by these kinds of markers.

Single nucleotide polymorphismss have several key advantages for genetic studies, which have led to a rapid increase in their popularity over recent years. Some of these include high abundance, amenability to automated scoring in large numbers, simultaneous assays of thousands of markers (e.g., using SNP arrays) and presence in both coding and non-coding regions. Therefore, SNPs have been applied recently for construction of dense genetic maps, which can be used for fine mapping of QTLs and facilitate the identification of causative genes involved in the genetic variation of specific characters (e.g., Lien et al., 2011; Gonen et al., 2014). Recently, there has been a rapid increase in available SNPs for salmonid species, mainly for Atlantic salmon and rainbow trout (Hayes et al., 2007; Sanchez et al., 2009; Everett et al., 2011, 2012; Hohenlohe et al., 2011; Lien et al., 2011; Houston et al., 2012, 2014; Salem et al., 2012). Additionally, SNP arrays are available for simultaneous genotyping of 10s to 100s of 1000s of markers in rainbow trout (Palti et al., 2014b) and Atlantic salmon (Houston et al., 2014; Yáñez et al., 2014b). These SNP resources are likely to be increasingly applied to high-resolution mapping of disease resistance genes in salmonid species.

In recent years, the same sequencing technology that has led to the increased ease of high-density SNP discovery described above has enabled direct genotyping of individual fish based on sequence information alone. Such techniques are collectively termed 'genotyping by sequencing' (GBS). Although GBS techniques encompass a diverse range of laboratory and bioinformatic pipelines, they are all based on the principle of using nucleotide barcodes ligated to the fragmented genomic DNA of individual fish and high-throughput sequencing in multiplexed pools (Davey et al., 2013). Partly due to the lack of established genomics resources for many farmed fish species, the aquaculture genetics community has been early adopters and developers of these techniques. In particular, RAD sequencing has been utilized in both Atlantic salmon (e.g., Houston et al., 2012) and rainbow trout (Palti et al., 2014a) as a conduit to incorporating genomic information into aquaculture breeding programs. While the application of GBS has huge potential for applied research in salmonids, there are also some challenges in managing and interpreting these datasets. For example, discovery of 'false' SNPs using sequencing can occur, particularly with the recent whole genome duplication of the salmonid species. Therefore, species-tailored bioinformatics pipelines are typically required to minimize these issues.

#### *Detection of QTL affecting disease resistance in salmonid species*

Salmon may be more likely than terrestrial farmed species to have QTL of major effect because they have had fewer generations of selection in the farmed environment and therefore the standing genetic variation for traits of economic importance may still be very large. The development of a genetic map based on linkage between genetic markers is the first step towards the identification of QTL. To date, several linkage maps have been constructed using different marker types for rainbow trout (e.g., Young et al., 1998; Sakamoto et al., 2000; Nichols et al., 2003; Guyomard et al., 2006; Rexroad et al., 2008; Palti et al., 2011, 2012; Guyomard et al., 2012); Atlantic salmon (Gilbey et al., 2004; Moen et al., 2004b; Lien et al., 2011; Gonen et al., 2014); coho salmon (McClelland and Naish, 2008); brown trout (*Salmo trutta*; Gharbi et al., 2006); Arctic Trout (*Salvelinus alpinus*; Woram et al., 2004); and sockeye salmon (*Salmo nerka*; Everett et al., 2012).

Using microsatellite markers, two QTL with major effects on resistance to IPN in rainbow trout has been detected. These loci explained a large proportion (27 and 34%) of the phenotypic variation in a family from a backcross between a strain susceptible to IPN (YK-RT101) and a resistant one (YN-RT201; Ozaki et al., 2001). Using AFLP and microsatellite markers, QTLs for IHN resistance have been identified in three different linkage groups of the same species (Rodriguez et al., 2004). For the same disease, RFLP markers were associated with resistant and susceptible families in backcrosses of rainbow trout and cutthroat trout (Palti et al., 1999). Extensive research into the genetic architecture of resistance to Bacterial Coldwater Disease has also been undertaken, with evidence for QTL of major effect (Wiens et al., 2013b; Vallejo et al., 2014a,b). Additionally, major QTL have been detected for resistance to whirling disease caused by the myxosporean parasite *Myxobolus cerebralis* (Baerwald et al., 2010) andVHS (Verrier et al., 2013a).

In Atlantic salmon, using AFLP markers, two QTL associated with resistance to ISA have been detected in two full sib families (Moen et al., 2004a). One of these QTLs has been validated using microsatellite markers from a higher number of genotyped fish. This QTL explained 6% of the phenotypic variation for resistance to ISA and has been mapped to linkage group VIII of the Atlantic salmon genome (following SALMAP notation; Moen et al., 2007). In the same species, a major QTL for resistance to IPN has been identified using data collected from a 'field' seawater outbreak of the disease (Houston et al., 2008a). The QTL detection strategy involved utilizing the low male recombination rate observed in salmonids by using just two to three microsatellites per chromosome and male segregation to determine linkage groups with a significant effect, and secondly, a higher number of markers per linkage group. Female segregation was used to confirm the previously detected QTL and position it within the linkage group (Houston et al., 2008a). The major QTL, mapped to linkage group 21, was subsequently confirmed by analyzing nine additional families and a higher saturation of markers (Houston et al., 2008b). The same QTL was then confirmed and fine mapped in an independent population, and haplotypes of markers that could predict the genotype at the QTL were identified based on the linked microsatellites (Moen et al., 2009). Using a restriction-site associated DNA sequencing approach (RAD sequencing), several additional SNP markers linked to the QTL were identified and two SNP markers showed a significant population-level association with resistance in two year classes from an Atlantic salmon breeding population (Houston et al., 2012). The high proportion of the total variance for IPN resistance that this QTL explains has allowed the incorporation of markers linked to it into MAS

schemes for the genetic improvement of this trait in Atlantic salmon in both Norway (Moen et al., 2009) and Scotland (Houston et al., 2010).

In general, confidence intervals for mapped QTLs mapped are large. This issue has two consequences: first, widespread confidence intervals might contain a large number of genes (1000s) and, therefore, identification of the causative polymorphism is challenging. Second, the use of markers linked to QTLs in MAS programs is complicated, since the linkage phase between the marker and the QTL throughout the population may be different from family to family. An alternative for QTL fine mapping is to use information from linkage analysis in conjunction with information from linkage disequilibrium (LD) across the population (Meuwissen et al., 2002). Through simulation studies, power and accuracy of this combined approach has been successfully tested for QTL fine mapping, accounting for the structure of commercial salmon populations (Hayes et al., 2006). Additionally, the availability of both high-density SNPs marker panels and high-resolution genetic maps will contribute to detect association between markers and QTLs with higher precision by means of using across-population LD mapping (Goddard and Hayes, 2009). These strategies along with a reference sequence and a consolidated physical map of the genome, will facilitate the identification of causative mutations affecting disease resistance traits through positional studies in salmonid species.

## **GLOBAL GENE EXPRESSION**

Functional genomics, defined as the application of experimental methods of genomic or systemic coverage to assess gene function using data from structural genomics (mapping and sequencing), has been recognized as an area of primary interest in disease studies (Hiendleder et al., 2005). These methodologies broaden the spectrum of biological research to study, simultaneously, the expression of thousands of genes at the transcriptional level. Currently, genomic resources and new sequencing technologies have helped to assess differential gene expression levels in the response against diseases in salmonid species. These data can help to pinpoint functional genetic variation underlying disease resistance.

An early example, using suppression subtractive hybridization (SSH) and liver samples from individuals injected with a *V. Anguillarum* bacterium and normal individuals, more than 25 genes important in the immune response in rainbow trout were identified, including sequences of proteins of acute phase of inflammation, complement, and coagulation system (Bayne et al., 2001). Using the same technique, genes involved in signal transduction and innate immunity (among others) have been identified as relevant factors in response to a challenge against*A. salmonicida* in Atlantic salmon (Tsoi et al., 2004).

The availability of ESTs (expression sequence tags) and cDNA libraries have allowed the development of DNA microarrays which can be used to study the differential expression patterns of a large number of genes simultaneously in salmonids (Rise et al., 2004b; Ewart et al.,2005; von Schalburg et al.,2005). Using a microarray of human cDNAs, differentially expressed transcripts against a challenge with *A. Salmonicida* have been identified in Atlantic salmon.

However, due to species divergence, only 6% of the sequences of the microarray showed detectable hybridization against salmon liver cDNA (Tsoi et al., 2003). Rise et al. (2004a) conducted the first study using a microarray constructed from salmon cDNA libraries. Following this, differential gene expression has been assessed in between macrophages infected and uninfected with *P. salmonis* and hematopoietic kidney from Atlantic salmon individuals challenged or not against the same pathogen. Differentially expressed genes were proposed to be relevant in immune response and as potential biomarkers of infection with *P. salmonis* (Rise et al., 2004a). Additionally, using an Atlantic salmon cDNA microarray including more than 4,000 genes extracted from liver, spleen and hematopoietic kidney, several differentially expressed genes in response to infection by *A. salmonicida* have been found (Ewart et al., 2005). Furthermore, the transcriptomic response against vaccination with an *A. salmonicida* bacterium has been assessed in Atlantic salmon, revealing temporal and tissue differences in terms of expression levels, which may be relevant in establishing protection (Martin et al., 2006). In addition, gene expression profiles in response to a DNA vaccine for the IHN virus has been studied in rainbow trout, identifying 910 genes modulated in the injection site, and also determining the overexpression of genes of the type I Interferon system (IFN-1) in other tissues, suggesting that this system forms the basis of early antiviral immunity (Purcell et al., 2006). In the studies presented above, certain transcripts that showed variation in their expression levels during infection have low levels of homology with well-characterized genes available in public databases and, thus, do not have a known function (Rise et al., 2004a; Ewart et al., 2005; Martin et al., 2006; Purcell et al., 2006). Therefore, studies based on sequences of the encoded proteins aimed at decrypting their role in the immune response against infection are still needed.

To date, few studies have focused on the analysis of differential expression between resistant and susceptible fish for a particular disease. For example, Sutherland et al. (2014) analyzed gene expression profile differences between chum and pink salmon during infections with sea lice (*L. salmonis*) to gain insight into the functional mechanisms underlying the divergent resistance to lice observed in these two species. Zhang et al. (2011) took a similar approach to examine the differences between resistant and susceptible salmon families to the pathogen *A. salmonicida* to highlight the importance of several innate immune response genes. Cofre et al. (2014) also examined differential expression of several candidate genes in families with divergent resistance to IPNV. Finally, Langevin et al. (2012) compared the transcriptional response of resistant and susceptible clonal lines for Bacterial Coldwater Disease. The information provided by this type of analysis might be useful in the discovery of new sets of genes, with or without an assigned function, which may be associated with disease resistance (Walsh and Henderson, 2004). Another possibility is to consider the differential expression levels as a quantitative trait. This may allow identification of QTLs associated with differences in gene expression patterns between resistant and susceptible individuals (eQTL; Pomp et al., 2004; de Koning et al., 2005). However, it remains unclear how the information given by expression analysis can be used in breeding programs (Walsh and Henderson, 2004).

## **MOLECULAR MARKER-ASSISTED SELECTION (MAS) AND GENOMIC SELECTION (GS)**

While selective breeding based on performance information from the selection candidate and its relatives is a highly successful means of genetic improvement in the trait of interest, utilizing genetic markers can provide an improvement in both genetic gain and selection accuracy (Goddard and Hayes, 2009). Historically, the use of genetic markers required mapping of QTL and therefore a key parameter for subsequent application in MAS programs is the level of LD between markers and causative mutations at a population level (Goddard and Meuwissen, 2005).

One factor reducing the level of LD each generation is recombination. When the LD between the QTL and the marker only exists within families and not across families, recombination can break the association between marker alleles and the QTL between families. Therefore, the linkage phase between the marker and the QTL should be determined in each generation and separately for each family if it is to be utilized in selection (Wientjes et al., 2013). To determine if the marker and QTL are in LD within each family phenotypic records and genotypes are needed on each generation. This makes unattractive the implementation of MAS exploiting only within-family LD (i.e., linkage) between the QTL and markers (Dekkers and Van der Werf, 2007). In the case of disease resistance traits, this means that all families comprising the breeding nucleus would need to be challenged each generation.

Marker assisted selection schemes are likely to be optimal when the markers explain a large proportion of the total variance of the trait, as in the case of the QTL for resistance to IPN in Atlantic salmon (Houston et al., 2008a, 2012; Moen et al., 2009). The information from markers linked to this QTL is being applied in MAS programs in different breeding populations of Atlantic salmon. Although SNP variants associated with IPN resistance in two different year classes of a breeding nucleus have been discovered (Houston et al., 2012), there is still a need to validate if these alleles are associated with resistance in independent Atlantic salmon populations. This will depend on the LD relationship between the markers and the causative mutation in the population(s) of interest. As such, testing the association between markers and IPNV mortality in these populations should be considered a prerequisite for effective commercial application.

Another alternative to exploit the LD at the population level is using information from high density panels of markers to predict GEBVs of selection candidates (Meuwissen et al., 2001). This approach effectively takes into account all markers when estimating the breeding value of the candidate, without the need to surpass a significance threshold for association with a particular trait of interest. The effectiveness of this strategy will also depend on the magnitude of the effects associated with the markers. When the effects of the markers across the entire genome are estimated, these effects can be used to select individuals lacking phenotypes (Meuwissen et al., 2001). As such, in the case of disease resistance, there would be fewer requirements for challenge testing of siblings for diseases of economic importance. Recombination results in decay of LD on each generation and the magnitude of this reduction depend on several population features (Porto-Neto et al., 2014). This means that in practice, it is necessary to corroborate the accuracy in estimating the GEBVs and response selection in each generation.

*Best linear unbiased prediction* (BLUP) combines information from pedigree and phenotypes to predict breeding values (EBVs) of individuals. Molecular markers provide a new source of information, which will give higher accuracy when predicting EBVs. Thus, selection response will be potentially higher in traits in which the accuracy is low, i.e., traits with low heritability or traits that cannot be measured in the selection candidate, such as disease resistance (Sonesson and Meuwissen, 2009; Villanueva et al., 2011; Taylor, 2014). The relative increase in accuracy depends on the amount of variation explained by the markers. It is also possible to combine breeding values for disease resistance and other performance traits into a selection index with specified weightings based on their economic importance for multi-trait genetic improvement.

In livestock species the effects of QTL have been shown to exhibit a moderate leptokurtic gamma distribution, suggesting a small number of loci of large effect and high number of loci of small effect (Hayes and Goddard, 2001), which is likely to be expected in aquaculture species for most economically relevant traits. Therefore, it is expected that more than one marker is necessary to efficiently assist breeding programs using molecular information. The availability of high throughput SNP genotyping platforms will allow the use of these markers in selective breeding for aquaculture species by means of the implementation of GS schemes (Goddard and Hayes, 2009; Sonesson and Meuwissen, 2009; Taylor, 2014). There are currently at least three independent initiatives to generate high density SNP arrays for Atlantic salmon, run by groups from three of the leading producers of this species: Chile, Scotland and Norway. The first of these to be published is that of Houston et al. (2014) whereby approximately 132 K SNP markers were identified and verified in several populations of farmed and wild Atlantic salmon. The Chilean group validated almost 160 K SNP markers from a 200 K SNP platform for Atlantic salmon, genotyping fish representing wild and farmed populations from Europe, North America, and Chile (Yáñez et al., 2014b). Additionally, a 57 K SNP chip is available for rainbow trout (Palti et al., 2014b). Therefore, it can be expected that programs using a high number of markers in population LD will be implemented in the near future. However, it is necessary to determine the economic benefit of using high-density panels, versus the use of low-density panels supported by imputation methodologies and establish a strategy for GS taking into account technical and economic feasibility.

### **CONCLUSION**

There is increasing information related to the determination of the genetic basis for disease resistance in salmonid species. The rapid development of genomic resources in these species will provide new tools for genetic dissection of these traits. The use of these tools will be of great help in identifying loci involved in the genetic variation of disease resistance. This information will increase our understanding of the underlying biology of resistance to disease. Further, it will be crucial for the implementation of MAS or GS programs that include disease resistance within the breeding objective. These methods will increase the accuracy of selection

candidates, thereby improving the selection response. However, the economic feasibility and profitability of the implementation of these new strategies and its comparison with conventional selection schemes must be studiedfor each particular breeding program in salmonids. It will also be necessary to assess the long-term impact of these strategies on the control of each specific disease, in an epidemiological context.

## **ACKNOWLEDGMENTS**

This work was partially funded by grants from CORFO (11IEI-12843 and 12PIE-17669), Government of Chile and from Programa U-Inicia, Vicerrectoría de Investigación y Desarrollo, Universidad de Chile. José M. Yáñez would like to thank María E. López and Liane Bassini for their help in revising the manuscript.

## **REFERENCES**


following DNA vaccination of rainbow trout against infectious hematopoietic necrosis virus. *Mol. Immunol.* 43, 2089–2106. doi: 10.1016/j.molimm.2005. 12.005


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 26 September 2014; accepted: 06 November 2014; published online: 26 November 2014.*

*Citation: Yáñez JM, Houston RD and Newman S (2014) Genetics and genomics of disease resistance in salmonid species. Front. Genet. 5:415. doi: 10.3389/fgene.2014.00415 This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics.*

*Copyright © 2014 Yáñez, Houston and Newman. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited andthatthe original publication inthis journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

## Applications in the search for genomic selection signatures in fish

## *María E. López1,2 \*, Roberto Neira1 and José M. Yáñez 2,3 \**

<sup>1</sup> Faculty of Agricultural Sciences, University of Chile, Santiago, Chile

<sup>2</sup> Aquainnovo, Puerto Montt, Chile

<sup>3</sup> Faculty of Veterinary and Animal Sciences, University of Chile, Santiago, Chile

#### *Edited by:*

Peng Xu, Chinese Academy of Fishery Sciences, China

#### *Reviewed by:*

Magadan Mompo Susana, Institut National de la Recherche Agronomique Unité de Virologie et Immunologie Moleculaires, France William S. Davidson, Simon Fraser University, Canada

#### *\*Correspondence:*

María E. López and José M. Yáñez, Faculty of Agricultural Sciences and Faculty of Veterinary and Animal Sciences, University of Chile, Avenue Santa Rosa 11735, La Pintana, P. O. Box 8820808, Santiago, Chile e-mail: me.lopez.dinamarca@ gmail.com; jmayanez@uchile.cl

Selection signatures are genomic regions harboring DNA sequences functionally involved in the genetic variation of traits subject to selection. Selection signatures have been intensively studied in recent years because of their relevance to evolutionary biology and their potential association with genes that control phenotypes of interest in wild and domestic populations. Selection signature research in fish has been confined to a smaller scale, due in part to the relatively recent domestication of fish species and limited genomic resources such as molecular markers, genetic mapping, DNA sequences, and reference genomes. However, recent genomic technology advances are paving the way for more studies that may contribute to the knowledge of genomic regions underlying phenotypes of biological and productive interest in fish.

**Keywords: fish, selection, domestication, single nucleotide polymorphisms, genome**

## **INTRODUCTION**

Selection signatures are genomic regions that harbor DNA sequences involved in genetic variation of traits subject to natural or artificial selection (Qanbari et al., 2012). Currently, due to advances in genomic technologies and statistical methods, such signatures can be identified in the genomes of various species.

Most studies in this field of genetics are based on the concept of *hitchhiking*, which suggests that selection affects the genome at a specific region, leaving "signatures" around the selected gene(s) (Smith and Haigh, 1974). Specifically, the hitchhiking theory focuses on the spread of new variants in a population due to selection for their favorable effects (Przeworski, 2002; Kim and Nielsen, 2004). Selection involving alleles from the population's standing genetic variation produces specific and detectable DNA sequence patterns (Hermisson and Pennings, 2005).

The search for these molecular signatures has been the subject of intense research in recent years in both domesticated and wild populations of plants and animals, as well as in humans. These studies have been motivated by two main objectives: (1) a strong interest in the evolutionary past of the species and basic molecular mechanisms governing this evolution and (2) the expectation of an association between these genomic regions and biological functions or phenotypes of interest, since these regions should have some functional or adaptive importance underlying their selection (Nielsen et al., 2007). These studies are possible due to the development of various methods aimed at detecting selection at the molecular level in population samples. Information on allelic frequencies or haplotype patterns segregated in the population

can be used to identify signatures, since selection modifies the patterns of genetic variation expected under the neutral theory of molecular evolution.

Most studies in domesticated populations have focused on detecting relatively old selection signatures dating back hundreds or thousands of generations, e.g., (Flori et al., 2009), with few studies on genetic changes during early domestication stages (Trut et al., 2009).

Certain fish species provide unique models for studying the effects of selection and domestication, as their populations were domesticated recently and are available as both wild and domesticated populations simultaneously.

In this article we present different aspects involved in studying selection signatures at a genomic level in different species and discuss about the potential application of these studies in fish populations to unravel recent selection and domestication processes in these species.

## **IDENTIFICATION OF LOCI ASSOCIATED WITH TRAITS OF INTEREST**

The search for genes controlling phenotypic variation can be performed in two different ways. First, the "top–down" approach which begins with knowledge of the phenotype of interest and uses genetic analysis to identify genes or causal regions. These approaches include candidate gene studies, identification of Quantitative trait loci (QTLs) and association mapping. These studies have certain limitations, including the need for an *a priori* hypothesis about which genes underlie the trait of interest, information about family relationship between individuals, as well as, access to a large number of relatives with phenotypic records (Gu et al., 2009). Second, the "bottom–up" approach, in contrast, begins with genomic information and involves statistical evaluation of molecular information to identify regions subject to selection (Ross-Ibarra et al., 2007). This approach searches for patterns of linkage disequilibrium, genetic differentiation, or frequency spectrum that are inconsistent with the neutral evolution model to identify selection signatures (Qanbari et al., 2010). Recent advances in genomics provide a new paradigm for the "bottom– up" strategy concerning population genomics, a discipline that infers genetic and evolutionary parameters of a population based on datasets from the whole genome (Black et al., 2001).

In this context, population genomics relies on two basic principles or assumptions. First, neutral loci will be equally affected by demographic effects and by the evolutionary history of the population. Second, loci under selection will tend to behave distinctively, revealing atypical variation patterns (Luikart et al., 2003).

## **MODELS OF SELECTION**

Natural selection can be defined as the differential contribution of genetic variation to future generations (Aquadro et al., 2001) due to differential reproduction of some phenotypes/genotypes over others under prevailing environmental conditions at a given time (Futuyma, 1998). It is the driving force behind Darwinian evolution and can be subdivided into different types, depending on the evolutionary outcome (Hurst, 2009).

Directional selection tends to decrease variation *within* a population but may increase or decrease variation *among* populations. Positive selection is a type of directional selection that favors alleles that increase fitness of individuals. When directional selection eliminates unfavorable mutations, it is called purifying selection (also known as negative selection).

Diversifying (or disruptive) selection favors variety and benefits individuals with extreme phenotypes over intermediate. In this type of selection, the propagation of an allele never reaches fixation, and therefore it may occur when an allele is initially subject to positive selection, and then negative selection when the frequency becomes too high (Nielsen, 2005).

Balanced selection, which helps to maintain an equilibrium point at which both alleles remain in the population, has several forms, including frequency-dependent selection and overdominance, which occurs when the heterozygote has the higher biological fitness, and therefore variability is maintained in the population (Nielsen, 2005).

## **SELECTION SIGNATURES**

In the classic "hitchhiking" scenario, first described by Smith and Haigh (1974), a new allelic variant that represents a favorable adaptive substitution originates within the population as a new mutation, and its frequency increases as a result of constant selection pressure. When a favorable allele is selected, and its frequency increases to fixation in a population, genetic variation in the surrounding DNA segment is altered; that is, the increased frequency of the selected allele also produces increased frequency of closely-linked alleles (Pennings and Hermisson, 2006).

The ancestral variation, i.e., genetic variation present in a population prior to a selection process, is maintained only if recombination during this phase disrupts the association between an adjacent locus and the selected site. The resulting pattern of such a selective event is a strong reduction in genetic variation around the selected site, known as a "*hard sweep,"* which corresponds to the classic selective sweep (Pritchard et al., 2010).

There is a second scenario in which an adaptive substitution involves multiple copies of a favorable allele in the population. This may occur for two reasons. First, when an adaptation arises from genetic variation, many copies of the favorable allele may be present in the population. Fixation of this allele may involve descendants of more than one of these copies. Second, a favorable allele can be introduced in the population by recurrent mutation or migration during a selection phase, and again, several descendants of independent origin may contribute to the allelic fixation. In both cases, different alleles of loci adjacent to any such favorable copies will be retained in the population, resulting in different haplotypes (Pennings and Hermisson, 2006).

Selection signatures involving descendants of more than one copy of the selected allele and, therefore with different haplotypes at closely-linked sites, are called "soft sweeps." This type of selection signature results in different haplotype patterns than the "*hard sweeps*" described above and it is more difficult to detect as it only produces a slight reduction in the levels of adjacent polymorphisms (Cutter and Payseur, 2013).

Furthermore, when adaptation occurs by polygenic selection, it induces an increase in the allelic frequency of several loci which have a favorable effect on a particular phenotype; however, these polygenic alleles do not necessary achieve fixation, and the resulting haplotype pattern corresponds to several partial selection signatures or multiple "partial sweeps" (Pritchard et al., 2010).

Finally, when purifying or negative selection reduces the frequency or eliminates a deleterious allele, the genetic diversity at linked loci also decreases, which is known as"*background selection*" (Charlesworth, 1993).

**Figure 1** schematically summarizes the patterns caused by"*hard sweeps*,""*soft sweeps*," and"partial sweeps" that correspond to selective events for favorable variants in a population, as well as the pattern produced by "*background selection."*

## **DOMESTICATION AND RECENT ARTIFICIAL SELECTION IN FISH**

Domestication is the process by which various species have been adapted to a captive environment by humans. Such adaptation is accomplished through systematic breeding over generations and is characterized by changes in behavior, morphology, and physiology, as well as adaptive genetic changes caused by artificial and natural selection (Price, 1999).

In fish, domestication occurred very recently as compared to other land animals. One theory to explain the late domestication of aquatic species suggests that, due to the high fertility of these species, a small number of broodstock were required to obtain a sufficiently large progeny in subsequent generations. After a few generations, the inbreeding depression increases considerably; therefore, fitness and productive behavior decrease. As a result, fish farmers were forced to repeatedly take new broodstock from the wild environment, interrupting the continuity of domestication and breeding (Gjedrem et al., 2012). For this reason, aquaculture

has lagged behind land animal and plant culture in the use of breeding to enhance biological production efficiency.

On the other hand, it is estimated that less than 10% of aquaculture production is based on genetically improved stocks (Gjedrem, 2012), although the annual genetic gains reported for aquaculture species are substantially higher than for land animals. For example, selection response reported for grow-related traits are even higher than 10% in fish populations, which can substantially enhance aquaculture production by selective breeding (Gjedrem et al., 2012). Recent high selective pressures in farmed fish populations may have shaped genome variation in regions harboring causative mutations of selected traits. The identification of these regions may help in the understanding of the effect of selection events and identification of genetic variants involved in phenotypic variation in fish populations.

## **EXAMPLES OF DOMESTICATION AND BREEDING PROGRAMS IN AQUATIC SPECIES**

The fish belonging to the family *Ciprinidae* are most likely the first fish species to be domesticated. For instance, the goldfish (*Carrassius auratus*) is an ornamental fish, believed to have been domesticated in China before the XVI century and later taken to Japan and Europe (Purdom, 1993). Another important group of ornamental fish is the koi carp, a variety derived from common carp (*Cyprinus carpio*) and mainly cultivated in Japan. The large variety of colors and forms among koi carp resulted from directed selection and crossbreeding (Gjedrem, 2005).

There is evidence in common carp showing a large response to selection for furunculosis survival rates (Schaperclaus, 1962). In 1987, Ilyassov (1987) showed results from 4 to 5 generations of selection in this species for resistance to dropsy disease, which increased survival by 30–40% as compared to unselected carps. On the other hand, selection responses reported for growth rate in rohu carp (*Labeo rohita*) have been particularly high, reaching almost a 30% per generation (Gjedrem, 2012).

Furthermore, salmonid species are the most intensively selected fish populations. In this regard, the rainbow trout (*Oncorhynchus mykiss*) has a long history of domestication and breeding in the United States, Norway, Finland, and Denmark (McAndrew and Napier, 2011). In 1932, investigators began to select individuals to improve growth rate, number of eggs, and characteristics of sexual maturity (Donaldson and Olson, 1957). Currently there are 13 breeding programs worldwide aimed at improving growth rate, age at sexual maturity, fillet quality, and disease resistance in this species (Rye et al., 2010).

In the case of Atlantic salmon (*Salmo salar*), breeding programs exist in Norway, Scotland, Ireland, Australia, and Chile (Norris et al., 1999; Metcalfe et al., 2003; Glover et al., 2009; Dominik et al., 2010; Rye et al., 2010; McAndrew and Napier, 2011). Several traits of commercial interest such as growth, sexual maturity, meat quality, and disease resistance have been incorporated into breeding objectives. Furthermore, findings from genomic technologies have been incorporated into these breeding programs, for example, the use of QTLs to assist selection for resistance against the viral disease named infectious pancreatic necrosis (Houston et al., 2008; Moen et al., 2009).

Among Pacific salmon, the chinook salmon(*Oncorhynchu stshawytscha*) originating in British Columbia (BC), Canada was one the first species of salmons to be domesticated (Kim et al., 2004). Currently, its farming is limited and there are two breeding programs in operations (Rye et al., 2010). Moreover, genetic improvement programs for coho salmon (*Oncorhynchus kisutch*) have been successful in selecting for harvest weight and early spawning, with selection responses of about 10% per generation (Neira et al., 2006).

Tilapias are the second-most important group of cultivated fish in the world. The dominant species is the Nile tilapia (*Oreochromis niloticus*); however, other species of the genus *Oreochromis* (Neira, 2010) are also cultivated. The GIFT (Genetic Improvement of Farmed Tilapias) program, begun in 1987 in the Philippines, systematically compared wild and commercial strains in various aquatic environments and established a family-based selection system to improve growth rate (Eknath et al., 1993). The program is currently managed by the World Fish Center in Malaysia and genetic gains for growth-related traits are among 10–15% (Ponzoni et al., 2011).

Breeding programs have recently been established for other important species such as, sea bass (*Dicentrarchus labrax*; Vandeputte et al., 2009), sea bream (*Sparus aurata*), turbot (*Scophthalmus maximus*), Atlantic cod (*Gadus morhua*; Glover et al., 2011), halibut (Glover et al., 2007), and tuna (Owen, 2011).

All of these domestication and artificial selection processes shape the genomes of cultured fish populations, resulting in selection signatures that could potentially be identified using molecular and statistical methods.

## **APPROACHES USED FOR DETECTING SELECTION SIGNATURES**

When a new allelic variant that does not affect the fitness of individuals originates in a population, it is not affected by natural selection and is said to be neutral. Statistical tests aimed at testing a neutral evolution model can be divided into three main classes: (1) tests based on polymorphisms within species; (2) tests based on the differences between species; and (3) tests that use information within and between species. A description of these three approaches is given below.

## **TESTS BASED ON POLYMORPHISMS WITHIN SPECIES** *Frequency spectrum*

The frequency spectrum is defined as the allele frequency distribution of a large number of independent loci in a given sample (Nielsen, 2005;Vogl and Clemente, 2012). Deviations from expectations of the neutral model (no selection, recombination, population subdivision, or changes in the effective population size) could be indicative of selection: purifying or negative selection tends to increase the fraction of mutations segregating at low frequencies, while positive selection increases the number of alleles observed at high frequencies (Hurst, 2009).

Many tests for detecting selection signatures are based on information provided by the frequency spectrum obtained from DNA sequence data. One of the most commonly used is the Tajima's (1989) *D* test, which compares two measures of genetic variation (θ). The first is obtained from the average of nucleotide differences between pairs of sequences, and the second is the total number of segregating sites (Nielsen, 2005). If the difference between these two measures is greater than expected under neutral evolution, this model is rejected. Other tests have incorporated phylogenetic information in order to estimate the direction of change and increase power to detect deviations from the null hypothesis of the neutral model (Perfectti et al., 2009). One such test is that of Fu and Li (1993), which also calculates a statistic based on the comparison of two genetic variation estimates, adding phylogenic

information. For example, a related species may be added as an outgroup, such as the inclusion of the chimpanzee in an analysis of human genetic variation (Nielsen, 2005). Likewise, Fay and Wu (2000) developed a test based on the concept that the frequency spectrum expected under neutrality must be enriched with mutations at low frequencies, and that therefore, mutations at high frequencies are atypical.

Researchers have used this approach to detect selection signatures in several species. In humans, for example, evidence of selection has been found in genes related to the immune system and social behavior (Sabeti et al., 2002; Williamson, 2007). In other species such as chickens, it has been possible to identify genomic regions related to production-related traits such as eggshell hardness and immune system characteristics (Qanbari et al., 2012).

### *Linkage disequilibrium (LD) and haplotype structure*

Linkage disequilibrium (LD) refers to the non-random association of alleles at two or more loci. That is, if two alleles at two loci segregate together in greater proportion than expected by chance, it is said that these loci are in linkage disequilibrium. This measure has been widely used to study various demographic events and evolutionary processes in plants and animals, such as breeding systems, patterns of geographic subdivision, events of natural, and artificial selection, gene conversion, mutation, and other forces that can cause changes in gene frequency (Slatkin, 2008). The LD is affected by different evolutionary factors, including recombination, admixture, bottlenecks, gene flow, genetic drift, inbreeding, and selection (Slatkin, 2008). As a consequence, LD across the genome can vary within and between populations.

Thus, another approach to detect genomic selection signatures is based on statistical comparisons of atypical LD patterns at specific haplotypes of certain genomic regions that are inconsistent with the neutral evolution model (Mueller, 2004). This approach has been used in numerous studies to detect selection signatures in humans and in domesticated species (Sabeti et al., 2002; Voight et al., 2006; Hayes et al., 2008). These studies are based on the concept that in a large population, a neutral variant, which by definition is not under selection, will take many generations to become fixed or lost. Recombination and the passing of generations act with stronger intensity, and therefore, LD around these neutral alleles erodes quickly, leaving a smaller surrounding haplotype (Kimura, 1983; Nielsen et al., 2005b; Sabeti et al., 2006).

Conversely, alleles under positive or balanced selection carry other linked alleles with them, generating increased LD in the genomic region, as described for the hitchhiking effect (Smith and Haigh, 1974). LD between these alleles is slowly eroded, such that the adjacent haplotype is longer than expected by chance (Sabeti et al., 2002). Thus, large haplotypes reflect positive selection. This forms the basis of the EHH statistic ("extended haplotype homozygosity") suggested by Sabeti et al. (2002), which is defined as the probability that two randomly selected chromosomes carrying the core haplotype are identical by descent, and also measures the decay of haplotype homozygosity as a function of the distance. EHH allows for identification of regions with atypical frequencies of extended haplotypes and has been effectively used to detect signatures of recent positive selection within a population (Tang et al., 2004; Walsh et al., 2006).

Voight et al. (2006) developed the statistic |*iHS*| or "integrated Haplotype Score" which allows to compare the area under the curve of EHH distribution between ancestral and derived alleles. This approach is based on the fact that the EHH area of an allele under selection will be greater than that of a neutral allele; therefore, the integral of EHH captures this effect. *iHS* corresponds to a standardized ratio between the areas under the curve of ancestral and derived alleles, which is equal to 0 when the EHH decay is similar for both types of alleles. A negative *iHS* value near −1 indicates extended haplotype around a derived allele, whereas positive values near one indicate extended haplotype around an ancestral allele.

The *iHS* statistic is more sensitive for detecting rapid increases in frequencies of the derived allele produced by selection. However, it cannot detect selection signatures resulting from complete or nearly complete fixation of a beneficial allele in the population, and therefore cannot detect a significant fraction of variants under positive selection (Qanbari et al., 2011). For this reason, Tang et al. (2007) reported a new method involving comparison of *EHH* at the same site, but between populations, i.e., an approach based on the genetic diversity among divergent populations. These statistics are called site-specific *EHH* (*EHHS*); the area under the *EHHS* curve (*iES*); and the standardized ratio of *iES* between two populations (Hellmann et al., 2003), which reflect haplotype variation among populations. The search for selection signatures from EHH statistical derivatives has been performed in several species such as cattle (Qanbari et al., 2011), poultry (Li et al., 2012; Zhang et al., 2012), swine (Ai et al., 2013) and humans (Sabeti et al., 2007).

#### *Index of population differentiation*

The *F*ST (Wright, 1951) is a statistical measure of genetic variation due to differences in allele frequencies between and within populations (Holsinger and Weir, 2009). The *F*ST statistic has been one of the most widely used methods for detecting genomic regions that have been under selection (Gianola et al., 2010; Qanbari et al., 2011). The *F*ST for a locus that has been selected in one population but not another will be higher than in other loci not affected by selection, where genetic diversity is mainly caused by genetic drift (Holsinger and Weir, 2009). Genetic drift affects all loci in the genome similarly; however, loci under selection often behave differently and therefore may present atypical patterns of variation. These atypical patterns can be determined by genotyping, for example, a large number of single nucleotide polymorphisms (SNP) throughout the whole genome, where loci influenced by selection may be identified by deviations from the empirical distribution of *F*ST statistic (Cavalli-Sforza, 1966; Akey et al., 2002). That is, relative to a neutral model, outliers with value below a certain level suggest the effect of balanced selection, while outliers with values above a certain level are indicative of directional selection.

Various estimates of the *F*ST statistic have been developed and applied in a number of studies to search for selection signatures (Akey et al., 2002; Hayes et al., 2009; Amaral et al., 2011). However, although the outlier approach may be effective in identifying genes

under selection, it poses several challenges, such as susceptibility to genotyping errors, population stratification, and false positives, as well as variations in mutation rate and low sensitivity (Narum and Hess, 2011). It is also well known that the outlier detection methods have limited power to detect disruptive selection (Beaumont and Balding, 2004) and weak forms of divergent selection (Wright and Gaut, 2005).

#### **TESTS BASED ON DIFFERENCES BETWEEN SPECIES**

The statistical methodology to detect selection signatures by comparing information between species relies on the fact that genomic substitutions in coding regions are present in two forms: nonsynonymous mutations (dn), which can lead to the replacement of amino acids in the resulting proteins, and synonymous mutations (ds), which do not cause amino acid substitution because of the redundancy of the genetic code (Nielsen, 2005; Biswas and Akey, 2006).

The dn/ds ratio provides information about evolutionary forces acting upon a particular gene. For example, at loci under neutrality, the dn/ds ratio will be equal to 1. Genes subject to functional limitations, such that a non-synonymous substitution is detrimental, will tend to be eliminated from the population by negative selection; therefore, dn/ds <1. Conversely, an excess of non-synonymous mutations over synonymous mutations (dn/ds >1) provides evidence for the action of positive selection in favor of non-synonymous substitution, which could provide a comparative advantage at the protein level (Nielsen et al., 2005a).

Based on these concepts, several studies have detected selection in many genes and organisms, such as genes related to immune response (Endo et al., 1996; Hughes, 1997; Sawyer et al., 2004), viral receptor genes (Fitch et al., 1997; Nielsen and Yang, 1998; Bush et al., 1999), genes associated with fertility (Swanson et al., 2001, 2003), and genes involved in sensory perception and smell in humans (Gilad et al., 2000).

#### **TESTS THAT USE INFORMATION WITHIN AND BETWEEN SPECIES**

The neutral theory of molecular evolution indicates that genomic regions that evolve rapidly and, thus, have high divergence between species, will also show high levels of polymorphisms within species. The Hudson–Kreitman–Aguade (HKA) test compares the level of polymorphisms within each species and observed divergence between related species for two or more loci. The test can determine if it is likely that the observed difference is due to neutral or adaptive evolution (Hudson et al., 1987). The HKA test is the precursor to the McDonald– Kreitman test (Howe et al., 2013), which compares synonymous (PS) and non-synonymous (PN) mutations at a specific locus that are polymorphic within a species and synonymous (DS) and non-synonymous (DN) mutations that are fixed between species. Under neutrality, the ratios between PN/PS and DN/DS should be the same, while positive selection leads to increased divergence of synonymous substitutions (DN/DS > PN/PS; McDonald and Kreitman, 1991).

### **GENOMIC RESOURCES IN FISH**

In recent decades, the development of DNA markers has greatly contributed to the study of animal genetics. DNA markers allow us to observe and exploit variation across the genome of an individual (Liu and Cordes, 2004; Tier, 2010).

In fish, a wide range of DNA markers have been used, including amplified fragment length polymorphisms (AFLP), random amplified polymorphic DNA (RAPD), sequence tagged sites (STS), variable number of tandem repeats (VNTR), microsatellites or simple sequence repeats (SSR), SNP, and expressed sequence tags (EST; Liu, 2007). Currently, with the development of high-throughput sequencing technologies many gigabases of nucleotide sequences can be generated in a short period of time, and many SNP and other polymorphisms can be detected using bioinformatics methods (Liu, 2011). These techniques provide an affordable and reliable scale of DNA sequencing in several organisms (Mardis, 2008). They are extensively used in *de novo* sequencing, quantification of gene expression by RNA-seq ("RNA sequencing"; Wang et al., 2009), massive identification of SNP markers using RAD-sequencing ("restriction site associated DNA sequencing"; Rowe et al., 2011), and population genomics studies (Hohenlohe et al., 2010; De Wit et al., 2012).

Although teleost fish are the largest group of vertebrates (about 27,000 species), they are underrepresented in genome sequencing projects (Spaink et al., 2013). **Table 1** shows some of the species that have undergone genome sequencing projects to date.

Extracted and modified from Spaink et al. (2013). The terms scaffolds or contigs indicate that the genome of the species has been partially sequenced, and the term chromosome indicates that sequencing has been anchored to the existing physical map of the species.

## **SELECTION SIGNATURES IN FISH**

In fish, studies aimed at detecting selection signatures are performed mainly in the context of molecular ecology disciplines. Most of them have been limited to a low level of resolution and restricted to specific genomic regions.

### **MODEL FISH SPECIES**

Using SNP markers from ESTs, loci with outlier *F*ST values were identified in wild populations of zebrafish (*Danio rerio*), suggesting directional selection in genes associated with energy metabolism, homeostasis regulation, and signal transduction, which could be associated with local adaptation among different populations. Further, evidence was found to suggest balanced selection of the gene encoding the receptor for the NS1A influenza virus (Whiteley et al., 2011). In the same study, outlier *F*ST values were found for loci in laboratory strains related to oxidoreductase activity, chromatin condensation, immune response, and induction of apoptosis, among other processes, which could be associated with the domestication process of cultured strains (Whiteley et al., 2011).

## **CICHLIDS**

Keller et al. (2013) detected outlier SNP patterns between five cichlid species from the Lake Victoria area in East Africa, identifying signatures of divergent selection between the two genera that include these species. These selection signals were associated with male color, depth distribution, feeding patterns, and morphological traits that distinguish the genera. Moreover, evidence has been found to suggest selection in the homeobox genes (*dlx*) involved in the development of the nervous system, the craniofacial skeleton, and the formation of connective tissue and appendages (Diepeveen et al., 2013).

## **SALMONIDS**

In lake whitefish (*Coregonus clupeaformis*), a fish of the salmon family distributed along northern Alaska and all of Canada, 24 loci were identified that revealed selection signatures associated with QTL of certain adaptive traits such as natatorium behavior, growth rate, morphology, and reproduction characters (Rogers and Bernatchez, 2007).

In Atlantic salmon, Vasemägi et al. (2012) used microsatellite markers and SNP to locate 10 genomic regions showing signatures of directional selection related to characteristics such as growth rate and morphology. Martinez et al. (2013) found strong evidence of selection in a microsatellite marker on chromosome 3, which harbored QTL for body weight. Furthermore, there is evidence that genes associated with immune response have been subject to greater selection pressure compared with other regions of the genome (Tonteri et al., 2010; Portnoy et al., 2014). Other studies in genera *Oncorhynchus*, *Salmo*, and *Salvelinus* have revealed signatures of balanced selection for genes of the major histocompatibility complex IIB (Aguilar and Garza, 2007; Limborg et al., 2012). In brown trout, analysis with markers linked to genes related to the immune response showed evidence of having been subjected to selection (Jensen et al., 2008). Finally, evidence was found in both brown trout and sockeye salmon of disruptive selection at two loci within the major histocompatibility complex IIB (Hansen et al., 2010; Gomez-Uchida et al., 2011; Meier et al., 2011).

#### **OTHER FAMILES**

In guppies (*Poecilia reticulata*), outlier *F*ST values suggest that between 3.5 and 6.5% of SNP markers are under directional selection. Some of these loci are near QTL associated with ornamental traits, and they are also in EST (Willing et al., 2010).

In Atlantic cod in 1960, Sick detected evidence of selection in the locus encoding Hemoglobin (*Hb I*) and the *Pan I* locus, which encodes a protein related to the neuroendocrine system and has recently been associated with vesicle transport in adipocytes (Pogson, 2001). In the same species, Moen et al. (2008) identified 29 SNP with outlier *F*ST, suggesting that these loci are or have been under selection. These loci were found in genes involved in muscle contraction, immune response, and production of ribosomal proteins. Moreover, Nielsen (Nielsen et al., 2009) found evidence of directional selection for local adaptation to various environmental conditions, such as loci with outlier *F*ST values associated with genes involved in the production of proteins for thermal shock (*Hsp90*), determination of sexual behavior (*Aromatasa*), and formation of photoreceptor cells for perception of light (*rhodopsin*).

In stickleback (*Gasterosteus aculeatus*), a fish of the *Gasterosteidae* family, studies using microsatellite markers to assess genetic diversity among marine and freshwater populations revealed evidence of directional selection that might be associated with


adaptation of certain populations to freshwater environments (Mäkinen et al., 2008).

## **FUTURE DIRECTIONS**

The search for and detection of genomic signatures produced by selection has provided valuable information that contributes to the understanding of evolutionary forces affecting the genome and genefunctions that control phenotypes of biological and economic interest (Nielsen, 2005; Nielsen et al., 2007).

Some fish species provide the great advantage of simultaneous availability as both a wild and a cultivated population. Additionally, these species have unique characteristics in terms of population structure and intra-specific adaptive divergence, mainly due to the diversity of environmental conditions that fish populations inhabit, resulting in populations that exhibit characteristics of strong local adaptation. Comparative studies among these populations would provide benefits in terms of elucidating the effects of selective processes and recent domestication events, which could improve the understanding regarding the impact of the interaction between domesticated and wild populations, the identification of genetic factors involved in economically important traits for aquaculture and unravelling the actual phenotypic variation within and between fish populations.

In domesticated species, the main motivation behind the search for selection signatures lies in the possibility of finding genes or genomic regions associated with traits of economic interest. The development of next-generation sequencing technologies and high-throughput genotyping has made it possible to investigate the effect of selective pressures on genome variation in several domesticated species. In cattle and sheep, researchers have detected selection signatures associated with carcass yield traits, tail fat deposition, dairy traits (Moradi et al., 2012; Rothammer et al., 2013), reproductive traits (Gautier and Naves, 2011; Qanbari et al., 2011), immune response (Gautier and Naves, 2011), coat color, and horn development (Druet et al., 2013), among other characters of interest. Also, in swine, selection signatures have been identified in genomic regions associated with traits such as coat color, ear morphology, reproductive characteristics, and fat deposition (Wilkinson et al., 2013). In chickens, researchers have identified selection signatures associated with eggshell hardness and immune system characteristics (Qanbari et al., 2012), to mention just a few examples. These studies may provide a basis for conducting similar research that allows for investigation of the genomic regions affected by the processes of domestication and natural or artificial selection in fish populations, allowing for discovery of new genes that underlie phenotypic traits of interest and understanding processes relevant for conservation purposes.

There is currently little genomic information for fish species as compared to humans or domesticated animals. This is one of the reasons why selection signatures studies have been conducted in only a few species and generally limited to a low level of genomic coverage (Vasemägi et al., 2005, 2012). However, recent advances in genomic technologies, including high quality reference genome sequences, construction of genetic maps, and development of high-density SNP arrays are paving the way for systematic study of genetic variation in these species.

The development and application of next-generation sequencing approaches will represent a powerful strategy to improve the resolution and accuracy when detecting regions under selection in several species. This may lead to determination of the causative genetic factors involved in several biological aspects of aquaculture species. However, the application of these results in the aquaculture development requires further studies aiming at determining effective and practical applications of this technology. Candidate disciplines to be benefited from the discovery of selected regions using next-generation sequencing are, for example, genetic improvement, vaccine and pharmaceutical development and fish nutrition.

## **CONCLUSION**

The development of genomic methodologies has contributed greatly to the study of genetic variation between and within species. High-resolution studies at the level of the whole genome can identify selection signatures explaining phenotypic variation between and within populations, and therefore potentially identify genetic variants underlying characteristics of biological and economic interest.

Although the application and utility of these techniques in aquaculture species has been limited by a lack of genomic information, there is a great potential for conducting such studies, especially in species for which there are genome sequencing projects and high-density molecular markers platforms availability.

## **ACKNOWLEDGMENTS**

This work was supported by the Programa Formación Capital Humano Avanzado from CONICYT, Government of Chile, and partially funded by grants from CORFO (11IEI-12843 and 12PIE-17669), Government of Chile.

## **REFERENCES**


pancreatic necrosis in Atlantic salmon (*Salmo salar*). *Genetics* 178, 1109–1115. doi: 10.1534/genetics.107.082974


*Salmo salar* L. in chilean aquaculture facilities*. PLoS ONE* 9:e99358. doi: 10.1371/journal.pone.0099358


a unique immune system. *Nature* 477, 207–210. doi: 10.1038/nature 10342


Zhang, H., Wang, S.-Z., Wang, Z.-P., Da, Y., Wang, N., Hu, X.-X., et al. (2012). A genome-wide scan of selective sweeps in two broiler chicken lines divergently selected for abdominal fat content. *BMC Genomics* 13:704. doi: 10.1186/1471- 2164-13-704

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 08 October 2014; accepted: 15 December 2014; published online: 14 January 2015.*

*Citation: López ME, Neira R and Yáñez JM (2015) Applications in the search for genomic selection signatures in fish. Front. Genet. 5:458. doi: 10.3389/fgene.2014.00458 This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics.*

*Copyright © 2015 López, Neira and Yáñez. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

**REVIEW ARTICLE** published: 29 September 2014 doi: 10.3389/fgene.2014.00340

## Genetic architecture of sex determination in fish: applications to sex ratio control in aquaculture

## *Paulino Martínez1\* Ana M. Viñas <sup>2</sup> , Laura Sánchez1, Noelia Díaz <sup>3</sup> , Laia Ribas <sup>3</sup> and Francesc Piferrer <sup>3</sup>*

<sup>1</sup> Departamento de Genética, Facultad de Veterinaria, Universidad de Santiago de Compostela, Lugo, Spain

<sup>2</sup> Departamento de Genética, Facultad de Biología, Universidad de Santiago de Compostela, Santiago de Compostela, Spain

<sup>3</sup> Institut de Ciències del Mar, Consejo Superior de Investigaciones Científicas, Barcelona, Spain

#### *Edited by:*

Ross Houston, University of Edinburgh, UK

#### *Reviewed by:*

Lior David, The Hebrew University of Jerusalem, Israel Shaojun Liu, Hunan Normal University, China Christos Palaiokostas, University of Stirling, UK

#### *\*Correspondence:*

Paulino Martínez, Departamento de Genética, Facultad de Veterinaria, Universidad de Santiago de Compostela, Campus de Lugo, 27002 Lugo, Spain e-mail: paulino.martinez@usc.es

Controlling the sex ratio is essential in finfish farming. A balanced sex ratio is usually good for broodstock management, since it enables to develop appropriate breeding schemes. However, in some species the production of monosex populations is desirable because the existence of sexual dimorphism, primarily in growth or first time of sexual maturation, but also in color or shape, can render one sex more valuable.The knowledge of the genetic architecture of sex determination (SD) is convenient for controlling sex ratio and for the implementation of breeding programs. Unlike mammals and birds, which show highly conserved master genes that control a conserved genetic network responsible for gonad differentiation (GD), a huge diversity of SD mechanisms has been reported in fish. Despite theory predictions, more than one gene is in many cases involved in fish SD and genetic differences have been observed in the GD network. Environmental factors also play a relevant role and epigenetic mechanisms are becoming increasingly recognized for the establishment and maintenance of the GD pathways. Although major genetic factors are frequently involved in fish SD, these observations strongly suggest that SD in this group resembles a complex trait. Accordingly, the application of quantitative genetics combined with genomic tools is desirable to address its study and in fact, when applied, it has frequently demonstrated a multigene trait interacting with environmental factors in model and cultured fish species. This scenario has notable implications for aquaculture and, depending upon the species, from chromosome manipulation or environmental control techniques up to classical selection or marker assisted selection programs, are being applied. In this review, we selected four relevant species or fish groups to illustrate this diversity and hence the technologies that can be used by the industry for the control of sex ratio: turbot and European sea bass, two reference species of the European aquaculture, and salmonids and tilapia, representing the fish for which there are well established breeding programs.

**Keywords: sex determination, fish, genetic architecture, sex ratio, aquaculture**

## **INTRODUCTION**

Fish represent the most diverse group of vertebrates including more than 28,000 species (Nelson, 2006). This diversity is a reflection of their high capacity for adaptation to a broad spectrum of environmental conditions. As a result, fish show amazing morphological, physiological, and behavioral adaptations to live in the highly diverse aquatic environment. Fish also show all types of reproductive strategies, including gonochorism, proterandrous, protogynous, and simultaneous hermaphroditism, and unisexuality (Devlin and Nagahama, 2002). These reproductive strategies emerged independently in different lineages during evolution demonstrating a polyphyletic origin (Avise and Mank, 2009).

Domestication of fish for production is an ancient practice and again shows the high adaptation capacity of this group, especially considering that more than 354 fish species are cultivated all over the world (Food and Agriculture Organization [FAO], 2014). Production of domestic fish largely relies on reproduction, and a vast

amount of information has been gathered for its control. Reproduction techniques include production of monosex populations because the existence of sexual growth dimorphism, either in favor of males or females depending on the species, and also because sometimes the most valuable trait is associated with one sex (e.g., color, shape, secondary sexual ornaments).

Here, we review the available data on the genetic architecture of sex determination (SD) in fish and how sex ratio is controlled in aquaculture production. We contrast this information with models emerging from the classical studies in *Drosophila*, mammals, or birds with highly conserved mechanisms associated to marked sex chromosome heteromorphisms. We show the huge intra and interspecific diversity of SD systems in fish associated to a high evolutionary turnover. So, its genetic architecture, although commonly supported by major genes, is also influenced by minor genes, and environmental factors approaching it to a complex trait. We illustrate this diversity and its influence on the strategies aimed at the production of the most

desired sex in the last section of the paper by analyzing the genetic basis of SD in two important species of the European aquaculture, turbot (*Scophthalmus maximus*), and European sea bass (*Dicentrarchus labrax*). We also consider two of the main fish groups with established breeding programs, salmonids, and tilapias, all of them with remarkable differences in sex determination.

## **GENETIC ARCHITECTURE OF SEX DETERMINATION IN FISH**

The genetic architecture of a complex trait refers to the genes involved in that trait, their influence and interactions to establish the phenotype. It takes also into account the influence of environmental factors and their interactions with the genotype on the final phenotype (McKay, 2001). Although getting all this information is a very ambitious enterprise, the more we approximate to this goal the better we shall understand key questions related to the genetic variation for its exploitation in genetic breeding programs.

#### **MAIN FEATURES OF GONAD DEVELOPMENT**

Gonad development includes all developmental processes aimed to transform an undifferentiated primordium into a mature gonad, either ovary of testis. It is the result of two concatenated development processes controlled by a hierarchical genetic network: SD and gonad differentiation (GD; **Figure 1**). The sex of gonads is essentially determined by processes, either genetic or environmental, operating at the beginning of development, where a binary decision is taken related to the fate of the undifferentiated primordium (Kobayashi and Nagahama, 2009; Siegfried, 2010; Uller and Helanterä, 2011). Once the future of the gonad has been established, morphogenetic GD processes work until maturation is completed.

As a consequence of the hierarchical nature of gonad development, a single gene or environmental cue operating at the beginning of development can drive the gonad pathway towards one direction or another. Thus, in contrast to other characters influenced by many genes operating in different routes with important additive effects, here gene interactions likely represent an important genetic component. Particularly, epistatic effects may be relevant because a single gene acting upstream or even downstream of a preexisting SD gene (SDg) may take the control of gonad fate, thus, masking extant genetic variation at other involved loci. Epistatic interactions have been reported between major SD loci in different fish groups (Cnaani et al., 2008; Ser et al., 2010; Parnell and Streelman, 2013), and also epistatic allelic variants have been reported segregating in populations of species with a well known SD genetic system like medaka (*Oryzias latipes*; Shinomiya et al., 2010). In fact, notable interactions occur between gene products at the initial stages of gonad development such as the suppression of *wnt4* and β*-catenin,* key genes for ovarian development, by *sox9* and *fgf9* (Nef and Vassalli, 2009); the modulation of gonadal aromatase (cypb19a), responsible of the balance between androgens and estrogens, by the action of other genes or environmental factors such as temperature (Navarro-Martín et al., 2011); or the interaction between the anti-müllerian hormone (amh1) and its receptor (amhr2), which triggers an essential signaling pathway for testis development (Kamiya et al., 2012).

Gonad development of fish is unusual in the sense that the sexually undifferentiated period can last from weeks until years (Saito and Tanaka, 2009; Berbejillo et al., 2012) opening a large developmental window in which the sexual fate can be influenced by abiotic or biotic environmental factors (Penman and Piferrer, 2008; Baroiller et al., 2009). In such a long period, it is tempting to speculate that the brain may be involved through

sex-associated loci, an environmental factor (e.g., temperature) in an ecologically relevant context (i.e., occurring normally in the habitat of the species) or a combination of them, typically when the gonads are still sexually undifferentiated or even before they are formed at the most rudimentary stage (pre-gonadal stage). Successive events are connected by horizontal arrows and include differences in the proliferative rate of germ cells (females > males), which can be one of the first effects of the sex determining factor, whether genetic or environmental. During this period, the

subsequent sex differentiation, usually in the female → male direction (diagonal dashed arrow). Finally, also at the beginning of gonad differentiation – the transformation of an undifferentiated gonad into a testis or ovary – the accidental (i.e., contamination) or deliberate (e.g., sex control treatment) incorporation of sex steroids, androgens or estrogens can result in female → male (vertical blue dotted arrow) or male → female (vertical red dotted arrow) sex-reversal, i.e., in that genotypic females and males develop into phenotypic males and females, respectively.

the hypothalamic–pituitary–gonadal axis (Baroiller et al., 2009). However, despite the fact that the brain certainly integrates environmental stimuli and, in particular, social interactions, which have been shown to be implicated in the process of sex-change in hermaphrodites, currently there is no convincing proof that the brain plays any role in the SD process in gonochoristic fishes.

Homoeothermic vertebrates show a conserved morphogenetic development supported by a strongly canalized genetic sex determining system (Charlesworth et al., 2005). However, fish show inter-specific differences in the morphogenetic events occurring along gonad development. Variation exists both regarding the general pattern of differentiation, the interaction between somatic and germ cells, and in the time of occurrence and relative weight of the different steps. The amount of primordial germ cells have been reported to be the first development difference between males and females in species such as medaka (*O. latipes*; Kondo et al., 2009) and stickleback (*Gasterosteus aculeatus*; Lewis et al., 2008), and this has been related to the possible influence of growth-related factors on SD (Schlueter et al., 2007; Baroiller et al., 2009). Also, gonad development has been classified as undifferentiated or differentiated type, respectively, depending on the existence of an initial transitory female stage that can subsequently revert to testis like in zebrafish (*Danio rerio*; Orban et al., 2009), or the lack of that stage like in medaka or tilapia (*Oreochromis niloticus*; Saito and Tanaka, 2009). Interaction between somatic and germ cells is also recognized as an important feature for gonad development. In fact, several key genes related to the initial steps of differentiation like *dmrt1*, *amh*, or *sox9* in males (Lee et al., 2009; Wu et al., 2010) and *cyp19a1a* or *foxl2* in females (Nakamura et al., 2009) are expressed in Sertoli or granulosa/theca cells, respectively, and thus, communication between somatic and germ cells is essential for GD. In this communication there are species-specific differences and, for example, the ablation of the female germ cells determines the reversal of development towards a testis in zebrafish (*D. rerio*) and tilapia (Siegfried and

Nusslein-Volhard, 2008; Fujimoto et al., 2010), while in goldfish (*Carassius auratus*) the female pathway is maintained (Goto et al., 2012).

#### **SEX DETERMINATION: ORIGIN AND EVOLUTION**

A consensus existed until recently on the high conservation of the gene network controlling gonad development among vertebrates, differences being mainly related to changes in the switching mechanism. This hierarchical development controlled process would facilitate the control of sex ratio by a single-gene mechanism, but at the same time it would open the opportunity for changing the SD factor in response to new evolutionary scenarios (Marín and Baker, 1998).

Theories on the evolution and genetic architecture of SD in animals have been largely influenced by studies on *Drosophila*, mammals, and birds, all of them showing convergent patterns, with a heteromorphic sex chromosome pair and, as a consequence, a particular sex-linked inheritance model (Bachtrog et al., 2014). A generalized theory on the origin and evolution of SD systems emerged from these data, which assumed a sexual conflict between antagonistic alleles at specific loci favorable to one sex but detrimental to the other (**Figure 2**; Charlesworth et al., 2005). To maintain the beneficial association between the antagonistic allele and the SD locus, recombination would be restricted, giving rise to the permanent heterozygous state of that portion of the sexual pair (Bergero and Charlesworth, 2009). That circumstance would promote the accumulation of repetitive elements and deleterious variants in the SDg-bearing chromosome, contributing to its progressive degeneration and the typical heteromorphic shape of the sexual pair (Charlesworth et al., 2005). Mathematical models based on this theory suggested that only one gene should underlie the SD system, and if more than one gene were segregating, this should represent an unstable equilibrium towards a new SD mechanism (Rice, 1986).

Initial data in ectothermic vertebrates, particularly fish, demonstrated a sharply different picture (Devlin and Nagahama, 2002).

to the origin of genes (A,B) with antagonistic alleles favorable to females (Af, Bf) or to males (Am, Bm) associated with a new SDg. Subsequent steps involve accumulation of repetitive elements (rep) and the degeneration of the Y chromosome because of its permanent heterozygotic state at the differential region (d: recessive non functional variant of a sex-linked gene).

More than one sex-associated gene has been reported in many fish species (Cnaani et al., 2008; Ser et al., 2010). In addition, the conservation of the GD cascade has been to some degree questioned, and notable genetic differences have been observed not only at the top, but also downstream of the GD network (Böhne et al., 2013; Herpin et al., 2013). For example, the well-known femaleassociated aromatase genes (*cyp19a1a* and *cyp19a1b*) have shown a new role in testis differentiation in African cichlids (Böhne et al., 2013). Although antagonistic alleles demonstrated to be associated with the SD region in fish (Roberts et al., 2009; Parnell and Streelman, 2013), only 7% species showed heteromorphic sex chromosomes (Penman and Piferrer, 2008; Oliveira et al., 2009). This has been related to the huge evolutionary turnover of SD mechanism which limits chromosome differentiation and, as a consequence, different SD systems have been reported between closely related species and even between populations of the same species. However, degeneration and differentiation of sex chromosomes can evolve very quickly (Charlesworth et al., 2005) and a broad heteromorphism degree has been observed among species of the same genus in Neotropical fish of the genera *Eigenmannia* (Henning et al., 2011), *Characidium* (Vicari et al., 2008), and *Leporinus* (Parise-Maltempi et al., 2013). Thus, sex heteromorphism can involve the whole chromosome and be detectable with the usual cytogenetic techniques, as reported in Neotropical fish (Henning et al., 2011; Parise-Maltempi et al., 2013); be cryptic at cytogenetic level but involving an important degeneration of the SDg-bearing chromosome, as in stickleback (Ross and Peichel, 2008; Shikano et al., 2011); embrace no more than a few kilobases as in medaka (*O. latipes*; Matsuda et al., 2002); or show a very tiny differentiation as the single SNP observed in the *amhr2* receptor, the only detectable difference between X and Y chromosomes in fugu (*Takifugu rubripes*; Kamiya et al., 2012).

To understand the origin of SD regions both external pressures (i.e., sexual selection) and the internal context (available genes and genomic structure) should be considered. Several genes with different functions have been recruited along evolution as SDg in fish, which shows the opportunistic nature of selection to face new evolutionary pressures. In this regard, the specific genome duplication occurred within teleosts may have provided a suitable raw material for new sex determinants (Mank and Avise, 2009). A remarkable case which exemplifies the many options available to change SD mechanisms is the association of sex to B chromosomes, usually considered as junk DNA, in species of the genus *Astyanax* and in two species of pufferfish (Vicente et al., 1996; Noleto et al., 2012). The high turnover of the SD region in fish has led to suggest that changes in the SD mechanism may be associated with speciation (Ser et al., 2010; Böhne et al., 2013; Parnell and Streelman, 2013).

## **THE GENETIC BASIS OF SEX DETERMINATION IN FISH**

High genetic variation has also been described between fish species regarding the gene responsible for SD, the number of genes involved in such decision and the relationships between them. Currently, five different master genes have been documented in fish: *dmY*, *gsdf*, *amhy*, *amhr2*, and *sdY*. *dmY* (DM-domain gene on the Y chromosome), the SDg of medaka and the first one described in fish (Matsuda et al., 2002), is a transcription factor expressed in

the somatic cells surrounding germ cells before sex differentiation and in the testis thereafter and it is involved in germ cell proliferation and development of pre-Sertoli cells into Sertoli cells. It originated from a segmental duplication of a small autosomal region containing the precursor *dmrt1*, followed by an insertion of the duplicated region on the proto-Y chromosome (Matsuda et al., 2002). *gsdf*, *amhy*, and *amhr2* are members of the TGF-ß superfamily involved in cell signaling controlling cell proliferation (Heule et al., 2014). *gsdf* (gonadal soma derived growth factor) is a downstream gene of *dmY* in the SD cascade that has taken the role of master SDg in *Oryzias curvinotus* (Myosho et al., 2012). *amhy* (Y chromosome-specific anti-müllerian hormone) is expressed in the presumptive Sertoli cells of XY males of *Odonthestes hatcheri* during the onset and subsequent GD. This gene has been inserted upstream of *amh* in the cascade of male development, becoming a male SDg (Hattori et al., 2012). In *T. rubipres*, *amhr2* (antimüllerian hormone receptor type 2) is expressed in somatic cells surrounding germ cells and it is thought to be the SDg in this species. This gene contains a specific SNP variant in the kinase domain of *amhr2* in the X chromosome which determines lower affinity for the amh hormone, thus fating the female pathway when homozygous (XX; Kamiya et al., 2012). Finally, *sdY* (sexually dimorphic on the Y chromosome) is linked to the SEX locus of salmonids and is necessary and sufficient to induce testicular differentiation. It has evolved through neofunctionalization from *irf9* by losing its role in *ifn* signaling pathway and acquiring a new role in SD (Yano et al., 2012).

From an evolutionary perspective different SDgs have been reported either in closely related species like *O. latipes (dmY)* and *Oryzias luzonensis* (*gsdf1*; Myosho et al., 2012) or in divergent ones such as rainbow trout (*Oncorhynchus mykiss*; *sdY;* Yano et al., 2012), fugu (*amhr2*; Kamiya et al., 2012), or pejerrey (*O. hatcheri*; *amhY*; Hattori et al., 2012). Also, the same SDg has been identified in closely related species like *O*. *latipes* and *O. curvinotus* (*dmY*; Matsuda et al., 2002); *T. rubripes, Takifugu pardalis*, and *Takifugu poecilonotus* (*amhr2*; Kamiya et al., 2012); and in most salmonid species studied so far (*sdY*; Yano et al., 2013). Very recently, *dmrt1* has been suggested as a strong candidate in the half-smooth tongue sole (*Cynoglossus semilaevis*) based on its association with sex and its pseudogenization in the W chromosome (Chen et al., 2014). This would constitute the first SDg reported in a ZW system species and within flatfish, a group of species of great commercial value and with a particular metamorphosis to adapt to demersal life. However, no functional demonstration has been reported to date thus requiring further investigation.

Additional information exists from marker and QTL sexassociated studies which are usually unraveling variation on major genes controlling the fate of the undifferentiated primordium, and thus, basically related to SD. The relationships between the different genomic regions identified through this approach has been established through comparative mapping using model fish as a bridge, taking advantage of the conserved macrosynteny pattern observed in teleosts (Kai et al., 2011; Bouza et al., 2012). In most fish groups analyzed, these SD regions demonstrated to be non-homologous (Piferrer et al., 2012), so in the *Oryzias* (medaka) genus up to five different genes/genomic regions seem to be involved in sex determination, and only two species among the six analyzed show the same SDg (Tanaka et al., 2007); in the Gasterosteidae (stickleback) family four different genomic regions have been identified in the five species analyzed, and in the fifth, a fusion of two previously reported SD chromosomes gave rise to the sexual chromosome (Ross et al., 2009); in the tilapiine (tilapia) cichlid tribe two chromosomes have been identified [linkage group 1 (LG1) and LG3], two species associated with LG1, other two with LG3, and the remaining two showing both linkage groups, but other studies also demonstrated association with LG23 (Cnaani, 2013); in the Salmoniformes (salmonids) order most species showed different non-homologous sex-associated genomic regions (Phillips et al., 2001); and finally, in the Poeciliidae (guppy and platyfish) family, up to four different chromosomes are involved in SD (Tripathi et al., 2009). Very likely these non-homologous regions include different SDgs, although in salmonid species showing non-homologous SD genomic regions, the SDg appears to be the same (Yano et al., 2013).

The diversity of SDg in fish highlights the many options available at the undifferentiated stage of gonads to switch and drive gonad fate, although some genes have been recurrently used because of their prominent position in the development cascade (Graves and Peichel, 2010; Heule et al., 2014). Among them, *dmrt1* and related genes found in medaka and likely in half-smooth tongue sole have also been reported in different vertebrates, including birds and amphibians (Smith and Sinclair, 2004; Yoshimoto et al., 2008), which illustrate processes of convergent evolution related to the suitability of some genes acting at the beginning of development. Furthermore, a great plasticity has been shown, since it can be found in XY or ZW systems and acting on a presence/absence model (medaka and frog) or in a dose-dependent manner (birds and likely tongue sole; Koopman, 2009; Chen et al., 2014). Three other genes, *gsdf1*, *amh1*, and *amhr2*, have also been reported to be activated at the beginning of development in the male pathway, and thus their SD role fits to the top of the GD invoked by theory (Heule et al., 2014). Contrary to previous findings in vertebrates (*sry* and *dmrt-*derived genes), *gsdf1*, *amh1*, and *amhr2* are not transcription factors being involved in cell signaling controlling cell proliferation (Heule et al., 2014). Finally, the *sdY* gene, the major SD factor in rainbow trout, represents an unexpected SDg of unknown function, whose carboxi-terminal extreme is homologous to an interferon-related gene, thus exemplifying the vast source of genes availablefor leading the SD process (Yano et al., 2012).

In addition to species with a single major SDg, many fish have shown more than one gene of big effect involved in sex determination. Indeed, in the most investigated fish groups at least two major genes (or genomic regions) related to SD have been reported in the same species: within tilapiinid, two major male and female determinant genes on LG1 and LG3, respectively, (Cnaani et al., 2008); in the platyfish, a poecilid species, a multifactorial SD system with X,Y, and W sex chromosomes (Schultheis et al., 2009); in the sticklebacks, *Gasterosteus weathlandi* a multiple chromosome system due to the fusion of two major SD chromosomes presented in other species of Gasterosteidae (Ross et al., 2009); and in cichlids from lake Malawi, two main SD systems on LG5 and LG7, the first one representing a female epistatically dominant factor (Ser et al., 2010). Finally, a polygenic SD system has also been documented in

other species like European sea bass (Vandeputte et al., 2007) and zebrafish (Liew et al., 2012)

The lowering cost of new generation sequencing (NGS) methodologies will allow obtaining much more information in the near future to get a more accurate picture of SD in fish. Restriction-site associated DNA (RAD) sequencing, a technique which combines the powerful of NGS with the simplification of genomes through restriction enzyme digestion (Baird et al., 2008; Davey et al., 2011), is enabling to perform dense genomic screening to study the genetic architecture of multigene traits (Hohenlohe et al., 2010). This methodology is being applied for the study of SD in zebrafish (Anderson et al., 2012) and to identify sex-associated genomic regions in Nile tilapia (*O. niloticus*; Palaiokostas et al., 2013a) and Atlantic halibut (*Hippoglossus hippoglossus*; Palaiokostas et al., 2013b) for its application in aquaculture production. In addition, the high capacity of RAD sequencing for SNP discovery and constructing genetic maps will aid to get dense maps at candidate regions to narrow them and facilitate the identification SDgs (Taboada et al., 2014).

Gene expression studies linked to NGS methodologies will also be essential to understand the relationship between morphogenetic effects and the underlying genetic network. Microarrays have been used as a powerful tool for assessing gene expression profiles along GD (Gardner et al., 2012; Sreenivasan et al., 2014), but more recently, RNA-Seq is being applied due to its higher sensitivity, accuracy, and also because it provides additional information on genetic variants linked to expression differences (Sun et al., 2013; Tao et al., 2013). Finally, the evaluation of the pattern of methylation through bisulfite sequencing and other methodologies is providing quick genomic evaluation of the epigenomic maps along development (Cokus et al., 2008) constituting a valuable tool for understanding the regulation of SD and gonadogenesis (Piferrer, 2013).

## **ENVIRONMENTAL FACTORS ON SEX DETERMINATION**

In environmental sex determination (ESD), the first difference between the two sexes is established by differences in the value of an environmental factor. ESD has been well studied in reptiles like crocodiles and turtles (Valenzuela and Lance, 2004). In *Menidia menidia* (Conover and Kynard, 1981), the first fish species where temperature-dependent sex determination (TSD), a form of ESD,was first described, some populations have shown a genetic component underlying SD (Conover and Heins, 1987). Much of the literature in this field demonstrated the influence of environmental factors in laboratory conditions, extreme in some cases, but not necessarily reflecting the conditions that species experience in nature (Ospina-Álvarez and Piferrer, 2008). Nevertheless, the presence of TSD in fish demonstrates the plasticity of gonad development (Baroiller et al., 2009).

Social interactions represent important environmental cues for SD in hermaphroditic species (Godwin, 2009). In these cases, the brain has shown to be a major player translating social cues into a physiological signal (Kobayashi et al., 2010). Although this is not a mechanism driving the gonad fate at the beginning of development, it is a good example of the plasticity of gonad development in fish and also an evidence on the existence of bipotential

primordium cells in the differentiated gonads of adult fish (Zhou and Gui, 2010).

Temperature is the environmental factor with highest influence in SD in fish (Ospina-Álvarez and Piferrer, 2008; Baroiller et al., 2009). Temperature influence on sex shifting can be exerted at several points of the differentiation cascade. High temperatures usually tend to produce more males and low temperatures have no effects or produce more females in some cases (Ospina-Álvarez and Piferrer, 2008). The ultimate mechanism (if there were a single one) connecting temperature and sex ratio is not known, and several have been proposed. The influence of temperature on SD has been related to a higher stress, giving rise to changes in circulating cortisol levels. In fact, the administration of cortisol in the diet has demonstrated a significant influence on sex ratio (Mankiewicz et al., 2013). However, it is far from clear the ultimate molecular mechanism connecting cortisol levels and masculinization. On one hand, Hayashi et al. (2010) proposed a direct up-regulation of the follicule stimulant hormone (FSH) receptor, which would be connected to germ cell proliferation. On the other, Fernandino et al. (2013) suggested an up-regulation of hsd11b2, a steroidogenic enzyme implicated both in the metabolism of cortisol into cortisone, and in the synthesis of biologically active androgens such as 11-ketotestosterone. Epigenetic regulation of aromatase expression mediated by temperature has also been proposed. Thus, Navarro-Martín et al. (2011) demonstrated that hypermethylation of aromatase promoter is correlated with high temperature during the thermosensitive period in the European sea bass, strongly suggesting that sex differentiation is under epigenetic control in this species.

## **GENETIC VARIATION WITHIN SPECIES: SEX DETERMINATION AS A COMPLEX TRAIT**

Most studies in SD in fish have been focused on identifying the master SDg expected according to previous SD models. However, new data are consistently showing that other minor genetic and environmental factors are also involved in SD, even in species with well studied master genes like *dmY* of medaka (Matsuda et al., 2002), and thus, more effort should be devoted to investigate this variation to get the closest picture as possible on SD in fish.

The information exposed so far shows that: (i) although environmental factors may influence sex ratio, usually the genetic component represent the main SD factor in most studied species; (ii) major genes, those which explain a high proportion of the trait phenotypic variance, are on the basis of SD in many species, likely an expectedfact, because of the hierarchical nature of the sex development pathway; and (iii) the interaction of other genes involved in SD with the major factors and the environment. Although major genes are involved in SD of many fish species, available data suggest that SD studies and their applications for fish aquaculture should emphasize the complex nature of this trait and thus, using appropriate quantitative genetics tools for its study.

In fact, consistent variation among families has been reported on sex ratio in European sea bass (Vandeputte et al., 2007), turbot (Haffray et al., 2009; Martínez et al., 2009), Nile tilapia (Lozano et al., 2013) and zebrafish (Liew et al., 2012). In some species, the additive genetic component underlying SD or sex ratio was even estimated (Vandeputte et al., 2007; Lozano et al., 2013). In

zebrafish, a polygenic SD system was suggested based on interfamily and inter-strain variation, and consequently several QTL at different genomic locations were identified, some of them associated to the different strains studied (Bradley et al., 2011; Anderson et al., 2012; Howe et al., 2013). In other species where major loci are involved, the application of QTL screening or genomic association analysis to look for the SD region demonstrated to be efficient and new SD-related genomic regions were identified (Martínez et al., 2009; Hermida et al., 2013), sometimes denoting important intraspecific variation such as in *Eigenmannia* or in cichlid species complex (Ser et al., 2010; Henning et al., 2011; Parnell and Streelman, 2013). As a consequence, selection has demonstrated to be efficient to change sex ratio in progenies of several species and in other related traits like the sensitivity to temperature on sex ratio (Baroiller et al., 2009; Liew et al., 2012; Lühmann et al., 2012; Lozano et al., 2013).

## **SEX-ASSOCIATED TRAITS IN FINFISH AQUACULTURE: SOME RELEVANT EXAMPLES**

In many fish species one sex grows faster or matures earlier than the other and these differences may be accentuated under aquaculture conditions (Breder and Rosen, 1966; Parker, 1992; Piferrer et al., 2012). Sex-associated growth differences generate size dispersion, and therefore, classification must be performed for feeding and to avoid cannibalism or size hierarchies affecting social relations (Dou et al.,2004). This represents more work in animal husbandry, and a higher number of production units to adjust different growth groups (Piferrer et al., 2012). Sexual growth dimorphism can favor males (e.g., tilapias) or, more frequently, females (flatfish, sea bass, among others). In some cases, as in the turbot, females can be 50% larger than males (Imsland et al., 1997). In other cases, as in the European sea bass, the rearing conditions result in highly male-biased stocks (Piferrer et al., 2005). A great deal of research towards the development of sex control methods has been carried out in fish (Piferrer, 2001; Cnaani and Levavi-Sivan, 2009).

Sex-associated markers are very useful in this context for precocious sex identification, especially in those species lacking morphological sexual dimorphism. This can aid to identify the sex of potential broods in genetic breeding programs and to avoid sex bias in the selected population. However, the most relevant application of sex-associated markers is to identify the genetic sex of sex-reversed individuals after hormonal treatment to accelerate the processes for establishing monosex populations (Penman and Piferrer, 2008). The availability of sex-associated markers or even better the SD master gene makes it possible to shorten this process using marker assisted selection (MAS) or gene assisted selection (GAS), respectively.

## **TURBOT**

The strong sexual growth dimorphism of turbot has promoted the interest of industry for all-female populations. No sex-associated karyotype heteromorphism have been detected in turbot, either after analyzing the mitotic or the 11-fold longer and higher resolution meiotic chromosomes (Bouza et al., 1994; Cuñado et al., 2002). This suggests that the SD region in turbot is small or not large enough to be detected with these cytogenetic techniques. A QTL screening performed with 100 homogeneously distributed

microsatellites identified a major SD region in the proximal region of LG5 between two markers separated by 17.4 cM (Martínez et al., 2009). Assuming a single SD region with full penetrance, the SD master gene (SDg) was located at 2.6 cM of the Sma-USC30 microsatellite locus, representing 1.4 Mb according to the general relationship between genetic and physical maps in turbot (Bouza et al., 2007). Also, the analysis of segregation of Sma-USC30 in all families demonstrated that the mother is responsible for sex, supporting a ZZ/ZW system (Martínez et al., 2009) in accordance with the sex ratios observed in progenies from hormonal sex reversed parents (Haffray et al., 2009). However, sex association of Sma-USC30 showed variation among families (between 84 and 100%) and, in addition, other minor QTL were detected at LG6, LG8, and LG21. Temperature also showed some influence on sex ratios (Haffray et al., 2009), although without the general trend reported in most species where the proportion of males increased with temperature (Ospina-Álvarez and Piferrer, 2008).

Using the Sma-USC30 marker, it was possible to classify correctly 98.4% of the individuals in four out of five families analyzed. This information was essential to develop a molecular tool for precocious sex identification in turbot, currently under a Spanish patent (Ref. number: 2 354 343**)**. Since sex cannot be identified in turbot until fish maturation, precocious sex identification is relevant in breeding programs to estimate sex ratio in selected progenies. This molecular tool is also essential to facilitate the achievement of all-female populations. Because turbot displays a ZW mechanism, getting all-female populations requires a threegeneration pedigree starting from hormonal sex-reversed ZW neomales until obtaining WW superfemales in the progeny of the second cross (**Figure 3**). These superfemales would produce all-female offspring after being crossed with normal ZZ males. However, the chromosome constitution of ZW neomales or WW superfemales require individual progeny testing of

**populations of turbot.** I, II, and III represent the three generations required for the full process. Neomales (ZW) are obtained in generation I by applying methyltestosterone in the diet in undifferentiated fry. Identification of neomales (I) and superfemales (II) is usually done by individual progeny testing, so parents producing 50:50 f/m in cross I and II are culled because they are not neomales and superfemales, respectively. Crossing normal males (ZZ) with superfemales (WW) would produce 100% females (ZW) assuming a single full penetrant SDg.

hormone-treated larvae (ZZ or ZW) and of female offspring in cross II (ZW or WW).

Progeny testing is long and involves at least 2–3 years until fish are mature for checking sex ratio in their progenies. Additionally, sex can only be visually identified 4–6 months after hatching in sacrificed offspring, which globally represents a minimum of 2.5 years for progeny testing per generation. The availability of a genetic marker closely linked to the SDg has enabled assessment of the genotypic sex of fish just after 4–6 months when a fin clip can be obtained without damage, thus saving a minimum of 5 years for the production of all-female progenies. However, some limitations still remain, so this technology should be further refined. First, production of 100% females is not often fulfilled because turbot sex also depends on other minor genetic and environmental factors. In addition, because Sma-USC30 is a linked marker to the SDg, we need to establish association of SmaUSC-E30 to sex at family level because this is not a sex-specific marker. Finally, because crossovers can take place between the marker and the SDg, some of the selected ZW neomales orWW females would not show the expected genetic constitution. Nevertheless, this tool is being used by turbot companies with encouraging results.

The availability of a denser genetic map with around 600 markers (Bouza et al., 2012; Hermida et al., 2013), the enriched database with reproduction and immune genes (Pereiro et al., 2012; Ribas et al., 2013), and the recently assembled turbot genome (Figueras et al., in preparation) is enabling a more refined analysis of the SD region to identify the SDg, to analyze its relationship with the previous suggestive QTL, and to study the evolution of sex chromosomes (Taboada et al., 2014). The fine mapping performed has narrowed the genomic position of the SDg to a few kilobases and now much more genetic markers closely associated to the SDg are available, thus facilitating the precocious evaluation of sex. However, although several strong candidates were identified in that region (*sox2*, *dnajc19*, *and fxr1*), none of them were associated with sex at the species level, which illustrates the difficulty of such enterprise (Hermida et al., 2013). In addition, this work has provided additional support for the existence of minor factors at LG6, LG8, and LG21 in new families and demonstrated an interactive rather than an additive component between minor and the major SD QTL.

## **EUROPEAN SEA BASS**

The European sea bass is a gonochoristic marine teleost of the family Moronidae that is present in the NE Atlantic, from Norway to NW Africa and in the Mediterranean up to the Black Sea. As in most teleosts, sea bass does not show sex chromosomes (Devlin and Nagahama, 2002). Female homogamety (XX) has been ruled out since sex ratios of normal diploid and gynogenetic offspring are equivalent (Felip et al., 2002), and offspring from masculinized females is not female-biased (Blázquez et al., 1999). Other data from hormone-treated fish suggested that male homogamety with environmentally male-biased sex ratio would still be a possibility (Vandeputte et al., 2007). However, a polygenic sex determining system (Vandeputte et al., 2007) influenced by temperature (Piferrer et al., 2005) seems to fit better to data. Sex ratio shows interfamily differences with specific and measurable paternal and maternal components and at least two genes would

be necessary to explain the variation pattern observed (Vandeputte et al., 2007), although no sex-associated QTL screening was carried out. The temperature can greatly influence European sea bass sex ratios (Piferrer et al., 2005) and a genetic component determining differential sensitivity to the masculinizing power of temperature has been reported (Saillant et al., 2002). Variation found on female percent due to thermal treatments show that genetic and environmental components are of comparable magnitude, supporting the notion of a continuum between GSD and ESD components of SD in the European sea bass.

In the European sea bass, gonads remain undifferentiated during post-larval stages until fish attain about 8 cm standard length, usually at about 5–6 months of age (Bruslé and Roblin, 1984; Blázquez et al., 1999; Navarro-Martín et al., 2009a; Díaz et al., 2013). European sea bass females differentiate earlier, are bigger, and mature later than males (Blázquez et al.,1999; Navarro-Martín et al., 2009a; Díaz et al., 2013). While still sexually undifferentiated, European sea bass gonads can be influenced by environmental abioticfactors or externalfactors such as sex steroids (Navarro-Martín et al., 2009a), but once sex is determined remains throughout life (Zanuy et al., 2001).

One of the main genes involved in fish GD as outlined above is the aromatase (cytochrome P450). This gene is present in two paralogous copies as a result of the teleost genome duplication, one predominantly expressed in the ovary (*cyp19a1a*; Dalla Valle et al., 2002) and the other in the brain (*cyp19a1b*; Blázquez and Piferrer, 2004). Interestingly, the *cyp19a1a* promoter exhibits important conserved binding sites for several genes of the GD network such as *sf1*, *sox*, *foxl2* or *ar*.

Recent work by Navarro-Martín et al. (2011) showed how temperature during early development is linked to the production of male-biased populations through differences in the methylation levels specifically on the gonadal aromatase promoter at one year. Different CpGs loci within the *cyp19a1a* promoter showed different sensitivities to temperature, suggesting a different role on regulation of aromatase expression. Methylation of gonadal aromatase promoter is thought to be the cause of the lower expression of aromatase in the temperature-masculinized fish linked to the suppressed ability of *sf1* and *foxl2* transcription factors to stimulate gonadal aromatase expression (Navarro-Martín et al., 2011).

While 15◦C has been proposed as the optimal temperature for larval rearing in European sea bass (Koumoundouros et al., 2001), optimal growth in juveniles is found at 26◦C and 13◦C is considered as detrimental (Person-Le Ruyet et al., 2004). European sea bass hatcheries usually apply high temperatures (>20◦C) to speed up growth rates, but male-bias progenies determine a loss of biomass. In the European sea bass the thermosensitive period includes from half-epiboly to mid-metamorphosis (∼17–18 mm total length; ∼70 dph; Koumoundouros et al., 2002), and treating fish with high temperatures masculinize a high proportion of the population (Navarro-Martín et al., 2009b), while temperatures never surpassing 17◦C until metamorphosis yielded the maximum female proportion (Pavlidis et al., 2000). However, raising fish at low temperatures (15◦C) for a long period also masculinize the population even more than a high temperature thermal treatment (Saillant et al., 2002). A thermal protocol to maximize the number of females was developed (Navarro-Martín et al., 2009a)

and recently patented (patent no N200802927). Such protocol consisted on maintaining 17◦C water temperature until the end of the thermosensitive period, and then increasing the temperature as a ratio of 0.5◦C/day until 21◦C to allow high growth rates. It should be stated that there is no temperature regime that will increase the proportion of females in the European sea bass. Instead, proper management of temperature during early development will avoid induced masculinization by applying high water temperatures before the thermosensitive period is over. What a proper thermal regime does is allowing the production of as many females as the polygenic system will allow, bearing in mind the sex ratio of a given brood, even reared at the optimal thermal regime, will ultimately depend on the genetic constitution of both parents.

With this information at hand, different strategies have been devised by European sea bass industry to improve growth rate. On one hand, classical breeding programs may incorporate sex ratio as additional phenotypic information to get female-biased progenies. This strategy would be much more efficient if sex-related QTL screening were performed and sex-associated genetic markers explaining an important proportion of trait variance were incorporated following MAS selection programs. On the other hand, controlling sex-ratio through larval rearing temperature protocols could increase female proportions by avoiding masculinization due to elevated water temperature. Currently, efforts are underway aimed at selecting broodstock that will produce the highest number of females and investigating the possibility that, among these fish, those that are more resistant to the masculinizing temperature can also be selected. An additional step would be further investigation of the epigenetic mechanisms responsible for the inheritance of the high-temperature masculinization, as already done in the half-smoth tongue sole (Chen et al., 2014). In the case of the European sea bass, the goal would be the opposite, i.e., to select as future broodstock those fish that despite being reared at high (masculinizing temperature) they do not become masculinized. In this way, the production of the highest number of females across different generations perhaps could be achieved.

## **SALMONIDS**

The Salmonidae family (11 genera, 70 species; salmon, trout, char, whitefishes, and graylings) includes several of the most economically important species for aquaculture and fisheries industry (the third largest world fish production). Salmonids are worldwide distributed and some species have played a vital role in the life and culture of the North hemisphere societies for thousands of years.

Atlantic salmon (*Salmo salar*) is the leading intensively farmed marine fish. In 1998 global farmed production exceeded the world's total wild salmon captures, and in 2010 around 1.2 million tons were produced worldwide (Food and Agriculture Organization [FAO], 2014). Atlantic salmon breeding companies have achieved more than 100% growth increase in around six generations of selection, and significant improvement in disease resistance and delay at the onset of sexual maturation. The vast majority of farmed Atlantic salmon eggs and smolts are now sourced by such breeding companies (Bostock et al., 2010). Another important salmonid species, the rainbow-trout (*O. mykiss*), is the most-widely cultivated cold fresh water fish species in the world.

In salmonids, early maturation occurs differentially in males and females and is responsible for some problems related to intensive culture (Felip et al., 2006) including reducing growth, increasing disease susceptibility and changing of organoleptic properties. Thus, in this group, females are preferred for production. In all salmonid species investigated sofar, SD is strictly genetic with a male heterogametic sex-determination system (Davidson et al., 2009), although no heteromorphisms or only slight morphological differentiation have been reported associated to sex chromosomes in some species. In rainbow trout, all-female production is generalized in Europe since females are still immature at harvest. Neomales (hormone sex-reversed genotypic females) are used to produce all-female progenies from XX neomale crossed to XX females. Triploids are also produced using chromosome set manipulation techniques to avoid sexual maturation for the production of individuals of bigger size (more than 3 kg; Piferrer et al., 2009).

Both, endocrine and genetic technologies have been implemented for sex control on a production scale in salmonids. Nevertheless, the phenotypic differentiation of males and females is still problematic until the fish become sexually mature and, thus, in some species it is still necessary a genetic/molecular test for sexing. Sex-specific markers, including a linked sequence to the growth hormone pseudogene, have been developed in the last decade for Pacific species of the *Oncorhynchus* genus, rainbow trout (*O. mykiss*), Chinook salmon (*O*. *tshawystscha*), coho salmon (*O. kisutch*), chum salmon (*O. keta*), and pink salmon (*O. gorbuscha*; Brunelli and Thorgaard, 2004).

Recently, the *sdY* (sexually dimorphic on the Y chromosome) was identified as a male-specific linked gene on the Y chromosome in most salmonids (Salmoninae, Coregoninae and Thymallinae subfamilies), strongly suggesting that *sdY* may be the conserved master sex-determining gene of this group (Yano et al., 2012, 2013). However, because this gene is not located at homologous genomic positions among the different salmonid species, it has been suggested its jumping associated to mobile elements (Yano et al., 2013). Irrespective of its location, sequences of this gene may represent a useful tool for sexing. However, some exceptions were observed to this general rule, and in the Coregoninae subfamily, while *Stenodus leucichthys* showed *sdY* as a male specific gene, *Coregonus lavaretus* and *Coregonus clupeaformis*, both males and females contain the *sdY* gene, so a different SD mechanism appears to be involved.

The analysis of sex-associated markers in several families of the SALTAS Tasmanian Atlantic salmon program from different male lineages allowed the discovery of three sex-associated markers (Ssa03, Ssa06, and Ssa2) mapping at three different linkage groups. Ssa2 is the same sex-associated marker previously reported in Atlantic European populations (Eisbrenner et al., 2014), but Ssa03 and Ssa06 represent new genomic positions. These three loci showed positive amplification for *sdY* gene and most individuals analyzed showed a good concordance between phenotypic sex and *sdY* PCR amplification, suggesting the movement of *sdY* to new positions within this species. However, some inconsistencies were detected among the sex marker associated genotypes, the presence of *sdY* gene and the phenotypic sex. Based on these findings, and in the fact that the sdY protein lacks a DNA binding

domain, these authors suggested the existence of another sex determining gene in Atlantic salmon upstream to *sdY*. Also, they emphasized the importance of using many families to identify sexassociated markers in salmonid species (Eisbrenner et al., 2014). Furthermore, the recent characterization of the male-specific region on the Y chromosome of rainbow trout which contains the *sdY* gene and the male specific marker (OmY1), revealed several male specific SNPs associated with 12 single-copy protein coding sequences whose role in the SD should be further analyzed (Phillips, 2013). Data show that even in a species with an apparent well-established SD mechanism variation is observed. Thus, caution should be taken when applying sex-associated markers for establishing associations and also to for their application to precocious sex determination.

### **TILAPIA**

The tribe Tilapiini includes more than 80 species of cichlid fish (family Cichlidae, order Perciformes). Tilapias are endemic to Africa and the Middle East, but they have been introduced into most tropical and subtropical countries for aquatic weed control and aquaculture. Tilapia culture was considered a resource to improve protein supply in developing countries, but nowadays there is also an important market for tilapias in Japan, United States, European Union as well as in other developed countries. Tilapia is one of the fastest growing fish farming sector, being China the production leader, and it constitutes the second most important group of farmed fish after carps and the most widely grown of any farmed fish (85 producer countries). The main aquaculture species is the Nile tilapia (*O. niloticus*) with a production exceeding 3.2 million metric tons in 2012 (Food and Agriculture Organization [FAO], 2014).

Tilapias exhibit an important sexual growth dimorphism in favor of males. Additionally, tilapias show early maturation (e. g., 4–5 months old in Nile tilapia), which determines successive spawning during the growing period, leading to stunting of growth (Beardmore et al., 2001). All these circumstances make it difficult to establish a uniform product, and all male fry are preferred.

Monosex culture can be obtained by different approaches: manual separation of males and females; hybridization between species to produce all-male offspring; or artificial sex reversal using hormones. The most frequent method for producing all-male populations in tilapia was the treatment with 17α-methyltestosterone included in the diet of sexually undifferentiated fry. If properly applied, farms can produce male populations with 98 to 100% effectiveness. However, marketing of hormonally treated fish can also be a problem for health and the direct use of hormones is usually forbidden by food safety regulations in European Community, although in other countries it may be allowed. One way to get through this problem is to combine a method for sex reversal with a breeding scheme aimed at obtaining broodstock that produces monosex fry following the reverse procedure as outlined before for turbot (**Figure 3**). In the case of Nile tilapia, it is necessary to produce YY supermales by crossing neofemales (XY) with regular males (XY). As in turbot, the use of sex-linked DNA markers could shorten the process by distinguishing XX, XY, or YY individuals, thus avoiding the identification of individual genotypes by progeny testing.

Sex chromosomes are not identifiable in tilapias using standard cytogenetic techniques. Most species show 22 chromosome pairs but there is no a heteromorphic sexual chromosome pair. Association studies suggested that sex is determined in tilapias by the existence of major genes located at linkage groups 1, 3, and 23 in the different species (Cnaani et al., 2008; Cnaani, 2013). Additionally, the heterogametic sex can be either the male or the female, depending upon the species, and ZZ/ZW (LG3) and XX/XY (LG1) systems have been reported within the same genus, i.e., in *O. niloticus and Tilapia zillii* (XX/XY), and in *Tilapia mariae, Oreochromis aureus, Oreochromis karongae, Oreochromis tanganicae* (ZZ/ZW). Some families of blue tilapia have been found segregating for both loci, and in these cases the ZW locus appears to be epistatic over the XY (ZW/XY individuals are female; Lee et al., 2004). In Mozambique tilapia (*Oreochromis mossambicus*), the SD locus was found at LG1 (Liu et al., 2013). However, Cnaani et al. (2008) found sex-associated markers on LG1 and LG3 in three families of this species. These discrepancies may be determined by the different genetic background of families and strains used in these studies (Liu et al., 2013). The differences in SD within species show that minor genetic factors segregate and interact with major genes in addition to the influence of environment factors (Baroiller et al., 2009), suggesting that SD should be treated as a quantitative trait and its dissection approached using QTL screening (Eshel et al., 2012).

The main cultured species is the Nile tilapia (GIFT project; Lozano et al., 2013). Most data pointed to LG1 as the sex chromosome in this species (Cnaani et al., 2008) and recently a small candidate region was narrowed by RAD (restriction site associated DNA) sequencing on a 1–2 Mb region on LG1 (Palaiokostas et al., 2013a). However, other sex-linked markers have been identified in *O. niloticus* and its hybrids (*O. niloticus x O. aureus*) mapping on LG3 and LG23 (Eshel et al., 2012). Also, it has been shown by linkage analysis that genetic factors are involved in the sensitivity of SD to temperature and that these factors are located depending on families on the same chromosomal regions as the major QTL at LG1, LG3, and LG23 (Lühmann et al., 2012).

The closely sex-associated genomic region identified on LG1 includes 10 genes not previously related to SD in other species, and two SNPs probed to be very useful for sexing individuals, thus being worthy for production all male stocks by industry (Palaiokostas et al., 2013a). However, some of these markers were not associated to sex in different strains or families, so checking several markers on different linkage groups should be done before its application. Most studies on SD in *O. niloticus* have been carried out on fish derived from Lake Manzala in Egypt population, but today this species show a worldwide production and new information is arising which probably will confirm the necessity for checking a set of markers previous to sexing specific strains.

## **CONCLUSION**

Major genetic factors can explain a high proportion of the SD variance in fish in accordance with the hierarchical gonad development of vertebrates and with the models proposed to explain its origin and evolution. However, several other minor genetic and environmental factors also influence sex following a complex

interactive pattern. Thus, the currently available information supports the idea that sex can be regarded as a complex trait in fish, with the influence of one or more genetic factors in addition to possible environmental influences, depending upon the species. The presence of genetic factors regardless of whether SD is under the control of a master gene, a polygenic system or driven by an environmental factor enables their application in MAS programs to exploit the benefits of a particular sex.

## **ACKNOWLEDGMENTS**

This research work was supported by the Spanish Government (Consolider Ingenio Aquagenomics: CSD2007-00002 project) and Spanish Ministerio de Ciencia e Innovación (AGL2009-13273 and AGL2010-15939) projects to Paulino Martínez and Francesc Piferrer.

## **REFERENCES**


the European sea bass, *Dicentrarchus labrax*, L. *Aquaculture* 256, 570–578. doi: 10.1016/j.aquaculture.2006.02.014


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 16 June 2014; accepted: 10 September 2014; published online: 29 September 2014.*

*Citation: Martínez P, Viñas AM, Sánchez L, Díaz N, Ribas L and Piferrer F (2014) Genetic architecture of sex determination in fish: applications to sex ratio control in aquaculture. Front. Genet. 5:340. doi: 10.3389/fgene.2014.00340*

*This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics.*

*Copyright © 2014 Martínez, Viñas, Sánchez, Díaz, Ribas and Piferrer. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

## Genetic considerations for mollusk production in aquaculture: current state of knowledge

## *Marcela P. Astorga\**

*Instituto de Acuicultura, Universidad Austral de Chile, Puerto Montt, Chile*

#### *Edited by:*

*José Manuel Yáñez, University of Chile, Chile*

#### *Reviewed by:*

*Ross Houston, University of Edinburgh, UK Jesús Fernández, Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA), Spain*

#### *\*Correspondence:*

*Marcela P. Astorga, Instituto de Acuicultura, Universidad Austral de Chile, Los Pinos s/n. Pelluco, Casilla 1327, Puerto Montt, Chile e-mail: marcelaastorga@spm.uach.cl* In 2012, world mollusk production in aquaculture reached a volume of 15,171,000 tons, representing 23% of total aquaculture production and positioning mollusks as the second most important category of aquaculture products (fishes are the first). Clams and oysters are the mollusk species with the highest production levels, followed in descending order by mussels, scallops, and abalones. In view of the increasing importance attached to genetic information on aquaculture, which can help with good maintenance and thus the sustainability of production, the present work offers a review of the state of knowledge on genetic and genomic information about mollusks produced in aquaculture. The analysis was applied to mollusks which are of importance for aquaculture, with emphasis on the 5 species with the highest production levels. According to FAO, these are: Japanese clam *Ruditapes philippinarum*; Pacific oyster *Crassostrea gigas*; Chilean mussel *Mytilus chilensis*; Blood clam *Anadara granosa* and Chinese clam *Sinonovacula constricta*. To date, the genomes of 5 species of mollusks have been sequenced, only one of which, *Crassostrea gigas*, coincides with the species with the greatest production in aquaculture. Another important species whose genome has been sequenced is *Mytilus galloprovincialis*, which is the second most important mussel in aquaculture production, after *M. chilensis*. Few genetic improvement programs have been reported in comparison with the number reported in fish species. The most commonly investigated species are oysters, with at least 5 genetic improvement programs reported, followed by abalones with 2 programs and mussels with one. The results of this work will establish the current situation with respect to the genetics of mollusks which are of importance for aquaculture production, in order to assist future decisions to ensure the sustainability of these resources.

**Keywords: mollusk, genomic, aquaculture, genetic improvement, genetic, bivalves**

## **AQUACULTURE**

The aquaculture industry consists of 3 main groups with different production volumes: fish, with a volume of 44.15 million tons in 2012, representing 66.3% of production; mollusks with 15.17 million tons equivalent to 22.8%; crustaceans, 6.44 million tons, 9.7%. The production of other types was 0.86 million tons, or 1.3% (FAO, 2014). Based on these values, mollusks form the second largest group; however the levels of biotechnological research and the technologies applied are very low compared to the high level of development in cultivated fish species.

Biotechnology and genetic research have developed strongly in the last decade, and a large database now exists for fish species of importance for aquaculture (NCBI, Genbank, EMBL). These new research lines have permitted great progress and increased efficiency. Research of this kind has led to the development of many applications for fish production, with an increase in the number of techniques for genetic improvement, and a wider range of genetic variability of the resources for application in studies of traceability, pedigree, degrees of relatedness, and the search for molecular markers for Marked Assisted Selection. These developments are well reviewed in Hulata (2001), Liu and Cordes (2004), and Martínez (2007); genetic improvement programs are reviewed in Gjedrem (2012) and Gjedrem et al. (2012). In mollusk production, however, the development of such knowledge has been minimal (for Latin American mollusks see Astorga, 2008).

The present work therefore reviews the current state of scientific knowledge on the genetics of mollusk species which are important for aquaculture. The amount of existing information on the genomes of these resources will be assessed—this is needed to establish a baseline for knowledge-based decision-making. Such knowledge would also enable us to define brood-stock selection programs, generate the bases for genetic improvement programs and finally assess the state of a given resource in order to ensure its long term conservation and sustainability.

## **MOLLUSK PRODUCTION**

World mollusk production in 2012 was 15.17 million tons. Total aquaculture production of mollusks is represented by 5 main groups: clams, oysters, mussels, scallops, and abalones. Clam production during 2011 was 4.929 million tons; this was followed by oysters with 4.519 million tons; mussels with 1.802 million tons; scallops with 1.52 million tons; and finally abalones with aquaculture production of 395,000 tons (FAO, 2013).

There are 5 principal mollusk species produced in aquaculture (i.e., with production of over 250,000 tons), all bivalves, which correspond to only 5 species of mollusks reported in the FAO statistics (2013). Three of these are clams: *Ruditapes philippinarum* or Japanese clam; *Anadara granosa (*=*Tegillarca granosa)*, Blood clam or Blood cockle from Malaysia; and *Sinonovacula constricta* or Chinese clam. The other two are the oyster *Crassostrea gigas* and the Chilean mussel *Mytilus chilensis*. The detailed production for each of these is shown in **Table 1**, including the economic value generated by the production of these 5 resources with the highest volume in shellfish aquaculture. In this review the state of knowledge on the genetics of each of these 5 species will be evaluated.

## **GENETIC RESOURCES**

Genetic resources is the name given to all the genetic or genome information for a group of representative individuals of a species which constitutes a natural resource (CBD Nagoya Protocol, 2010). Knowledge of the genetic resources of species has become increasingly important with the discovery of the importance of genetic indicators for species conservation, obtaining sufficient raw material to generate successful improvement programs, the sustainability of natural banks and the appropriate management of genetic resources (Kenchington et al., 2003).

Correct use of genetic resources allows the species to be developed sustainably, maintaining it for future generations; this information is also the basis for new genetic improvement programs in species in early or advanced stages of aquaculture (Kenchington et al., 2003; Guo, 2009; Lind et al., 2012). To start mollusk production in a hatchery, Gaffney (2006) proposes that high levels of genetic diversity must be ensured among the brood-stock. Proper levels of variability must then be maintained over time, by monitoring genetic variability to avoid excessive inbreeding (Gaffney, 2006). These factors—which require the application of biotechnologies or the design of genetic improvement plans—have been identified as important for starting aquaculture production.

Genetic improvement programs require reproductive management throughout the whole cycle of the species. Mollusk production is still carried out in extensive aquaculture facilities, where the initial production stages depend completely on the natural environment and on seed obtained from natural banks for the fattening stage, which makes it difficult to design breeding programs. On the other hand, the costs of obtaining seed from the natural environment are drastically lower than those of seed production in a hatchery, threatening the profitability of mollusk production companies and their ability to invest in genetic improvement programs. These programs may become very profitable, but only if the sale value of the product is very high, as occurs with the oyster, or if the genetic gain is directly related with a reduction in production costs.

## **GENETIC INFORMATION ON THE MAIN AQUACULTURE SPECIES**

(Sort by economic value of production according to **Table 1**).

## *RUDITAPES PHILIPPINARUM*

Genetic information on the Malaysian clam *Ruditapes philippinarum* (**Figure 1A**), also known as *Venerupis philippinarum,* was searched using the ISI Web of Science and the National Center of Biotechnology Information (NCBI) search engines and the PubMed database. This allowed us to establish the low level of existing genetic information, despite the fact that it is one of the 5 mollusk species with the highest production in the world and is located in the first place of economic value production. Only around 30 papers discussing the genetics of this species were found (see database at Annex 1), although we consider that the number is not always representative of the information available. The research focused mainly on the genetic characterization of populations and the evaluation of genetic differentiation and population structure (42%), followed by searches for molecular markers (17%), and evaluation of hybrids with congeneric species (13%). A few papers describe research related to the genetic response associated with the immune system, gene expression, and transcriptome (total 21%), and finally 2 works address heritability and selection measurement (8%). However, although the main focus of research is on molecular genetics and molecular biology, a large number of works was found associated with study of gene expression as it relates to the immune response or to exposure to contaminants; there were at least 19 papers from 2012 to date, showing a strong focus on the transcriptome of this species. On searching for genome information in the National Center for Biotechnology Information (NCBI), data was found on 30,442

**Table 1 | Aquaculture production and economic value of the 5 mollusk species with production greater than 250,000 tons (FAO, 2013), with review of genetic and genome information.**


*The genetic information is based on the number of genetics-related publications in Web of Science databases, NCBI, PubMed, databases in GenBank, EST (Expressed Sequences Tag) regions and coded proteins.*

DNA, and RNA sequences, 6469 EST (Expressed Sequence Tag) sequences, and 1046 protein sequences (**Table 1**). Little genetic information is available on this species, despite the fact that it has been cultivated in hatcheries since the mid 1970s (Zhang and Yan, 2006). The information is insufficient to generate a basis for developing genetic improvement programs. The heritability of shell length has been calculated in the natural and cultured population at the larval and juvenile stages (Yan et al., 2014); the values found were 0.22 for larvae and 0.39 for juveniles in natural populations, and 0.17 and 0.87 for cultured populations. These results suggest that selection should be highly effective in this species, and that selecting either a natural or a cultured population to obtain faster growth should be successful. Zhao et al. (2012) established that divergent selection is effective and promotes shell length changes in the Manila clam, finding relatively moderate heritability values of 0.165 and 0.260, but quite high

*granosa* or *Anadara granosa*. **(E)** *Sinonovacula constricta*. All images

genetic gain values, with an increase in growth of 3.55% over the control line. Zhao et al. (2012) found that a base population with high genetic variability should be established as the basis for a divergent selection program. This would allow stable production of high quality seed to ensure the sustainability of aquaculture of this resource.

## *CRASSOSTREA GIGAS*

naturelle (MNHN).

Unlike the species described above, a large quantity of genetic research has been done in the oyster *Crassostrea gigas* (**Figure 1B**), associated with the search for molecular markers and the genetic characterization of populations. In the area with potential for application in aquaculture, works were found associated with selection response, heritability measurement, and searches for QTL and chromosome manipulation, highlighting the work of Guo et al. (2012), Ge et al. (2014) and Zhong et al. (2014). There were also many works associated with gene expression in relation to environmental stressors and transcriptome analysis (citations from 2010 to date in Annex 2). This agrees with the large amount of information found in the NCBI database on EST (206,647) as compared to the other species with high aquaculture production (**Table 1**). In general databases of scientific publications, at least 450 citations from the last 10 years were found relating to the genetics of this species. It is therefore one of the bivalve mollusks with the most highly developed genetic and biotechnological research. Furthermore, the nuclear and mitochondrial genomes of this species have been sequenced, which is the case with only 5 species of mollusk and 2 species of bivalve mollusk. There is a specialized database on this oyster, called Gigasbase, run by the IFREMER group in France. At least five genetic improvement programs have been developed for this species (Rye et al., 2010) although Gjedrem et al. (2012) mention only 3 selective breeding programs. There is a large volume of information for the implementation of new genetic improvement programs in the species. For example, we found a work assessing the heritability of shell pigmentation (Evans et al., 2009) which reported high values for narrow sense heritability (*h*<sup>2</sup> <sup>=</sup> <sup>0</sup>*.*59). This suggests that total left-shell pigmentation in *C. gigas* is strongly influenced by additive genetic variation and therefore amenable to selection. Another characteristic which has been extensively studied in oysters is the genetic response to the high summer mortality observed in hatcheries (Dégremont et al., 2010). High realized heritabilities (*h*2r) values have been estimated for high survival groups (above 0.88), higher than the values for low survival groups (0.55–0.68), and the authors conclude that these values would permit good results in a selective breeding program to improve survival (Dégremont et al., 2010). We also found estimates of growth heritability, with high values reported by Li et al. (2011) with 0.40; 0.33; and 0.15 and Wang et al. (2012) who also report high heritability values for growth rate (0.457; 0.312; and 0.332), which has allowed a solid basis to be established for the further development of genetic improvement programs for this species, ensuring greater genetic gain in the short term.

## *MYTILUS CHILENSIS*

There is less scientific information published on the mussel *Mytilus chilensis* (**Figure 1C**) than the two previous species; in August 2014, only 15 papers on the genetics of this species could be found in the Web of Science (Annex 3). Aquaculture production is higher for this species than for other mussels, yet the amount of scientific information published is smaller—for example, around 100 scientific publications were found in the same database for the mussel *Mytilus galloprovincialis*. The largest proportion (50%) of the research published on *M. chilensis* is associated with the genetic characterization and population structure of this species distributed along the Chilean coast. Works can be found on the identification of molecular markers, heritability, selection response and gene expression, all in much the same proportions and small quantities. This is corroborated by a search for genome information in the NCBI databases, where only 119 data were found on sequenced genes, and only 7 EST were identified (**Table 1**). Of the 5 top species in aquaculture production, *M. chilensis* presents the least amount of genome information and its genome has not been sequenced. However, it should be noted that the genome of the Mytilid species *Mytilus galloprovincialis*, has been sequenced; this species, cultivated principally around the coast of Spain, presents the next highest aquaculture production among mussels.

There is one record of a genetic improvement program for mussels (Rye et al., 2010; Gjedrem et al., 2012); however, it is not being applied in production because cultivation is based on seed obtained from the natural environment and not from hatcheries. Nevertheless, works exist which estimate heritability and response to selection in this species. High heritability values have been found for a variety of characteristics such as growth in shell height and live weight (0.2–0.9) (Toro et al., 2004a). In larvae these parameters present a heritability range between 0.38 and 0.84 (Toro et al., 2004b). Finally, Alcapan et al. (2007) assessed the effects of environment and ageing on the heritability of body size in *M. chilensis*; they observed great variability in heritability values, conditioned by the site and the age of the individuals. These data would allow good results to be obtained in improvement programs, however the high production costs of seed obtained from hatcheries as opposed to the natural environment remains a limiting factor for this species. This makes the implementation of these improvement programs uncompetitive for producers in the short term. Nevertheless, the possibility cannot be ruled out that it may become necessary in future to make seed of high genetic quality available to this production sector.

### *ANADARA GRANOSA*

In the case of the Blood cockle *Anadara granosa* or *Tegillarca granosa* (**Figure 1D**), at least 4 publications were found on genetic characterization of populations, and two more relating to the search for molecular markers (Annex 4). There is therefore very little genetic information published for this species, despite its high production value. However, when we searched for genome information, we found more research into EST regions in this species (2278) (**Table 1**) than in *Mytilus chilensis* (only 7 EST). No records were found of current genetic improvement programs in this species, nor any information on baseline data which would allow the development potential of such programs to be assessed. On the other hand, the natural population structure has been studied (Ni et al., 2012; Wang et al., 2013) and this is a pre-requirement for selecting brood-stock and forming families for an improvement program. Wang et al. (2013) reported high genetic variability values, meaning that high diversity exists for brood-stock selection. However, they found genetic divergence in various sites on the coast of China presenting lower genetic variability, possibly caused by the admixture of artificially produced seed. This type of information suggests that genetic improvement programs need to include cross-breeding in their design in order to prevent genetic variability from being reduced, perhaps by using a large number of breeding individuals or a group which is representative of the genetic variability of the population. Ni et al. (2012) observe high genetic structuring in populations, indicating a high risk of erosion and possible local extinction. Populations need to be protected to prevent loss of genetic variability in the species.

## *SINONOVACULA CONSTRICTA*

Of the 5 top species in aquaculture production, the one with the least scientific information published on genetics is the Chinese clam *Sinonovacula constricta* (**Figure 1E**). In the Web of Science databases to August 2014, only three publications were found on searches for molecular markers plus two on the genetic evaluation of populations (Annex 5) (**Table 1**). However, there is a large number of sequenced genes in the NCBI databases, many more than for some other species analyzed. In this respect it ranks third among the top 5 species (**Table 1**), with the highest number of genes sequenced among the main 5 species produced in aquaculture (16,383 genes). There are no records of genetic improvement programs in this species, nor any publications of estimators for the implementation of such programs. Very little is known about its population structure. Niu et al. (2012) observed high genetic variability in 10 populations on the Chinese coast; however, they found genetic differentiation between sites, and propose the possible presence of cryptic species. This aspect should be studied since it is important to identify the cultivated species correctly and establish different management methods if more than one species exists.

## **CONCLUDING REMARKS**

On the basis of this review it is apparent that there is no obvious relationship between the economic importance of a species, in terms of the volume produced in aquaculture, and the amount of scientific information available about genetics/genomic resources, when restricted to the five most produced species. In this case our analysis was carried out for research into genetics, which is the basis for genetic improvement programs in aquaculture species. The scientific information found is summarized, with the databases used, in **Table 1**. From this we see that the most studied species, using all the genetic information indicators, is *Crassotrea gigas*. There is great interest in this species and it has been introduced into many parts of the world for cultivation. Among mussels, the small amount of genetic information available on *Mytilus chilensis* is striking. At the same time the congeneric species *Mytilus galloprovincialis* is the most studied mussel; furthermore the latter is one of only two mussel species whose genome has been sequenced, and there are 41,294 records of DNA sequences, 19,756 of EST and 2,424 coded proteins. The species with the largest number of DNA sequences is the oyster *Crassostrea angulata*, with 81,015 DNA sequences, even more than *C. gigas*. On the other hand, *C. gigas* is the bivalve mollusk with the most EST sequences identified.

The quantity of genetic information available on bivalve mollusks varies widely, with great differences between species, even though their production volumes in aquaculture are similar. There is also still little information with which to generate solid bases for genetic improvement programs such as are observed in fish or other cultivated species. The only exception is *C. gigas*, for which adequate base information exists and genetic improvement programs are currently being implemented. Sufficient information exists on the mussel *M. chilensis* to provide a basis for more improvement programs, but the low use of reproduction in controlled environments is a limiting factor in this species. For the other three species there is virtually no basic genetic information on which predicting their response to selection programs. Further research is required to generate a sufficient basis for selective reproduction, and also to clarify information on natural populations, identifying possible cryptic species or strong differentiation between populations. The lack of knowledge about species with high production in aquaculture could lead to population bottlenecks, since if the brood-stock management plan is unsuitable, or seed extraction from local populations is not handled correctly, the populations which sustain these resources could be drastically reduced. This could result in a severe diminution of genetic diversity, to the point where it might be difficult for the species to recover, meaning that in future these resources which are so important for aquaculture would not be sustainable.

## **SUPPLEMENTARY MATERIAL**

The Supplementary Material for this article can be found online at: http://www*.*frontiersin*.*org/journal/10*.*3389/fgene*.* 2014*.*00435/abstract

## **REFERENCES**


FAO. (2014). *The State of World Fisheries and Aquaculture 2014*. *Opportunities and challenges*. Rome: Food and Agriculture Organization of the United Nations, 223.


**Conflict of Interest Statement:** The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 13 September 2014; accepted: 24 November 2014; published online: 10 December 2014.*

*Citation: Astorga MP (2014) Genetic considerations for mollusk production in aquaculture: current state of knowledge. Front. Genet. 5:435. doi: 10.3389/fgene.2014.00435 This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics.*

*Copyright © 2014 Astorga. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

## RNA-seq as a powerful tool for penaeid shrimp genetic progress

## *Camilla A. Santos\*, Danielly V. Blanck and Patrícia D. de Freitas*

Laboratory of Molecular Biodiversity and Conservation, Department of Genetics and Evolution, Federal University of São Carlos, São Carlos, Brazil

#### *Edited by:*

José Manuel Yáñez, University of Chile, Chile

#### *Reviewed by:*

Hsiao-Ching Liu, North Carolina State University, USA Gonzalo Rincon, University of California, Davis, USA

#### *\*Correspondence:*

Camilla A. Santos, Laboratory of Molecular Biodiversity and Conservation, Department of Genetics and Evolution, Federal University of São Carlos, Washington Luis Road (SP 310), Km 235, Caixa Postal 676, CEP: 13565-905, São Carlos, São Paulo, Brazil e-mail: camilla.alves@yahoo.com.br

The sequences of all different RNA transcripts present in a cell or tissue that are related to the gene expression and its functional control represent what it is called a transcriptome. The transcripts vary between cells, tissues, ontogenetic and environmental conditions, and the knowledge that can be gained through them is of a solid relevance for genetic applications in aquaculture. Some of the techniques used in transcriptome studies, such as microarrays, are being replaced for next-generation sequencing approaches. RNAseq emerges as a new possibility for the transcriptome complexity analysis as well as for the candidate genes and polymorphisms identification of penaeid species. Thus, it may also help to understand the determination of complex traits mechanisms and genetic improvement of stocks. In this review, it is first introduced an overview of transcriptome analysis by RNA-seq, followed by a discussion of how this approach may be applied in genetic progress within penaeid stocks.

**Keywords: differential expression analysis, differential phenotype, high-throughput transcriptome, penaeidae transcriptome, candidate genes**

## **INTRODUCTION**

The term RNA-seq has been used to make reference to a transcriptome produced by methods of next generation sequencing (NGS), which ensure a good coverage of transcripts detection, due to the sequencing of millions of reads ranging from 25 to 300 bp, depending on the platform used (Wang et al., 2009; Oshlack et al., 2010). The full set of transcripts in a cell is known as transcriptome. They involve all types of ribonucleic acids (RNAs), including the protein coding messenger ribonucleic acid (mRNA) and the non-coding ribonucleic acid (ncRNA) such as ribosomal RNAs (rRNA), transfer RNAs (tRNA), and the small nuclear RNAs (snRNA). These RNAs may be differentially expressed according to the tissue, the stage of development and the physiological condition being accessed (Wang et al., 2009; Anders and Huber, 2010).

Transcriptome studies have been widely conducted in order to identify new genes, prospect simple sequence repeats (SSR) and single nucleotide polymorphisms (SNP) markers and to analyze differentially expressed genes. Such approaches have been helping to understand different mechanisms related to cellular control and describe important metabolic pathways, what enables a better understanding of the genotype–phenotype relationship (Marguerat and Bähler, 2010; Khatri et al., 2012; Qian et al., 2014).

Small and large scale transcriptome analyses and differential expression studies, such as Expressed Sequence Tags (ESTs) and microarrays, have been carried out in some penaeid shrimp species (Rojtinnakorn et al., 2002; La Vega et al., 2007; James et al., 2010; Brady et al., 2013). However, RNA-seq approaches are still incipient in shrimp (**Table 1**; Li et al., 2012, 2013; Guo et al., 2013; Sookruksawong et al., 2013; Xue et al., 2013; Zeng et al., 2013; Baranski et al., 2014; Yu et al., 2014). Therefore, the method herein has emerged as a new possibility for the transcriptome complexity

analyses in face of varied production and/or experimental conditions. Consequently, such approach aims to develop genetically improved strains, focusing mainly on the resistance factor.

In order to obtain a transcriptome via RNA-seq, some steps should be followed: (i) selection of tissue of interest and isolation of RNA molecules; (ii) construction of cDNA libraries; (iii) utilization of a NGS platform; and (iv) the reads analysis in order to establish unigenes and the transcriptome assemble through bioinformatics tools.

The tissue choice should be based on the study aim and/or the genes nature to be analyzed. As a parallel, a transcriptome consists in taking a photograph from a specific time in a cell, highlighting only the condition at that short period of time. In this manner, tissue selection and the suitable time to perform a transcriptome requires preparation, otherwise the experiment as a whole may be biased (Wang et al., 2009). Libraries establishment is crucial for the final result in face of the many laboratorial procedures that are conducted, leading to some biases in the obtained results (Wang et al., 2009). Bioinformatics analysis is also an important step and includes the use of computational tools that guarantee the processing of large volumes of data generated by next-gen (Gavery and Roberts, 2012; Guo et al., 2013).

Within this review, it is presented a brief overview of the RNA-seq method, including its main advantages and limitations. Following that, it will be discussed how such technique may be applied to obtain genetic progress in penaeid shrimp farming.

## **RNA-seq: ADVANTAGES AND LIMITATIONS**

The transcriptome assembly may be based on a reference genome available (Wang et al., 2009), which allows to quickly locate similar



regions using local alignment algorithms and presents higher reliability due to the large volume of small sized reads coming from alternative splicing. It equally provides a more even coverage of the genome (Anders and Huber, 2010; Qian et al., 2014). On the other hand, even when there is no reference genome available, *de novo* transcriptome assembly may be carried out using specific algorithms, which stands as a solid advantage for some species that have not been widely studied yet (Howe et al., 2013; O'Neil and Emrich, 2013).

When performing transcriptome via RNA-seq, a high coverage is obtained, which allows the discovery of new genes and polymorphisms (Marguerat and Bähler, 2010; Yu et al., 2014). Li et al. (2012) evaluated the abundance and coverage of transcriptomes obtained by RNA-seq in *Litopenaeus vannamei*. By comparing such data to the ESTs available on GenBank, it was found that only 14.2% (15,519 out of 109,169) of unigenes obtained by RNA-seq were also found in the EST libraries, generating a lot of new informative data. In addition to that, the wide coverage associated with high resolution provided by this technique ensured high accuracy in SNP discovery in coding genes (Yu et al., 2014).

That taken, RNA-seq allows the detection of variations in a single nucleotide, enabling the detection of the expression of protein isoforms and their respective allelic variants, characterizing SNPs (Baranski et al., 2014; Yu et al., 2014). Polymorphic microsatellites or SSRs have equally been identified through RNA-seq analysis (Mohd-Shamsudin et al., 2013; Zeng et al., 2013; Baranski et al., 2014). On those cases, though, a wider coverage of the reference genome is suggested (Qian et al., 2014) once the presence of highly repetitive regions could stand as a limiting factor by compromising the transcriptome assembly.

As could be observed, RNA-seq has been considered a solid method for the large-scale gene expression analysis due to the fact it does not require prior genome knowledge (Wang et al.,2009) and enables the detection of isoforms arising from alternative splicing (Ghosh and Qin, 2010). Even when involving several samples, such technique is accessible with moderate costs. In this case multiplex runs containing up to 10 samples per sequencing lane can be performed in some platforms and the costs are no longer a limiting factor.

Another RNA-seq advantage is its wide dynamic range (ratio between the minimum and maximum expression level). This feature makes it suitable for measuring low, medium and high expression levels of the genes, not requiring very sophisticated normalization. By contrast, DNA microarrays show reliable results only for medium expression levels and therefore have a much smaller dynamic range. Thus, RNA-seq provides much more informative data, requiring less biological material and lower costs, becoming this technique popular for measuring gene expression on a large scale (Sharov et al., 2004; Wang et al., 2009).

## **RNA-seq APPLICATION WITHIN PENAEID SHRIMP AQUACULTURE**

The use of RNA-seq in species of penaeid shrimp can be focused on transcriptome characterization, functional annotation, gene expression profiles analysis, and gene-associated markers identification. In this section, the emphasis is given to the analysis of differential expression, identification of molecular markers, and its potential to promote genetic gain and development of improved penaeid strains. Similar studies have allowed the identification of candidate genes or quantitative trait loci (QTLs), which could be related to traits of interest for aquaculture, such as reproduction, sex determination, growth, immunity, and tolerance against environmental stress. Data involving pathways are also relevant in order to obtain more details about the interaction mechanism between the expressed products and their importance and applicability.

### **IDENTIFYING CANDIDATE GENES THROUGH DIFFERENTIAL EXPRESSION ANALYSIS**

Although the applicability of RNA-seq in transcriptome and differential expression in aquatic organisms have increased in the past 3 years, the results found in literature and in the Sequence Read Archive Database of the National Center for Biotechnology Information (SRA-NCBI) indicate that such approach still is incipient for penaeid shrimp. SRA databank, per instance, presents only 28 deposits of data generated by NGS in what regards the species *L. vannamei*, *L. stylirostris,* and *Penaeus monodon* (http://www.ncbi.nlm.nih.gov/sra/?term=penaeidae). In literature, it has beenfound next-gen data onlyfor *P. monodon* (Baranski et al., 2014) *Fenneropenaeus chinensis* (Li et al., 2013) and *L. vannamei* (Li et al., 2012; Guo et al., 2013; Sookruksawong et al., 2013; Xue et al., 2013; Zeng et al., 2013; Yu et al., 2014). Mostly, the research in this field has been covering the identification of genes connected to immunity, mainly concerning the white spot syndrome virus (WSSV) and the taura syndrome virus (TSV; Li et al., 2013; Sookruksawong et al.,2013; Xue et al.,2013; Zeng et al.,2013; Baranski et al., 2014). Both syndromes have caused great economic losses for the shrimp industry throughout the past few decades.

Despite the fact that crustaceans do not own an immune system, some candidate genes have been obtained from hemolymph and hepatopancreas tissues. It is clearly seen in some differential expression studies concerning *L. vannamei* species, which represents the biggest portion of the marine shrimp worldwide production (Gucic et al., 2013). Among the main genes studied, it is possible to find those related to toll-like and signalizing receptors, apoptosis, *Vibrio cholerae* infection and other immune proteins (e.g., phagosome, hemocyanin, crustacyanin, antiviral), antioxidant enzymes (the peroxidases and glutathione ones), and lectins (**Figure 1**; Li et al., 2012, 2013; Sookruksawong et al., 2013; Xue et al., 2013; Zeng et al., 2013; Baranski et al., 2014; Yu et al., 2014).

Data related to toll-like and lectin proteins demonstrate that those may act as signaling molecules, what causes the increase of peptides expression responsible for controlling the immune response (Wang et al., 2014). On the other hand, genes associated with apoptosis may indicate an attempt to prevent proliferation of viruses and possible damages to genetic material, through death of infected cells. Specifically in what regards large number of proteins related to infection by *V. cholerae* response is due to the recurring presence of this group of bacteria within shrimp farming tanks (Banerjee et al., 2012).

Information regarding the main metabolic pathways and the quantity of most frequent genes in each pathway were also collected, as part of the data obtained via functional annotation for RNA-seq. In penaeid, the most commonly described pathways were those involving the general metabolism, spliceosome, RNA transport, *V. cholerae* infection, phagosome and the antioxidant ones, which include peroxidase enzymes (Li et al., 2012, 2013; Sookruksawong et al., 2013; Xue et al., 2013; Zeng et al., 2013; Yu et al., 2014). Spliceosome and RNA transport pathway supposedly act in new transcripts formation, providing genetic variants that may contribute to resistance (Yang et al., 2007).

Regarding Gene Ontology (GO) categories, all studies in penaeid have mainly reported the same data. Considering the biological processes, per instance, the most frequent were metabolism and biological regulation. In what regards cellular components, genes are mostly expressed at the cell and some unspecific organelles. Finally, concerning the molecular function, the most

common ones were catabolic activity and binding (Li et al., 2012, 2013; Sookruksawong et al., 2013; Xue et al., 2013; Zeng et al., 2013; Baranski et al., 2014; Yu et al., 2014). Overall results such as these were expected, since the penaeid species previously mentioned herein have too little information about their genomes available. In the case of *L. vannamei*, only approximately 12,000 gene products were described, which may be useful in a comparative approach concerning a *de novo* assembly (http://www.ncbi.nlm.nih.gov/protein/?term=Litopenaeus+ vannamei).

#### **IDENTIFYING GENE ASSOCIATED MARKERS**

In what regards RNA-seq technology, it has also been proving to be an extremely useful tool for identifying SNPs, which may be also used to develop high density SNPs chips for studies concerning the genome wide association (GWAS) and to build high density linkage maps (Baranski et al., 2014; Yu et al., 2014). Furthermore, SNPs can be used as markers in order to distinguish allelic transcripts whilst studying the allele-specific expression (Bell and Beck, 2009).

In a recent study, Yu et al. (2014) prospected SNPs in *L. vannamei*. A total of 58,717 unigenes and 36,277 high quality SNPs were predicted by transcriptomes "*M"* (produced by the authors themselves) and *"P"* (downloaded from SRA database, session number SRR346404, which was published by Li et al., 2012), respectively. Those SNPs were spread out among 25,071 unigenes and allocated to 254 pathways at the KEGG (Kyoto Encyclopedia of Genes and Genomes) database. The main pathways containing high number of SNPs were metabolic pathways, amoebiasis, *V. clolerae* infection, RNA transport, and actin cytoskeleton regulation.

Baranski et al. (2014) used the approach to build a high density linkage map in *P. monodon*. A total of 6,000 out of 473,620 SNPs/indels putative were genotyped by using the Illumina iSelectCerca genotyping matrix. Out of those SNPs, 3,959 were mapped in 44 linking groups and out of those 2,340 were functionally annotated according to the GO database (see dataset S5 and S6 from Baranski et al., 2014). According to the authors, these polymorphisms may be causal or closely related to other mutations that affect important traits, such as resistance to diseases and reproductive performance.

The identification and functional annotation of SNPs identified by Baranski et al. (2014) and Yu et al. (2014) studies represent a useful resource to comprehend mechanisms determining complex traits and, consequently, to develop programs aiming the genetic improvement of these characters in penaeid shrimp strains. That taken, those SNPs can be applied both in marker assisted selection (MAS), using SNPs closely associated with QTL, and in genomic selection, through complete set of identified SNPs. That increases the rate of genetic gain per generation in traits of great interest to the shrimp industry, such as growth and resistance to disease.

## **FINAL CONSIDERATIONS**

One of the possible challenges that arise within genetic gain is the development of penaeid strains that may simultaneously present high growth development and pathogens resistance.

Genetic correlation studies have shown that there is a negative phenotypic correlation between the resistance to diseases and the weight gained by the animals (Argue et al., 2002; Gitterle et al., 2005; Cock et al., 2009). Cock et al. (2009), reinforce the fact specimen potentially resistant to WSSV also present low reproduction efficiency. Such remarks suggest that genes with pleiotropic effects may be responsible for the trade-off observed between these traits in penaeid shrimp. From this perspective, the RNA-seq technique can be used for discovering such genes, since the overlapping of differentially expressed genes in both strains resistant to pathogens and in large weight gain strains can also be verified. Therefore, up-regulated or downregulated genes expressed in these two strains could indicate a possible pleiotropic effect. Besides, mRNA studies allied with RNA-seq method could also be used for micro RNAs (miRNA) analyses. This approach was applied in aquaculture species, such as freshwater prawn *Macrobrachium rosenbergii* (Tan et al., 2013), and tilapia (Huang et al., 2012). As a result, it has been shown that miRNAs are critical regulators of generalized cellular functions such as differentiation, proliferation, and cell growth.

Another challenge within aquaculture is the difficulty in achieving sexual maturity and spawning of penaeid species (except for *L. vannamei*), under the farming conditions (Lo et al., 2007; Brady et al., 2013). As an attempt to overcome such problem, ablation of the eyestalk has been conducted for many years. Nevertheless, such practice is associated to high mortality rates, and low spawning and survival rates (Huberman, 2000). Considering this, the transcriptome analysis obtained from reproductive organs of native and captive specimens of penaeid shrimp via RNA-seq may significantly contribute to the

identification of the underlying causes of reproductive dysfunction observed in farmed animals. Furthermore, the discovery of genes involved in gonadal maturation and reproductive performance may assist in gametogenesis, handling studies involving these species.

Finally, transcriptome and differential expression analysis by RNA-seq may be a powerful approach to optimize the penaeid diet composition (nutrigenomics), especially for those species that do not count on a specific availability of diet. The approach may be used to identify specific changes in molecular level (Chávez-Calvillo et al., 2010), which in turn also cause metabolic and physiological changes in shrimp treated with different diets (e.g., levels of crude protein, levels of plant protein inclusion and of antioxidants, vitamins, and polyunsaturated fatty acids). Thus, nutrigenomics can be used to produce healthy animals and safe and high quality products for the consumer, emerging as a promising area of research for sustainability and profitability in aquaculture (Cerdà and Manchado, 2013).

Although NGS technologies are showing their efficiency in works related to gene expression, other methodologies such as third-generation sequencing, also referred to as singlemolecule sequencing (Single-Molecule Real-Time, SMRT), are being developed, but already showing limitations. More advanced techniques of sequencing are also on the way, such as "nextnext-generation," which it is capable of handling millions of DNA molecules simultaneously, including cDNAs from the RNAs.

Considering the many technologies that are already available or emerging, researchers can only venture in this world of possible and promising technologies. Various research groups should seek to unite efforts in order to overcome the difficult and challenging task of applying the enormous potential of these new methods to advance and progress in penaeid shrimp aquaculture.

## **ACKNOWLEDGMENTS**

The authors would like to thank the Fundação de Apoio a Pesquisa do Estado de São Paulo (FAPESP) for the financial support (2012/17322-8).

## **REFERENCES**


muscle of Nile tilapia (*Oreochromis niloticus*). *J. Anim. Sci.* 90, 4266–4279. doi: 10.2527/jas.2012-5142


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 15 June 2014; accepted: 11 August 2014; published online: 28 August 2014. Citation: Santos CA, Blanck DV and de Freitas PD (2014) RNA-seq as a powerful tool for penaeid shrimp genetic progress. Front. Genet. 5:298. doi: 10.3389/fgene.2014. 00298*

*This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics.*

*Copyright © 2014 Santos, Blanck and de Freitas. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited andthatthe original publication inthis journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

## Appearance traits in fish farming: progress from classical genetics to genomics, providing insight into current and potential genetic improvement

## *Nelson Colihueque1\* and Cristian Araneda2*

<sup>1</sup> Laboratorio de Biología Molecular y Citogenética, Departamento de Ciencias Biológicas y Biodiversidad, Universidad de Los Lagos, Osorno, Chile <sup>2</sup> Laboratorio de Biotecnología y Genética Aplicada a la Acuicultura, Departamento de Producción Animal, Facultad de Ciencias Agronómicas, Universidad de Chile, Santiago, Chile

#### *Edited by:*

José Manuel Yáñez, University of Chile, Chile

#### *Reviewed by:*

Michelle Martinez-Montemayor, Universidad Central del Caribe, Puerto Rico

Antti Kause, MTT Agrifood Research Finland, Finland

#### *\*Correspondence:*

Nelson Colihueque, Laboratorio de Biología Molecular y Citogenética, Departamento de Ciencias Biológicas y Biodiversidad, Universidad de Los Lagos, Avenida Alcalde Fuchslocher 1305, Casilla 933, Osorno, Chile e-mail: ncolih@ulagos.cl

Appearance traits in fish, those external body characteristics that influence consumer acceptance at point of sale, have come to the forefront of commercial fish farming, as culture profitability is closely linked to management of these traits. Appearance traits comprise mainly body shape and skin pigmentation. Analysis of the genetic basis of these traits in different fish reveals significant genetic variation within populations, indicating potential for their genetic improvement. Work into ascertaining the minor or major genes underlying appearance traits for commercial fish is emerging, with substantial progress in model fish in terms of identifying genes that control body shape and skin colors. In this review, we describe research progress to date, especially with regard to commercial fish, and discuss genomic findings in model fish in order to better address the genetic basis of the traits. Given that appearance traits are important in commercial fish, the genomic information related to this issue promises to accelerate the selection process in coming years.

**Keywords: skin pigmentation, body shape, appearance traits, fish farming, quantitative trait loci**

## **INTRODUCTION**

Over the past few decades, body shape and skin pigmentation have become valuable appearance traits in commercial fish (**Table 1**). Due to increasing market sophistication, fish size, meat quality, and other traditional traits are not the only attributes that influence consumer choice at point of sale, especially when fish are sold whole.

On the subject of skin pigmentation, previous work conducted in other livestock species has demonstrated that proper handling of pigmentation traits allows for response to consumer demands for various food products, such as skin color in pigs and egg shell, yolk, and skin color in chickens (see review by Hudon, 1994). This topic is relevant for producers because the color of a food product is a quality attribute for the consumer. For example, consumers perceive redder salmon filets as being fresher, better-tasting, and higher quality as compared with paler salmon, and, therefore, they are willing to pay significantly more for the product (Anderson, 2000; Alfnes et al., 2006). Therefore, to satisfy modern market demands and increase profitability, producers are forced to manage external traits more intensively on an industrial scale, in particular body shape and skin color.

However, this is not an easy task, because body shape and skin color in fish are complex traits, involving numerous genetic and environmental factors. Thus, progress in this field will depend in part on dissecting the underlying genetics of these traits for future implementation of modern selection strategies, such as marker-assisted selection based on molecular data.

In commercial fish, such as the common carp, tilapia, sea bream, and salmonids (Pillay and Kutty, 2005), this strategy will complement progress made to date based solely on breeding values estimated with phenotypic and genealogical information or classical genetics which, for example, has enabled the development of new strains (**Figure 1**).

Further understanding of this issue may be gained from progress achieved in model and ornamental fish, where characterization of the inheritance mode of mutation, genes, and quantitative trait loci (QTL) for external traits is more advanced.

In this review, we describe efforts made to improve the external traits in commercial fish based on classical genetic approaches, as well as recent progress in genomics, the latter initially aimed at identifying the specific region that harbors genes controlling quantitative traits. This information, together with data available on this issue in model fish, will enhance progress in this field, an objective of tremendous importance for producers who need to increase the competitiveness of their cultures by managing external characteristics that give added value to cultured fish.

## **FISH BODY SHAPE**

In common carp, one of the first domesticated fish in the world (Balon, 2004), the long process of domestication has produced a domesticated phenotype very different from the wild-type phenotype (Ankorion et al., 1992; Zhang et al., 2013). Many of these changes have arisen due to intentional selection of traits, but it is equally true that many traits are the result of an unintentional selection process. This phenomenon emerged as a result of the


**Table 1 | Examples of appearance traits for body shape and skin pigmentation used in fish farming.**

adaptation of fish to captive conditions, quite different than the natural environment inhabited by wild-type fish.

The domesticated phenotype phenomenon arises because the body shape of an organism results from the integration of morphological, behavioral, and physiological traits (Reid and Peichel, 2010), where different genetic and environmental pressures can lead to functional trade-offs (Reznick and Ghalambor, 2001; Ghalambor et al., 2003; Walker, 2010). This creates functional constraints, where those changes with the greatest positive and fewest negative effects on fitness will be selected (Reid and Peichel, 2010). For example, in natural populations, there is a relationship between body shape and swimming performance, but body shape is also influenced by foraging behavior, the risk of predation, and stream velocity (Webb, 1984; Walker, 1997). The trade-off for body shape also operates in captive populations. For instance, cultured populations of rainbow trout selected for rapid growth result in more rotund fish, given the existence of a positive genetic correlation of body mass with body shape and condition factor (Gjerde and Schaeffer, 1989; Kause et al., 2003); that is, mass gain in fish achieved by increasing body width and height rather than by increasing body length.

It has been shown that other factors, such as water velocity (Pakkasmaa and Piironen, 2001), rearing environment (Swain et al., 1991), fish density, and diet (Higgs et al., 1992; Einen et al., 1998; Jenkins et al., 1999) may also modify body shape in fish. This phenomenon occurs given that many morphological growth-related traits show phenotypic and genetic correlations in fish (Kause et al., 2003; Martyniuk et al., 2003), whose origins are related to the genetic architecture of traits such as covariation in QTL location or conservation of chromosomal regions homologous across species (Reid et al., 2005).

In common carp, this process has produced various phenotypes of commercial value that are currently used in fish farming, such as the high-backed and elliptical body shape morphs, typical of the Galician and Wuyuanensis strains, respectively (Ankorion et al., 1992; Zhang et al., 2013). Even this process of domestication can reach a high level of body shape modification, such as it has been observed in the ornamental goldfish (*Carassius auratus*), where various morphological traits have been modified (e.g., body shape, fins, and eyes). Several of these modified traits can be found in the same individual, giving rise to popular strains called "monstrosities" (Balon, 1990).

Alterations of morphology characteristics, mainly body length and fins (Haffter et al., 1996) can also be seen in mutants of zebrafish (*Danio rerio*). In the guppy (*Poecilia reticulate*), variation in male body shape occurs in association with mating success (Tripathi et al., 2009). Therefore, available evidence in fish indicates that these organisms are highly amenable to morphological modification, already widely explored in ornamental as well as in model fish.

The underlying genetics of phenotypic variation is beginning to be understood in various commercial fish (Massault et al., 2009; Loukovitis et al., 2013; Zhang et al., 2013). These studies are focusing on finding significant QTL for morphometric traits based mostly on geometric morphometry; in these analyses, different types of molecular markers have been used. These investigations

**FIGURE 1 | Examples of commercial fish strains with improved skin pigmentation and body shape. (A)** A wild-type tilapia (Oreochromis niloticus) with normal black pigmentation; **(B)** a red strain tilapia (red Yumbo) with improved red skin pigmentation; **(C)** a wild-type rainbow trout (Oncorhynchus mykiss) with normal pigmentation and **(D)** a Blue Back rainbow trout with improved intense bluish back, whitish belly, and a reduced number of dark spots; **(E)** a common carp (Cyprinus carpio) of var. haematopterus (or Amur wild carp) with a spindle-shaped body and steel-gray skin color; and **(F)** a common carp of var. wuyuanensis with improved broadly elliptical body (red skin color).

are contributing to our understanding of the genetic architecture of divergence in body shape, by means determining the number of genes or QTLs that contribute to a particular trait, or the number of traits that a particular gene or QTL affects, i.e., the pleiotropic effect, and the location of genes or QTLs within the genome that affect body shape, along with their interaction.

Progress in model fish should be mentioned here, particularly in zebrafish (Haffter et al., 1996) where a set of dominant Mendelian loci affecting body shape and fins in induced mutants have been identified. For example, loci that affect body shape may cause a reduction of overall body length in the adult fish, due to a reduction either in the length of vertebrae (*stöpsel* mutant) or number of vertebrae (*däumling* mutant). Interestingly, mechanisms of body shape variation involving axial length modification also occur naturally across several fish species (Ward and Mehta, 2010), which indicates that this mechanism has been of evolutionary significance for body form differentiation in fish.

Recent work on QTL searching in commercial fish clearly supports the existence of major genes underlying the quantitative genetic variation of morphological and body shape-related traits (**Table 2**). In Gilthead seabream (*Sparus aurata*), Loukovitis et al. (2013), using half-sib regression analysis, found significant morphology QTLs, e.g., distances from pectoral fin to dorsal fin or from pectoral fin to anal fin (see **Table 2**), in three linkage groups (9, 21, and 25) identified at genome-wide level that explain 18.5 to 27.1% of trait variation. This result suggests the existence of one locus in each linkage group affecting several traits in this fish. Moreover, given that QTLs affecting body weight were located at the same positions for the linkage groups 9 and 21 (Loukovitis et al., 2011), the authors conclude that there might be only one pleiotropic QTL in each LG affecting overall body size. This is in accordance with the high genetic correlations (rG > 90%) observed between all traits analyzed (see **Table 2**), including body weight. These results, combined with those obtained from previous studies (Boulton et al., 2011), underline highly significant loci affecting overall morphology in *S. aurata*.

On the other hand, using half-sib regression analysis and variance component analysis at the genome-wide level in sea bass (*Dicentrarchus labrax*), six significant QTLs for a combination of morphometric traits (standard length, head length, body length, pre-anal length, abdominal length, post-anal length, head depth, body depth; see **Table 2**) on linkage groups 1B, 4, 6, 7, 15, and 24 were reported by Massault et al. (2009). These QTLs explain between 9.4 and 16% of phenotypic variance. In this study, a body weight QTL was discovered at the same linkage groups (linkage groups 4 and 6) and at similar positions as morphology QTLs, which might explain the high correlation observed between body weight and all morphometric traits studied in this fish.

Moreover, in common carp (*Cyprinus carpio*), in a primary genome-wide scan using single nucleotide polymorphisms (SNPs) and microsatellite markers, Zhang et al. (2013) found five significant QTLs for body-shape related traits (body height, body width and standard length) located at linkage groups 1, 12, and 20, which explain 20.4 to 20.7%, 18.9 to 21.1%, and 19.5% of phenotypic



QTLs were detected at genome-wide level using permutation tests at a significance threshold value of P < 0.05 for S. aurata and D. labrax, and of P < 0.01 for C. carpio.

variance, respectively. Given that QTLs of linkage group 1 were located in the same interval, it was concluded that only one QTL produced pleiotropic effects on these traits, which was not the case for QTLs found in linkage group 12, indicating that different factors control the traits. Importantly, this study provides strong evidence that the marked body shape differences of *Cyprinus carpio* populations, in particular between *Cyprinus carpio* var. *wuyuanensis* and *Cyprinus carpio* var. *haematopterus*, depend on quantitative genetic variations that control different body shape-related traits that may have originated through the process of selective breeding that has occurred for decades in this species.

In salmonids, another important group of commercial fish (Pillay and Kutty, 2005), progress has been made in this field through QTLs searching, mainly for growth-related traits, including fork length, body weight, and Fulton's condition factor, and also for meristic traits (for review see Araneda et al., 2008). For example, in Atlantic salmon (*Salmo salar*), four QTLs for condition factor and two for body weight were detected in comparative studies with rainbow trout (*Oncorhynchus mykiss*) and Arctic charr (*Salvelinus alpinus*). One strong QTL explaining 20.1% of variation in body weight was found on linkage group AS-8, while another QTL with a strong effect on condition factor accounting for 24.9% of trait variation was found on linkage group AS-14. This result suggests that a significant portion of quantitative variation in body weight and condition factor in this species is under the control of a few QTLs with relatively large effects.

However, to date no study has been specifically undertaken to search QTLs in salmonids based on the geometric morphology method. It is noteworthy that in some species of this group, such as rainbow trout, a marked intra- and inter-population differentiation in body shape has been observed (Kause et al., 2003; Hecht et al., 2012; Pulcini et al., 2013). For example, Pulcini et al. (2013), in a common-garden experiment, found a marked morphological variation in body shape traits such as body profile, head length, dorsal and anal fin length, and caudal peduncle size, using geometric morphometry among wild, semi-wild and domestic lines of this species. Domestic lines have a deeper body profile, with longer dorsal and anal fins and shorter and deeper caudal peduncles than wild lines. This differentiation, attributed to exposure of domestic lines to captive conditions, suggests that the variations may result from fixed genetic differences among lines due to the existence of QTL. Therefore, further QTL analysis in rainbow trout would be useful in clarifying the underlying genetics of this striking differentiation in body shape. To achieve this goal the use of SNPs it is possible given that these markers are considered to be the most desirable molecular markers for developing high-density genome scan to discover and locate target genes underlying the quantitative traits (Wang et al., 1998). This approach has been demonstrated to be efficient to discover several QTLs in guppy that control the complex patterns of skin pigmentation of males (Tripathi et al., 2009). However, it needs to use next-generation sequencing analysis to discover

thousands of such SNPs (Miller et al., 2007; Davey et al., 2011) and in several cases developing SNP chips to perform genome-wide scans.

## **SKIN PIGMENTATION**

The cellular basis of skin pigmentation in fish is well known. Skin color depends on five types of pigment cells (or chromatophores) known as melanophores, xanthophores, erythrophores, iridophores, and leucophores, each producing a different color (black or brown, yellow or orange, red, iridescent, blue, silver or gold, and white, respectively; Fujii, 1969, 1993). The underlying genetics of skin pigmentation phenotype, however, has been explored mostly for qualitative traits by means of large-scale analyses of natural or induced color mutants, mainly in model fishes (see review by Colihueque, 2010). Evidence from studies in these and in other fish species indicates that the inheritance mode of qualitative traits for skin pigmentation has a simple genetic

basis (Tave, 1986), i.e., a monogenetic control, which may be recessive, completely/incompletely dominant, and co-dominant or sex-linked. Moreover, these studies indicate that several genes participate in producing a specific skin color or color pattern that may be involved in chromatophore development, pigment synthesis, and pigment expression. For example, about 90 and 40 genes of this type have been identified in zebrafish and medaka (*Oryzias latipes*), respectively, that control specification, proliferation, survival, differentiation, and distribution of chromatophores, among other processes.

However, recent studies emphasize that skin pigmentation in fish can also possess a more complex genetic architecture, characterized by specific genome regions that harbor genes controlling quantitative traits (**Table 3**). For example, in the threespine sticklebacks (*Gasterosteus aculeatus*), two significant QTLs on linkage groups 1 and 6 that control the degree of barring and explain 26.6% of the variance of the trait were found


**Table 3 | Examples of QTLs for skin pigmentation traits mapped on different fish model genomes.**

QTLs were detected at genome-wide level using permutation tests at a significance threshold value of P < 0.05 for G. aculeatus and P. reticulate.

(Greenwood et al., 2011). Given that these QTLs were associated with spatial variation in melanophore number (linkage group 6) and degree of melanization of melanophores (linkage group 1), this finding reveals the existence of different loci underlying variation in pigment patterns of this fish, which shows striking diversity of among freshwater (barred) and marine (unbarred) populations. Moreover, the number of dorsal and ventral melanophores is also controlled by different loci in this fish, since they were mapped to linkage group 7 and linkage group 1, respectively.

Through synteny analysis, Greenwood et al. (2011) identified the *Gja5* gene contained in the barring QTL on linkage group 6, which encodes a gap junction protein whose mutation disrupts the normal pigmentation pattern in zebrafish, in which spots form in place of the typical horizontal stripes, caused by an altered melanophore distribution (Watanabe et al., 2006). Moreover, the region on linkage group 1 that mapped QTLs associated with both barring and degree of melanization contains the *tyrosinase* gene, which encodes a key enzyme in melanin synthesis, whose mutation eliminates all pigmentation in zebrafish and medaka (Koga et al., 1995; Iida et al., 2004). Therefore, these results suggest that a few genes with large effects underlie the pigmentation pattern variation in the threespine sticklebacks.

In guppy males, a more complex control of pigmentation pattern has been observed (Tripathi et al., 2009), including a phenotype characterized by multi-colored areas with an ornamental function involved in female choice and in male mating success, and therefore, important for male fitness. In the genome of this fish, using interval mapping and the multiple-QTL model, 49 QTLs for 11 areas of pigmentation traits were found (see **Table 3**), which explain 9.4 to 26.8% of phenotypic variation in these traits. In addition, these QTLs were mapped in 19 out of 24 linkage groups of this species, although mainly on linkage group 12 and 4. QTLs located on linkage group 12, which corresponds to the sex chromosome of the guppy, indicate that loci responsible for polymorphisms in guppy color patterns are clustered on this chromosome. This result coincides with previous knowledge regarding physical linkage of major color pattern loci to sex chromosomes in this species (Winge and Ditlevsen, 1947; Khoo et al., 1999). In summary, the results obtained in this fish strongly suggest that multiple QTLs with minor effects contribute to each color trait in guppy males.

In commercial fish, most color phenotypes of commercial value are qualitative traits known to be under Mendelian control, such as the Red Stirling strain of tilapia (*Oreochromis niloticus*; dominant inheritance, McAndrew et al., 1988), or the iridescent metallic blue variant of rainbow trout (recessive inheritance, Kincaid, 1975); therefore, given its simple inheritance mode, they could be more easily subjected to selective breeding for new stocks with particular colors.

However, there are some rainbow trout skin pigmentation phenotypes of commercial value, such as the Blue Back (Colihueque et al., 2011) and Finnish national breeding program (Kause et al., 2003) traits with complex pigmentation patterns comprising several attributes (skin color, number, size, and position of dark spots) that vary continuously at the intrapopulation level. A substantial quantitative genetic component for the different attributes

that compose these traits has been reported (Kause et al., 2003; Díaz et al., 2011). As it has been seen in model fish, it is possible that these skin pigmentation traits may possess a complex genetic architecture, with the existence of a variable number of quantitative loci with a minor or major effect for the different trait attributes. Further analysis of these traits will clarify their particular genetic architecture.

## **CONCLUDING REMARKS**

In farmed fish, several traits are taken into account in order to obtain a quality fish harvest suitable for marketing. These traits include body shape and skin pigmentation, both of which affect consumer acceptance of marketed fish at the point of sale. A fish with an improved appearance has greater consumer acceptance and, therefore, has a higher sale value than a fish with a normal appearance.

There has been some progress in this area with commercial fish, including traditional and new cultures, mainly through selective breeding or classical genetic analysis. This selection strategy has resulted in new fish stocks whose market participation is constantly increasing, contributing to the improved profitability of fish cultures. This trend is expected to continue over the next few years due to the sophistication of the market in many areas of the world. Therefore, there is interest in fish selection to ensure specimens that are visually appealing, for example, tilapia, rainbow trout, common carp, gilthead sea bream, and sea bass.

However, to meet this challenge, fish farmers must adapt and align their selective breeding goals with market demands. One tool that may be explored to achieve this objective derives from the discovery of QTLs or genes that underlie body shape and skin pigmentation, in which continuous variation of the different attributes that compose these traits is usually observed. This information could be used to implement selective breeding based on molecular markers tightly linked to QTLs that control various appearance traits of commercial interest, that is, markerassisted selection. This strategy may offer a more rapid response, yielding fish with a specific external appearance to satisfy market demands.

## **ACKNOWLEDGMENTS**

The suggestions and constructive comments of all those who helped to improve the final version of this manuscript, are gratefully acknowledged. The publication fee of this work was supported by the Departamento de Ciencias Biológicas y Biodiversidad of the Universidad de Los Lagos.

## **REFERENCES**


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 24 April 2014; accepted: 10 July 2014; published online: 04 August 2014. Citation: Colihueque N and Araneda C (2014) Appearance traits in fish farming: progress from classical genetics to genomics, providing insight into current and potential genetic improvement. Front. Genet. 5:251. doi: 10.3389/fgene.2014.00251*

*This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics.*

*Copyright © 2014 Colihueque and Araneda. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

## Primary analysis of repeat elements of the Asian seabass (*Lates calcarifer*) transcriptome and genome

## *Inna S. Kuznetsova1,2\*, Natascha M. Thevasagayam1, Prakki S. R. Sridatta1, Aleksey S. Komissarov2,3, Jolly M. Saju1, Si Y. Ngoh1,4, Junhui Jiang1,5, Xueyan Shen1 and László Orbán1,6,7\**

*<sup>1</sup> Reproductive Genomics Group, Strategic Research Program, Temasek Life Sciences Laboratory, The National University of Singapore, Singapore, Republic of Singapore*

*<sup>6</sup> Department of Animal Sciences and Animal Husbandry, Georgikon Faculty, University of Pannonia, Keszthely, Hungary*

*<sup>7</sup> Department of Biological Sciences, National University of Singapore, Singapore, Republic of Singapore*

#### *Edited by:*

*Scott Newman, Genus plc, USA*

#### *Reviewed by:*

*Hendrik-Jan Megens, Wageningen University, Netherlands Scott Newman, Genus plc, USA*

#### *\*Correspondence:*

*Inna S. Kuznetsova and László Orbán, Reproductive Genomics Group, Temasek Life Sciences Laboratory, The National University of Singapore, 1 Research Link, Singapore 117604, Republic of Singapore e-mail: inna@tll.org.sg; laszlo@tll.org.sg*

As part of our Asian seabass genome project, we are generating an inventory of repeat elements in the genome and transcriptome. The karyotype showed a diploid number of 2*n* = 24 chromosomes with a variable number of B-chromosomes. The transcriptome and genome of Asian seabass were searched for repetitive elements with experimental and bioinformatics tools. Six different types of repeats constituting 8–14% of the genome were characterized. Repetitive elements were clustered in the pericentromeric heterochromatin of all chromosomes, but some of them were preferentially accumulated in pretelomeric and pericentromeric regions of several chromosomes pairs and have chromosomes specific arrangement. From the dispersed class of fish-specific non-LTR retrotransposon elements Rex1 and MAUI-like repeats were analyzed. They were wide-spread both in the genome and transcriptome, accumulated on the pericentromeric and peritelomeric areas of all chromosomes. Every analyzed repeat was represented in the Asian seabass transcriptome, some showed differential expression between the gonads. The other group of repeats analyzed belongs to the rRNA multigene family. FISH signal for 5S rDNA was located on a single pair of chromosomes, whereas that for 18S rDNA was found on two pairs. A BAC-derived contig containing rDNA was sequenced and assembled into a scaffold containing incomplete fragments of 18S rDNA. Their assembly and chromosomal position revealed that this part of Asian seabass genome is extremely rich in repeats containing evolutionarily conserved and novel sequences. In summary, transcriptome assemblies and cDNA data are suitable for the identification of repetitive DNA from unknown genomes and for comparative investigation of conserved elements between teleosts and other vertebrates.

**Keywords: repeated DNA, Asian seabass, transcriptome, rDNA, chromosomes**

## **INTRODUCTION**

Asian seabass (*Lates calcarifer*), also known as barramundi, belongs to the family *Latidae* in the order *Perciformes*, which represents the largest order of vertebrates. Most members of the *Latidae* family are endemic fishes in Africa and the Indian and Pacific Oceans (Luna, 2008). The Asian seabass is a fast-growing, popular food fish found in tropical and sub-tropical fresh and salt waters. Their white flaky meat and the low number of Y-bones have made them a highly rated table fish. This species is a protandrous hermaphrodite: individuals typically mature as males and later reverse their sex to become females (Moore, 1979; Guiguen et al., 1993). The availability of classical aquaculture technologies made the Asian seabass potentially suitable for marker-assisted selection (MAS). Over the past 10 years, together with our two partners in Singapore, we have been working on a MAS project of the Asian seabass. Information encoded in the genes and in the whole genome is required to increase the resolution of selection from MAS to genomic selection in order to make the approach suitable to identify minor-effect genes that control only a small proportion of a trait (Liu and Cordes, 2004; Gjedrem and Baranski, 2009; Gjedrem, 2010).

Recently, we have started the Asian seabass Genome Project in order to develop new platforms for the eventual improvement of the selection process. A substantial part of nuclear genomes of most eukaryotes is occupied by various types of repetitive DNA sequences (Britten and Kohne, 1968) that make the assembly of genomic sequences difficult (Sutton et al., 1995). On the other hand, repeated DNA elements

*<sup>2</sup> Institute of Cytology of the Russian Academy of Sciences, St-Petersburg, Russia*

*<sup>3</sup> Theodosius Dobzhansky Center for Genome Bioinformatics, St Petersburg State University, St Petersburg, Russia*

*<sup>4</sup> School of Biological Sciences, Nanyang Technological University, Singapore, Republic of Singapore*

*<sup>5</sup> Agri-Food and Veterinary Authority of Singapore, Singapore, Republic of Singapore*

have numerous advantages for genomic studies. They have been extensively applied as physical chromosome markers in comparative studies for the identification of chromosomal rearrangements, the identification of sex chromosomes, chromosome evolution analysis and applied genetics (Ferreira and Martins, 2008).

Repetitive DNA sequences are classified based on genomic organization of repetitive units as dispersed and tandem. Dispersed repeats, represented by various classes of transposable elements encode for proteins which facilitate their replication and integration into nuclear genome (Devine et al., 1997). Tandem repeats are organized in tandem arrays that may be large and consist of thousands or even millions of repetitive units arranged in head-to-tail orientation (Willard, 1991). Chromosome loci rich in polymorphic satellite DNA usually show specific banding patterns and this makes them potentially useful as cytogenetic markers to discriminate individual chromosomes (Lee et al., 2005; Han et al., 2008). The presence of repeat-derived chromosome landmarks enables the identification of individual chromosomes and is a prerequisite to study structural changes accompanying evolution and speciation and to follow chromosome behavior and transmission in interspecific hybrids (Lee et al., 2005).

The processes by which satellites arise and evolve are not well understood; unequal crossing over, gene conversion, transposition and formation of extra chromosomal circular DNA were all implicated (Carone et al., 2009; Enukashvily and Ponomartsev, 2013). The Nucleus Organizer Regions (NOR) are described as highly repetitive genome sites related to the rRNA synthesis. These regions present small, active transcription sites (highly conserved during evolution) and non-transcribed spacing segments (highly variable), organized as two distinct multigene families 47S and 5S rRNA genes. Tandemly arrayed rDNA repeats are composed of hundreds to thousands copies, disturbed by dispersed elements and usually belong to separate chromosomes (Cioffi et al., 2011).

Repeat regions can cause serious problems during genome assemblies, especially for those that involve short reads produced by next generation sequencing (NGS) technologies. This is due to the fact that a fragment from a repeat region can have false overlaps with fragments from other repeat regions, resulting in the merging of unrelated regions and incorrect final assemblies. The fragments from repeat regions are identified by the number of potential overlaps each fragment has based on pairwise comparisons (Sutton et al., 1995).

In this study, we focus on the primary analysis of repeat elements of the Asian seabass transcriptome and genome using approaches based on classical cytogenetics, molecular biology and next generation sequencing technology. Through this work we have demonstrated that the transcriptome assembly and cDNA data contains repeats and it is suitable for the identification of repeat DNAs from an unknown genome and for comparative investigation of conserved repeat elements between species. Altogether thirteen repeat elements, including 5S rDNA and 18s rDNA, have been identified from the Asian seabass genome, together they constitute 8–14% of the genome. All of them produce short transcripts. Some repeats are enriched on specific chromosomes, while others are generally distributed on the karyotype, except for B-chromosomes.

## **MATERIALS AND METHODS**

## **PRIMARY CULTURE AND CHROMOSOMES PREPARATION**

Asian seabass larvae of 1–2 days post-hatching (dph) age from the Marine Aquaculture Center (St John's Island, Singapore) were sacrificed by placing on ice and they were seeded into cell culture dishes in L15 or RPMI medium supplemented with 20% fetal calf serum and antibiotic and antimitotic solution (Sigma). The primary culture was incubated at 29◦C with 5% CO. After 1–2 weeks, the cells were incubated with 0.01% of colchicine for 5–6 h. Chromosome spreads were prepared using the method outlined by Pradeep et al. (2011). Chromomycin A3 (CMA3) and DAPI fluorochrome stainings followed the methodologies of Schmid (1980) and Schweizer (1980), respectively.

## **DNA AND RNA EXTRACTION**

Tissue samples from brain and gonad of four female and four male individuals were collected and stored at −80◦C until use. Genomic DNA (gDNA) was extracted from 1 to 2 dph larvae, as well as the liver, ovary and testis of 5 years old male and female individuals using the standard phenol-chloroform procedure (Sambrook and Russell, 2001). Total RNA was isolated using the Trizol kit (Invitrogen).

## **QUANTITATIVE REAL-TIME PCR (qPCR) ANALYSES**

One mg gDNA-free total RNA from the ovary, testis as well as female and male brain, respectively, of adult Asian seabass individuals was reverse-transcribed using a cDNA Synthesis Kit (Qiagene) according to the manufacturer's recommendations. The cDNA samples obtained were diluted 1:10 with sterile water before their use as templates in real-time quantitative PCR (qPCR). Levels of mRNA were determined using qPCR and SYBR Green chemistry on a Stratagene Mx3000P (Agilent Technologies, USA) using elongation factor 1-alpha and ribosomal protein L8 as reference genes. The Relative Expression Software Tool was used to calculate the relative expression of target mRNA. Results are reported as mean ± standard error. Significant differences between means were analyzed with the pairwise fixed reallocation randomization test (Pfaffl et al., 2002). In all cases, a value of *p <* 0*.*5 was used to indicate significant differences. As the expression levels between the male and female brains were similar their values were combined.

## **S1-NUCLEASE TREATMENT**

The procedure was modified from the protocol used to isolate Cot-1 DNA (Zwick et al., 1991; Ferreira and Martins, 2008). DNA samples (300µL of 100–500 ng/µL genomic DNA in 0.5 M NaCl) were boiled for 30 min and the size range of fragmented DNA was checked by electrophoresis in a 1% agarose gel (0.1–5 kb). Samples of 50µl of fragmented DNA were then denatured at 95◦C for 10 min, placed into ice-cold solution for 10 s, and transferred to a 65◦C water bath for re-annealing. Following 1 min of re-annealing, the samples were incubated at 37◦C for 20 min with 0.5 U of S1 nuclease to permit digestion of single-stranded sequences.

## **PCR-BASED ISOLATION OF REPETITIVE DNAs USING SPECIFIC PRIMERS**

The sequences of Rex1, 5S, and 18S rDNA were amplified directly from total gDNA by specific primers (Supplementary Table S1). Partial sequences of the MoSat\_SB, GGSat\_SB, and YRep\_SB repeat DNAs were detected in the Asian seabass transcriptome dataset. They were PCR-amplified from total gDNA with specific primers (Supplementary Table S1). Primers were designed with PRIMER3 (http://www-genome.wi.mit.edu/genome\_software/ other/primer3.html) software using the transcriptome and Repbase data as reference sequences. The telomere probe was obtained by PCR-amplification with (TTAGG)5 primer.

## **SELECTION OF 18S rDNA-CONTAINING CLONES FROM THE BAC LIBRARY, THEIR SEQUENCING AND ASSEMBLY**

The Asian seabass BAC library was a replica of that described earlier (Xia et al., 2010), except this version contained only 38,400 of the 49,152 clones reported earlier. With an average insert size of 98 kb (range: 45–200 kb), this provided a 5.4-fold coverage for the haploid genome. From the library described above, clones from one hundred 384-well plates were processed by a 3D pooling method yielding 140 pooled samples. 5S and 18S rDNA primers (Supplementary Table S1) were used for screening the above pools to identify clones containing 18S rDNA.

## **BAC SEQUENCING AND ASSEMBLY**

BAC-end sequences were generated by Sanger sequencing using the identical BAC DNA preparations as templates with the universal primers T7 (5 - TAATACGACTCACTATAGGG-3 ) and plBRP (5 -CTCGTATGTTGTGTGGAATTGTGAGCC-3 ). The inserts of the two BAC clones containing 18S rDNA were sequenced both by Ion Torrent (2,584,864 reads; AIT Biotech, Singapore) and 150 bp paired-end sequencing on Illumina HiSeq (2,584,864 reads; Yourgene, Taipei). The raw reads were corrected with Quake program with parameter *k* = 13 [http://genomebiology.com/2010/11/11/ R116/abstract]. Corrected reads from the two NGS platforms were mapped separately with Bowtie program (Langmead et al., 2009) to the human reference genome, to E. coli genome, to the vector sequence (pCC1BA), European seabass (*Dicentrarchus labrax*; http://www.ncbi.nlm.nih.gov/assembly/GCA\_0001808 15.1/); an African cichlid *(Astatotilapia burtoni* http:// www.ncbi.nlm.nih.gov/assembly/GCA\_000239415.1/, zebrafish (*Danio rerio*; http://www.ncbi.nlm.nih.gov/assembly/GCF\_0000 02035.4/), Nile tilapia (*Oreochromis niloticus*; http://www. ncbi.nlm.nih.gov/assembly/GCA\_000188235.2/, Atlantic salmon *(Salmo salar*; http://www.ncbi.nlm.nih.gov/assembly/GCA\_00 0233375.1/) and to the expected rDNA sequences. Reads that mapped only to E. coli, vector, and/or to the human genome were discarded. The remaining reads were assembled into contigs with SPAdes 2.4 assembler with default parameters (Bankevich et al., 2012). Short contigs (*<*200 bp) with a coverage less than 100x were discarded. The assembled contigs were compared against the nucleotide collection (nr/nt) database with BLAST on NCBI website. To assemble the scaffold drafts, two datasets were used: (1) contigs obtained by single read sequencing (Ion Torrent); and (2) contigs generated from paired-end reads (Illumina HiSeq 2000). The HiSeq-derived data was used for contig assembly, whereas the Ion Torrent-derived data was used for contig positioning according to their physical map relative to HiSeq data (Supplementary Figure S1). For the overlapping Ion Torrent datasets we computed k-mers with the Jellyfish program (Marcais and Kingsford, 2011) that were mapped to assembled contigs with the Bowtie software (http://www*.*bowtie-bio*.*sourceforge*.* net/index*.*shtml). According to the computed coverage, the Ion Torrent-derived contigs were separated into three groups: those located in the left part of physical map, those located in the middle part (overlap), and the rest located in the right part. Assembled contigs have been deposited at GenBank, under the accession numbers of KF432408–KF432412.

## **THE PARTIAL ASIAN SEABASS TRANSCRIPTOME ASSEMBLY**

The sequences used for the bioinformatic identification of repeats were obtained from the Asian seabass transcriptome assembly (Supplementary Figure S1). The brief description of the sequencing and assembly of the first batch of HiSeq data is described below. Pooled RNA samples from various organs and developmental stages of Asian seabass were sequenced on Illumina HiSeq 2000 with 2 × 100 pair-end reads. Following quality- and adapter-trimming, a total of ∼487 million reads were obtained and deposited to NCBI SRA [SRP033113]. Potentially contaminant reads were removed from the dataset if they did not show any homology with seabass or zebrafish mRNA sequences, resulting in ∼485 million clean reads (∼242 million pairs) which were then assembled using Trinity (version R2012-06-08) to generate 459,979 contigs.

All these contigs were further assembled using CAP3, resulting in 363,785 contigs. This Transcriptome Shotgun Assembly project has been deposited at GenBank, under the accession number of GAQL00000000 (the version described in this paper, is the first version, GAQL01000000). Contigs with an average read coverage ≥3 were further clustered using CD-HIT-EST resulting in 182,842 contigs.

## **SEQUENCE ANALYSIS**

All sequence comparisons between large sets of sequences were performed using standard algorithms such as BLAST (Altschul et al., 1990). To check for the presence of repeat elements the sequence sets were searched against the Repbase database (Kohany et al., 2006) using CENSOR (Ver: 4.2.28) with follow parameters: default, nofilter, minsim 0.75, show\_simple, bprg blastn, mode norm and the Nile tilapia repeat collection, http://cowry.agri.huji.ac.il/DATA\_SET\_RM/RepeatLib.html (Shirak et al., 2009). Phylogenetic relations between species have been determined by Interactive Tree Of Life software v2.2.2. (Letunic and Bork, 2011). Comparative transcriptome sequence analysis with Repbase of different species is presented in Supplementary Table S3. For detection of microsatellites, the Tandem Repeat Finder version 2.02 software (Benson, 1999) was used with the following parameters: match: 2, mismatch: 7, delta: 7, PM: 80, PI: 10, minscore: 50, maxperiod: 300 (Supplementary Table S4).

## **cDNA LIBRARIES**

The following five additional cDNA sequence datasets were used in the comparative analysis: Nile tilapia (*Oreochromis niloticus*; 67 Mb), three-spined stickleback (*Gasterosteus aculeatus*; 45 Mb), Japanese medaka (*Oryzias latipes*; 38 Mb), mouse (*Mus musculus*; 159 Mb) and human (*Homo sapiens*; 297 Mb). These datasets were obtained from the Ensembl genome browser (http://www*.* ensembl*.*org/index*.*html).

## **CLONING AND SEQUENCING**

The restricted DNA fragments and PCR products were cloned into pGEM-T plasmid vector (Promega) and transfected in DH5a E. coli competent cells according to standard protocols (Sambrook and Russell, 2001). Positive recombinant clones were sequenced by Sanger protocol.

## **SOUTHERN AND DOT-BLOT HYBRIDIZATION**

Genomic DNA was dot-blotted onto positively charged nylon membranes (Hybond-N++, Amersham) in a series of dilutions ranging from 50 ng to 2µg. Filter-immobilized DNA was hybridized with PCR products of repeat DNA labeled with digdUTP. The plasmid and PCR product with a cloned fragment was used to produce the calibration curve which was used to estimate the proportion of the sequences in the genome. Hybridization was performed at 42◦C in a solution composed of 6xSSC, 0.5% SDS, 5xDenhardt, 50µg/ml salmon sperm, 50% formamide, 0.01 mol/L EDTA, and 20 ng/ml probe (Cardinali et al., 2000; Sambrook and Russell, 2001). Densitometric quantification from the dot-blots was performed using Gel-Pro Analyzer Version 3.1.00.00.

For Southern hybridization, samples of genomic and BAC DNA (10µg) were completely or partially digested with various restriction enzymes (Hind III, RsaII, EcoRI, BamHI, or HinfI). The digestion products were subjected to gel electrophoresis on 1–1.5% agarose gels and southern-transferred to a Hybond-N++ nylon membrane. The hybridized DIGlabeled probe was detected with an antidigoxigenin alkaline phosphatase-conjugated antibody (Roche, Germany) according to the manufacturer's protocols.

## **KARYOTYPES AND FLUORESCENT** *IN SITU* **HYBRIDIZATION (FISH)**

In order to determine the karyotype of the Asian seabass, 20 good metaphase plates were used. The classification of chromosomes followed (Levan et al., 1964). Metacentrics (M) and submetacentrics (SM) were described as two-arm chromosomes, whereas subtelocentrics (ST) and acrocentrics (A) as one-arm chromosomes.

All probes were labeled with digoxigenin-11-dUTP (DIG) and biotin-16-dUTP (Roche) for FISH. The labeled nucleotides were incorporated into fragments by PCR, using M13 forward and reverse primers. The slides were denatured in 70% formamide/2xSSC for 3 min at 72◦C. For each slide, 50µl of hybridization solution (containing 1µg of each labeled probe, 50% of formamide, 2x SSC, 10% dextran sulfate), was denatured for 10 min at 75◦C and allowed to prehybridize for 1 h at 37◦C. Hybridization took place for 16–18 h at 37◦C. Posthybridization washes were in 4x SSC for 5 min at 73◦C and 2x SSC for 5 min at room temperature. Following a wash in PBST (PBS, 0.1% Tween), slides were incubated with a Rodamin Red-avidin (Invitrogen) and FITC-conjugated anti-DIG antibody (Roche). Finally, the slides were counterstained with DAPI and mounted in an antifade solution (Vectashield, Vector laboratories, Burlingame, CA, USA). Images were captured with a Nikon (CCD) camera on a Zeiss/MetaMorph epifluorescence microscope and were optimized using Adobe Photoshop CS2.

## **RESULTS**

## **THE KARYOTYPE OF ASIAN SEABASS SHOWED 24 PAIRS OF AUTOSOMES AND A VARIABLE NUMBER OF B-CHROMOSOMES**

The karyotype of Asian seabass embryos showed a diploid number of 2*n* = 48 and a karyotype formula of 1M + 1SM + 11ST + 11A (*FN* = 37). The individual karyotypes also contained a variable number (2–10) of additional microchromosomes, suggesting the occurrence of B-chromosomes (**Figures 1A,B**). When stained with CMA3, most of chromosomes showed the presence of a GC-rich block near the pericentromeric (periCEN) chromatin (**Figure 1C**). All chromosome spreads had at least two GC-rich B-chromosomes (**Figure 1C**), whereas AT-rich B-chromosomes (**Figure 1B**) were detected only in 60% of the chromosome spreads.

## **ISOLATION, CLONING AND CHROMOSOMAL LOCALIZATION OF REPETITIVE ELEMENTS FROM THE ASIAN SEABASS TRANSCRIPTOME AND GENOME**

We have combined the power of three different approaches for the isolation and characterization of repetitive elements from the Asian seabass genome (Supplementary Figure S1). The first approach was digestion of genomic DNA by S1-nuclease. Out of the 12 sequences cloned, nine were unknown short

**FIGURE 1 | The karyotype of Asian seabass contains 24 pairs of A-chromosomes and a variable number (2–10) of additional B-chromosomes. (A)** The karyotype. Metaphase spreads stained with DAPI **(B)** and Chromomycin A3 **(C)**. Arrowheads in **(B)** indicate AT-rich B-chromosomes, whereas those in **(C)** label GC-rich ones.

sequences of 100–300 bp length and not utilized for subsequent analysis. Two of the remaining three have shown similarity with OnSatB (GenBank ID:S57288) periCEN DNA of the Nile tilapia genome (Supplementary Table S1) and they were called OnSat\_SB (**Table 1**). The divergence between the two OnSat\_SB sequences was less than 1%. Southern blot hybridization of total DNA with OnSat\_SB probe yielded satellite-like character ladders with steps of 200 bp length (**Figure 2A**). The third sequence was a short fragment of a CR1 non-long-terminal-repeat (non-LTR) retrotransposon MAUI\_SB (**Table 1**), which showed similarity with the ORF1 part of a non-LTR retrotransposon, called MAUI from the green spotted pufferfish (*Takifugu rubripes*) genome (Poulter et al., 1999).

The second approach was based on the analysis of the draft assembly of Asian seabass transcriptome that contained a high proportion of unknown transcripts (57%). Their comparison with Repbase showed only a 10% overlap. In total, 14,283 various repeat-containing transcripts were found, their size ranged from 26 to 3600 bp (Supplementary Tables S3, S4). They could be divided into seven classes of repeats: DNA transposons, LTR retrotransposons, non-LTR retrotransposons, endogenous retroviruses, microsatellites, satellites and unclassified (Supplementary Tables S3–S5). When the relative proportion of repeats was compared among four teleost (Asian seabass, Japanese medaka, Nile tilapia and three-spined stickleback) and two mammalian (mouse and human) species, the repeat profile for the Asian seabass was the closest to that of the Japanese medaka (**Figure 3A**, Supplementary Table S6). While the overall percentage of repeats in the six vertebrate genomes was similar, the teleost transcriptomes contained a lower percentage of repetitive elements than the mammalian ones (**Table 2**).

Sequences that showed homology with characterized satDNAlike repeat elements from other fish genomes were selected from the draft transcriptome assembly (Supplementary Tables S3, S5). One of them aligned with a mosaic satellite from the zebrafish genome (MoSat\_DR; GenBank ID: DP000237), while two other fragments showed sequence homology with two different sex chromosome-specific DNAs: the first with repetitive AT-rich DNA sequences from the Y chromosome of the Mediterranean fruit fly, *Ceratitis capitata* (YREP\_CC; GenBank ID: AF115330; Supplementary Table S1) and the second with W-specific satellite DNA (GGSat; GenBank ID: X57344, Supplementary Table S1) from the chicken genome (Kawai et al., 2007). The three repeats were cloned and sequenced (**Table 1**); their sequences showed more than 70% of homology with the corresponding reference sequences listed above. Southern-blot hybridization of gDNA with MoSat\_SB and distribution of PCR fragments yielded satellite-like ladder patterns with about 100 bp increments (**Figures 2B,C**). On the other hand, the PCR distribution fragments for YRep\_SB and GGSat\_SB were dispersed (i.e., didn't produce the expected ladder) (**Figure 2C**).

Among the retrotransposon elements, the Rex group is the most highly represented in the different fish species. We managed to clone two Rex1-related sequences, Rex1\_SB and MAUI\_SB, from the Asian seabass using specific primers (**Table 1**). Next, we determined the chromosomal localization of these cloned repetitive sequences. Cytogenetic mapping of these CR1 non-LTR retrotransposons showed stronger signals at the periCEN and telomeric (Tel) regions of all chromosomes and weaker ones on the chromosome arms (**Figures 4A,B**). The FISH signals of YREP\_SB were seen at periCEN regions of all chromosomes and accumulated at 3 pairs of acrocentric chromosomes in the pretelomeric (preTel) and periCEN regions (**Figure 4C**), whereas those of GGSAT\_SB were dispersed through all the chromosomes (**Figure 4C**). SatDNAs occupied the periCEN regions of all chromosomes (**Figures 4D,E**). OnSat\_SB and MoSat\_SB FISH signals


**Table 1 | The inventory and characterization of nine repeats isolated from the Asian seabass transcriptome and genome.**

*\*periCEN, pericentromeric.*

*\*\*N/A, not applicable.*

were both enriched mostly at the periCEN regions of all chromosomes (**Figure 4D**), but the latter were also detected in the preTel and periCEN regions of three pairs of acrocentric chromosomes (**Figure 4E**). All teleosts analyzed so far showed the presence of a standard telomeric repeat (TTAGGG)n (Chew et al., 2002). In the Asian seabass, telomere signals were detected only at the ends of the chromosomes (**Figure 4F**).

## **AT LEAST 8% OF THE ASIAN SEABASS GENOME CONTAINS REPEATS AND SOME OF THEM SHOW DIFFERENTIAL EXPRESSION IN THE GONADS**

The relative amount of OnSat\_SB, MoSat\_SB, YREP\_SB, GGSat\_SB, Rex1\_SB, and MAUI\_SB sequences in the Asian seabass genome was estimated using dot-blot hybridization (**Figure 2D**). The results showed that around 4–10% of the genome contained CR1-type non-LTR retrotransposons, MAUI\_SB and REX1\_SB, whereas about 2% consisted of two types of satDNA: OnSat\_SB and MoSat\_SB (**Figure 2D**). Together, the six repeat types constituted approximately 8–14% of the Asian seabass genome.

Seven repeats belonging to tandem and dispersed classes of repetitive DNA have been discovered. All of them identified matching sequence reads in the transcriptome, indicating that they produce RNA products (Supplementary Table S3) and four of them (MoSat\_SB, Rex1\_SB, YRep\_SB, and GGSat\_SB) were subjected for quantitative analysis of expression levels. While the relative expression level of MoSat\_SB and Rex1\_SB in the testis was higher than the ovary (*P <* 0*.*05; pairwise fixed reallocation randomization test), those of YRep\_SB and GGSat\_SB did not exhibit such differences between the two gonads (**Table 3**). In the brain, the expression level of Rex1\_SB, but not MoSat\_SB, also showed an increase compared to that of the ovary.

## **THE GENOMIC ORGANIZATION OF TWO RIBOSOMAL DNAs SHOWED SPECIES-SPECIFIC AND EVOLUTIONARILY CONSERVED FEATURES**

Evolutionarily conserved fragments of 5S rDNA and 18S rDNA (**Table 1**) were obtained through PCR amplification with teleostspecific primers. Two fragments of 18S rDNA 306 bp and 756 bp as well as 556 bp fragments of 5S rDNA (120 bp of these sequences are conserved across the species analyzed) were cloned and sequenced (**Figure 5A**, Lanes 1–3).

Labeled probes were produced for both rDNA types and they were hybridized onto the chromosomes. FISH signals for the two rDNAs appeared on separate chromosomes: the signal for 5S rDNA was located on a single chromosome pair (**Figure 5B**), whereas that for 18S rDNA was found in the periCEN region of two pairs of chromosomes (**Figure 5C**).

The 18S rDNA-specific primers were used for screening a BAC library to identify clones containing this ribosomal DNA. Two BAC clones (A6&N6), which partially overlapped forming a contig of ca. 101 kb length were chosen. According to the Southern data, the contig was expected to contain short fragments of 18S rDNA (**Figure 5A**, Lanes 4–5). When the two BAC clones forming the contig were labeled and used as probes for hybridization, the signal was chromosome-specific: it showed accumulation in the periCEN regions of two acrocentric chromosomes, and two periTel regions of two small acrocentric chromosomes, while a slightly dispersed signal was observed on all chromosome arms (**Figure 5D**).

The two BAC clones were sequenced using the HiSeq pairedend protocols and assembled into contigs. Five non-overlapping contigs were produced, their combined length was 120,780 bp

*latipes*), Nile tilapia (*Oreochromis niloticus*), European seabass (*Dicentrarchus labrax*), Asian seabass (*Lates calcarifer*) were generated

by Tree of Life interactive tool (Letunic and Bork, 2011).

ing was 101 kb according the physical map (FPC contig 2979; Xia et al., 2010). Then single-end reads produced by the Ion Torrent together with the HiSeq data from the five contigs were used for scaffolding based on: (1) alignment with known fish genomic sequences; and (2) the coverage for each contig by the Ion Torrent reads. The total length of the resulting A6N6 scaffold was 120 kb, which was longer than the combined length of the two BAC inserts. A dot-matrix analysis of the resulting A6N6 scaffold was performed with chromosome sequences corresponding to Linkage group 1 of the European seabass (*Dicentrarchus labrax*; GenBank: FQ310506.3); the results showed 92% of identity between the two sequences (**Figure 6**). The A6N6 scaffold did not produce any annotated BLASTN hits. In absence of LG-specific markers in these sequences, we were unable to match A6N6 to a unique chromosome(s). The reference sequence from European seabass genome contains incomplete fragments of the rDNA. Comparison to these indicated that our scaffold contained incomplete, short fragments of 18S rDNA (the polyA tail and 40–80 bp). Masking with Repbase sequences showed that only 6.9% of the scaffold yielded hits with annotated repeats. All repeat fragments found were generally short (14–314 bp; Supplementary Table S7). Around 14% of the scaffold produced hits when compared with BLAST against the Asian seabass transcriptome

(Supplementary Table S2), whereas the length of the A6N6 contig formed by the two BAC clones selected for sequenc-

## **DISCUSSION**

database.

Detailed information about repetitive elements present in a particular genome can provide useful information that can be used to improve de novo genome assemblies. For this reason, we have initiated a repeat inventory of the Asian seabass, as part of our genome project. The transcriptome and genome of this teleost species were searched for repetitive elements with experimental and bioinformatic tools. Studies involving comparative genomics have revealed that most vertebrate lineages contain different populations of dispersed elements, including retrotransposons and DNA transposons, and significant differences could be observed in their proportions among species of the same lineage. Highly repetitive satellite sequences and moderately repetitive transposable elements form constitutive heterochromatin, which is mostly

**Table 2 | Comparative analysis of the repeat content of teleost and mammalian transcriptomes and genomes showed distinct differences between groups and among species.**


*\*Incomplete assembly, in progress.*

*Data for analyses—with the exception of those from Asian seabass—were obtained from NCBI (www.ncbi.nih.gov) and Ensembl (www.ensembl.org) databases.*

**FIGURE 4 | The chromosomal distribution of seven cloned Asian seabass repeats determined by FISH assay. (A)** Rex1\_SB; **(B)** MAUI\_SB; **(C)** YRep\_SB (green) and GGSat\_SB (red); **(D)** OnSat\_SB;

**Table 3 | Comparison of the levels (Ct) of repeat-derived transcripts from the gonad and brain of five male and five female adult Asian seabass individuals detected differential expression of MoSat\_SB and Rex1\_SB between the testis and ovary.**


*\*Indicates the Ct values with statistical difference. Statistically significant differences were calculated by the pairwise fixed reallocation randomization test (\*P < 0.05). The Ct values of male and female brain were similar, thus the data were combined. Elongation factor 1-alpha and ribosomal protein L8 were used as reference genes.*

located in the preTEL, CEN and periCEN regions of chromosomes (for reviews see Choo, 1997; Kawai et al., 2007; Enukashvily and Ponomartsev, 2013). Our data have demonstrated that transcriptomes show similarity to the non-coding regions of genomes as they also contain various dispersed and tandem elements in different teleost species (**Figure 3A**; Supplementary Tables S3–S6). The current size of the Asian seabass transcriptome assembly (169.6 Mb) is much larger than those of the other fish species (38–68 Mb; **Table 1**), presumably due to the following reasons: (i) it is partial and yet to be mapped onto the genome; (ii) while the other teleost transcriptomes were mostly generated

**(E)** MoSat\_SB, and **(F)** Telomere. Chromosomes were counterstained with DAPI (blue). Arrowheads indicate chromosomes with enriched signals. Bar—5µm.

with the aid of Sanger- and/or pyrosequencing of cDNAs, the Asian seabass transcriptome was produced by high throughput next generation sequencing resulting in a much larger amount of sequence data; (iii) Ensembl transcripts were obtained from RNA-seq-based gene models (Flicek et al., 2012), whereas the current version of the Asian seabass transcriptome wasn't; and (iv) there is a possibility of "inflation" due to unique transcripts that have retained introns (i.e., sequencing of immature mRNAs). Although our "in progress" Asian seabass transcriptome is derived from over a dozen different adult tissues and samples collected at several developmental stages, we cannot exclude the possibility that we have missed repeat-derived transcripts that were active, but not expressed in the samples used for sequencing.

Repeat-derived transcript sequences form only ca. 6–13% of the Asian seabass transcriptome, whereas most of the teleost genomes assembled so far contain more than 14% of various repeat types (**Table 2**). This indicates that the large differences in the repeat content between species are due to the repeat DNAs being present not only in the non-coding regions of genomes (Kramerov and Vassetzky, 2005), but also in the transcriptomes (**Table 2**, Supplementary Tables S3–S6). Studies involving comparative genomics have revealed that most vertebrate lineages contain different populations of retrotransposable elements and DNA transposons with significant differences frequently observed among species of the same lineage (Ferreira et al., 2011). The repeat-derived transcripts of teleosts are generally short, their average length is 50–400 bp; the largest transcript length is no longer than 3.6 kb for Asian seabass compared to 6 kb for human transcriptome (Supplementary Table S3). The longest repeatderived transcripts in human and mouse generally belong to

**FIGURE 5 | Genomic hybridization and chromosomal location of two ribosomal DNAs (5S and 18s rDNA). (A)** 18S rDNA (Lanes 1, 2) and 5S rDNA (Line 3) PCR products of the *Lates calcarifer* genome visualized on 2% agarose gel. Southern blot hybridization of the Asian seabass total DNA digested by restriction endonucleases HinfI and HindIII with the DIG-labeled 18S rDNA (Line 4) and 5S rDNA (Line 5) probes. 18S rDNA PCR product from A6 and N6 BACs was used as a DIG-labeled probe for Southern blot

hybridization of the same two BAC clones digested by HinfI (Line 6). M—molecular mass markers. The corresponding molecular masses are indicated on the left and right of the figure. **(B)** Fluorescent *in situ* hybridization onto metaphase chromosomes with 5S rDNA probe. **(C)** Fluorescent *in situ* hybridization onto metaphase chromosomes with18S rDNA probe. **(D)** Chromosomal distribution of the A6 and N6 BACs. Arrows indicate chromosomes with enriched signals. Bar—5µm.

the non-LTR retrotransposon L1, ERV, or Mariner/TC1 groups (Supplementary Table S3), while those in fish transcriptomes are part of a partially overlapping set of repeat types, excluding ERV. The longest repeat transcript for Asian seabass is MAUI\_SB with 3575 bp (Supplementary Table S3). Some dispersed, transcriptderived repeats from retrotransposons, such as MAUI (Poulter et al., 1999), Gypsy, Rex1 (Volff et al., 2000, 2003), Bell, and TART (Casacuberta and Pardue, 2003) are longer than 1 kb and show similarity with each other across different fish genomes (Supplementary Table S3). The maintenance of *D. melanogaster* telomeric DNA is accomplished by repeated retrotransposition of the non-LTR retrotransposon HeT-A/TART specifically to telomeres, in contrast to the maintenance of tandem arrays of species-specific simple DNA repeats at telomeres by telomerase in many eukaryotic organisms (Maxwell, 2004). The result of transcript analysis was consistent with TART producing an array of heterogeneous sense and antisense transcripts (Maxwell, 2004). TART, a non-LTR retrotransposon, was also found in the Asian seabass transcriptome (Supplementary Tables S3, S6). This repeat belongs to the Jockey group (Casacuberta and Pardue, 2003) and its transcript size is longer in teleosts than in mammals (**Figure 7**). The DNA transposon-based transcript content in human and mouse is lower than those of fishes, but that of the endogenous retrovirus group is higher in mammals (**Figure 3**). We demonstrated a fish-specific transcriptome-based repeat profile with a few group of repeats, where the proportion of groups Rex1, SINE, R4, Gypsy, and Jockey is higher in fishes then in human and mouse (**Figure 7**), while that of ERV3 and L1 is lower (**Figure 7**; Supplementary Table S6). The relative amount of primate-specific SINE1/7SL or Alu transcripts is about three magnitudes higher in human than in fishes. The same tendency has been observed in the distribution of R4/Rex6, Rex1, and Gypsy retrotransposons based on comparative analysis of whole genome shotgun sequences of pufferfishes (Japanese fugu and green-spotted pufferfish), zebrafish, human and mouse genomes (Volff et al., 2003; Shirak et al., 2009). The short sequence length

of most repeats from different fish genomes and transcriptomes, including satDNA (Volff et al., 2003; Shirak et al., 2009), indicates that a large part of fish-specific repeat sequences are not characterized.

Moreover, although transcriptomes tested so far tend to have a unique satDNA distribution profile, all investigated fish transcriptomes have similar sets of satellite DNA (Supplementary Table S5). For example, the major satDNA transcript in Asian seabass is OnSat\_SB, which has shown similarity with Nile tilapia genome, and is present in the other fish transcriptome datasets, albeit with lower level of similarity. On the other hand, the two main satDNA sequences of the Nile tilapia genome, Satellite A (SATA) and Satellite B (SATB) (Shirak et al., 2009), have not been found in the partial transcriptome data of Asian seabass so far (Supplementary Table S5). The phylogeny based on mitochondrial and nuclear markers suggests independent amplification from the "library." The existence of a "library of satDNA" has been previously demonstrated experimentally (Mestrovic et al., ´ 1998; Bruvo et al., 2003; Pons et al., 2004) and suggested a common ancestor whose genome harbored all or most of the major satDNA families present in the living species at low copy numbers (Pons et al., 2004). Thus, we can confirm that satDNA is shared by a group of related organisms at variable copy number (Mravinac and Plohl, 2010). The existence of short RNA fragments originating from satDNA has been shown experimentally (Rizzi et al., 2004; Valgardsdottir et al., 2005; Enukashvily et al., 2009; Ting et al., 2011; Kuznetzova et al., 2012). Sequencing of such transcripts confirmed that they consisted of satellites only. They are polyadenylated (Rizzi et al., 2004; Enukashvily et al., 2009) and exhibit mostly intranuclear spot-like localization (Valgardsdottir et al., 2005; Kuznetzova et al., 2012). The reported length of satDNA transcripts in mammals varied from 20 bp to 5 kb (Valgardsdottir et al., 2005; Ting et al., 2011) probably because the transcripts were revealed at different stages of their processing. In the various cell types and at different stages of development or cell cycle, the transcription of satellites is asymmetrical: either the sense or the antisense strand is transcribed (Rizzi et al., 2004; Valgardsdottir et al., 2005; Enukashvily et al., 2009; Ting et al., 2011; Kuznetzova et al., 2012). Our experiments have demonstrated a different expression level of satDNA (MoSat\_SB) and Cr1 non-LTR retrotransposon Rex1\_SB, among the ovary and testis—but not between the male and female brain—of adult Asian seabass (**Table 3**).

The distribution pattern of constitutive heterochromatin is a good chromosome marker for some teleosts (Kantek et al., 2009; Vicari et al., 2010). Different strategies have been utilized for isolating repetitive DNA sequences: (1) traditional method genomic DNA restriction (Beridze, 1986); (2) re-association kinetics based on *C0t*–*1* DNA (Devine et al., 1997; Langmead et al., 2009); (3) the microdissection of chromosomes submitted to C-banding and subsequent amplification of heterochromatic sequences using DOP-PCR or WGS amplification primers (Cioffi et al., 2013); and (4) bioinformatic analysis of WGS and NGS sequences (Komissarov et al., 2011) and transcriptome data (Jiang et al., 2012). Some of these approaches have been used to investigate the repeats in this study.

Six new repeats belonging to tandem and dispersed groups of repetitive DNA have been identified, they constitute ∼8– 14% of the Asian seabass genome. The MoSat\_SB and YRep\_SB repeats are enriched on individual chromosomes, while the other four—Rex1\_SB, MAUI\_SB, OnSat\_SB, and GGSat\_SB—are generally distributed on the autosomes throughout the karyotype (**Figures 4**, **8**). The analysis of the chromosomal location of repeated elements demonstrated that they are, in most cases, compartmentalized in heterochromatic regions and the periCEN regions of chromosomes tend to show a unique repeat distribution as has been observed in several other vertebrates (Beridze, 1986; Enukashvily and Ponomartsev, 2013). Our results have demonstrated that the compartmentalization of repeat elements is mostly restricted to heterochromatic AT-rich segments, whereas most of the chromosomes have large GC-rich blocks within their centromeric regions (**Figures 1**, **8**). The AT-rich B-chromosomes of the Asian seabass contain chromatin which differs from heterochromatin identified in periCEN regions of the autosomal chromosomes (**Figure 8**). The GC-rich B-chromosomes, as chromosomes with potentially higher recombination rate could be useful for future characterization *Lates* populations throughout the Indian, South-East Asian and Australian regions.

Transposable elements (TEs) can be organized in clusters or dispersed throughout the genome. CR1 elements regulate gene activity and located within genes, as MAUI from the fugu genome (Poulter et al., 1999). Among the CR1 clade of LINE element family of non-LTR retrotransposons, Rex was characterized for the first time in the genome of the *Xiphophorus* and it was found to be widely present in different fish genomes (Volff et al., 2000). Phylogenetic analysis revealed that Rex1 retrotransposons were frequently active during fish evolution. They formed multiple ancient lineages, which underwent several independent and recent bursts of retrotransposition and invaded fish genomes with

**FIGURE 8 | The ideogram of the chromosome complement of Asian seabass exhibiting cytogenetic mapping of ribosomal sequences, repeats and, heterochromatic, AT- and GC-rich regions.** Altogether, eight of the 24 autosomal chromosome pairs showed a unique hybridization profile with the various probes.

variable success rate (Volff et al., 2000; Ferreira et al., 2011). The divergence between two cloned sequences of Rex1 in the Asian seabass was ∼10%, at the same time about 20% of divergence was found among Rex1 elements isolated from three different teleost species (Volff et al., 2000, 2003). The physical mapping of different Rex elements showed that they were primarily compartmentalized in the periCEN heterochromatic regions, although dispersed or clustered signals in euchromatic regions were also observed (**Figure 4A**, Valente et al., 2011). The presence of TEs in heterochromatin can be correlated with their role in structure and organization of heterochromatic areas (such as centromeres) or with the lower selective pressure that act on these genepoor regions. Rex elements were also concentrated in the largest chromosome pair of the Nile tilapia, *Oreochromis niloticus*. This chromosome pair is supposed to have originated by fusions, demonstrating the possible involvement of TEs with chromosome rearrangements (Valente et al., 2011). Rex1, 3, and 6 are non-LTR elements that have been active during the evolution of different fish lineages. In the family of *Cichlidae* (*Perciformes*), Rex elements are organized in clusters within the genome of the majority of the species (Martins et al., 2004). Rex elements showed a wide distribution among fishes and could be observed both in the peri-CEN region and euchromatic regions of chromosomes. However, at present, due to the lack of an assembled genome, we are unable to resolve their exact location in relation to the periCEN region. Rex elements can also be associated with genes involved with sex determination (Volff et al., 2000). In this study, we have demonstrated the existence of Rex1-derived transcripts in Asian seabass; moreover, their expression levels were significantly lower in the adult ovary than in the testis (**Table 3**).

The Nucleolus Organizer Regions (NORs) are highly repetitive genome sites related to rRNA synthesis (Preuss and Pikaard, 2007; Pinhal et al., 2008). NORs present small, active transcription sites (evolutionally conserved 120 bp for 5S rDNA and 300–2800 bp for 18S rDNA) and highly variable non-transcribed spacing segments with their own structural dynamics, in which the presence of transposons located close to the genome regions has been identified (Martins et al., 2004). rDNA sequences and their chromosomal location have proven to be valuable as genetic markers to distinguish closely related species and also in the understanding of the dynamic of repetitive sequences in the genomes. In this study we have shown that the chromosomal position of 5S rDNA and 18S rDNA probes in Asian seabass is similar to those observed in different fish species (Martins et al., 2004; Mantovani et al., 2005; Cioffi et al., 2010; Merlo et al., 2010; Nakajima et al., 2012). At the cytogenetic level, the 5S rDNA probes hybridized mostly near to the centromere region; while the 18S rRNA gene probe has identified variable positions across different fish species (Martins et al., 2004; Mantovani et al., 2005; Cioffi et al., 2010; Merlo et al., 2010; Nakajima et al., 2012).

In order to improve our understanding of their organization, rDNA sequences were analyzed in the Asian seabass genome. Two clones were identified through hybridization with 18SrDNAderived probes from a BAC library; they formed a single contig (A6N6). They were sequenced with two NGS platforms (Illumina HiSeq and Ion Torrent) and assembled. The A6N6 contig is extremely rich in repeats, containing conserved (Supplementary Table S7) and novel sequences. Some of them are chromosomespecific, whereas others are dispersed throughout the chromosomes (**Figure 5D**). Due to the shortness of reads, the assembled repeats tend to be highly fragmented. In teleosts, the spacer portion of the major rDNA is generally GC-rich (Schmid and Guttenbach, 1988), whereas our A6N6 scaffold shows a high (59%) AT content. It was shown in the red wolf fish (*Erythrinus erythrinus*) that rDNA sequences were co-localized with the Rex3 retrotransposable element in the centromeric heterochromatin (Valente et al., 2011) and they have also shown strong association with Tc1/Mariner and Rex elements (Schmid and Guttenbach, 1988; Cioffi et al., 2010; Merlo et al., 2010; Nakajima et al., 2012). The presence of Tc1/Mariner and Rex1\_SB repeats was observed in the A6N6 scaffold (Supplementary Table S7). Transcripts derived from the A6N6 scaffold showed sequence homology with a predicted gene (PDZ domain containing ring finger 3) from fish and amphibian genomes. The synteny of Asian seabass A6N6 scaffold with a European seabass linkage group (**Figure 6**) and their phylogenetic association (**Figure 3B**) strengthened the idea that the chromosomes have undergone rearrangements during evolution. These rearrangements were likely mediated by retrotransposon activity in which the insertion of the retrotransposable element into rDNA sequences created an rDNA-transposon complex that moved and dispersed in the karyotype (Cioffi et al., 2010; Nakajima et al., 2012).

The transcriptome and genome of Asian seabass were searched for repetitive elements with experimental and bioinformatics tools. In summary, our data indicate that the sequence and structure of most Asian seabass repeat DNAs are likely to be unknown. Detailed analysis of the completed genome assembly will provide the final proof for this suggestion. The approach used for the analysis of repeats in this study has also yielded useful knowledge only about evolutionarily conserved repeats. All analyzed sequences, except Tel, belong to the periCEN part of chromosomes (**Figure 8**), they could form chromosome-specific periCEN patterns of compartmentalization and the transcripts of some show differential expression between the adult gonads of Asian seabass sex.

## **ACKNOWLEDGMENTS**

The authors thank Jun Hong Xia and Gen Hua Yue for access to their BAC library. This research was supported by the National Research Foundation, Prime Minister's Office, Singapore under its Competitive Research Programme (Award No: NRF−CRP7- 2010-001) and the Strategic Research Program of Temasek Life Science Laboratory for László Orbán, as well as the Russian Ministry of Science (Mega-grant no.11.G34.31.0068) and a grant from presidium RAS (MCB, N01200955639).

#### **SUPPLEMENTARY MATERIAL**

The Supplementary Material for this article can be found online at: http://www*.*frontiersin*.*org/journal/10*.*3389/fgene*.* 2014*.*00223/abstract

### **REFERENCES**

Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990). Basic local alignment search tool. *J. Mol. Biol.* 215, 403–410. doi: 10.1016/S0022- 2836(05)80360-2


French Polynesia: histological and morphometric description. *Environ. Biol. Fish* 39, 231–247. doi: 10.1007/BF00005126


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 29 January 2014; accepted: 27 June 2014; published online: 25 July 2014. Citation: Kuznetsova IS, Thevasagayam NM, Sridatta PSR, Komissarov AS, Saju JM, Ngoh SY, Jiang J, Shen X and Orbán L (2014) Primary analysis of repeat elements of the Asian seabass (Lates calcarifer) transcriptome and genome. Front. Genet. 5:223. doi: 10.3389/fgene.2014.00223*

*This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics.*

*Copyright © 2014 Kuznetsova, Thevasagayam, Sridatta, Komissarov, Saju, Ngoh, Jiang, Shen and Orbán. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

## Genomic prediction in an admixed population of Atlantic salmon (*Salmo salar*)

#### *Jørgen Ødegård1 \*, Thomas Moen2, Nina Santi 2, Sven A. Korsvoll 1, Sissel Kjøglum1 and Theo H. E. Meuwissen3*

*<sup>1</sup> Breeding and Genetics, AquaGen AS, Trondheim, Norway*

*<sup>2</sup> Research and Development, AquaGen AS, Trondheim, Norway*

*<sup>3</sup> Department of Animal and Aquacultural Sciences, Norwegian University of Life Sciences, Aas, Norway*

#### *Edited by:*

*José Manuel Yáñez, University of Chile, Chile*

#### *Reviewed by:*

*Jiuzhou Song, University of Maryland, USA Roger Vallejo, United States Department of Agriculture, USA Ignacy Misztal, University of Georgia, USA*

#### *\*Correspondence:*

*Jørgen Ødegård, Department of Animal and Aquacultural Sciences, Norwegian University of Life Science, Arboretveien 6, PO Box 5003, Aas, NO-1432, Norway e-mail: jorgen.odegard@aquagen.no* Reliability of genomic selection (GS) models was tested in an admixed population of Atlantic salmon, originating from crossing of several wild subpopulations. The models included ordinary genomic BLUP models (GBLUP), using genome-wide SNP markers of varying densities (1–220 k), a genomic identity-by-descent model (IBD-GS), using linkage analysis of sparse genome-wide markers, as well as a classical pedigree-based model. Reliabilities of the models were compared through 5-fold cross-validation. The traits studied were salmon lice (*Lepeophtheirus salmonis*) resistance (LR), measured as (log) density on the skin and fillet color (FC), with respective estimated heritabilities of 0.14 and 0.43. All genomic models outperformed the classical pedigree-based model, for both traits and at all marker densities. However, the relative improvement differed considerably between traits, models and marker densities. For the highly heritable FC, the IBD-GS had similar reliability as GBLUP at high marker densities (>22 k). In contrast, for the lowly heritable LR, IBD-GS was clearly inferior to GBLUP, irrespective of marker density. Hence, GBLUP was robust to marker density for the lowly heritable LR, but sensitive to marker density for the highly heritable FC. We hypothesize that this phenomenon may be explained by historical admixture of different founder populations, expected to reduce short-range lice density (LD) and induce long-range LD. The relative importance of LD/relationship information is expected to decrease/increase with increasing heritability of the trait. Still, using the ordinary GBLUP, the typical long-range LD of an admixed population may be effectively captured by sparse markers, while efficient utilization of relationship information may require denser markers (e.g., 22 k or more).

**Keywords: Atlantic salmon, genomic selection, reliability, admixture, genetics**

## **INTRODUCTION**

Aquaculture populations are characterized by high male and female fecundity, typically resulting in large full-sib families. For invasive traits, traditional aquaculture selection programs involve sib-testing, which has limited reliability under classical selection schemes, as selection candidates are evaluated based on mid-parent means. Furthermore, this also leads to increased co-selection among close relatives, and enforcing restrictions on inbreeding will therefore hamper selection on such traits more than selection for individually evaluated traits. For individually evaluated traits, the sizeable family groups of aquaculture species give a substantial potential for within-family selection.

Marker-assisted selection can be used to select directly for favorable QTL alleles, a method that allows individual selection of genotyped animals even in absence of phenotyping. This requires that the QTL effects are known and that carriers of the favorable alleles can be identified through the markers. For Atlantic salmon, the method has been utilized with great success in selection for reduced incidence of infectious pancreatic necrosis (IPN), for which a single QTL explains (nearly) all genetic variation (Moen et al., 2009). In situations where several QTL underlie the trait, MAS will be more complex, and power of QTL detection lower as each QTL explains a smaller fraction of the total genetic variance. For such traits genomic selection (GS) is a viable alternative, utilizing information from numerous genome-wide marker loci jointly in the genetic analysis (Meuwissen et al., 2001). The GS methods facilitates computation of individual breeding values for all genotyped animals and do not require any prior knowledge of the underlying QTL. In simulated aquaculture populations, superior performance of GS models compared with classical models has been documented in several publications (e.g., Nielsen et al., 2009; Ødegård et al., 2009; Ødegård and Meuwissen, 2014), while documentation from real aquaculture data has been largely absent so far.

The original idea behind GS was that the genome-wide markers would capture linkage disequilibrium between marker loci and QTL (Meuwissen et al., 2001). However, accuracy of GS has been shown to be non-zero even in absence of linkage disequilibrium (LD) (Habier et al., 2007), and the actual reliability of GS models can thus be explained by three types of quantitative-genetic information sources contained in the genomic data (Habier et al., 2013):


The ancestry (pedigree) is indeed reflected through inheritance of marker loci and is thus implicitly included in the dense marker information, although pedigree is not used directly. Cosegregation is the deviation from independent segregation of alleles as a result of linkage (i.e., deviations between relationships estimated from pedigree and linkage analysis), while LD is the statistical dependency between alleles at different loci in the base generation (i.e., the generation with unknown parents). Information on (1) and (2) can thus explain the non-zero reliability of GS even in absence of LD. Furthermore, in populations of strong relationship structure (e.g., livestock and aquaculture populations) LD may not even be the most important of these factors under GS; Wientjes et al. (2013) showed that the level of family relationship between selection candidates and the reference population had a higher effect on reliability of GS than LD *per se*.

There are currently numerous available GS methodologies. The most widely used methods are GS models using identityby-state (IBS) information on dense genome-wide SNP markers, including the so-called genomic BLUP (GBLUP) and Bayesian methods (e.g., BayesA, BayesB, BayesC, BayesD) (Meuwissen et al., 2001; Habier et al., 2011). Other methods involve use of SNP haplotypes (combining multiple SNPs), that also take identity-by-descent (IBD) information into account (Calus et al., 2008). Finally, GS may be performed based on linkage analysis of genome-wide markers, producing an IBD genomic relationship matrix (IBD-GS), completely ignoring LD information (Villanueva et al., 2005; Luan et al., 2012).

In the following, we will focus on two of these methodologies for use in aquaculture breeding: Ordinary GBLUP and IBD-GS. GBLUP can be implemented by ridge-regression on genome-wide marker genotypes (Meuwissen et al., 2001) or by an animal model using a realized genomic relationship matrix estimated from marker genotype similarities across the genome (Hayes et al., 2009). The latter method will be used here. The advantage of the IBD-GS model lies in its ability to utilize realized IBD relationships rather than expected relationships estimated through the pedigree, e.g., full-sibs (which are numerous in aquaculture) are no longer necessarily related by a coefficient of ½, but their relationships depend on the actual length of shared IBD chromosome segments, which are traced by the markers through linkage analysis. Compared with other GS methods, IBD-GS has the advantage that it can be successfully implemented even at extremely low marker densities. This is due to the fact that number of recombinations from parent to offspring is usually low (i.e., averaging one per Morgan), and inheritance of long chromosomal blocks can thus be traced accurately even with a few genome-wide markers. A recent simulation study on an aquaculture-like population indicated that IBD-GS works effectively at densities where IBS-based methods are expected to fail, e.g., with 10–20 SNPs/Morgan (Vela-Avitúa et al., in press). Thus, there is no need for dense marker panels, making IBD-GS attractive for cost-effective GS implementation. For dairy cattle, IBD-GS models have been shown to give similar reliability as ordinary GBLUP models with dense markers (Luan et al., 2012). Hence, for livestock populations with large family sizes, realized close relationships (pedigree and co-segregation) are essential for the reliability of any GS model, and GS methodology may thus have large potential even in absence of strong LD structures. Aquaculture populations typically have strong relationship structures, with selection candidates having numerous full-sibs and potentially both maternal and paternal half-sib groups.

The Norwegian AquaGen Atlantic salmon population originates from the first family-based selective breeding program on Atlantic salmon, going back to the 1970'ies, based on crossing of wild founders from numerous wild Norwegian river strains (Gjedrem et al., 1991). Originally, four parallel populations were created, one for each year class in a 4 year generation interval. Although as much as 41 river strains were originally included, contributions of the different rivers vary considerably, both between the original base populations of the 4 year classes and as result of subsequent selection. Hence, the original farmed populations were indeed heavily admixed. The year-class strains were selected for a common breeding goal, but kept largely separate for 7 generations until 2005, when they were merged into a single population. Hence, the AquaGen population can be regarded as an admixed/synthetic population comprised of genetic material from many wild subpopulations, which likely have been separated for a long time in nature.

Admixture between genetically distinct populations increases LD between all loci (linked and unlinked) that have different allele frequencies in the founding populations (Pfaff et al., 2001). However, LD between unlinked loci will quickly be removed through recombinations, while LD between linked loci will be more persistent, e.g., for loci separated by 1 or 10 cM, respectively 90 and 35% of the admixture-induced LD (ALD) is expected to remain even after 10 generations, while 82 and 12% remain after 20 generations. However, admixture will not only introduce long-range ALD, it will also reduce the short range LD, i.e., the LD existing in the original founder populations. The shortrange LD will decrease as phase associations between marker and QTL alleles can differ depending of the origin of the chromosome segments (Thomasen et al., 2013), and haplotype segments with strong LD are thus shorter in admixed populations (Toosi et al., 2010). This can be illustrated by the following example, assuming two sub-populations for simplicity: The frequency of a M1N1haplotype is (p + κ)(q + λ) + DI in population I, where (p + κ) and (q + λ) are the frequencies of the alleles M1 and N1, expressed as deviations from the across population frequencies (p and q), with frequency deviations κ and λ, and DI is the LD in population I. Similarly, (p − κ)(q − λ)+DII is the haplotype frequency in population II. The haplotype frequency in their crossbred-offspring (F1) is thus: - *p* + *q* + *D* , where *D* is the average of DI and DII. The LD in the F1 cross is - κλ + *D* , which comprises a ALD term κλ due to the crossbreeding (depends on frequency differences and is independent of distances between the loci), and the average of the original population-specific LD coefficients between the loci. *D* is on average smaller than either DI or DII since they may have opposite signs in the two populations, resulting in a reduced short-range LD in the admixed population.

The reduced short-range LD originating from founder populations may challenge accurate genomic prediction. Still, longrange ALD (the κλ term) can be effectively captured even by sparse markers, but may explain a limited fraction of the genetic variance, depending on the degree of differentiation between the founding populations.

Hence, effectiveness of GS in admixed populations depends on several layers of information: remaining LD from the founder populations, long-range ALD, and the relationship structure within the existing population. Furthermore, the relative importance of these factors likely depends on genetic architecture, marker density, heritability and the GS methodology used.

The aim of the study was to quantify the importance of marker density on the reliability of ordinary GBLUP models and to compare these estimates with IBD-based models completely ignoring LD, i.e., classical pedigree-based models and IBD-GS models. To this end, two traits measured on Atlantic salmon [fillet color (FC) and salmon lice resistance], with high and low heritability, using alternative GS models and marker densities were studied. So far, no QTL of large effect has yet been found for lice resistance, but major QTL have been found for FC, still these do not explain all genetic variance (Baranski et al., 2010).

## **MATERIALS AND METHODS**

#### **DATA**

The fish used in the material were from the AquaGen population year-class first-fed in 2011. In total, 157 full-sib families (offspring of 99 dams and 97 sires) were sampled for salmon lice (*Lepeophtheirus salmonis*) challenge testing, and 30–40 fish from each of these families were transferred to Nofima at Averøy, Norway and put into sea net-cages in October 2011. Two separate lice tests were conducted the following year, with all families being represented in both tests. Test 1 was conducted in the period July 16–18, 2012 and Test 2 in the period October 17–19, 2012. The total number of challenge-tested fish was 5198, with 2850 and 2348 fish in Test 1 and 2, respectively. Lice challenge testing of the fish was approved by the Norwegian Animal Research Authority (S-2012/148773).

Challenge testing was conducted by closing the net cages with tarpaulins prior to adding *L. salmonis* copepodites to the water. The copepodites attach immediately to the fish and the test aimed at 10–20 copepodites per fish 10–15 days after infection, when number of lice per fish was recorded at the end of chalimus II stage (Hamre et al., 2013). Lice count (LC) on the surface of the skin was recorded by manual counting. Average LC per fish was ∼21 lice in Test 1 and ∼13 in Test 2. However, the distribution of LC was highly skewed (**Figures 1**, **2**), with some animals having extremely high infestations (up to 238 parasites on a single fish). Skewness in distribution of parasite abundance traits are frequently observed, and such traits are thus often analyzed on the log-scale (e.g., Robert et al., 1990; Morand and Guegan, 2000; Rozsa et al., 2000; Davies et al., 2006). Hence, LC was normalized through log-transformation (LogLC), defined as:

LogLC = loge (lice count + 1)

Lice counts were added a constant value of 1 to avoid computing errors due to fish with zero recorded lice. Furthermore, there is a tendency toward increasing LC with increasing body size (i.e., body surface) of the fish. Hence, Gjerde et al. (2011) developed an alternative measure of lice resistance, defined as estimated LiceD on the skin:

$$\text{LiceD} = \frac{\text{LC}}{\sqrt[3]{\text{BW}^2}}$$

where BW is body weight (g) at time of recording, and <sup>√</sup><sup>3</sup> BW2 is an approximate measure for the surface skin area of the fish. Still, considerable skewness was also observed for LiceD in the current dataset, while log-transformed LiceD (LogLD) was approximately normal (**Figures 1**, **2**), indicating an approximate lognormal distribution of the trait:

$$\text{LogLD} = \log\_{\text{e}} \left( \frac{\text{LC} + 1}{\sqrt[3]{\text{BW}^2}} \right).$$

Using the latter trait definition increased the estimated heritability (i.e., the fraction of variance explained by additive genetic

effects) increased substantially compared with a linear model applied to untransformed LiceD (results not shown).

FC was recorded in a subsequent slaughter test in April 2013, where fish originating from both lice challenge tests were jointly recorded for FC (majority of the recorded fish originated from Test 1). The trait FC was defined as the pigmentation (redness) of the fillet and was automatically measured using image analysis with PhotoFish equipment and software (Photofish AS, Ås, Norway). The recorded FC was found to be approximately normally distributed (**Figure 3**).

More descriptive statistics are shown in **Tables 1**–**3**.

#### **GENOTYPING**

A total of 1963 phenotyped individuals were genotyped with a 220 k Affymetrix genome-wide SNP-chip. About half the individuals in Test 1 were genotyped (1444 individuals), but a smaller fraction in Test 2 (519 individuals). The genotyping strategy was supposed to serve two purposes: (1) Application of GS; and (2) a genomewide association study, potentially followed by marker-assisted selection (MAS) for the most significant SNP(s). The aim was to establish two experimental selection lines for, respectively, high and low sea lice resistance, using a combination of pedigree-based selection, GS and MAS. Hence, genotyping was not completely

**FIGURE 3 | Density plot of pigmentation in salmon fillets (FC) in the slaughter test, April 2013.** A normal density is given with the blue line.

**Table 1 | Descriptive statistics of data from fish participating in lice challenge test 1.**


**Table 2 | Descriptive statistics of data from fish participating in lice challenge test 2.**


random, but particularly focused on the most extreme families (in both directions) with respect to lice resistance. Of the 1963 genotyped animals, 1869 had phenotype for FC.

#### **STATISTICAL MODELS**

A preliminary analysis showed that there was no significant genetic correlation between logLD and FC, and the two traits were therefore analyzed separately in subsequent analyses.

For logLD, an initial quantitative genetic analysis was run, treating phenotypes of the two lice tests as two correlated genetic traits. A bivariate linear animal model was used for analysis of the data:

$$\mathbf{y} = \begin{bmatrix} \mathbf{y\_1} \\ \mathbf{y\_2} \end{bmatrix} = \begin{bmatrix} \mathbf{X\_1}\mathbf{\beta\_1} + \mathbf{Z\_1}\mathbf{a\_1} + \mathbf{e\_1} \\ \mathbf{X\_2}\mathbf{\beta\_2} + \mathbf{Z\_2}\mathbf{a\_2} + \mathbf{e\_2} \end{bmatrix},$$

**Table 3 | Within-test Pearson correlation coefficient between the traits BW, LC, LogLD and LogLD, with coefficients for the tests 1 and 2 are given, respectively, above and below the diagonal.**


where **y1** and **y1** are vectors of LogLD phenotypes from Test 1 and 2, respectively, **β<sup>1</sup>** and **β<sup>2</sup>** are vector of fixed effects (overall means of the two tests, and effect of observing person, nested within

each effect), genetic effects are given in **a1 a2 ∼ N** (**0**, **A⊗G0**), residual effects are given in **e1 e2 ∼ N 0**, **Iσ<sup>2</sup> e1 0** 0 **Iσ<sup>2</sup> e2** , **Z1** and **Z1** are appropriate incidence matrices (assigning animal genetic effects to phenotypes), **I** is an identity matrix of appropriate size, **A** is the pedigree-based numerator relationship matrix, and **G0** is the additive genetic (co)variance matrix for the two genetic traits. As the two tests were performed on different individuals, the residual covariance between the two traits was assumed to be zero.

As the genetic correlation between logLD of the two tests was lower than unity (results shown below) and the majority of the genotyped animals came from lice challenge test 1, predictive ability of the genomic models for lice resistance was performed using phenotypes of the first test only. The FC was recorded on a later stage with fish originating both lice challenge tests, recorded at same age within the same slaughter test. FC of all fish was therefore analyzed jointly as a single genetic trait. Hence, predictive abilities of the different classical and genomic models for logLD (test 1) and FC were assessed using univariate animal models, with the following general characteristics:

$$\mathbf{y} = \mathbf{X}\mathbf{b} + \mathbf{Z}\mathbf{a} + \mathbf{e}$$

Where y is a vector of phenotypes (logLD or FC), **a∼ N 0**, **Gσ<sup>2</sup> g** is a vector of random additive genetic effects, where **G** is a given relationship matrix (model dependent), and **e**∼ N - **0**,**I**σ<sup>2</sup> *e* is a vector of random residuals. The fixed effects (**b**) included person (responsible for counting) by day for logLD, and gender of fish for FC. Common environmental effects of family were also tested, but these effects were small and not significantly different from zero (*P* > 0.20) for both traits, and were thus dropped in the final model.

#### *Univariate sub-models*

The different models differed solely with respect to their specification of the relationship matrix **G**:

PED: Classical pedigree-based analysis, i.e., **G** = **A** (numerator relationship matrix).

IBD-GS: Identity-by-descent GS, using a linkage-based IBD relationship matrix for the genotyped animals. The matrix was calculated from a sparse marker set containing 5590 mapped genome-wide SNP markers, using the LDMIP software (Meuwissen and Goddard, 2010). The number of mapped SNPs per chromosome varied from 52 to 396, and relationship matrices were thus computed for each chromosome separately and subsequently averaged over chromosomes to produce **G**.

GBLUP: Identity-by-state GS (ordinary GBLUP), calculating the **G** directly from genome-wide SNP markers using the second method by Vanraden (2007). Alternative **G** matrices were tested by extracting random sub-sets from the complete marker data set, including either (a) 1100 (1 K), (b) 2200 (2 k), (c) 4400 (4 k), (d) 22 000 (22 k), (e) 55 000 (55 k), or (f) all 220 000 (220 k) SNP markers, respectively. For (a) to (d) at total of 10 non-overlapping replicates (sub-sets of marker genotypes) were generated, while (e) was replicated 4 times. Results were averaged over replicates.

All models utilizing genomic information (IBD-GS and GBLUP) used one-step estimation of EBVs (Legarra et al., 2009; Christensen and Lund, 2010), combining relationships from genotyped and ungenotyped individuals into a unified relationship matrix **H**. Furthermore, the **G** matrices were adjusted to the same average rate of inbreeding and relationship as the numerator relationship matrix, using the ADJUST option in DMU (Christensen et al., 2012; Madsen and Jensen, 2013). Identical variance components were used in all models. These were estimated with the PED model using all phenotypic data.

## *Model comparison*

Reliabilities of the different models were assessed through predictive ability, using five-fold cross-validation, i.e., individuals being both phenotyped and genotyped were randomly sampled into five validation sets, which were predicted one at a time, masking the phenotypes of the validation animals and using all the remaining phenotypes and genotypes as training data. Reliability was estimated as:

$$R\_{EBV,BV}^2 = \frac{R\_{EBV,V}^2}{h^2}$$

where *R*<sup>2</sup> *EBV*,*<sup>y</sup>* is the squared correlation between EBVs of a given model (predicted from the training data, without the phenotype of the animal itself) and the recorded phenotype (y), while *h*<sup>2</sup> is the estimated heritability of the trait.

## **RESULTS**

## **ESTIMATED HERITABILITIES AND GENETIC CORRELATIONS**

Heritability of lice resistance (logLD) was estimated for the two tests using a bivariate PED model, as described in the above section. The estimated heritabilities for the two tests were low to moderate (0.14 ± 0.03 and 0.13 ± 0.03 for July and October, respectively), and the estimated genetic correlation between lice resistance in the two tests was high (0.72 ± 0.12). The estimated heritability of FC, based on an univariate PED model, was high

**FIGURE 4 | Relative increase in reliability1 of genomic selection models for LR compared with a classical pedigree-based model.**

(0.43 ± 0.06). Based on likelihood ratio tests, genetic effects were highly significant for both traits (*P* < 0.001).

#### **RELIABILITY OF DIFFERENT MODELS AND MARKER DENSITIES**

Based on the five-fold cross validation, the reliability of the PED model was slightly higher for FC (0.36) than for lice resistance (0.34), the relative increases in reliabilities for the different GS models (compared with PED) are presented in **Figures 4**, **5** for lice resistance and FC, respectively. In general, all GS models outperformed the classical PED model, but

1Reliability of LR using the PED model was 0.34.

the relative improvement varied considerably between models and traits. For lice resistance, the relative increase in reliability using GS was substantial (up to 52% for GBLUP with 220 k), but moderate for FC (21% for IBD-GS and 22% for GBLUP with 220 k).

Using GBLUP, higher marker densities were always favorable, but the relative advantage was considerably more expressed in FC than in lice resistance. For example, the relative increase in reliability of GBLUP for FC was 39% when going from 4 to 220 k, while the corresponding increase for lice resistance was only 11%. Nevertheless, GBLUP was superior to PED for both traits, even at the lowest marker densities (1 k). For both traits, going from 22 to 220 k SNPs increased reliability by only ∼1%. Hence,

<sup>2</sup>Reliability of FC using the PED model was 0.36.

increasing SNP density beyond 22 k would have little practical effect on selection.

Another striking result was the enormous difference between the traits with respect to the relative reliability of the IBD-GS model. For the lowly heritable lice resistance the relative improvement compared with a classical pedigree-based model was considerably lower for IBD-GS than (220 k) GBLUP (14 and 52% for IBD-GS and GBLUP, respectively). In contrast, for the more highly heritable FC, increases in reliability for the two models were similar (21and 22% for IBD-GS and 220 k GBLUP, respectively).

## **DISCUSSION**

The estimated heritabilities (0.14 ± 0.03 and 0.13 ± 0.03) for lice resistance (measured as log of LD) obtained in the current study was lower than recent estimates (0.26 ± 0.05) obtained for similar trait definitions (untransformed LD) and testing methods in a previous study on lice resistance in a different salmon population (Gjerde et al., 2011). It should be noted, however, that the previous test was conducted in tanks, while the current test was performed in sea-cages, with copepidids being added directly to the sea-cage.

As one of the aims of the project was to produce extreme high/low lines with respect to lice resistance, families with high/low lice resistance were over-represented among the genotyped animals, which may have some effect on the reliability of the PED and GS models for this trait. The analysis was in all cases validated based on genotyped animals, and extreme families for lice resistance are thus overrepresented among the validation animals, which is expected to inflate the between-family variation in the sample. In the PED model, predicted breeding values for animals with masked phenotypes is simply a function of the midparent means, and an inflation of the between-family variance in the training sample may thus increase the apparent reliability of the model. This may explain the relative small difference in reliabilities of the PED model for LR and FC (0.34 and 0.36, respectively), despite the considerable difference in heritability of the two traits (0.14 and 0.43, respectively). Despite this, the relative improvement of the reliability through GS was substantial for LR (up to 52%). For FC, selective genotyping with respect to LR had likely little impact, due to the low correlation between the traits.

The models used in this study utilize the sources of information contained in genomic data differently. The GBLUP model utilizes pedigree (implicitly contained in the genomic data), linkage analysis (animals sharing IBD chromosome segments will necessarily share marker alleles) as well as LD. However, its ability to utilize the different sources of information depends on marker density. IBD-GS utilizes pedigree and linkage analysis and is robust to marker density, while PED, by definition, utilizes the pedigree relationships only. For the GBLUP model, high marker density would be needed to capture both short-range LD and (tiny) variations in co-segregation among relatives. In contrast, the IBD-GS model will utilize linkage analysis information accurately, even at very low marker densities. Furthermore, the relative importance of the different types of information depends on several factors such as structure of the dataset (i.e., number of close relatives in the population), historical *Ne* (i.e., amount of LD), as well as the heritability of the traits involved. In general, it is expected that for a lowly heritable trait, genetic effects estimated over larger groups of individuals, such as LD-associated effects (general association between marker genotypes and phenotypes) and mid-parent means would be more robust and thus relatively more important for the reliability, while linkage-analysis based deviations from pedigree relationships (i.e., largely minor individual deviations) would be relatively more important at higher heritabilities. Thus, the relative advantage of the GBLUP model may be largest at low heritability (e.g., lice resistance), while IBD-GS would be expected to perform relatively better at higher heritability (e.g., FC), which is consistent with results of this study. Another contributing factor may be the genetic architecture of the two traits; A major QTL has been published for FC (Baranski et al., 2010), and two more has recently been identified in the AquaGen population. All three QTL on FC were also detected in a genome-wide association study of the current data set, while, in contrast, no major QTL for lice resistance has been found (unpublished results). The GBLUP model assumes that genetic variance is uniformly distributed over the entire genome, which may fit better for lice resistance than FC. For this reason, more advanced Bayesian variable selection models (BayesB, BayesC, etc.) may have a larger potential in FC than LR.

Still, the factors discussed above do not explain the favorable performance of GBLUP for lice resistance at extremely low marker densities (e.g., 4 k), for which limited LD is usually expected (in homogenous populations), and linkage-based deviations from the expected relationships are unlikely to be accurately captured by IBS information. The explanation may thus lie in the selection history of farmed Atlantic salmon. As described in the introduction, admixture from several distinct wild strains is expected to introduce long-range LD, and simultaneously reduce the shortrange LD in the population. This will likely reduce the relative advantage of dense SNP data, as a relatively larger fraction of the available LD may be captured even by sparse marker panels, potentially explaining the good performance of GBLUP for lice resistance even at extremely low marker densities. Current terrestrial livestock populations may also have been formed by admixtures of old populations, but these admixture events occurred longer ago and may have been less extreme than in Atlantic salmon. Still, some admixture effects on the LD structure may also be seen in terrestrial livestock species, and thus contribute to the rather small increases observed in accuracy of GS as marker density increases (Vanraden et al., 2011). A high marker density in GBLUP will be favorable for utilization of linkage analysis information, which is mainly an advantage at higher heritabilities and strong relationship structures (Ødegård and Meuwissen, 2014), i.e., as seen with FC.

The number of genotyped animals was rather limited in the current study. Genotyping larger fractions of the population would be expected to increase the reliability of GS models even further.

## **ACKNOWLEDGMENTS**

The study was supported by the Norwegian Research Council through projects no. 200511/S40, 226266/E40 and 225181, and by Regionale Forskningsfond Midt-Norge through project no. ES486711. We would also thank Bjarne Gjerde and Sissel Nergaard, Nofima for taking responsibility of the sea lice challenge tests.

## **REFERENCES**


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 01 September 2014; accepted: 31 October 2014; published online: 21 November 2014.*

*Citation: Ødegård J, Moen T, Santi N, Korsvoll SA, Kjøglum S and Meuwissen THE (2014) Genomic prediction in an admixed population of Atlantic salmon (Salmo salar). Front. Genet. 5:402. doi: 10.3389/fgene.2014.00402*

*This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics.*

*Copyright © 2014 Ødegård, Moen, Santi, Korsvoll, Kjøglum and Meuwissen. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

## Whole-body transcriptome of selectively bred, resistant-, control-, and susceptible-line rainbow trout following experimental challenge with *Flavobacterium psychrophilum*

#### *David Marancik1, Guangtu Gao1, Bam Paneru2, Hao Ma3, Alvaro G. Hernandez 4, Mohamed Salem2, Jianbo Yao3, Yniv Palti <sup>1</sup> and Gregory D. Wiens <sup>1</sup> \**

*<sup>1</sup> National Center for Cool and Cold Water Aquaculture, Agricultural Research Service, United States Department of Agriculture, Kearneysville, WV, USA*

*<sup>2</sup> Department of Biology, Middle Tennessee State University, Murfreesboro, TN, USA*

*<sup>3</sup> Animal and Nutritional Sciences, West Virginia University, Morgantown, WV, USA*

*<sup>4</sup> High-Throughput Sequencing and Genotyping Unit, Roy J. Carver Biotechnology Center, University of Illinois at Urbana-Champaign, Urbana, IL, USA*

#### *Edited by:*

*Scott Newman, Genus, plc, USA*

## *Reviewed by:*

*Hugo Murua Escobar, Medical Faculty University of Rostock, Germany Dan Nonneman, Agricultural Research Service, United States Department of Agriculture, USA*

#### *\*Correspondence:*

*Gregory D. Wiens, National Center for Cool and Cold Water Aquaculture, Agricultural Research Service, United States Department of Agriculture, 11861 Leetown Rd., Kearnyesville, WV 25430, USA e-mail: greg.wiens@ars.usda.gov*

Genetic improvement for enhanced disease resistance in fish is an increasingly utilized approach to mitigate endemic infectious disease in aquaculture. In domesticated salmonid populations, large phenotypic variation in disease resistance has been identified but the genetic basis for altered responsiveness remains unclear. We previously reported three generations of selection and phenotypic validation of a bacterial cold water disease (BCWD) resistant line of rainbow trout, designated ARS-Fp-R. This line has higher survival after infection by either standardized laboratory challenge or natural challenge as compared to two reference lines, designated ARS-Fp-C (control) and ARS-Fp-S (susceptible). In this study, we utilized 1.1 g fry from the three genetic lines and performed RNA-seq to measure transcript abundance from the whole body of naive and *Flavobacterium psychrophilum* infected fish at day 1 (early time-point) and at day 5 post-challenge (onset of mortality). Sequences from 24 libraries were mapped onto the rainbow trout genome reference transcriptome of 46,585 predicted protein coding mRNAs that included 2633 putative immune-relevant gene transcripts. A total of 1884 genes (4.0% genome) exhibited differential transcript abundance between infected and mock-challenged fish (FDR < 0.05) that included chemokines, complement components, tnf receptor superfamily members, interleukins, nod-like receptor family members, and genes involved in metabolism and wound healing. The largest number of differentially expressed genes occurred on day 5 post-infection between naive and challenged ARS-Fp-S line fish correlating with high bacterial load. After excluding the effect of infection, we identified 21 differentially expressed genes between the three genetic lines. In summary, these data indicate global transcriptome differences between genetic lines of naive animals as well as differentially regulated transcriptional responses to infection.

**Keywords:** *Flavobacterium psychrophilum***, bacterial cold water disease, selective breeding, disease resistance, aquaculture, immune gene, tnfrsf, rainbow trout genome**

## **INTRODUCTION**

Selective breeding programs contribute to increased aquaculture production through the generation of animals with improved resistance/tolerance toward infectious disease causing microorganisms (Gjedrem, 1983, 2005; Van Muiswinkel et al., 1999; Cock et al., 2009; Gjedrem et al., 2012). In 2005, a family-based selective breeding program was initiated at the National Center for Cool and Cold Water Aquaculture (NCCCWA) to improve rainbow trout (*Oncorhynchus mykiss*) survival following exposure to *Flavobacterium psychrophilum* (Silverstein et al., 2009). This pathogen is the etiologic agent of bacterial cold water disease (BCWD) and rainbow trout fry syndrome (RTFS), and causes considerable losses to the rainbow trout aquaculture industry within the U.S. and to trout and salmon populations worldwide (Nematollahi et al., 2003; Barnes and Brown, 2011). Infection of rainbow trout with *F. psychrophilum* typically results in mortality, ranging from 2 to 30% of the population, with higher losses caused by co-infection with the infectious hematopoietic virus. A further impact of the disease is that survival following infection has been associated with skeletal deformities (Kent et al., 1989; Madsen et al., 2001). Disease prevention is difficult as the pathogen is geographically wide spread, limited chemotherapeutants are available for treatment, and there is currently no commercial vaccine available in the U.S, although killed, subunit, and live-attenuated vaccines are all active areas of research (Gómez et al., 2014; Sundell et al., 2014).

A pedigreed line of rainbow trout, designated ARS-Fp-R, has been subjected to over three generations of selection and demonstrates increased survival following experimental injection challenge (Hadidi et al., 2008; Silverstein et al., 2009; Leeds et al., 2010) and natural exposure (Wiens et al., 2013a), relative to a disease susceptible line, ARS-Fp-S, and a randomly bred control line, ARS-Fp-C. Current research goals at the NCCCWA includes elucidating intrinsic factors associated with survival of the ARS-Fp-R line to better understand mechanisms of how selection has altered the genetic control of disease resistance. Phenotypic studies have thus far shown that ARS-Fp-R line fish have decreased organ damage as determined by histopathology (Marancik et al., 2014b) and fewer pathophysiologic changes in plasma biochemistry (Marancik et al., 2014a) during the acute-phase of disease following experimental challenge. Experiments that quantified splenic *F. psychrophilum* numbers on days 5 (Hadidi et al., 2008) and 9 post-infection (Marancik et al., 2014a) demonstrate significantly lower splenic bacterial loads in ARS-Fp-R line fish. It is likely these observed differences are a result of a differential immune response to infection.

Changes associated with host immunologic response can be elucidated by profiling alterations in host mRNA abundance between pathogen naive and infected animals (Martin et al., 2006; Beck et al., 2012; Langevin et al., 2012; Peatman et al., 2013; Pereiro et al., 2014; Shi et al., 2014). Previous studies of rainbow trout infected with *F. psychrophilum* demonstrate significant upregulation and downregulation of rainbow trout immune-relevant genes in the limited number of tissues examined (Overturf and LaPatra, 2006; Villarroel et al., 2008, 2009; Evenhuis and Cleveland, 2012; Langevin et al., 2012; Orieux et al., 2013; Henriksen et al., 2014). Microarray analysis of head kidney tissue from susceptible compared to resistant double-haploid rainbow trout lines, identified differences in basal gene expression as well as induction of antimicrobial peptides, complement, matrix metalloproteases, and chemokines 5 days post-infection (Langevin et al., 2012). Taken together, these studies suggest that the rainbow trout immune response to *F. psychrophilum* is likely multifactorial involving both innate and adaptive components.

In this manuscript, we quantify changes in gene transcript abundance between genetic lineswith thefollowing goals: (1)identify differentially regulated genes common to the host response to *F. psychrophilum* infection, (2) identify genes differentially regulated between lines in response to *F. psychrophilum* infection, and finally, (3) examine baseline differences in gene expression between naive animals that might contribute to the post-challenge phenotype. For this study, we utilized a comprehensive RNA-seq, transcriptome approach, starting with whole-body lysates from 1.1 g fry from pooled fish of naive or experimentally infected genetic lines. Experimental infection during the fry/juvenile lifestage allowed gene transcript profiling at a time when genetic lines express robust survival differences, and when epizootics and mortality are described as most severe in production environments (Branson, 1995; Decostere et al., 2000). Sequence reads were aligned to the recently released rainbow trout genome transcriptome (Berthelot et al., 2014) to which we added automated annotation and manual immune gene curation. In summary, we described common, whole-body gene transcriptional responses to early *F. psychrophilum* infection and potential differences associated with the innate response between genetic lines, and finally, compare our results with published gene expression studies utilizing *F. psychrophilum* challenged rainbow trout.

## **MATERIALS AND METHODS ETHIC STATEMENT**

Fish were maintained at the NCCCWA and animal procedures were performed under the guidelines of NCCCWA Institutional Animal Care and Use Committee Protocols #053 and #076.

## **EXPERIMENTAL ANIMALS**

The ARS-Fp-R, ARS-Fp-C, and ARS-Fp-S genetic lines were derived from the same base population, and thus differed only as a result of artificial selection for BCWD post-challenge survival (Wiens et al., 2013a). Single-sire × single-dam matings were made within genetic lines between 3-year-old females and 1-yearold neo-males as previously described (Marancik et al., 2014b). Water temperatures in the egg incubation jars were manipulated so that all families hatched within a 1-week period (Leeds et al., 2010). Eggs were pooled within-line at the eyed stage and reared in ∼12.5◦C flow-through spring water. The ARS-Fp-R egg pool consisted of contributions from 43 full-sib families, the ARS-Fp-C egg pool consisted of 10 full-sib families, and the ARS-Fp-S egg pool consisted of 11 full-sib families. The resistant line eggs were progeny of dams that had undergone three generations of BCWD selection while the sires had undergone four generations of selection. The control-line eggs were progeny of parents that had undergone one generation of selection for increased resistance (2007 year class) and since that time, randomly bred. The susceptible-line eggs were progeny of parents that had undergone one generation of selection for increased susceptibility (2007 year class) and since that time, randomly bred (Wiens et al., 2013a). Eggs were pooled from a larger number of resistant line families as more resistant families are generated within the breeding program to apply selection differential, and thus this sampling design more accurately captured the genetic diversity within the resistant line. In addition, the resistant-line egg pool was part of a germplasm release and was utilized in additional challenge studies that will be reported elsewhere (Wiens, unpublished data).

All brood-stock and fish used in this study were certified to be free of common salmonid bacterial and viral pathogens by two independent diagnostic laboratories as described previously, and were negative for *F. psychrophilum* infection (Leeds et al., 2010; Wiens et al., 2013a). Prior to challenge, fry were allowed 1 week to acclimatize to challenge tanks. Mean body weight of the ARS-Fp-R line fish was 1.11 ± 0.05 g, the ARS-Fp-C line was 1.12 ± 0.03 g, and the ARS-Fp-S line was 0.98 ± 0.04 g (±1 SD, pooled weights of *n* = 4 tanks) at the initiation of the challenge. In the challenge facility, photoperiod was adjusted weekly to maintain a natural lighting cycle, and at the time of RNA-seq sample collection, was 14.5 h light: 9.5 h dark. Water quality parameters have been described previously (Wiens et al., 2013a).

## **RNA-seq EXPERIMENTAL DESIGN**

Bacterial challenge was carried out in the NCCCWA challenge facility with *F. psychrophilum* strain CSF-259-93 (initially provided by Dr. S. LaPatra, Clear Springs Foods, Inc.). This strain was previously isolated from a BCWD field-case and maintained at −80◦C in TYES media supplemented with 10% (v/v) glycerol and has been consistently utilized as the challenge strain within the selective breeding program (Hadidi et al., 2008; Silverstein et al., 2009; Leeds et al., 2010; Wiens et al., 2013a) and the complete genome sequence determined (Wiens et al., 2014). Frozen stock was cultivated on TYES media for 5 days at 15◦C, suspended in PBS and O.D.525 adjusted to 0.4. Colony plate counts were performed in triplicate and recorded after 5 days incubation to estimate the challenge dose enumerated as viable CFU fish−1.

Each tank held fifty, randomly assigned fish and were supplied with 2.4 L min−<sup>1</sup> of 12.5 <sup>±</sup> 0.1 ◦C flow-through spring water. For each genetic line, two tanks of fish were challenged by *F. psychrophilum* injection and two tanks of fish were challenged by PBS injection and served as non-infected control animals. In total, 100 fish per line were anesthetized with 100 mg/L tricaine methanesulfonate (Tricaine-S, Western Chemical, Inc., Ferndale, WA) and intraperitoneally (IP) injected with 4.<sup>2</sup> <sup>×</sup> <sup>10</sup><sup>6</sup> CFU fish−<sup>1</sup> *F. psychrophilum* suspended in 10μL of chilled PBS or 10μL of chilled PBS alone. Injections were performed using a repeater pipette (Eppendorf, Hauppauge, NY) fitted with a 27G × 1/2 inch needle. Fish age at the time of challenge was 49 days post-hatch (617 temperature degree days).

Five fish were sampled per tank on day 1 and on day 5 postinjection for RNA extraction. Survival prior to and following sampling was monitored daily for 21 days with the exception of one PBS injected tank that was excluded on day 16 due to water failure. All fish were fed daily a standard commercial fishmealbased diet by hand (Ziegler Bros, Inc., USA). The day 1 sampled fish were removed prior to being fed and the day 5 sampled fish were removed after being fed.

## **RNA EXTRACTION, LIBRARY PREPARATION, AND SEQUENCING**

The sampled fish were euthanized with 150 mg mL−<sup>1</sup> tricaine methanesulfonate and individually flash frozen in liquid nitrogen and stored at −80◦C. Total RNA was extracted from individual whole, ground fish using the standard TRIzol protocol (Invitrogen, Carlsbad, CA). Total RNA was extracted and integrity confirmed by running a 1% agarose gel. Equal amounts of RNA from five fish were pooled from each of the 12 tanks at each of the two time points (total of 24 pools, *n* = 120 fish total). The cDNA libraries were prepared using Illumina's TruSeq Stranded mRNA Sample Prep kit following the manufacturer's instructions. Briefly, mRNA was selected with oligo(dT) beads and chemically fragmented to a size of ∼100–400 nt before annealing of random hexamers and first strand cDNA synthesis. The 24 indexed and barcoded libraries were randomly divided into three groups (eight libraries per group) and sequenced in three lanes of an Illumina HiSeq 2000 (single-end, 100 bp read length) at the University of Illinois at Urbana-Champaign. All raw RNA-seq reads were submitted to the NCBI Short Read Archive under accession number BioProject ID PRJNA259860 (accession number SRP047070). RNA sequence reads that matched the *F. psychrophilum* CSF259-93 genome sequence (GenBank accession CP007627.1) were separated from host RNA-seq data and counted in each library.

Frozen fish homogenate lysates (500μL), stored at −80◦C, were individually processed to isolate genomic DNA (TRIzol DNA Isolation Procedure) and qPCR detection of *F. psychrophilum* genomic DNA was performed as described previously (Marancik and Wiens, 2013). Bacterial genome equivalents were normalized to per 100 ng−<sup>1</sup> extracted DNA.

## **GENES DESCRIBED AS REGULATED IN RESPONSE TO** *F. PSYCHROPHILUM* **CHALLENGE**

Based on a meta-analysis of studies in which rainbow trout were challenged with *F. psychrophilum*, 23 genes encoding immune relevant factors with putative roles in inflammation, innate disease response, and adaptive immunity were utilized as a curated gene reference set that included *cd3*, *cd8*, *mx-1* (Overturf and LaPatra, 2006), *saa* (Villarroel et al., 2008), *igm*, *igt*, *inf*-γ , *il-8*, *tcr-*β, *tlr5*, *tnf-*α (Evenhuis and Cleveland, 2012), *mt-a*, *sod-1*, *tgf-*β (Orieux et al., 2013), and *cd4*, *il-1*β, *il-6*, *il-17c1*, *il-17c2, IL-4/13A*, *foxp3b*, *mhc-I*, and *mhc-II* (Henriksen et al., 2014).

## **RAINBOW TROUT REFERENCE GENOME, GO ANNOTATION, AND IDENTIFICATION OF IMMUNE RELEVANT GENES**

In addition to the curated reference gene dataset described above, the recently released rainbow trout genome sequence accession CCAF000000000 (Berthelot et al., 2014) encoding 46,585 predicted coding mRNA sequences were utilized as a more complete reference set of gene models. Briefly, the Onchorhynchus\_mykiss\_pep.fa file (April 24, 2014 release), accessed at (http://www.genoscope.cns.fr/trout-ggb/data), was subjected to Blast2GO sequence annotation pipeline with the default parameters applied in the blast, mapping, and annotation steps using a local Blast2GO database (GO database file go\_201401-assocdb-data). NCBI non-redundant protein sequence (nr) database (build 12/13/2013) was used as the reference in the blast step. Of the 46,585 protein-coding gene models with supporting protein evidence from other vertebrates, a total of 46,103 were assigned a "Sequence Description" from the blast step and most of these predicted protein sequences assigned GO terms (Supplementary Data Sheet 1) from the Blast2GO annotation step. A subset of 2633 genes were identified as "putative immune relevant" either by sorting genes identified by GO annotation as "Immune Response" or by manual annotation based on sequence description (Supplementary Data Sheet 2). We excluded from analyses gene models encoding long non-coding RNAs and microRNAs. Splice variants were not analyzed and will be the subject of further study.

## **IDENTIFICATION OF DIFFERENTIALLY EXPRESSED GENES**

To identify the differentially expressed genes, reads from each library were mapped against the annotated gene database and the coding sequences from the rainbow trout genome assembly. Based on the high quality score distribution of the RNA-seq reads, the whole 100 bp of the sequences were used in this step. Bowtie short read aligner (Langmead et al., 2009) was used in mapping the reads to the references with a maximum of two mismatches allowed and no gaps. The output of Bowtie was filtered with an in-house Perl script to generate a count table, where a number at the corner of row i and column j represents the number of total mapped reads to the transcript i from the library j. For principle component analyses (PCA), reads were normalized to reads per kilobase of exon model per million mapped reads (RPKM) using an in-house script (Gao, available upon request) using the formula described previously (Mortazavi et al., 2008).

$$RPKM = N\_{read} \Big/ \left(L\_{exon} / 10^3\right) \Big/ \left(N\_{total} / 10^6\right)$$

$$= 10^9 N\_{read} \Big/ \left(L\_{exon} N\_{total}\right)$$

Where *Nread* is defined as the number of reads mapped to the gene; *Lexon* is defined as the total bases of the sequence of the gene, and *Ntotal* is defined as the total number of reads mapped to the whole reference (sum of the *Nread* for all genes). Raw data and normalized RNA-seq data (*RPKM*) for all samples including immune relevant genes are included as Supplementary Data (Supplementary Data Sheet 3).

Principal component and nearest neighbor network analyses were performed using Qlucore Omics Explorer (v3.0). Read count data from the 24 samples (RPKM) were *log*<sup>2</sup> transformed and subjected to normalization (mean = 0 and variance = 1) and variables (genes) subjected to multi-group comparison with a false discovery rate (FDR) *<sup>q</sup>* <sup>&</sup>lt; <sup>0</sup>.05 (*R*<sup>2</sup> <sup>≥</sup> <sup>0</sup>.36) and *<sup>p</sup>* <sup>&</sup>lt; <sup>0</sup>.<sup>002</sup> [*F*(1, 22) ≥ 12.42].

The raw count table was also input into the R package DESeq2 (Love et al., 2014), datasets selected for pair-wise comparisons, and the standard differential analysis steps of DESeq2 applied to the selected datasets. The output table of DESeq2 contains a column of adjusted *p*-value (padj) obtained using the Benjamini-Hochberg procedure (Benjamini and Hochberg, 1995), and we utilized a cut-off of *padj* < 0.05 as a criteria for differential expression with no filtering based on fold-change. For more stringent data filtering and visualization, data were first sorted by variance 2.5%, then filtered by *q* < 0.01 and *log*<sup>2</sup> fold change of >3 (Qlucore Omics Explorer v3.0).

### **GO ANNOTATION ENRICHMENT ANALYSES**

The GOSSIP program embedded in the Blast2GO package was used for GO enrichment analysis (Bluthgen et al., 2005a). This program examines each GO term for gene annotation enrichment by comparing a test set with a reference set using Fisher's exacttest method. In our analysis, the differentially expressed genes were selected as the test set and all the peptide sequences from the rainbow trout genome assembly for which GO terms were assigned, excluding the test set, were used as the reference set. FDR is computed by GOSSIP using an analytical equation (Bluthgen et al., 2005b) and the final list of the enriched GO terms selected at the FDR < 5%.

#### **VALIDATION OF RNA-seq DATA BY qPCR**

Six genes and primer sets were chosen for qPCR analysis to determine fold-change differences between control and infected fish on day 5 post-infection (Table S1). Extracted RNAs were treated with Optimize™ DNAase I (Fisher Bio Reagents, Hudson, NH) to remove contaminating genomic DNA. One microgram of DNAase treated RNA was used to make cDNA in a total reaction volume of 20μL. cDNA was synthesized using the Verso cDNA Synthesis Kit (Thermo Scientific, Hudson, NH) following the manufacturer described protocol. Reverse transcription reaction was carried out using My Cycler™ Thermal Cycler (Bio Rad, Hercules, CA) at 42◦C for 30 min (one cycle amplification) followed by 95◦C for 2 min (inactivation). Anchored oligo(dT) primer, at a final concentration of 25 ng/μL, was used to prime the reverse transcription reaction.

Relative gene expression was determined by CFX96™ Real Time System (Bio Rad, Hercules, CA). Forward and reverse primer sequences were manually aligned to their respective genes for validation. cDNA amplification was performed using DyNAmo Flash SYBR Green Master Mix (Thermo Scientific, Hudson, NH) containing 0.1 nm/μL forward and reverse primers and 0.0025μg/μL of cDNA template in a total reaction volume of 20μL. Initial denaturation was done at 95◦C for 7 min. Fortythree cycles of amplification were carried out at the condition: 95◦C for 0.1 min (denaturation), 57 −64◦C (annealing and extension) and 60◦C for 5 min (final extension). Different annealing temperatures were used for different primers depending on their melting temperature (Table S1).

Real time data were analyzed using the software Bio-Rad CFX Manager (Bio Rad, Hercules, CA). Differential gene expression was calculated using the standard curve method in which β-actin (Accession: AJ438158) was used as endogenous reference to normalize the target gene. β-actin expression levels demonstrated in RNA-seq data were similar in PBS and *F. psychrophilum-*injected fish (data not shown). qPCR data were quantified using delta delta Ct (Ct) methods (Schmittgen and Livak, 2008). Ctvalues of β-actin were subtracted from Ct-values of the target gene to calculate the normalized value (Ct) of the target gene in both the calibrator samples (PBS-injected) and test samples (*F. psychrophilum-*injected). The Ct value of the calibrator sample was subtracted from the Ct value of the test sample to get the Ct value. Fold change in gene expression in the test sample relative to the calibrator sample was calculated by the formula 2−Ct and the normalized target Ct values in each infected and non-infected group was averaged. Correlation between gene expression fold-change measured by qPCR and RNA-seq was performed by Pearson correlation. All statistics were performed with a significance of *P* < 0.05.

## **RESULTS**

## **DISEASE RESISTANCE PHENOTYPE AND** *F. PSYCHROPHILUM* **LOAD IN THREE GENETIC LINES FOLLOWING EXPERIMENTAL CHALLENGE**

Following injection challenge, survival significantly differed between the three genetic lines (*P* < 0.001) with the ARS-Fp-R line exhibiting highest survival (92.3%), the ARS-Fp-C line intermediate survival (54.6%), and the ARS-Fp-S line exhibiting the lowest survival (29.4%) (**Figure 1A**). The 62.9 percentage point survival difference between the ARS-Fp-R and ARS-Fp-S lines is consistent with the expected percentage point survival difference calculated from estimated parental breeding values (data not shown), analyzed as previously described (Leeds et al., 2010). Survival of PBS injected fish was >98% per line over the 21 days challenge period. Whole body *Fp* loads, measured by qPCR,

**FIGURE 1 | (A)** Post-challenge survival of ARS-Fp-R line (resistant, blue line), ARS-Fp-C line (control, green line), and ARS-Fp-S line (susceptible, red line) fish, challenged using replicate tanks. Survival differences were significant between genetic lines (*P* < 0.001). Fish were injected with either *F. psychrophilum* in PBS (*n* = 100 fish per line) or with PBS alone (data not shown) and survival monitored for 21 days. Five fish were sampled from each tank (*n* = 10 total) on days 1 and 5 post-challenge

(arrows) for RNA-seq analysis. **(B)** Mean *F. psychrophilum* load, genome equivalents per 100 ng extracted DNA (+1 SD) measured by qPCR. Individual fish were tested (*n* = 10 fish per group) with the exception of day 1 ARS-Fp-C line (*n* = 1) as samples were not available. Load differences were significantly different between genetic lines on day 5 (*P* < 0.001). **(C)** Mean *F. psychrophilum* cDNA count per library (+1 SD) identified from the RNA-seq dataset.

demonstrated no difference between genetic lines on day 1 but significant differences on day 5 (*P* < 0.001), with ARS-Fp-R line mean loads of 10 ± 22 *F. psychrophilum* genome equivalents (GE) (*n* = 10 fish, ±1 SD), increasing to 484 ± 937 GE in the ARS-Fp-C line (*n* = 10, ±1 SD), and highest mean load was present in the ARS-Fp-S line 4683 ± 6011 GE (*n* = 10, ±1 SD) (**Figure 1B**). In the RNA-seq libraries, a small number of sequences were present that matched the *F. psychrophilum* CSF259-93 genome. These presumably represent either cDNA from bacterial expressed genes or contaminating genomic DNA. The abundance of the *F. psychrophilum* read counts were similar between genetic lines on day 1, and by day 5, lowest in ARS-Fp-R line fish and highest in ARS-Fp-S line fish (**Figure 1C**). In summary, by day 5 post-challenge the phenotypes of the ARS-Fp-R, ARS-Fp-C and ARS-Fp-S lines displayed expected differences in pathogen load and subsequently post-challenge mortality.

## **RAINBOW TROUT RNA-seq REFERENCE DATASET, GO ANNOTATION, AND MANUAL IMMUNE GENE CURATION**

A total of 520 million sequence reads were generated from the 24 libraries with an average of 21.7 million (M) RNA-seq reads per library ranging from 17.4 M to 24.4 M reads (**Table 1**). Approximately half of the reads aligned to the reference transcriptome coding sequence, averaging 11.2 M (51.8%) per library with a range of 9.3 M–12.5 M. Of the 46,585 genes identified in the rainbow trout genome having protein coding evidence, an average of 43,068 (92.4%), exhibited detectable levels of expression, defined as >1 sequence read per gene (**Table 1**). The number of genes with an average of ≥10 read counts across the 24 libraries fell to an average of 32,830 (70.5%). In order to normalize gene expression across samples, data were converted to reads per kilobase of exon model per million mapped reads (RPKM) format (Supplementary Data Sheet 3). Of the putative immune relevant genes, 1797 (68.2%) had an average of ≥1 RPKM across the 24 libraries (Supplementary Data Sheet 3).

## **GLOBAL TRANSCRIPT ABUNDANCE DIFFERENCES BETWEEN CONTROL AND INFECTED AND BY SAMPLE DAY**

Across the complete dataset, comparison of *F. psychrophilum* infected vs. PBS injected groups by principal component and nearest neighbor network analysis identified samples grouped by day and infection status (**Figure 2**). Most tank replicates were connected by nearest-neighbor analysis although there was variation observed between tanks. The replicate tanks for the day 5 PBS injected ARS-Fp-S line (red colored balls), day 1 *F. psychrophilum* infected ARS-Fp-R line (blue colored balls), and day 5 *F. psychrophilum* ARS-Fp-C line (green colored balls) grouped together but were not directly connected by the network analysis.

## **DIFFERENTIALLY EXPRESSED GENES COMMON TO THE THREE GENETIC LINES IN RESPONSE TO** *F. PSYCHROPHILUM* **INFECTION**

Two-group comparison by infection (Qlucore) identified a total of 1884 genes as differentially regulated (*q* < 0.05) accounting for approximately 4.04% of the coding genes identified in the genome (see Table S2, **Qlucore** *q <* **0.05** for the complete list). Of the differentially regulated genes, 279 (14.8%) were categorized as immune relevant by GO or manual annotation. In order to more precisely identify differences between samples, 24 pairwise comparisons were tabulated by DESeq2 using a *q* < 0.05 (**Table 2**). There was a decrease in the number of differentially regulated genes in the infected ARS-Fp-R line fish compared to PBS injected fish between day 1 (*n* = 515 genes) and day 5 (*n* = 428 genes) time points. In contrast, the number of differentially regulated genes in the infected ARS-Fp-C line fish compared to PBS injected fish, increased from day 1 (*n* = 20 genes) to day 5 (*n* = 2201 genes). The number of differentially regulated genes in the infected ARS-Fp-S line fish compared to PBS injected fish increased from day 1 (*n* = 1663) to day 5 (*n* = 2225). In general, there were relatively few differentially regulated genes between the three PBS injected genetic lines, ranging from 3 to 246 genes (**Table 2**). Bacterial challenge increased the number



*Total number of 100 bp reads and the number that match reference genes are listed.*

*aThe number of expressed genes is defined as genes having at least one read count with 2 bp or less mismatch and no gaps (n* <sup>=</sup> *46,585 reference genes).*

of differentially expressed genes between genetic lines from day 1 to day 5 in all between-line comparisons. There was a strong sampling time effect within each line including PBS injected groups possibly due to differences in feeding or post-injection recovery.

Analysis of GO term enrichment within the pair-wise comparisons revealed a larger number of over-represented terms within the dataset as compared to under-represented terms (**Table 3**). Overrepresented processes in the ARS-Fp-S line included defense response to bacterium, inflammatory response, leukotriene and arachidonic acid production, complement activation, humoral immune response, antigen processing, chemotaxis, B cell homeostasis, interleukin-2 mediated pathway signaling and cellular iron homeostasis (Supplementary Data Sheet 5 for complete list). The ARS-Fp-R line Day 1 GO term enrichment included several categories that include response to wounding and wound healing and was enriched for cytokine and chemokine activity. ARS-Fp-R line day 5 GO term enrichment included complement activation, inflammatory response, B cell mediated immunity and defense response to bacterium.

Pair-wise analysis of differentially regulated genes shared between infected and PBS injected fish within each sampling day identified no common genes on day 1, although 274 were shared between ARS-Fp-R and ARS-Fp-S line fish (**Figure 3**). This may in part be due to the low numbers of genes differentially expressed between the ARS-Fp-C line replicates. Among all day five samples, 175 common genes were differentially regulated, by pair-wise comparisons, in all three lines (see Table S2, **Pair-wise** for the complete list). Of these, the majority (89%, *n* = 156) were upregulated, while only 19 were consistently downregulated. Application of stringent data filtering to the entire dataset (Qlucore, *q* = 0.01 and >3-fold *log*<sup>2</sup> expression difference) identified 110 genes that were the most robust predictors of infection status that collapsed to 51 common sequence descriptions (**Figure 4** and Table S2: **tab Qlucore** *q <* **0.01, log2** *>* **3 fold** for complete gene list). Many shared coordinated patterns of gene expression across samples were identified by unsupervised hierarchal clustering (**Figure 4**). The most highly regulated immune relevant genes included serum amyloid A (GSONMT 00016296001, GSONMT00005013001), complement c1q-like protein 2-like (GSONMT00002696001), differentially regulated trout protein 1 precursor (GSONMT00048193001, GSONMT 00025517001), leukocyte cell-derived chemotaxin 2 precursor (GSONMT00024746001, GSONMT00065856001), toll-like receptor 5 membrane form (GSONMT00013855001), c-type lectin domain family 4 member e (GSONMT00005166001), c for 10%.

type lectin receptor b (GSONMT00023806001), cd59b glycoprotein (GSONMT00025518001), and interleukine-1 receptor type 2-like (GSONMT00066304001). Interestingly, a number of putative metabolic genes were also highly expressed in infected ARS-Fp-S line fish compared to the ARS-Fp-R line including cis-aconitate decarboxylase-like (GSONMT00057407001), tbtbinding partial (GSONMT00003889001), l-serine dehydratase l-threonine deaminase-like (GSONMT00025010001), leptin (GSONMT00002603001), and growth differentiation factor 15 (GSONMT00000024001), catechol o-methyltransferase domaincontaining protein 1-like (3 genes) and hepcidin (GSONMT 00082379001). Multiple paralogues of several less abundantly expressed immune relevant genes included complement component 3 (4 genes), interferon-induced guanylate-binding protein 1 (3 genes), interferon-induced guanylate-binding protein 1-like (5 genes), interferon-induced protein 44-like (3 genes) and microfibril-associated glycoprotein 4-like (3 genes). Differentially regulated cytokine genes included interleukin 11 (GSONMT00009406001) and interleukin 1-beta (GSONMT00005489001) as well as nine chemokine genes that included: cc chemokine (GSONMT00017873001, GSONMT 00007278001), c-c motif chemokine 19 precursor (GSONMT 00014841001, GSONMT00057082001), c-c motif chemokine 4-like (GSONMT00080018001), chemokine ck-1 (GSONMT 00024124001), cxc chemokine (GSONMT00080684001), and interleukin 8 (GSONMT00038968001, GSONMT00059090001). Immune genes of interest also included, programmed cell death 1 ligand-1 (GSONMT00040812001), tnf receptor superfamily member 5 (GSONMT00012579001), tnf receptor superfamily member 6b (GSONMT00034343001) and tnf receptor



*aAbbreviations: R, ARS-Fp-R; C, ARS-Fp-C; S, ARS-Fp-S; Fp, F. psychrophilum challenged; PBS, phosphate buffered saline injected.*

superfamily member 9 (GSONMT00057532001), and b-cell receptor cd22-like isoform x2 (GSONMT00072668001).

#### **GENE EXPRESSION IN NAIVE FISH**

Gene expression profiles were compared in PBS-injected ARS-Fp-R, ARS-Fp-C and ARS-FP-S line fish to examine the effect of selective breeding on the transcriptome of naive animals. In the first analysis, pair-wise comparisons between lines were examined. On day 1 post challenge, three genes demonstrated significantly different transcript abundance between PBS-injected ARS-Fp-R and ARS-Fp-C fish, 76 genes were different between the ARS-Fp-R and ARS-Fp-S line fish, and 28 genes were different between the ARS-Fp-C and ARS-Fp-S line fish (**Table 2** and Supplementary Data Sheet 4: **tabs R\_C PBS Day 1, R\_S PBS Day 1, and S\_C PBS Day 1, respectively**). None of these genes were common to all three comparisons and none were found to be similarly differentially regulated between genetic lines after challenge with *F. psychrophilum*. On day 5 post injection, the number of differentially regulated genes in naive fish increased with 246 genes demonstrating significant differences in transcript counts between the ARS-Fp-R and ARS-Fp-C lines, 45 genes between the



*aAbbreviations: Fp, F. psychrophilum challenged vs. PBS, phosphate buffered saline injected; D1, day1; D5, day5; R, ARS-Fp-R line; C, ARS-Fp-C line; S, ARS-Fp-S line.*

ARS-Fp-R and ARS-Fp-S lines, and 61 genes between the ARS-Fp-C and ARS-Fp-S lines (**Table 2**). No genes were common to all three comparisons and eight genes exhibited a significant difference within two comparisons with similar trends in infected fish (Supplementary Data Sheet 4). Of these genes, only complement factor h-like (GSONMT00015052001) has a purported immunologic role. Fold gene expression differences were 2.8 and 2.1 between naive resistant and susceptible and infected resistant and susceptible fish, respectively. Complement factor h-like was 1.8 fold-higher in naive control fish compared to naive susceptible fish and 5.0 fold different between the infected lines (Supplementary Data Sheet 4).

A global analysis of expression differences between lines was performed on the entire dataset. In these analyses, data were first filtered for variance ≥2.5% and then subjected to *q* < 0.05 and infection status was removed as a factor. Within the complete dataset, 21 differentially expressed genes were identified (**Figure 5**). Several genes exhibited the pattern of low expression in ARS-Fp-S line, intermediate expression in ARS-Fp-C line and highest expression in ARS-Fp-R line. These included immune relevant genes tnf receptor superfamily member 14b-like isoform x1 and interleukine-1 receptor-like 1, genes. We also analyzed only day 1 samples, and identified additional immune relevant candidates that included complement c1q-like protein 4 precursor, protein nlrc3-like and gata-binding factor 2-like isoform x2 (**Figure 6**). Interestingly, there were also genes that exhibited the opposite expression pattern: higher in ARS-Fp-S line and lower in the ARS-Fp-C and ARS-Fp-R lines. This included putative immune relevant gene nlrc5-partial (**Figure 5**) and ig-like v-type domain-containing protein family 187a-like (**Figure 6**).

## **MANUAL EXAMINATION OF KNOWN IMMUNE GENES EXPRESSED IN RESPONSE TO** *F. PSYCHROPHILUM* **INFECTION**

In order to compare this dataset to previous analyses of gene expression following bacterial challenge, 23 known immune relevant genes were selected and *de novo* sequence alignment of the complete RNA-seq dataset was performed. There was a significant difference in transcript abundance between *F. psychrophilum* and PBS-challenged fish on day 1 and/or day 5 post-infection for 10/23 published gene sequence that were searched against the RNA-seq data set (**Table 4**) (*Padj* < 0.05). Seven genes demonstrated a significant difference in transcript counts between

**FIGURE 3 | Venn diagrams depicting commonalities of regulated genes in infected ARS-Fp-R, ARS-Fp-C, and ARS-Fp-S line fish that showed significant differences in transcript abundance compared to respective PBS-challenged fish groups on days 1 and 5 post-infection.** For all analyses, pair-wise comparisons were calculated with DESeq2 using a *q* < 0.05. Circles are color coded by line, ARS-Fp-S (red line), ARS-Fp-C (green), and ARS-Fp-R (blue).

genetic lines on day 5 post-infection, with no significant differences occurring between genetic lines on day 1 (**Table 4**) (*Padj* < 0.05). There was no significant difference in transcript abundance between infected and control fish or between genetic lines for *tgf-*β, *tcr-*β, *sod1*, *mx-1*, *mt-a*, *mhc-I*, *il-6*, *il-4/13A*, and *foxp3b*. Transcript counts were too low to in at least one comparison to provide a significant *Padj*value for *tnf-*α*, il1-* β*, il-17c2*, and *inf-*γ *.*

## **VALIDATION OF RNA-seq DATA BY qPCR**

In order to begin validating RNA-seq transcript abundance estimates, selected differentially expressed genes were identified for qPCR analysis (Table S1). There was significant correlation between transcript fold-change values determined by RNA-seq and qPCR in control and infected fish on day 5 post-infection (*R* = 0.75, *P* < 0.001) (**Figure 7**).

## **DISCUSSION**

To our knowledge this is the first study describing wholebody transcriptome analysis and comparison of three divergently selected lines of rainbow trout exhibiting graded survival differences in response to standardized *F. psychrophilum* challenge. We identified large numbers of differentially expressed genes between genetic lines that increased in number with time and bacterial load. Since we utilized whole fish in these experiments, we interpreted differences in transcript abundance as upregulation/downregulation of genes or alteration of transcript stability. Importantly, the use of entire fish for RNA extraction rules out the potential for cell migration to confound expression differences inherent in studies utilizing defined immune tissue source (i.e., spleen, blood, or anterior kidney). A limitation of our approach is the likelihood of missing differentially expressed genes expressed only in rare subsets of cells. Nevertheless, our

depth of sequence coverage allowed quantification of transcript abundance from about 70.5% of all identified protein coding genes identified within the rainbow trout genome and approximately 68.2% of putative immune relevant genes that we identified by automated and manual annotation. Genes identified as differentially expressed were primarily associated with the acute phase response to bacterial infection but we also identified genes associated with innate and adaptive immune responses, physiologic and metabolic processes, and wound healing. This experimental design allowed for inclusion of both mucosal and systemic immune system tissue sampling and demonstrates previously unrecognized changes in gene transcription that occur with BCWD. These findings illustrate the high degree of transcriptional complexity involved in the rainbow trout BCWD response and provide a reference data set to begin to understand the impact of selective breeding on the genetic basis of disease resistance.

## **INDUCTION OF THE ACUTE PHASE RESPONSE AND INNATE IMMUNITY**

The transcriptional response included induction of complement factors, acute-phase proteins, cytokines, chemokines and other genes associated with innate immunity. A relatively high number of complement factors were identified as upregulated after infection. This included multiple transcript alignments to genes encoding complement c3, complement c9, complement c4-b-like, and complement c1q-like proteins. Generation of complement during the acute-phase response has been well-described in rainbow trout and likely imparts direct bactericidal activity (Sunyer and Lambris, 1998; Whyte, 2007). The magnitude and range of complement factor expression suggests the complement system constitutes an important aspect of the host response to *F. psychrophilum* although factors were not found to be significantly different between resistant and susceptible fish. A wide range of acute phase protein encoding genes, including serum

amyloid A and tlr5 were identified as highly expressed after infection. Similar trends have been described from tissues of rainbow trout affected by BCWD (Overturf and LaPatra, 2006; Langevin et al., 2012). Differentially expressed trout protein is a relatively recently recognized immune factor shown to be expressed during the salmonid acute phase response after bacterial infection (Bayne et al., 2001; Tsoi et al., 2004). Along with IL-1, IL-11, and IL-17-c1, these factors likely exert diverse pro-inflammatory and defensive actions including recruitment of inflammatory cells and further amplification of the acute phase response (Stadnyk, 1994; Jorgensen et al., 2001; Carrington et al., 2004; Wang et al., 2009). Significantly higher transcript counts for acute phase proteins and cytokines in the ARS-Fp-S line on day 5 may represent higher induction of pro-inflammatory conditions compared to the ARS-Fp-R line that correlates with higher bacterial loads.

## **DIFFERENTIAL EXPRESSION OF GENES THAT CONTRIBUTE TO ADAPTIVE IMMUNITY**

A limited number of differentially regulated genes were identified that may be associated with adaptive immune processes. There was modest up-regulation of *igm* and *igt* genes and cellular factors associated with cell signaling and activation including *mhc-II*, *cd3*, *cd4*, and *cd8* genes. Higher *igm* gene transcript levels in resistant and control-line fish compared to the susceptible-line suggest an earlier development of antibody mediated processes, although the converse pattern was observed for *mhc-II* gene transcript counts on the same days postinfection.

## **GENES INVOLVED IN WOUND HEALING**

A number of genes associated with wound healing and wound response showed significant expression differences between naive and infected fish but no trends between genetic lines or sample day. These included syndecan-4-like isoform x2, plasminogen activator inhibitor 1-like, and ras-related c3 botulinum toxin substrate 2 precursor. All have roles in localized tissue repair in mammals through augmenting extracellular matrix reorganization, cellular growth and proliferation, and regulation of cell signaling (Romer et al., 1996; Woods and Couchman, 1998; Lin et al., 2005; Ojha et al., 2008). A limited number of genes with described roles in wound healing in other species (i.e., abhydrolase domain-containing protein 2-a-like and a collagen alpha-1 chain-like factor) demonstrated reduced expression in infected fish. Wound repair has not been wellcharacterized in fish and the general effect of these genes on the host response to infection is unknown. Necrosis of internal organs, peripheral skeletal muscle and skin is likely directly associated with host morbidity and mortality (Nilsen et al., 2011; Marancik et al., 2014b) and further studies are needed to determine how wound healing impacts recovery and survival.

**FIGURE 6 | Heat map showing multi-group comparison by genetic line eliminating infection as a factor (***q* **= 0***.***05) for Day 1 samples.** Variables (genes) are grouped by hierarchal clustering.



*Quantification was performed between Fp and PBS challenged ARS-Fp-R (R), ARS-Fp-C (C), and ARS-Fp-S (S) line fish and between genetic lines on days 1 and 5 post-infection. There were no significant differences between genetic lines on day 1 post-infection.*

## **IDENTIFICATION OF LARGE PARALOGOUS FAMILIES OF PUTATIVE IMMUNE-RELEVANT GENES**

The recent rainbow trout genome contains expansion of putative gene families including protein nlrc3 and nlrc3-like (*n* = 111), protein nlrc5 (*n* = 9), polymeric immunoglobulin receptorlike (*n* = 14), perforin-1-like (*n* = 8), B-cell receptor cd22-like proteins (*n* = 38), cd209-like (*n* = 34), macrophage mannose receptor 1-like (*n* = 28), lrr and pyd domain-containing proteins

(*n* = 37), and fish virus induced trim proteins (*n* = 44) (Supplementary Data Sheet 3). The availability of the genome allows a more comprehensive analysis of partially characterized immune gene families including tumor necrosis factor superfamily of ligands (*n* = 26) and toll-like receptors (*n* = 27) (Palti, 2011; Wiens and Glenney, 2011). Of note, a large number of tumor necrosis factor receptor superfamily members (tnfrsf, *n* = 59) were identified (Supplementary Data Sheet 3). While the precise phylogenetic nomenclature and grouping of tnfrsf remains, our analysis indicates many members are differentially regulated in response to *F. psychrophilum* infection including tnfrsf 1a precursor (GSONMT 00019008001), tnfrsf 1a-like (GSONMT00061996001) tnfrsf 5 precursor (GSONMT00012579001, GSONMT00003531001), tnfrsf 5-like (GSONMT00013182001), tnfrsf 6 (GSONMT 00082555001), tnfrsf 6b (GSONMT00034343001), tnfrsf 6blike (GSONMT00020936001), tnfrsf 9 (GSONMT00057532001), tnfrsf 9-like (GSONMT00050654001), tnfrsf 19-like isoform x1 (GSONMT00055755001) and tnfrsf 19-like isoform x2 (GSONMT00069336001). In addition, basal expression of one paralogue of tnfrsf 14-like isoform x1 (GSONMT00000915001, from a total *n* = 11 paralogues present in the genome) is modestly higher in the ARS-Fp-R line as compared to the ARS-Fp-S line (**Figure 5**). While Blast2GO v.2 provides a description for novel sequences based on natural language text mining functionality (Gotz et al., 2008), we emphasize that detailed annotation and phylogenic analysis of these genes remains to be undertaken. Some of these genes may be pseudogenes and it is likely that the sequence description of many of the automated annotations we present here will require revision following expert curation, and may also change as improvements are made to the reference rainbow trout genome sequence and analysis of transcript variants.

## **DIFFERENTIALLY EXPRESSED GENES BETWEEN NAIVE FISH FROM THE THREE GENETIC LINES**

There was tight clustering of data and a low number of gene expression differences between PBS-injected fish. This suggests phenotypic differences between the ARS-Fp-R, -C, and -S lines are largely induced by infection and that selective breeding appears to have had a relatively low impact on basal gene expression during the normal physiologic state. Expression of complement factor-h like was observed to be significantly different between both naive and infected resistant and susceptible fish. Complement factor-h is a regulatory protein of the alternative complement pathway and although isolated from rainbow trout (Anastasiou et al., 2011), its contribution to the rainbow trout immune response has not yet been characterized.

## **COMPARISON OF DIFFERENTIALLY REGULATED GENES WITH Langevin et al. STUDY**

This study expands on the work of Langevin et al. (2012) who described differential regulation of select genes by microarray and qPCR in the anterior kidney of BCWD resistant and susceptible rainbow trout clones after experimental infection. Both studies showed an increase in transcription of genes encoding pro-inflammatory cytokines, anti-bacterial effectors and matrix metalloproteases. Notably, complement factors, serum amyloid A, mannose-binding proteins, chemokines, and interleukins 1 and 8 all showed significant transcriptional increase on day 5 post-infection. RNA-seq data provides further evidence for upregulation of immunorelevant factors not previously identified, including tlr5, leptin, haptoglobin, and C-type lectin and genes associated with adaptive, physiologic, structural, and intracellular process. There was no evidence to support differential regulation of interferon-gamma during infection in either study although potential interferon induced genes are differentially regulated. Tumor necrosis factor-alpha was also notably absent from both studies despite previously published (Evenhuis and Cleveland, 2012) and unpublished data (Wiens, unpublished data) suggesting upregulation during infection. Assay sensitivity may have been confounded by whole-body RNA isolation which could reduce the ability to detect low abundant cytokines expressed in specific tissues.

There was variability between studies when transcriptomic responses were compared between resistant and susceptible line fish. Both studies observed higher bacterial loads in susceptible fish and increased transcriptional response of a number of genes associated with pro-inflammatory conditions including interleukin-1, cc chemokine and matrix metalloproteinases 1 and 19. Our study further identified serum amyloid A, differentially regulated trout protein, and cytochrome p450 1a as significantly upregulated in susceptible fish. A number of metalloproteinase inhibitors were identified by Langevin et al. (2012) as significantly upregulated in resistant but not susceptible fish but showed no significant differences between genetic lines when analyzed in our study. This variability likely extends from apparent differences in experimental design including challenge route, bacterial dose and strain, tissue sampled, and microarray vs transcript count quantification and analysis. However, even with these differences in method, both studies demonstrated a robust immune response in *F. psychrophilum* challenged fish with significant differences in the transcriptome of resistant and susceptible fish, associated with immune relevant genes.

## **CONCLUDING REMARKS**

Complex transcriptional differences were identified between lines following infection with *F. psychrophilum* strain CSF259-93. Most of the differentially regulated genes exhibited increased transcript abundance and correlated with higher bacterial loads. It is likely that most of the changes we identified are consequences of the differential expansion of *F. psychrophilum* following infection, especially in the ARS-Fp-S line. However, differences in bacterial load cannot account for transcriptional differences observed on day 1, as bacterial loads were similar between genetic lines. Future efforts will be directed at dissecting early time points following exposure and will focus on identifying inter-individual differences as this study examined pools of fish due to sequencing cost limitations. This data will be integrated with proteomic studies to examine the relationship between transcript and protein levels and to assist in exploring biomarkers of infection, immunocompetency and disease resistance. Also, in this study we did not analyze long non-coding RNA, microRNA and splice variants. Despite these limitations, we suggest that this data set represents an important future resource for exploration of candidate genes identified from QTL analyses being conducted in parallel with this study. Quantitative trait loci (QTL) mapping has identified nine QTL on seven chromosomes that have moderate to large effects on resistance (Vallejo et al., 2014a). We have focused on an Omy19 QTL (Wiens et al., 2013b) and confirmed inheritance in a subsequent generation (Vallejo et al., 2014b). By combining chromosomal position known for many of the genes within this dataset, we have begun to identify differentially regulated genes present on Omy19 and other chromosomes as proof of principal for this approach. Further studies are needed to validate and fine-map the BCWD QTL and these studies are currently underway. These complex patterns support a polygenetic architecture of BCWD resistance and will serve as a reference dataset for identifying mechanisms associated with the genetic basis of disease resistance.

### **ACKNOWLEDGMENTS**

We acknowledge technical contributions from Travis Moreland and Joel Caren. Author contributions: conceived and designed the experiments: Yniv Palti, Mohamed Salem, Jianbo Yao, Gregory D. Wiens; performed the experiments Gregory D. Wiens, Bam Paneru, Hao Ma, Mohamed Salem, Alvaro G. Hernandez; analyzed the data David Marancik, Guangtu Gao, Gregory D. Wiens; contributed reagents/materials/analysis tools GG; wrote the paper David Marancik, Gregory D. Wiens. All authors reviewed and approved the manuscript. This work was supported by Agricultural Research Service CRIS Project 1930-32000-005 "Integrated Approaches for Improving Aquatic Animal Health in Cool and Cold Water Aquaculture," CRIS Project 1930-32000- 009 "Utilization of Genomics for Improving Production Traits in Cool and Cold Water Aquaculture" and Agriculture and Food Research Initiative competitive grant no. 2012-67015-30217 from the USDA National Institute of Food and Agriculture to Gregory D. Wiens. The authors have no conflict of interest to declare. Mention of trade names or commercial products in this publication is solely for the purpose of providing specific information and does not imply recommendation or endorsement by the U.S Department of Agriculture. USDA is an equal opportunity employer.

## **SUPPLEMENTARY MATERIAL**

The Supplementary Material for this article can be found online at: http://www.frontiersin.org/journal/10.3389/fgene. 2014.00453/abstract

## **REFERENCES**


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

#### *Received: 24 October 2014; paper pending published: 24 November 2014; accepted: 11 December 2014; published online: 08 January 2015.*

*Citation: Marancik D, Gao G, Paneru B, Ma H, Hernandez AG, Salem M, Yao J, Palti Y and Wiens GD (2015) Whole-body transcriptome of selectively bred, resistant-, control-, and susceptible-line rainbow trout following experimental challenge with Flavobacterium psychrophilum. Front. Genet. 5:453. doi: 10.3389/fgene.2014.00453 This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics.*

*Copyright © 2015 Marancik, Gao, Paneru, Ma, Hernandez, Salem, Yao, Palti and Wiens. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

## Optimizing the creation of base populations for aquaculture breeding programs using phenotypic and genomic data and its consequences on genetic progress

#### *Jesús Fernández <sup>1</sup> \*, Miguel Á. Toro2, Anna K. Sonesson3 and Beatriz Villanueva1*

*<sup>1</sup> Departamento de Mejora Genética Animal, INIA, Madrid, Spain*

*<sup>2</sup> Departamento de Producción Animal, ETSI Agrónomos, Universidad Politécnica de Madrid, Madrid, Spain*

*<sup>3</sup> Nofima, Ås, Norway*

#### *Edited by:*

*Ross Houston, University of Edinburgh, UK*

#### *Reviewed by:*

*Jeff Silverstein, United States Department of Agriculture, Agricultural Research Service, USA Ross Houston, University of Edinburgh, UK*

#### *\*Correspondence:*

*Jesús Fernández, Departamento de Mejora Genética Animal, INIA, Ctra. Coruña Km 7,5, 28040 Madrid, Spain e-mail: jmj@inia.es*

The success of an aquaculture breeding program critically depends on the way in which the base population of breeders is constructed since all the genetic variability for the traits included originally in the breeding goal as well as those to be included in the future is contained in the initial founders. Traditionally, base populations were created from a number of wild strains by sampling equal numbers from each strain. However, for some aquaculture species improved strains are already available and, therefore, mean phenotypic values for economically important traits can be used as a criterion to optimize the sampling when creating base populations. Also, the increasing availability of genome-wide genotype information in aquaculture species could help to refine the estimation of relationships within and between candidate strains and, thus, to optimize the percentage of individuals to be sampled from each strain. This study explores the advantages of using phenotypic and genome-wide information when constructing base populations for aquaculture breeding programs in terms of initial and subsequent trait performance and genetic diversity level. Results show that a compromise solution between diversity and performance can be found when creating base populations. Up to 6% higher levels of phenotypic performance can be achieved at the same level of global diversity in the base population by optimizing the selection of breeders instead of sampling equal numbers from each strain. The higher performance observed in the base population persisted during 10 generations of phenotypic selection applied in the subsequent breeding program.

**Keywords: base populations, aquaculture breeding programs, SNP markers, optimal contributions, fish genomics**

## **INTRODUCTION**

The success of an aquaculture breeding program critically depends on the way in which the base population of breeders is constructed (Hayes et al., 2006; Holtsmark et al., 2006, 2008a,b). In particular, all the genetic variability that can be used for selection on the traits initially included in the breeding objective is that found in the original breeders. Also, the decisions taken when creating the base population will have consequences on the genetic progress for any other additional trait that may be part of future breeding goals, whatever production or fitness related traits.

Traditionally, base populations in aquaculture breeding programs were created from wild or from domesticated strains not subjected to any formal selection program (see Gjedrem et al., 1991, for details on the formation of the base population in the Norwegian breeding program for Atlantic salmon and Eknath et al., 1993, 1998, for details on the formation of the base population in the GIFT breeding program for tilapia). In this context, Holtsmark et al. (2006, 2008a) investigated through computer simulation the effects of the number of strains contributing to the base population and the mating strategy (within and across strains) on the genetic gain achieved in the subsequent selection program. These studies assumed wild populations and no knowledge of the genetic structure (i.e., within and between strain diversity) or phenotypic levels for the trait of interest when setting up the base population. Consequently, the simulated strategy was randomly sampling the same number of individuals from each strain to form the base population. However, the solution leading to the highest level of diversity and acceptable levels of phenotypic performance when starting the breeding program surely will depart from equal proportions.

In the absence of genealogies (which could be the case even in improved commercial strains), molecular markers can be used to estimate relationships between and within populations. Hayes et al. (2006) compared random with molecular-based optimized selection of breeders in terms of the genetic variance captured for growth and for two disease traits in Atlantic salmon. However, no considerations about the phenotypic level of candidates were made when sampling the breeders and the number of markers they used was scarce (237 AFLP).

The present availability of large panels of SNPs makes marker diversity more informative on the global genetic variability in the genome than diversity computed from genealogical data (Gómez-Romano et al., 2013). Nowadays, dense SNP chips have been already developed for Atlantic salmon (130K, Houston et al., 2014), rainbow trout (57K, Palti et al., 2014), catfish (250K, Liu et al., 2014) and carp (250K, Xu et al., 2014). This development of an increasing number of markers for aquaculture species will make it possible in a near future to have accurate measures of genetic relationships between and within strains. In this new scenario, the optimal number of individuals to be sampled from each strain could be calculated in a similar way as when optimizing the construction of mixed populations from different origins in conservation programs aimed at capturing the highest levels of genetic diversity (Eding and Meuwissen, 2001; Caballero and Toro, 2002; Eding et al., 2002).

Nowadays, for some aquaculture species already genetically improved strains are available. The use of improved rather than wild strains when creating base populations would allow the new breeding program to begin from higher phenotypic levels for the trait of interest, making it competitive from the start. The necessity of taking into account the phenotypic value of each candidate strain is even clearer when searching for new breeders to be included in an already established breeding program whose genetic variability has been greatly reduced. The increase in genetic variability should be achieved without compromising the gain in performance previously obtained through artificial selection.

The objective of this paper was to study, using computer simulations, the consequences of using genome-wide molecular information to compute genetic relationships and phenotypic records to optimize the sampling of individuals from different strains when creating base populations in aquaculture breeding programs. Different scenarios varying in the type of strains (wild or commercial), the degree of relationships within and between strains and the level of information (individual or strain information) were considered for generic aquaculture population designs. Results were compared in terms of phenotypic level for the trait of interest and the diversity achieved in the base population and in subsequent generations of selection.

## **MATERIALS AND METHODS**

## **GENOME STRUCTURE**

Diploid individuals were simulated with a genome comprising 20 chromosomes of 1 Morgan each. This genome length could be representative of the genomic architecture of the most cultivated fish species in aquaculture (which have genomes ranging from 10 to 35 Morgans). Each chromosome carried 25,000 neutral biallelic loci that will be referred to as "non-marker loci" thereafter. Additionally, 5–5000 biallelic markers per chromosome (equivalent to SNPs) were simulated interspersed with the neutral loci. Therefore, the total number of available markers ranged from 100 to 100,000 and the density of markers ranged from 5 to 5000 markers/M. All loci were evenly spaced within each chromosome. Markers were used to optimize the contributions of the different strains to the base population while non-marker loci (i.e., the 25,000 neutral loci per chromosome) were used to monitor the effect on genetic diversity of the different strategies evaluated (see below). With dense marker panels we expect that most nonmarker loci in the genome will be in linkage disequilibrium (LD) with at least some of the markers and, thus, managing diversity using marker genotypes will lead to the maintenance of diversity in the rest of the genome.

## **GENERATION OF CANDIDATES**

The creation of the candidates to contribute to the base population followed a two-step process. First, a large population in mutation-drift equilibrium was generated. Then, individuals were sampled from this population to form different strains which were allowed to diverge for several generations before being available for the selection of the individuals to be included in the base population.

## *Equilibrium population*

First, to obtain a realistic pattern of linkage disequilibrium, a large population (*N* = 1000 with equal number of males and females) under random selection was simulated for 1000 discrete generations. Each generation, sires and dams were sampled with replacement and population size was kept constant across generations. Initially, genotypes for the individuals were assigned at random independently for each locus (markers and nonmarkers). Consequently, the initial allelic frequencies were 0.5 for all loci and Hardy-Weinberg and linkage equilibria existed within and between loci, respectively. During this period, mutation was allowed to occur throughout the genome. The mutation rate per locus and generation was <sup>μ</sup> <sup>=</sup> <sup>2</sup>.<sup>5</sup> <sup>×</sup> <sup>10</sup>−<sup>3</sup> for both types of loci (marker and non-marker). The number of new mutations simulated at every generation was sampled from a Poisson distribution with mean 2*Nnc*μ*nl* where *nc* is the number of chromosomes and *nl* is the total number of loci (markers and non-markers) per chromosome. Mutations were then randomly distributed across individuals, chromosomes and loci, switching allele 0 to allele 1 and vice versa. When generating the gametes, the number of crossovers per chromosome was drawn from a Poisson distribution with mean equal to 1. Crossovers were randomly distributed without interference. At the end of the process the expected heterozygosity of the population had already reached an equilibrium value.

## *Creation of strains*

The second step of the process consisted in randomly sampling 10 different groups of individuals (mimicking 10 different strains) from the population at equilibrium. At this step, a quantitative trait with phenotypic mean (μ), initial phenotypic variance [*VP*(0)] and heritability [*h*<sup>2</sup> (0)] of 100, 30, and 0.4, respectively, was defined. The trait, measured in both sexes, was controlled by 1000 additive loci (thereafter called selective loci). These selective loci were chosen at random from the previously simulated loci (markers and non-markers). The additive effect of locus *i* (*ai*) was sampled from a normal distribution with mean 0 and variance *VA*(0)/[2*p*(1–*p*)*nsel*], where *VA*(0) is the initial additive variance (*h*<sup>2</sup> (0)*VP*(0) = 12), *p* is the average frequency across selective loci and *nsel* is the number of selective loci (i.e., 1000). Note that, in this way, the expected additive variance summed over all loci equals *VA*(0), assuming that covariances between loci generated by LD are negligible. The phenotypic value for a particular individual *j* was obtained as *Pj* = μ + *nsel <sup>i</sup>* <sup>=</sup> <sup>1</sup> *xiai* + *ej*, where *xi* is an indicator variable that takes values 1, 0 or −1 for homozygous 11, heterozygous or homozygous 00, respectively, and *ej* is the individual environmental deviation that was sampled from a normal distribution with mean 0 and variance *VE* <sup>=</sup> *VA*(0)(1–*h*<sup>2</sup> (0))/*h*<sup>2</sup> (0). The environmental variance (*VE*) was initially calculated for each replicate and strain in order to assure that all started with the same *h*<sup>2</sup> (0) value and was kept constant across generations.

Once the strains were created, they were allowed to diverge during 20 discrete generations (with constant population size) under three different regimes: (i) random selection; (ii) artificial selection with different selection pressures in order to mimic already improved strains; and (iii) stabilizing selection with different optima in order to mimic wild strains with local adaptations in nature. Four different scenarios varying in the type of strains they included were then defined:


$$W(P) = \exp\left(-\frac{\left(P - P\_{\rm opt}\right)^2}{2\alpha^2}\right)$$

where *Popt* is the optimum phenotype in a particular environment and ω<sup>2</sup> is an inverse measure of the strength of stabilizing selection. For some strains *Popt* was set to the original mean (100), for others the optimum was set to a lower value (90) and for others the optimum was set to a higher value (110). Selection pressure was relatively strong (ω<sup>2</sup> = 5) and population size was equal in all strains (20 males and 20 females).

(4) Mixed Scenario. The set comprised strains of the three types described above (i.e., randomly selected, under directional selection and under stabilizing selection).

**Table 1** summarizes the specific parameters used when creating the 10 strains under each scenario. The sizes of the strains simulated during the 20 generations were rather small in order to force a rapid divergence between them and, thus, real differences in relatedness and phenotypic levels at the end of the differentiation stage. After this period, subpopulations were expanded in order to have a large number of candidates to form the base population. Specifically, the population size of each strain increased to 50 males and 50 females in a single generation of random selection and mating.

## **FOUNDATION OF THE BASE POPULATION**

From the 1000 available candidates (50 males and 50 females from each strain) for each particular scenario, base populations were constructed by selecting 100 males and 100 females following different strategies: (i) Taking at random equal numbers of individuals from each strain (strategy E); (ii) Determining optimal strain proportions for maximizing the expected heterozygosity (*He*) calculated from the mean coancestry values within and between strains (strategy MC); (iii) Determining optimal strain proportions for maximizing the mean (strain) phenotypic value with a restriction on coancestry (strategy MP); (iv) as in (ii) but using individual relationships instead of strain means (strategy IC); and (v) as in (iii) but maximizing individual values of the selected individuals instead of strain means (strategy IP). Note that, in these abbreviations, M and I stand for mean and individual information, respectively, C indicates that the objective is


**Table 1 | Parameters used to generate each strain for the four different scenarios simulated.**

*Size and Numbers selected refer to the number of individuals per sex.*

just to minimize coancestry (i.e., maintain high levels of *He*) and P indicates that the objective is to create base populations with a high phenotypic level.

Strategy E is equivalent to that used by Holtsmark et al. (2006, 2008a,b) and provides a reference point for comparisons. Strategy MC followed the methodology presented by Eding and Meuwissen (2001) and Caballero and Toro (2002). The particular objective function to minimize was -*Ns i* = 1 -*Ns <sup>j</sup>* <sup>=</sup> <sup>1</sup> *ci cj fij*, where *Ns* is the number of strains, *ci* is the proportion of individuals to be sampled from strain *i* and *fij* is the mean coancestry coefficient between strains *i* and *j* calculated from marker genotypes. Contributions were forced to be in the interval [0,1] and to sum up to 1.

Strategy MP searched for the solution that maximized the objective function -*Ns <sup>i</sup>* <sup>=</sup> <sup>1</sup> *ci Pi*, where *Pi* is the mean phenotype of strain *i* but imposing the restriction -*Ns i* = 1 -*Ns <sup>j</sup>* <sup>=</sup> <sup>1</sup> *ci cj fij* <sup>≤</sup> *CE*, where *CE* is the mean coancestry of the base population obtained under strategy E. The restrictions imposed in MC were also applied to MP. In the two strategies relying on strain mean values (MC and MP) the actual number of individuals to be sampled from each strain was obtained by multiplying *ci* by the total number of individuals to be selected and rounding to the nearest even integer. This procedure was implemented to ensure that half of the individuals from each strain were males and half females.

The objective under strategies IC and IP was to minimize-*N i* = 1 -*N <sup>j</sup>* <sup>=</sup> <sup>1</sup> *xixj fij* and to maximize-*N <sup>i</sup>* <sup>=</sup> <sup>1</sup> *xiPi*, respectively. Here *N* is the total number of candidates (i.e., all individuals from every strain), *fij* is the coancestry between individuals *i* and *j*, *Pi* is the phenotype of individual *i* and *xi* is an indicator variable that takes a value of 1 if individual *i* is to be selected and 0 otherwise. The sum of *x*'s for males and females was forced to be equal to the number of individuals to be selected for creating the base population (i.e., 100 of each sex). Strategy IP also included a restriction to guarantee that solutions had a global coancestry lower or equal than that obtained under E strategy. All the optimizations were performed using "simulated annealing" algorithms (Kirkpatrick et al., 1983).

It must be emphasized that in strategies MC and MP the output of the optimizations is the proportion of individuals to be taken from each strain but the specific individuals are sampled at random. This could be the situation when prior knowledge of the mean genetic relationship or performance of the strains is available but individual information for the candidates is not. In strategies IC and IP the outputs are the particular individuals to be selected as we assume that their genotypes and phenotypic values are known.

As described above, restrictions imposed under strategies MP and IP were set to the level of coancestry (calculated on the markers) obtained under strategy E. Thus, these three strategies could be compared in terms of phenotypic level at the same diversity level.

#### **ARTIFICIAL SELECTION**

To explore the consequences of each strategy for creating the base population on genetic gain and diversity in the subsequent breeding programme, 10 generations of artificial selection were simulated for each combination of marker density, type of available strains and strategy. At generation *t* = 0 (i.e., base population), founders were mated at random to form 100 families and 10 offspring were obtained from each couple. Consequently, 1000 individuals were available as candidates for selection. The 100 males and 100 females with the highest phenotypic value for the simulated trait were selected to produce generation *t*+ 1 (i.e., phenotypic truncation selection was conducted). Selected individuals were mated at random and, again, 10 offspring were generated from each couple. Therefore, the proportion of selected individuals was 20% that corresponds to a selection intensity of 1.4.

#### **VARIABLES FOR COMPARISON**

In the base population, comparisons between strategies were made in terms of the mean phenotypic value (*P*¯) and *He* of the group of selected individuals. Note that *He*, calculated on the nonmarker loci, is a measure of the genetic variation of the population and its ability to adapt to new environments. The contributions of strains to the base population, measured as the proportion of breeders selected from each of the strains, was also considered in the comparisons.

In the artificial selection step, the strategies were also compared in terms of mean breeding value (*BV*) and additive variance (*VA*) for the target trait, genealogical inbreeding (*F*) and coancestry (*f*) coefficients and rates of gain, inbreeding (-*F*) and coancestry (*f*). For the computation of inbreeding and coancestry, founders in the base population were assumed to be unrelated and non-inbred. In all scenarios, values presented are averages of 100 replicates.

### **RESULTS**

#### **CONTRIBUTIONS OF STRAINS TO THE BASE POPULATION**

The proportional contribution of each available strain to the base population is shown in **Table 2** for the most extreme marker densities, the four scenarios and the five strategies simulated. In general, the observed patterns were the same when using high or low density of markers. Particular differences in performance due to marker density are highlighted below.

In the Drift scenario no meaningful differences in contributions were observed between strategies MC, MP, and IC for a particular strain. This was due to the fact that no selection on the quantitative trait was exerted during the generation of the strains and, therefore, the phenotypic mean was equal to the initial value before the divergence period (i.e., 100) for all strains. However, differences arose between strains. The higher the population size of a particular strain the higher was its contribution. Strains with a small size (strains 1, 2, and 3) had the lowest contributions due to the large loss of genetic diversity during the divergence period. Therefore, these strains were less useful for increasing the amount of diversity stored in the synthetic base population. The opposite happened with large size strains (7, 8, 9, and 10) that contributed more than 14% each to the base population (see Drift scenario in **Table 2**). The lowest variance of contributions between strains (beyond strategy E) was found under strategy IP. This could be explained by the fact that strong drift in small populations could result in the existence of individuals with extreme high phenotype. Thus, as the main objective of strategy IP is



*Contributions for strategies MC, MP, IC, and IP are given as deviations from those of strategy, E. The variance of contributions across strains is given in italics. E, equal numbers from each strain; MC, minimization of coancestry based on mean strain values; MP, maximization of phenotype with restriction on coancestry based on mean strain values; IC, minimization of coancestry based on individual values; IP, maximization of phenotype with restriction on coancestry based on individual values.*

achieving a high phenotypic level in the base population, it would be worthy to keep individuals not only from large but also from small strains. In this situation a high phenotypic level can be obtaining without reducing too much the diversity maintained.

In the Selection scenario, the variance of contributions across strains was much lower than in the Drift scenario (1.3–6.2 vs. 12.6–16.7; **Table 2**) except for strategy IP with low marker density. This is a consequence of the greater uniformity between strains, at least for the genetic variability of the trait. In the Selection scenario all strains had been under directional selection for the same trait and all had the same size. The only difference between strains was the strength of selection. Those under the weakest selection pressure (8, 9, and 10) had a higher effective population size (*Ne*), maintained higher levels of genetic diversity and, thus, in average contribute more to the base population. Strains subjected to a strong selection pressure were those with the lowest contributions, even under the strategies directed to keep high levels of trait performance (see strains 1 and 2, Selection scenario in **Table 2**). This is due to the relative long time period since the separation of the strains (20 generations). The small *Ne* induced by the selection pressure erodes rapidly not only neutral variability but also the genetic variance of the trait. Therefore, strains 1 and 2 reached a selection limit before the rest and presented lower phenotypic mean values at the time the base population was created (data not shown). Consequently, they cannot contribute much to diversity nor to trait value either.

In the Stabilizing scenario, when the criteria for choosing individuals to constitute the base population included considerations on the phenotypic performance (i.e., strategies MP and IP) the contribution of a strain was proportional to its phenotypic mean (**Table 2**). Contrarily, contributions were almost equalized when the only concern was to keep the highest levels of diversity (strategies MC and IC), given that all strains had an identical population size and a similar selection pressure in the divergence period. An interesting observation was that the variance of contributions across strains greatly decreased for strategies MP and IP when a large panel of SNPs (100,000) was used (see lower section of **Table 2**). With dense genotyping, diversity at selective loci is tightly linked to neutral diversity and, thus, groups of individuals with high phenotype will also have low diversity at the markers. Therefore, optimal solutions include the selection of fewer individuals from the same high performance strain to cope with the restriction on genetic diversity. The lower diversity of high performance groups of individuals is not detected with sparse marker coverage as diversity at selective loci is loosely linked to neutral diversity.

The Mixed scenario included all kind of strains and, thus, the performance was somehow more complex. Notwithstanding, the general patterns highlighted before (i.e., dependence of contributions on the historical size, the intensity of selection and the optimal phenotypic value, respectively) are still observable (see right part of **Table 2**). A particular observation was that the contribution of strains adapted to a low phenotypic value (i.e., strains created under stabilizing selection with a low optimum) was no longer required even with massive genotyping. Also, it was observed that under strategies MP and IP the low diversity of high performance individuals may be compensated for the high diversity of the "drift" strains with reasonable phenotypic levels.

## **DIVERSITY AND PHENOTYPIC LEVEL IN THE BASE POPULATION**

The mean phenotypic value for the trait of interest and the genetic diversity captured in the group of individuals conforming the base population are presented in **Table 3**. It must be pointed out that all strategies except E are explicitly concerned with the maintenance of diversity. However, only MP and IP strategies included considerations about the phenotypic level in the objective function to optimize.

Using strategy IP with a large number of markers in the Mixed scenario led to a base population with mean phenotypic value for the trait 7% higher than under strategy E at the same diversity. The advantage of strategy IP over E in other scenarios ranged from about 4% (Selection scenario) to 6% (Drift scenario), being proportional to the degree of differentiation between strains. This is a logical result if we realize that the margin for improvement with unequal contributions was lower when strains were more similar (for example in the Selection scenario). When relying on the information of few markers, IP led to higher trait means than with dense genotyping (left part of **Table 3**) because diversity at selective loci and at the markers was more loosely linked. Then, groups of individuals with high phenotypic mean for the trait can be found also showing high levels of diversity at the markers and, thus, coping with the restrictions in the optimization. However, the global diversity at non-marker loci will be low (as stated before) yielding solutions that do not achieve the intended balance between phenotype and diversity (right part of **Table 3**).

In general, the higher the number of markers used to estimate relationships the higher the *He* retained in the base population. However, the differences in *He* with different marker densities were small. The largest difference occurred in the Mixed scenario under strategy IP (3% higher *He* when using 100,000 markers instead of 10 markers). Improvements in the level of *He* maintained under IC, which should be the most efficient strategy in terms of diversity captured were never larger than 1% (**Table 3**).

The genetic diversity maintained in the base population under different strategies was very similar across scenarios. Except for some cases with low number of markers, the strategy capturing the highest levels of neutral *He* was IC, because no other factor but diversity was included in the objective and decisions were taken on the genotypes of the individual candidates and not on the mean strain values. The advantage of this strategy compared to sampling equal number of individuals from each strain (strategy E) ranged from 0.4% (in the Selection scenario) to 1.1% (in the Mixed scenario). This result was obtained because the Selection scenario and Mixed scenario present the highest and the lowest degree of homogeneity between the available strains, respectively.

When using the average strain values (i.e., all individuals from the same strain assumed equivalent) levels of diversity obtained under MC were always lower than with IC although, as stated before, differences were small. With increasing number of markers differences diminished and, eventually, disappeared (see **Table 3**, Stabilizing scenario).

Strategies MP and IP were intended to select individuals with high phenotypic performance but keeping the same level of diversity than E. When using a low number of markers these strategies

**Table 3 | Average phenotypic value and expected heterozygosity (in percentage) under different strategies to select individuals for the base population for the four scenarios considered and for different number of markers (***nm***).**


*E, equal numbers from each strain; MC, minimization of coancestry based on strains values; MP, maximization of phenotype with restriction on coancestry based on strains values; IC, minimization of coancestry based on individual values; IP, maximization of phenotype with restriction on coancestry based on individual values. Standard errors of phenotypic value ranged from 0.13 to 0.25 and those for expected heterozygosity were lower than 0.01%.*

did not fulfill the restriction (i.e., *He* was lower under MP or IP than under E; see **Table 3**) as this was introduced in the formulation through the molecular coancestry calculated from the markers but diversity results were obtained from the nonmarkers genotypes. With the largest panel (i.e., 100,000 SNPs) IP did maintain the same diversity level than E but MP maintained slightly lower values because of the random sampling of individuals within strains.

## **GENETIC GAIN AND INBREEDING FROM THE BREEDING PROGRAM**

The capability to respond to artificial selection depends on the amount of additive genetic variance (*VA*) present for the trait. **Figure 1** shows *VA* along the 10 generations of phenotypic truncation selection for all scenarios and strategies used to construct the base population. The highest initial *VA* corresponded to the Mixed scenario as this was the most heterogeneous scenario in terms of types of available strains. The order for the rest of scenarios was Stabilizing, Drift and Selection, following thus the same pattern as that observed for general variability described in the previous section. Within each scenario, patterns of *VA* for each strategy were very similar. The highest values were observed for strategy E and the lowest for strategy IP. A large decrease in *VA* was observed in early generations in all scenarios due to the Bulmer effect (Falconer and Mackay, 1996).

Higher *VA* values in the founders of the breeding program turned into higher initial responses to selection (right panels in **Figure 2**). After the initial generations of selection, gain increased at a lower rate and, after the 10 generations observed values were, in general, inversely related to the initial *VA* (i.e., scenarios with higher initial *VA* maintained lower BV gain for the whole period of selection).

The mean breeding value (BV) of the base population was higher for those scenarios including already selected strains (i.e., Selection and Mixed) and close to zero for the other two scenarios (left panels in **Figure 2**). Irrespective of the differences in the rate of gain for each scenario pointed out before, the rank of scenarios in terms of BV remained the same for the 10 generations of selection.

For a particular scenario, strategy IP always provided the highest BVs (**Figure 2**), even although generally started from the lowest *VA* (**Figure 1**). When the number of markers used in the creation of the base population was small the advantage (in terms of mean BV) of strategy IP was greater than that presented in **Figure 2** (data not shown) due to the higher initial differences with the rest of strategies in trait performance already shown when constructing the base population. But, as discussed before, strategy IP also resulted in lower *He* values (**Table 3**). In all scenarios, strategies MC and IC performed almost identically for *VA*, BV and gain, especially when using a large panel of markers (left panel in **Figure 1** and both in **Figure 2**). Strategy E yielded similar results to those from MC and IC except for the Mixed scenario at early generations where the average BV was higher for E (left panels in **Figure 2**). However, at the end of the 10 generations of selection average BV for E, MC, and IC equalized.

It must be highlighted that the trait under selection was simulated with an additive gene action within and across loci. This is the reason for a continuous decay of *VA* for the trait and the corresponding decrease of genetic gain between consecutive generations (**Figures 1**, **2**, respectively). In traits with an important non-additive component the selection process may generate new additive variance which could lead to the maintenance of levels of response to selection larger than expected under a pure additive model.

The Drift scenario started from the highest *He* levels and also showed the lowest rate of loss of diversity along the breeding program (right panels in **Figure 1**). On the other hand, the fastest decrease in diversity was observed in the Mixed scenario and this was related to the large initial responses obtained under this scenario.

For all scenarios, populations arising from strategies accounting for the phenotypic level of the founders (i.e., MP and IP) lost more *He* during the 10 generation of selection than strategies aiming just at keeping diversity (right panels in **Figure 1**). This could be due to the fact that in groups of selected individuals the genetic variance for the trait would be more correlated to the global genetic diversity under MP and IP strategies than in the other strategies. Consequently, during the breeding programme the reduction in *VA* inherent to the selection process also imply greater reductions in *He* across generations. When initial breeders were chosen based on the genotypes for few markers, populations obtained following IC strategy maintained higher levels of diversity along the generations of selection than when using MC (data not shown) but when a large panel of SNPs was used the performance of both strategies was similar (right panels in **Figure 1**).

As mating was at random throughout the selection process, average inbreeding (*F*) and coancestry (*f*) coefficients run in parallel, with the expected lag for *F*. Therefore, only results for *f* are shown. Especially for the Mixed scenario (and to a lesser extent for the Stabilizing scenario) *f* was higher for strategies MC and IC than for the rest (left panels in **Figure 3**). The reason is that strategies MC and IC keep higher numbers of "low performance" individuals whose descendants will not be selected, leading to higher *f* and, thus, to lower *Ne* than expected. This effect was not detectable in the Selection and Drift scenarios due to the higher homogeneity between strains and individuals for the phenotypic level of the trait.

Irrespective of the scenario, at the beginning of the breeding program there was an increase in *f* (right panels in **Figure 3**) that was due to the removal of individuals with low genetic BV for the trait. Afterwards, *f* stabilized in the Drift and Selection scenarios around 0.4%. This figure is higher than the expected rate (*f* = 0.25%) for a random selection population of size 200 (i.e., the number of selected individuals each generation; Woolliams and Bijma, 2000) because between-family selection occurs. In the Stabilizing and Mixed scenarios, *f* monotonically decreased reaching levels closer to 0.25%.

## **DISCUSSION**

The present study has shown that the use of phenotypic information of the candidate strains and the use of genome-wide marker information to infer relationships within and between strains can help to optimize the proportion of individuals to be sampled from each strain when creating base populations for breeding programs

**expected heterozygosity for the non-marker loci along the generations of selection.** Results shown correspond to base populations obtained using 100,000 markers. E, equal numbers from each strain; MC, phenotypic value with a restriction on coancestry; IC, minimize individual coancestry; and IP, maximize individual phenotypic value with a restriction on coancestry.

in aquaculture species. The advantage of using this information is reflected in the phenotypic and breeding values obtained at the beginning of the program and at the genetic diversity captured by the base population. The advantage remains during the subsequent generations of selection, making the breeding program more profitable. The study has not been designed for a particular species. Instead, we have considered a general genome architecture and a population structure that fit most aquaculture species.

Traditionally, when creating base populations for the establishment of a breeding program, no information was available about the genetic relationships between candidate strains or between individuals within strains. Consequently, the usual strategy was to collect equal number of individuals from as many strains as possible (Holtsmark et al., 2006, 2008a,b). However, the increasing amount of molecular markers developed for aquaculture species provide us with the opportunity of estimating genetic relationships within and between strains and to optimize the contribution of each strain. In this study it has been shown that strains harboring low levels of genetic diversity should contribute less individuals in order to maximize the global diversity of the base population.

In the creation of base populations, one important point to determine beforehand is if the objective is to maximize the genetic variance for a particular trait (i.e., the trait in the breeding goal) or to maintain the highest global diversity. For the former objective, Bennewitz and Meuwissen (2005) showed that the optimal strategy is what they called maximum variance total (MVT), which gives more weight to the variance between strains than within strains. However, although the profitability of the breeding program depends on the performance for the target trait, diversity must be also maintained for other traits that are likely to be included in future breeding objectives and for fitness related traits. Following this logic the methodology should be minimizing the global coancestry which poses the same weight to within and between strains diversity. The latter method is similar to minimizing the long-term inbreeding of the population as demonstrated by Eding and Meuwissen (2001) and was the chosen strategy for the present study. Accordingly, in our results for three of the simulated scenarios, strategy E yielded the highest *VA* for the target trait in the base population. The lower levels of *VA* observed under strategies MC and IC were due to the fact that the objective was to maximize the global genetic diversity measured as *He* across all the genome. Hayes et al. (2006) compared random with marker-based optimized selection of breeders from a single population of Atlantic salmon in terms of the genetic variance captured for three different traits (growth and two disease traits). They followed an equivalent methodology to strategy IC presented in this study and found higher additive variances in the breeders for the disease resistance traits but a lower variance for growth when optimizing the selection than when breeders were chosen at random. The explanation for these contrasting results was the different genetic architecture of the traits. It must be noticed that the simulated trait in the present study was controlled by a large number of additive loci and had an intermediate heritability typical for growth. This is the reason for similar performance (i.e., highest levels under E than IC strategy) observed in Hayes et al. (2006) for growth and in the present study (at least for three of the simulated scenarios). Another problem for the interpretation of the results in Hayes et al. (2006) is that they employed 237 AFLPs and this number may be not enough for obtaining a high correlation between diversity at markers and at loci controlling growth.

In concordance with the previous considerations, in this study the highest levels of global diversity (*He* measured at the nonmarker loci) were captured when optimizing the creation of the base population using individual genotypes (strategy IC) with a large number of SNPs, although differences with the strategy equalizing proportions (strategy E) were small. Surprisingly, scenarios with a limited number of markers (i.e., 100) implied only a loss in *He* of 1% when comparing with results from using large numbers. In any case, when using few markers the *He* maintained under IC was sometimes lower than under the E strategy because of the lack of correlation between diversity at markers and diversity in the rest of the genome.

When individual information was absent (i.e., strategy MC) there was a reduction in the ability to capture diversity respect IC strategy whatever the number of markers used. Notwithstanding, values of *He* obtained when relying on strain averages were less than 2% lower than those observed when individual genotypes of candidates were available. This is an appealing result for cases where the budget is low and no all candidates can be genotyped.

Beyond all considerations about the genetic diversity, we must remember that the short-term profitability of a breeding program depends on actual mean levels of the phenotypic value for the trait of interest (as long as the breeding goal does not change and no fitness troubles arise in the population). The present study has shown that the mean phenotype of the selected individuals should be also accounted for when constructing the base population if that information is available. The loss of profit resulting from including low performance individuals in the base population may be not economically compensated in a reasonable period of time even if the response to selection is high due to a wider genetic variance for the trait. In fact our results showed that superiority of individuals selected under strategy IP last for the 10 generations of selection.

The presented results showed that, when reliable information is available for a fixed set of strains, a compromise solution between diversity and performance can be found when creating base populations. Having as a reference point the strategy randomly sampling the same number of individuals from each strain (strategy E, equivalent to the methodology used in Holtsmark et al., 2006, 2008a,b) up to 7% higher levels of phenotypic performance can be achieved under strategy IP at the same level of global diversity (*He* measured at the non-marker loci) when using a large panel of SNPs to genotype all candidates. Depending on the market value for the increase of one unit of the target trait this could translate into a large economic gain. Moreover, results showed that phenotypic values remain higher during all generations of artificial selection that were simulated from the base population under the IP strategy. Therefore, there was a clear superiority of fishes obtained using this strategy.

It must be realized that giving a large weight to the phenotypic value for a particular trait will have consequences on other correlated traits. Special attention should be paid for traits negatively correlated with the target trait which could be of potential interest. In such situations a useful strategy would be to use an index that includes several traits in the objective function to be maximized or to include an additional restriction in the optimization to ensure a minimum acceptable phenotypic level for the secondary traits.

It must be stressed that genotyping for a limited number of markers may give undesirable results because diversity at those markers will be loosely related to diversity at non-marker loci and to diversity at loci controlling the trait. Consequently, individuals with a high phenotypic performance may actually maintain little global diversity but still exhibit by chance high levels of diversity at the markers. Our results show that, when relaying in few markers, lower *He* levels were found when optimizing contributions using the IP strategy than using the E strategy even when a restriction was imposed to keep the same level of diversity (**Table 3**). However, results from our simulations suggest that, although it will depend on the particular characteristics of the species under management and the genetic architecture of the available strains, about 1000 SNPs could be enough to efficiently create base populations in aquaculture as no relevant improvements are obtained by increasing further the number of markers.

For all scenarios, strategies accounting for the phenotypic level of the founders (i.e., MP and IP) started the selection program from lower values of *VA* for the target trait. This fact did not preclude these strategies to maintain the initial advantage in performance during the 10 generations of selection.

Results of this study suggest that if "healthy" commercial strains (i.e., where diversity is not exhausted and no problems of inbreeding exist) are available they should be used to form the base population because they provide a higher starting performance level. This is even clearer when the objective is to complement the breeders' population in an ongoing selection program. Contrarily, the use of wild adapted strains with low performance would be only recommended if we suspect that unique information for other traits of interest is present in them. This could be the situation for strains naturally resistant to a particular disease. Otherwise, the general diversity that they could provide will not compensate for the lower trait phenotypic mean.

The small differences in *He* observed under different strategies in our simulations could be due to the large number of individuals selected to form the base population (200) making strategy E to perform so well that the other strategies have difficulties in improving *He*. In an extra scenario run with a smaller set of candidates (only four strains with 20 individuals each), harboring lower levels of diversity (mutation-drift equilibrium reached for a population of 100 individuals) and selecting a lower number of breeders (24), strategies implying optimized proportions showed still only slightly greater advantages over strategy E (2% increase). In any case, the differences observed in phenotypic values make worthy to optimize the construction of base populations in aquaculture and levels of diversity should be also accounted for in that task to get an appropriate balance.

Another advantage of using molecular information, beyond balancing phenotypic values and diversity when optimizing the construction of a base population, is that it provides us with the possibility of estimating the actual relationships between breeders in the base population itself. These relationships can be used for calculating EBVs through BLUP methodology and also for controlling the rate of inbreeding through Optimal Contribution strategies. Holtsmark et al. (2008b) studied the effects on the performance of the breeding program across generations of assuming unrelated and non-inbred founders when they are not. They concluded that an incorrect estimation of the relationships between and within strains and individuals leads to sub-optimal use of subpopulations with an increased risk of loss of alleles of direct and strategic relevance to the breeding program.

If genotypes for dense panel of markers and phenotypes are available for the same individuals (the candidates to be part of the base population or related individuals), the additive effect of each SNP can be calculated in the same way as in the Genomic Selection methodology (Meuwissen et al., 2001). Thus, genomic value of candidates can be calculated and used to take decisions instead of their phenotypic value.

In the present study the selected trait was simulated with an additive gene action both within and across loci. However, traits with commercial interest may have an important dominant component. Dominant effects can be also estimated and then used to design the mating scheme between breeders, at least to form the families from which the selection program will start. This way, the effects of heterosis can be accounted for and extra responses can be obtained in the first round of selection (Toro and Varona, 2010).

In short life species, where individuals may be not reproductively active by the time their phenotypic records and genotypes are available, it would be difficult to implement strategies based on individual information (i.e., IC and IP). However, even in such situations strain information can be used to optimize the creation of the base population, as proved in the present study. The impossibility of controlling the specific matings with mass spawning species does not interfere with the optimization of the base population either. Phenotypic levels and diversity can be optimized at the start of the program although responses in subsequent generations of selection may differ to those shown here given that selected breeders would need to be mated in groups.

In livestock terrestrial species, breeding programs have been running for many years and, thus, it is not very likely that there is a need for creating new base populations. Notwithstanding, our conclusions go beyond the scope of creating base populations. For instance, they may help to take decisions when creating a gene bank for any species. When the aim of such a bank is to store the genetic diversity from the available strains, the same strategies can be applied as when creating the base population of breeders. When, in the future, the stored material will be used for creating a live population or for complementing a breeding program the same results and consequences than those presented here are expected. If the phenotypic values of the candidates are not taken into consideration when determining the sampling scheme the starting population will show low levels for the trait of interest. Finally, the methodology presented in this study is also useful when the objective is to create a "core" live population within an ex-situ conservation program that aim at collecting the genetic diversity that exist in all available populations. This scenario is common for local breeds of livestock species.

## **ACKNOWLEDGMENTS**

The research leading to these results has received funding from the European Union's Seventh Framework Programme (KBBE.2013.1.2-10) under grant agreement n◦ 613611.

## **REFERENCES**


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 24 July 2014; accepted: 06 November 2014; published online: 25 November 2014.*

*Citation: Fernández J, Toro MÁ, Sonesson AK and Villanueva B (2014) Optimizing the creation of base populations for aquaculture breeding programs using phenotypic and genomic data and its consequences on genetic progress. Front. Genet. 5:414. doi: 10.3389/fgene.2014.00414*

*This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics.*

*Copyright © 2014 Fernández, Toro, Sonesson and Villanueva. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

## Characterization of the rainbow trout spleen transcriptome and identification of immune-related genes

## *Ali Ali 1,2, Caird E. Rexroad3, Gary H. Thorgaard4, Jianbo Yao5 and Mohamed Salem1,5\**

*<sup>1</sup> Department of Biology, Middle Tennessee State University, Murfreesboro, TN, USA*


#### *Edited by:*

*José Manuel Yáñez, University of Chile, Chile*

#### *Reviewed by:*

*Alexandre Rodrigues Caetano, Embrapa Recursos Genéticos e Bioteclogia, Brazil Ed Smith, Virginia Polytechnic Institute and State University, USA*

#### *\*Correspondence:*

*Mohamed Salem, Department of Biology, Box 60, 2055 SCI, 1500 Greenland Dr., Murfreesboro, TN 37132, USA e-mail: mohamed.salem@mtsu.edu* Resistance against diseases affects profitability of rainbow trout. Limited information is available about functions and mechanisms of teleost immune pathways. Immunogenomics provides powerful tools to determine disease resistance genes/gene pathways and develop genetic markers for genomic selection. RNA-Seq sequencing of the rainbow trout spleen yielded 93,532,200 reads (100 bp). High quality reads were assembled into 43,047 contigs. 26,333 (61.17%) of the contigs had hits to the NR protein database and 7024 (16.32%) had hits to the KEGG database. Gene ontology showed significant percentages of transcripts assigned to binding (51%), signaling (7%), response to stimuli (9%) and receptor activity (4%) suggesting existence of many immune-related genes. KEGG annotation revealed 2825 sequences belonging to "organismal systems" with the highest number of sequences, 842 (29.81%), assigned to immune system. A number of sequences were identified for the first time in rainbow trout belonging to Toll-like receptor signaling (35), B cell receptor signaling pathway (44), T cell receptor signaling pathway (56), chemokine signaling pathway (73), Fc gamma R-mediated phagocytosis (52), leukocyte transendothelial migration (60) and NK cell mediated cytotoxicity (42). In addition, 51 transcripts were identified as spleen-specific genes. The list includes 277 full-length cDNAs. The presence of a large number of immune-related genes and pathways similar to other vertebrates suggests that innate and adaptive immunity in fish are conserved. This study provides deep-sequence data of rainbow trout spleen transcriptome and identifies many new immune-related genes and full-length cDNAs. This data will help identify allelic variations suitable for genomic selection and genetic manipulation in aquaculture.

#### **Keywords: spleen transcriptome, annotation, KEGG, immune-related genes, spleen-specific genes, full-length cDNA**

## **INTRODUCTION**

Teleost fish are the first class of vertebrates that have the elements of both innate and adaptive immune responses (Whyte, 2007). Innate immunity is more important in teleosts as the first line of defense due to the restrictions on adaptive immunity in suboptimal environments (Ullal et al., 2008). Teleost fish have no bone marrow or lymph nodes. Immunogenomics can be used in clarifying the origin and evolution of the immune systems (Kaiser et al., 2008).

The ability of fish to combat viral, bacterial or parasitic pathogen is affected by genetic factors. Thus, genetic selection can improve disease resistance and provide prolonged protection against pathogens (Skamene and Pietrangeli, 1991; Wiegertjes et al., 1996; Van Muiswinkel et al., 1999; Leeds et al., 2010; Wiens et al., 2013a). In addition, investigations on immune reactions in fish could aid in development of vaccines (Raida and Buchmann, 2009).

Rainbow trout (*Oncorhynchus mykiss*) are widely distributed and cultured aquaculture species used as a food and sport fish (Thorgaard et al., 2002). Additionally, this organism is used as a model species in many different fields of research such as cancer biology (Tilton et al., 2005), toxicology (Köllner et al., 2002), nutrition (Wong et al., 2013), evolutionary biology (Taylor et al., 2011), and immunology (Nya and Austin, 2011). Rainbow trout are larger than other fish model species, making them a source of large quantities of specific tissues and cells for immunological, biochemical and molecular analyses. Several genomic resources have been developed for research to help improve rainbow trout commercial production traits including disease resistance. The list includes clonal lines (Young et al., 1996; Thorgaard et al., 2002), BAC libraries (Palti et al., 2004), genetic linkage maps (Young et al., 1998; Sakamoto et al., 2000; Rexroad et al., 2008; Palti et al., 2011; Guyomard et al., 2012), microarrays (Salem et al., 2008), expressed sequence tags (ESTs) (Rexroad et al., 2003;

*<sup>2</sup> Department of Zoology, Faculty of Science, Benha University, Benha, Egypt*

Sánchez et al., 2009; Boussaha et al., 2012), single nucleotide polymorphisms (SNPs) (Sánchez et al., 2009; Amish et al., 2012; Boussaha et al., 2012; Houston et al., 2012; Salem et al., 2012; Christensen et al., 2013; Colussi et al., 2014; Palti et al., 2014), next-generation sequence read archives (SRA) and genome as well as transcriptome reference assemblies (Salem et al., 2010; Sanchez et al., 2011; Berthelot et al., 2014; Fox et al., 2014). Several studies have identified many immune-related key genes and gene networks (Thorgaard et al., 2002; Cerdà et al., 2008; Chiu et al., 2010; Zhang et al., 2011). Immunogenomic studies of rainbow trout have established an agricultural importance due to their direct and immediate contributions to the aquaculture industry (Thorgaard et al., 2002).

The spleen is a primary hematopoietic and peripheral lymphoid organ (Zapata et al., 2006; Mahabady et al., 2012). This organ has melano-macrophage centers for breakdown of aged erythrocytes, and T-like as well as B-like cells for antigen trapping. In addition, spleen has a role in antigen presentation and initiation of the adaptive immune response (Espenes et al., 1995; Zapata et al., 1997; Chaves-Pozo et al., 2005; Whyte, 2007; Alvarez-Pellitero, 2008). Positive genetic correlation exists between bacterial cold water disease resistance and spleen index in domesticated rainbow trout (Wiens et al., 2013b).

Transcriptomic profiling is useful in revealing spleen genes that are engaged in the innate and adaptive immune responses and expressed as a result of the presence of toxicants or infection (Pereiro et al., 2012). RNA-Seq (whole transcriptome sequencing) is an effective tool for identifying the functional complexity of transcriptomes, alternative splicing, non-coding RNAs, new transcription units and assembly of full-length genes (Xiang et al., 2010; Huang et al., 2011; Salem et al., 2012; Djari et al., 2013). This deep sequencing technology gives low background noise, aids in identifying allele-specific expression and reveals weakly expressed transcripts. Bioinformatics algorithms were developed facilitating transcriptomic profiling (Yang et al., 2012). Therefore, RNA-Seq is a valuable tool in studying functional complexity of the spleen transcriptome and identifying immune-relevant genes and signaling networks (Nie et al., 2012).

In this study, we aimed to (1) characterize the transcriptome of rainbow trout spleen and (2) identify spleen-specific and immune-relevant genes (including full-length cDNAs) that could be used to develop genetic markers for disease resistance. Identifying the networks associated with such genes will be helpful in generating new technologies to improve aquaculture (Takano et al., 2011).

## **RESULTS AND DISCUSSION**

The spleen transcriptome was sequenced from an apparently healthy single homozygous doubled-haploid fish from the Swanson clonal line, the same line used for BAC library construction (Palti et al., 2004) and sequencing both of the whole transcriptome (Salem et al., 2010) and the whole genome reference (Berthelot et al., 2014). A single doubled-haploid fish was used to help overcome the assembly complications associated with the tetraploid genome of the rainbow trout (Allendorf and Thorgaard, 1984). Spleen transcriptome RNA-Seq data were *de novo* assembled into contigs. Assembled contigs were analyzed and annotated to identify genes that are predominantly expressed in the spleen and genes that are involved in immune signaling pathways. Spleen sequencing data yielded a total of 93,532,200 reads with a read length of at least 100 bp. After filtration to remove the adaptors, low complexity reads and duplicates, 58,013,135 (62%) high quality reads (*Q* values *>* 33%) were obtained and assembled into 43,047 contigs with an average contig length of 1154 nt and N50 equal to 1306 nt.

## **FUNCTIONAL ANNOTATION**

Contigs were searched against the NCBI's non-redundant protein (NR) database using the BLASTX program with *E*-value of 1.0E-3. There were 26,333 (61.17%) contigs with hits to the NR database (Table S1). The contigs which had no hits [16,714 (38.83%)] may be attributed to non-coding RNAs, contig misassembles (Grabherr et al., 2011), limited information about protein sequences of related fish in the NCBI database or diverged sequences of rainbow trout due to partial genome duplication (Ravi and Venkatesh, 2008; Lee et al., 2011). Further work toward characterization of the non-coding RNAs is still needed.

Data statistics of the sequencing, assembly and annotations are presented in **Table 1**. A total of 13,780 (88.40%) of the contigs of more than 1000 bp in length had BLAST matches, whereas only 12,543 (51.72%) of contigs shorter than 1000 bp had BLAST hits (**Figure 1**). Short sequences may give false-negative results because they are not long enough to show sequence matches or may lack a representative protein domain (Wang et al., 2012). The identity distribution revealed that 70% of the contigs have greater than 80% similarity and 24% possess identity similarities between 60 and 80%. The *E*-value distribution of the top

**Table 1 | Statistical summary of rainbow trout spleen sequencing, assembly and annotation.**


hits to the NR database showed that 28% of the assembled contigs showed significant homology to previously deposited sequences (less than 1.0E-50), and 72% ranged from 1.0E-50 to 0. The assembled contigs have been submitted to the NAGRP Aquaculture Genome Projects (http://www*.*animalgenome*.*org/ repository/pub/MTSU2014*.*0811/).

Gene ontology (GO) categorizes gene products and standardizes their representation across species (Consortium, 2008). Contigs with lengths of 500 nucleotides or more (43,047) were annotated using the Blast2GO suite (Conesa et al., 2005; Götz et al., 2008); contigs were assigned to appropriate molecular function, biological process and cellular component GO terms (Ashburner et al., 2000). **Figure 2** shows summary of the GO assignments at the second level in the three areas of gene ontology. GO term distribution was compared to a transcriptome reference that we previously assembled from 13 tissues (Salem et al., 2010) (**Figure 3**). In the biological process category, the most represented terms were related to cellular process (17%), followed by metabolic process (15%) and biological regulation (12%) (**Figure 2A**). These percentages are lower than the corresponding categories at the whole transcriptome reference; 25, 18, and 25%, respectively (**Figure 3A**). Conversely, some immunerelevant sub-categories of metabolic processes exist in higher percentages compared to counterparts at sub-categories at the whole transcriptome reference; response to stimuli (9%) and signaling (7%) compared to 3% response to stimuli and 1% immune system process, respectively (**Figure 3A**). These percentages suggest that we identified a larger number of immune-related genes in this study.

Within the molecular function category, 51, 30, and 4% of the spleen transcripts were assigned to binding, catalytic activity and receptor activity, respectively. The whole transcriptome reference showed 46 and 32% and unlisted percentage, respectively (**Figure 3B**). Pereiro and co-workers suggested several immune-related processes were represented in the binding and catalytic activity categories (Pereiro et al., 2012). In the cellular component categories, a significant percentage of clusters assigned to cell (42%), organelle (27%), macromolecular complex (13%), membrane enclosed lumen (8%) and membrane 7% (**Figure 3C**) compared to their corresponding categories at the whole transcriptome reference; 59, 24, 9, 3% and unlisted percentage, respectively (**Figure 3C**). Discrepancies in the GO distribution profiles may be attributed to differences in the nature of the cDNA libraries, the numbers of sequences used to retrieve GO terms, sequencing technology and the assembly approaches. Information about GO terms is supplied in additional Table S2.

KEGG pathway analysis was carried out to categorize and annotate the assembled contigs (Kanehisa and Goto, 2000; Kanehisa et al., 2012). Searching contigs against the KEGG database yielded 7,024 KEGG hits (16.32% of the total number of transcripts), with 4779 unique hits (11.10% of total number of transcripts). KEGG Orthology (KO) numbers were used to assign sequences to different metabolic pathways (**Table 2**). In total, 2236 (31.83% of the total number of KEGG hits) KEGG annotated sequences were assigned to metabolism that was further classified into carbohydrate metabolism (453 sequences, 20.26%), lipid metabolism (349 sequences, 15.61%) and amino acid metabolism (343 sequences, 15.34%). In addition, 1,394 (19.85%) annotated sequences were assigned to genetic information processing which includes folding, sorting and degradation (484 sequences, 34.72%), translation (459 sequences, 32.93%), replication and repair (255 sequences, 18.29%), and transcription (196 sequences, 14.06%). Further, 1830 sequences (26.05%) were classified as environmental information processing assigning 1508 sequences (82.40%) to signal transduction and 283 sequences (15.46%) to signaling molecules and interaction. The Cellular processes group contained 1421 (20.23%) KEGG-annotated sequences.

Remarkably, the KEGG organismal systems category contained 2825 (40.22%) annotated sequences with the highest number of sequences assigned to immune system (842 sequences, 29.81%) followed by endocrine system (619 sequences, 21.91%), nervous system (526 sequences, 18.62%), digestive system (268 sequences, 9.49%) and development (210 sequences, 7.43%). Assignments of the organismal systems to the last four categories support the previously reported relationships between function of the spleen and other systems of the body. For example, a subset of genes with functions relevant to neurodevelopment was identified in the spleen transcriptome of the house finch (*Haemorhous mexicanus*) (Backström et al., 2013). Regarding the endocrine functions, it was thought that spleen secretes a hormone-like substance under the control of pituitary gland and adrenal cortex in case of emergencies (Ungar, 1945). Recently, the spleen endocrine function has been confirmed after in-depth studies of its function (Wu, 1998; Horiguchi et al., 2004; Tarantino et al., 2011). A cytokine known as Lymphotoxin was reported to keep the immunological balance of the gastrointestinal tract through regulation of the immune system of the digestive tract which is represented by immune cells, immunoglobulins and intestinal bacteria (Kruglov et al., 2013). In addition, hormones of the gastrointestinal tract activate the immune system in case of gut inflammation (Khan and Ghia, 2010).

#### **TAXONOMIC ANALYSIS**

A BLASTX top-hit species distribution of gene annotations showed highest homology to *Salmo salar* (4,833 BLAST hits;

18.35%), followed by *Oreochromis niloticus* (16.64%), *Maylandia zebra* (16.47%) and *Danio rerio* (16.12%) (**Figure 4**). Other fish species in the BLASTX top-hit were *Takifugu rubripes* (5.98%) and *Oryzias latipes*(5.17%). Rainbow trout itself (983 BLAST hits; 3.73%) fell in the seventh position of the top-hit species distribution. This may be explained by identification of a large number of new genes in this study and/or and existence of a limited number of rainbow trout proteins (6965 proteins) that currently available

at NCBI database. The model fish species in the list, *D. rerio, T. rubripes* and *O. latipes*, have large number of proteins available in the NCBI database. All first nine species were fish, starting with *S. salar* from the family Salmonidae to which rainbow trout belongs. Therefore, these results support the high quality and high level of phylogenetic conservation of the assembled spleen transcriptome. Other species with known genome sequences appearing in the BLASTX top-hit species distribution were mammals including *Homo sapiens, Mus musculus,* and *Rattus norvegicus, Gallus* (chicken) and the amphibian *Xenopus laevis.*

#### **IMMUNE TRANSCRIPTOME ANALYSIS**

Our transcriptome analysis identified 842 immune-related transcripts in 15 KEGG immune pathways, (Table S3). Many of

#### **Table 2 | KEGG biochemical mappings for rainbow trout.**


these transcripts are represented by complete cDNA sequences that were identified for the first time in rainbow trout. The immune-related transcripts were mapped to a newly assembled genome reference (Berthelot et al., 2014). The coordinate genome reference IDs and complete/incomplete ORF conditions are provided in Tables S4–S10. The immune-related transcripts were clustered according to their KEGG assigned pathways (**Figure 5**).

#### *Toll-like receptor signaling pathway*

Toll-like receptors (TLRs) activate the innate immune response through recognition of pathogen associated molecular patterns (PAMPs) including lipopolysaccharides or peptidoglycan in bacterial cell wall, β-1,3-glucan on fungal cell wall and dsRNA from viruses (Medzhitov and Janeway, 2000; Janeway and Medzhitov, 2002). This leads to the activation of nuclear factor- kB (NF-kB)

component.

that in turn induces proinflammatory cytokines (Barton and Medzhitov, 2003; Takeda and Akira, 2004).

More than 10 TLRs have been identified in teleost fish including zebrafish (Jault et al., 2004; Meijer et al., 2004), rainbow trout (Tsujita et al., 2004; Palti et al., 2010a), common carp (Kongchum et al., 2010), pufferfish (Oshiumi et al., 2003), channel catfish (Bilodeau and Waldbieser, 2005; Baoprasertkul et al., 2007a,b) and Atlantic salmon (Rebl et al., 2007). In our spleen transcriptomic data, there were 7 transcripts, as represented in Table S4, showing high similarity to TLR1, TLR2, TLR3, TLR5, TLR7/8, and TLR9. The coordinate genome reference IDs were identified and five transcripts matching TLR1, TLR3, TLR7, TLR8, and TLR9 had complete cDNA sequences. Several TLRs were previously identified in rainbow trout (Tsujita et al., 2004; Rodriguez et al., 2005; Ortega-Villaizan et al., 2009; Palti et al., 2010a,b). TLR4 was not reported in teleosts except zebrafish whereas TLR6 is totally missing in teleost fish (Takano et al., 2011). This study supports the notion of absence of both TLR4 and TLR6 in rainbow trout.

There were three transcripts matching the NF-κB complex. Two transcripts matched each of MKK4/7 and IFN-αβR complexes. Each of MKK3/6 and MEK1/2 complexes has one transcript whereas MAP2K3 and MAP2K1 have no matches. Additionally, AP-1 which is composed of JUN and FOS (Zenz et al., 2008) had only one transcript matching JUN. The remaining 56 transcripts, out of the 73 total transcripts, showed high similarity to other members of the TLR signaling pathway of higher vertebrates (**Figure 6**). A total of 38 transcripts have complete cDNA sequences. To our knowledge, 26 different proteins were annotated for the first time in rainbow trout in the current study. Information about transcripts that showed homology to molecules involved in Toll-like receptor signaling pathway is included in additional Table S4.

## *B cell receptor signaling pathway*

B-lymphocytes are involved in antigen specific defense. They are activated through binding of antigen to B cell receptors. These cells produce specific antibodies to neutralize the foreign particles (Kurosaki et al., 2010). The binding of antigen to B-cell receptor activates B lymphocytes (Batista and Neuberger, 1998). **Figure 7** shows all B cell signaling pathway annotated and non-annotated proteins in the current study. A total of 68 sequences were assigned to the B cell signaling pathway, 44 have been identified for the first time in this transcriptomic study. Information about transcripts that showed homology to molecules involved in B cell receptor signaling pathway is included in additional Table S5.

## *T cell receptor signaling pathway*

Like B cells, T lymphocytes are involved in the antigen specific defense. Both T cell receptors (TCR) and costimulatory molecules such as CD28 are required for T cell activation. The cytotoxicity of Cytotoxic T lymphocytes in fish is obscure due to lack of suitable experimental systems even though few studies have depicted the lysis of virus-infected cells by NK-like cells in rainbow trout (Moody et al., 1985; Yoshinaga et al., 1994) and channel catfish (Hogan et al., 1996). Information about proteins that are involved in this cascade was very limited in rainbow trout. The annotated transcripts showed high similarity to many members of the T cell receptor signaling pathway of higher vertebrates as shown in **Figure 8**. In this study, many transcripts (56 out of 82) that are included in T cell receptor signaling pathway were identified for the first time. Information about transcripts that showed homology to molecules involved in T cell receptor signaling pathway is included in additional Table S6.

## *Chemokine Signaling Pathway*

Chemokines have a major role in trafficking and activation of leukocytes toward the site of inflammation by the aid of

C-terminal domain of their receptors (chemotaxis) (Pease and Williams, 2006). Few chemokine-related genes have been cloned in rainbow trout including CCR9 (Dixon et al., 2013), CXCR4, CCR7 (Daniels et al., 1999) and CXCL14 (Bobe et al., 2006). Many of chemokine receptors haven't been reported to date in rainbow trout (Dixon et al., 2013). In the present study, most of the proteins in the chemokine signaling pathway have been identified (**Figure 9**). Out of 107 annotated transcripts, 73 sequences matching 49 proteins have been identified for the first time in rainbow trout. Information about transcripts that showed homology to molecules involved in chemokine signaling pathway was included in additional Table S7.

## *Fc gamma R-mediated phagocytosis*

Clusters of IgG coat the foreign particles in a process termed opsonization. Leukocytes and tissue macrophages phagocytose the opsonized pathogens through Fc gamma receptors (Pacheco et al., 2013). Before the present study, some proteins were known to be involved in Fc gamma R-mediated phagocytosis as listed in Table S8. In the current transcriptome analysis, all annotated proteins belonging to the Fc gamma R-mediated phagocytosis are shown in **Figure 10**. There were 52 sequences out of 82 annotated sequences matching 30 proteins identified for the first time in rainbow trout. Information about transcripts that showed homology to molecules involved in the Fc gamma R-mediated phagocytosis signaling pathway is included in additional Table S8.

## *Leukocyte transendothelial migration*

White blood cells migrate in an amoeboid fashion through the endothelium lining of the blood vessels to drive the immune response to the site of infection (Muller, 2011, 2013). Most of the proteins belonging to this pathway have been identified in this transcriptome analysis (**Figure 11**). There were 92 transcripts showing high similarity to members of the leukocyte transendothelial migration cascade of higher vertebrates. Prior to the current transcriptome sequencing, some proteins were previously annotated in rainbow trout (Table S9). To our knowledge, 36 proteins belonging to this pathway were reported for the first time in the current study. Information about transcripts that showed homology to molecules involved in leukocyte transendothelial migration is included in additional Table S9.

## *Natural killer (NK) cell mediated cytotoxicity*

NK cells are lymphocytes working as a part of the innate immune system. Although they don't have classical antigen receptors like T and B lymphocytes, their receptors can discriminate between self and non-self-cells (Lanier, 2003). In the current transcriptome analysis, many but not all proteins that are involved in the NK cell mediated cytotoxicity pathway were annotated (**Figure 12**). Many proteins involved in this pathway were reported before the current study. In this cascade, 42 transcripts out of 72 identified sequences have been annotated for the first time in rainbow trout. The newly annotated transcripts matched 24 proteins. Information

about transcripts that showed homology to molecules involved in NK cell mediated cytotoxicity signaling pathway is included in additional Table S10.

## **SPLEEN-SPECIFIC GENES**

Recently, a total of 51 spleen-specific have been identified in our lab (data will be published elsewhere). The assembled contigs were submitted to the NAGRP Aquaculture Genome Projects (http://www.animalgenome.org/repository/pub/MTSU 2014.0811/). The coordinate IDs of the spleen-specific transcripts were determined using BLASTN (cut off *E*-value of 1.00E-10) against our spleen transcriptome (Table S11). As shown in Table S11, the level of gene expression was at least 20 fold higher in the spleen compared to 12 other tissues, with statistical false discovery rate (FDR) less than 5%. The spleen-specific genes were mapped to the newly assembled genome reference (Berthelot et al., 2014). The coordinate genome reference IDs and complete/incomplete ORF conditions are provided in additional Table S11.

The list of spleen-specific genes includes: (1) Immune proteins include 5 transcripts such as Fc receptor-like protein 3 like, thrombopoietin receptor precursor, P-selectin precursor, nuclear factor, interleukin 3 regulated (NFIL3), and lectin precursor. (2) Respiratory gas transport proteins include 4 most highly expressed transcripts assigned to hemoglobin and one transcript assigned to carbonic anhydrase II. A large representation of iron-binding proteins was reported in spleen of virus infected Turbot (Pereiro et al., 2012). (3) Coagulation cascade and adhesion proteins include 10 transcripts; five out of the ten transcripts were assigned to platelet glycoproteins whilst the other 5 transcripts were assigned to coagulation factor XIII A chain precursor, thrombopoietin receptor precursor, integrin alpha 2b, integrin beta-3-like and Von Willebrand factor. Thrombocytes appear in spleen during the first week postfertilization before appearing in blood because they participate in body defense (Tavares-Dias and Oliveira, 2009). (4) Development proteins involve 5 transcripts including ectodysplasin receptor, T-box

transcription factor TBX6, homeobox proteins Nkx-2.6-like, Tcell leukemia homeobox protein 1 and Zinc finger protein Gfi1b. The participation of the house finch (*H. mexicanus*) spleen transcriptome in neurodevelopment through a subset of genes has been already reported (Backström et al., 2013). (5) Transporter proteins include one transcript for band 3 anion exchange protein. Both of the biosynthetic and membrane incorporation processes of Band 3 protein have been studied *in vivo* in erythroid spleen cells (Sabban et al., 1981). Among the other identified spleen-specific genes was 5-aminolevulinate synthase erythroid specific mitochondrial precursor which is necessary for heme biosynthesis. Other spleen-specific genes such as rhomboid-like protease 4, Methionine aminopeptidase 2, zinc finger protein 143 like, GATA binding factor 1, N-acetyltransferase 6-like, RING finger protein 151 and cytosolic purine 5 -nucleotidase were also expressed. In mouse and rat, 39 spleen-specific genes are found in tissue-specific database (Xiao et al., 2010). Moreover, 168 Refseq are preferentially expressed in human spleen based on ESTs (Liu et al., 2008). Further work is still needed to validate spleen-specific genes obtained from our high-throughput spleen transcriptome analysis.

Next to spleen, the spleen-specific genes showed relatively higher expression in kidney and fat compared to the rest of the tissues. The mean RPKM values of the spleen-specific genes were 5491, 973, and 400 in spleen, kidney and fat, respectively (**Figure 13**). Kidney is one of the large lymphoid organs in teleosts containing macrophages and lymphoid cells (Zapata et al., 2006; Uribe et al., 2011). The relatively high level of expression of the spleen-specific genes in adipose tissue may be attributed to presence of many populations of immune cells in fat tissues (Ferrante, 2013).

Most of the 51 spleen-specific genes have functions not related to immunity. Interestingly, upon testing a few of these genes after infection with flavobacterium, three transcripts showed differential expression associated with infection, these are contig C23964\_c2\_seq1\_P-selectin\_precursor, contig C13\_c2\_seq1\_Hemoglobin\_subunit\_beta-1 and contig C83628\_c0\_seq1\_RING\_finger\_protein\_151 (Detailed data will be published elsewhere). These data indicate that many of the spleen-specific genes may have immune functions. Further research is still needed to test if the other spleen-specific gene functions are related to immunity.

### **ORF/ FULL-LENGTH cDNA PREDICTION**

Full-length cDNAs are a crucial tool for many genetic and genomic studies including alternative splicing and characterization of gene duplications or pseudogenes (Xin et al., 2008). To identify full-length cDNAs in the above mentioned seven immune pathways, contigs were analyzed by the TargetIdentifier server (Min et al., 2005). A total of 38, 29, 30, 49, 31, 37, and 26 full-length cDNAs were identified in the Toll-like receptor signaling pathway, B cell receptor signaling pathway, T cell receptor signaling pathway, chemokine signaling pathway, Fc gamma R-mediated phagocytosis, Leukocyte transendothelial migration and NK cell mediated cytotoxicity, respectively (Tables S4–S10). Out of the total number of full-length cDNAs, there were 25, 15, 18, 30, 24, 24, and 19 sequences with completely sequenced ORF identified in all studied immune pathways, respectively. Likewise, spleen-specific genes were analyzed by the online TargetIdentifier server (Min et al., 2005) to identify full-length cDNAs. A total of 37 full-length sequences including 24 with completely sequenced ORF were identified among the spleenspecific transcripts. Many of these transcripts were annotated for the first time in rainbow trout. Further work is needed to validate these full-length cDNAs and examine their genomic characteristics (UTR length, Kozak sequence, and conserved motifs) in detail.

## **METHODS**

## **TISSUE SAMPLING AND RNA ISOLATION**

Homozygous doubled haploid rainbow trout fish from the Swanson clonal line were produced at Washington State University (WSU) by androgenesis (Scheerer et al., 1986, 1991; Young et al., 1996; Robison et al., 1999). The fish, approximately 300 g in weight, was reared in recirculating water systems at 12–16◦C and had sexually matured as a male prior to sampling. Spleen tissues were collected and frozen in liquid nitrogen, then shipped on ice. All samples were preserved at −80◦C until RNA isolation to reduce autocatalytic degradation. Total RNA was isolated from spleen tissues with TRIzol (Invitrogen, Carlsbad, CA) and purified according to the manufacturer's guidelines. Quantity of total RNA was assessed by measuring the absorbance at A260/A280 using a Nanodrop™ ND-1000 spectrophotometer (Thermo Scientific). RNA quality was checked by electrophoresis through a 1% (w/v) agarose gel. Moreover, RNA integrity was tested using the bioanalyzer 2100 (Aglient, CA).

#### **cDNA LIBRARY PREPARATION AND ILLUMINA SEQUENCING**

RNA-Seq library preparation and sequencing were carried out by the University of Illinois at Urbana-Champaign (UIUC), 901 West Illinois street Urbana, IL 61801 USA. A RiboMinusTM

Eukaryote Kit V2 (Invitrogen, Carlsbad, CA) was used to deplete rRNA from the total RNA. cDNA libraries were constructed using ∼1μg of rRNA depleted RNA following the protocol of the Illumina TruSeq RNA sample preparation Kit (Illumina). The resulting double-stranded manufactured cDNA was used in the preparation of the Illumina library. The standard end-repair step was carried out first, followed by the standard ligation reaction where the end-repaired DNA along with a single A base overhang were ligated to the adaptors using T4-DNA Ligase (TrueSeq RNA Sample Prep Kit v2, Illumina, San Diego, CA). The products of the ligation reaction were purified and exposed to size selection of the target length (400–450 bp) from the gel for carrying out the ligation-mediated PCR. Cluster generation and sequencing were carried out following the cluster generation and sequencing manual from Illumina (Cluster Station User Guide and Genome Analyzer Operations Guide). All sequenced raw data were first exported in FASTQ format and are currently being uploaded to the NCBI short read archive (SRA).

## **CLC GENOMICS** *DE NOVO* **ASSEMBLY**

De novo assembly of the expressed short reads was carried out by CLC Genomics Workbench version 6.0; CLC bio, Aarhus, Denmark; http://www.clcbio.com/products/clc-genomics-work bench/ . The raw data were filtered by removing short, duplicated and low quality reads. CLC was run using the default settings for all parameters including a minimum contig length of 500 bp.

## **FUNCTIONAL ANNOTATION AND GENE ONTOLOGY ANALYSIS**

Blast2Go version 2.6.5 (http://www*.*blast2go*.*com/b2ghome) was used for the functional annotation and analysis of the assembled contigs according to molecular function, biological process and cellular component ontologies. BLASTX search for sequence homology (*E*-value of 1.0E-3, maximum 20 hits) was carried out against NCBI's non-redundant protein database (NR). GO terms related to the established hits were extracted and modulated. The functional annotations were analyzed and statistical analysis of GO distributions was performed.

## **IDENTIFICATION OF IMMUNE-RELATED PROTEINS**

Assembled consensus sequences were uploaded to the KEGG Automatic Annotation Server (KAAS) (Moriya et al., 2007) Ver. 1.67x (http://www*.*genome*.*jp/tools/kaas/). The functional annotation of genes was carried out by searching local BLAST against KEGG database.

represented in green color and absent proteins in white color.

were 5491, 973, and 400 in spleen, kidney and fat, respectively.

The Bi-directional best hit (BBH) method was used to analyze and identify the immune molecules that were present and absent in the seven immune pathways containing the highest number of transcripts that showed high similarity to different members of each pathway. Transcripts were mapped to a newly assembled genome reference. The coordinate genome reference IDs of the immune-related transcripts were determined using BLASTX (cut off *E*-value of 1.00E-10) against the genome protein dataset (Berthelot et al., 2014).

#### **IDENTIFICATION OF SPLEEN-SPECIFIC GENES**

Tissue-specific genes in spleen were identified using CLC genomics workbench Baggeley's test in which expression level of a gene in spleen was compared to its expression level in 12 tissues (brain, white muscle, red muscle, fat, gill, head kidney, kidney, intestine, skin, stomach, liver and testis). For distinction of tissue-specific genes, FDR value was set as 5% and fold change was at least 20 fold higher in spleen relative to the other 12 tissues. The spleen-specific genes were mapped to the newly assembled genome reference (Berthelot et al., 2014) to determine the coordinate genome reference IDs with a cutoff *E*-value of 1.00E-10.

### **ORF/ FULL-LENGTH cDNA PREDICTION**

All contigs annotated in the interesting KEGG immune pathways, including Toll-like receptor signaling pathway, T-cell receptor signaling pathway, B-cell receptor signaling pathway, chemokine signaling pathway, Fc gamma R-mediated phagocytosis, Leukocyte transendothelial migration and NK cell mediated cytotoxicity, in addition to spleen-specific genes were analyzed using the online TargetIdentifier server (Min et al., 2005) to look for open reading frames and putative full-length cDNAs. A BLASTX output file including the BLASTX results for all cDNA sequences in the "FASTA" file with a cutoff *E*-value of 1.00E-3 was uploaded to the TargetIdentifier program to work properly. cDNA was considered as full-length if the sequence has a 5 stop codon followed by a start codon or the sequence does not have a 5 stop codon but there is an in-frame start codon present prior to the 10th codon of the subject sequence. Based on the BLASTX results, TargetIdentifier predicts existence of an open reading frame (ORF) completely sequenced or not.

## **ACKNOWLEDGMENTS**

This study was supported by the USDA ARS Cooperative Agreement No. 58-1930-0-059. We thank Paul Wheeler for providing tissues from the Swanson doubled haploid trout. Mention of trade names of commercial products in this publication is solely for the purpose of providing specific information and does not imply recommendation or endorsement by the U.S. Department of Agriculture.

## **SUPPLEMENTARY MATERIAL**

The Supplementary Material for this article can be found online at: http://www.frontiersin.org/journal/10.3389/fgene. 2014.00348/abstract

## **REFERENCES**


Zhang, J., Yang, Y., Wang, Y., Wang, Z., Yin, M., and Shen, X. (2011). Identification of hub genes related to the recovery phase of irradiation injury by microarray and integrated gene network analysis. *PLoS ONE* 6:e24680. doi: 10.1371/journal.pone.0024680

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 08 July 2014; accepted: 16 September 2014; published online: 14 October 2014.*

*Citation: Ali A, Rexroad CE, Thorgaard GH, Yao J and Salem M (2014) Characterization of the rainbow trout spleen transcriptome and identification of immune-related genes. Front. Genet. 5:348. doi: 10.3389/fgene.2014.00348*

*This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics.*

*Copyright © 2014 Ali, Rexroad, Thorgaard, Yao and Salem. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

## Zebrafish as animal model for aquaculture nutrition research

## *Pilar E. Ulloa1\*, Juan F. Medrano2 and Carmen G. Feijoo1*

<sup>1</sup> Departamento de Ciencias Biologicas, Facultad de Ciencias Biologicas, Universidad Andres Bello, Santiago, Chile <sup>2</sup> Department of Animal Science, University of California, Davis, Davis, CA, USA

### *Edited by:*

José Manuel Yáñez, University of Chile, Chile

#### *Reviewed by:*

Nelson Colihueque, Universidad de Los Lagos, Chile Gen Hua Yue, National University of Singapore, Singapore

#### *\*Correspondence:*

Pilar E. Ulloa, Departamento de Ciencias Biologicas, Facultad de Ciencias Biologicas, Universidad Andres Bello, República 217, Santiago 8370146, Chile e-mail: pilarelizabeth@gmail.com

The aquaculture industry continues to promote the diversification of ingredients used in aquafeed in order to achieve a more sustainable aquaculture production system. The evaluation of large numbers of diets in aquaculture species is costly and requires timeconsuming trials in some species. In contrast, zebrafish (Danio rerio) can solve these drawbacks as an experimental model, and represents an ideal organism to carry out preliminary evaluation of diets. In addition, zebrafish has a sequenced genome allowing the efficient utilization of new technologies, such as RNA-sequencing and genotyping platforms to study the molecular mechanisms that underlie the organism's response to nutrients. Also, biotechnological tools like transgenic lines with fluorescently labeled neutrophils that allow the evaluation of the immune response in vivo, are readily available in this species. Thus, zebrafish provides an attractive platform for testing many ingredients to select those with the highest potential of success in aquaculture. In this perspective article aspects related to diet evaluation in which zebrafish can make important contributions to nutritional genomics and nutritional immunity are discussed.

**Keywords: zebrafish, aquaculture nutrition, diet evaluation, nutritional genomics, nutritional immunity**

## **AQUACULTURE NUTRITION: TRENDS AND FUTURE PERSPECTIVE**

Worldwide fish production is increasing at 8.8% per year (Food and Agriculture Organization of the United Nations [FAO], 2012). However, the decreasing availability of fishmeal because of a reduction in the supply of important sources of fish limits its use as the primary protein in fish diets. In fact, harvest volumes of these species decreased from 10.7 million tons in 2004 to 4.2 million tons in 2010 (Food and Agriculture Organization of the United Nations [FAO], 2012). This outcome, apart to generate a negative ecological impact, has led to an increase in the value of fishmeal, affecting the profitability of many aquaculture enterprises (Rosamond et al., 2000). Therefore, the utilization of plant protein meals has emerged as an alternative to replace of fish meal in aquaculture feeds (Hardy, 2010).

Within the wide range of available vegetable protein sources (peas, lupine, maize, and wheat), soybean has been the most commonly used due to its wide availability in the market, low cost, high content of digestible protein and balanced amino acid profile (Naylor et al., 2009). Increasing dietary levels of soybean meal and other vegetable proteins has been tested in a variety of fish species, with inclusion levels ranging from 20 to 100% of fishmeal replacement. Unfortunately, results have shown several adverse effects such as reduction of growth and intestinal inflammation, even at low levels of inclusion (Gómez-Requeni et al., 2004; Mundheim et al., 2004; Knudsen et al., 2007; Urán et al., 2008).

Different approaches have been taken to make the utilization of plant protein by different fish more efficient, including the improvement of genetic selection in plants aiming to reduce the effect of anti-nutritional components and the stimulation of the intestinal microbiota of fish (Bakke-McKellep et al., 2007; Froystad-Saugen et al., 2009; Merrifield et al., 2009). Additionally, the diversification of new protein ingredients (from animals and plants), and identification of additives (natural or synthetic) with "intestinal protective" activity to promote growth and health have been made a priority (Refstie et al., 2010; Sicuro et al., 2010; Czubinski et al., 2014). Thus, the effects of nutrition on genomics and immunity are being addressed (Montero et al., 2010; Hernández et al., 2013). The implementation of new technologies such as RNA-sequencing, together with progress in sequencing genomes of different fish, can identify the genes affected by nutrition and also identify the genetic variants that influence the organism's response to nutrients.

To fully understand the repercussions of new diets on fish physiology, a shift in approach is requires to determine the molecular and cellular pathways that regulate responses to different diets. The evaluation of a large numbers of diets directly in aquaculture species results in high costs and long-term assays, so it is necessary to implement new strategies in order to accelerate and make this experimental process cost-effective. It is also essential to determine the molecular mechanisms by which physiological process are perturbed in response to diet. This will provide insights on how to solve existing problems resulting from nutrition interventions in the aquaculture industry.

As an alternative approach to addresses the aforementioned issues, we propose the use of zebrafish (*Danio rerio*) as an animal model for high throughput testing of experimental diets in aquaculture. This teleost fish has numerous advantages related to its fast life cycle, ease of handling and very well-known biology, besides allowing *in vivo* analysis with large numbers of fish

(Kimmel et al., 1995). Here, we highlight the important contribution that zebrafish can make in the field of nutritional genomics and nutritional immunity. Both lines of investigation provide useful contributions to the evaluation of diets.

## **ADVANTAGES OF ZEBRAFISH AS AN ANIMAL MODEL FOR AQUACULTURE NUTRITION RESEARCH**

Zebrafish is a well-established animal model for a wide range of research areas, from biomedicine to toxicology (Roush, 1996; Bergeron et al., 2008). The use of this model fish for improving production process of aquaculture has emerged as an important research field (Ulloa et al., 2011; Ribas and Piferrer, 2013). In particular, research improving husbandry and survival, immune response, nutrition and growth have been carried out in zebrafish, and are expected to provide results applicable to important commercial fish (Oyarbide et al., 2012; Hedrera et al., 2013; Ulloa et al., 2013).

Among the advantages of this model are its ease of handling in breeding and experimentation, short generation intervals (∼3 months) and large numbers of eggs per brood (100–200 eggs/clutch), which allow performing all analyses with a high number of specimens per data point (Kimmel et al., 1995). Embryos hatch at 2 days post-fertilization and larvae can live for 5 days without feeding due to the consumption of the yolk (Lawrence, 2007). During the larval period all organs and systems are functional, making these individuals physiologically equivalent to adult animals. In fact, both larvae and adult zebrafish can eat a wide variety of foods including live feed (*Paramecium* and *Artemia* nauplii) and different commercial fish diets, as well as experimental plant protein-based diets (Hedrera et al., 2013; Ulloa et al., 2013). The availability of a sequenced genome (assembly ZV9) allows evaluating the effect of diet on molecular mechanisms using genomic tools such as RNA-sequencing (RNA-seq; Morozova and Marra, 2008). This technology has been recently used in some aquaculture species and also in zebrafish (Li et al., 2013; Long et al., 2013; Xu et al., 2013; Cui et al., 2014; Liu et al., 2014). On the other hand, a wide diversity of approaches in genetic manipulation are readily available in zebrafish. The availability of a large number of transgenic lines, which carry specific promoters coupled to GFP (green fluorescent protein), is widespread. For example, the use of the Tg(Bacmpx:GFP) line allows to follow specific innate immune cells such as neutrophils *in vivo* due to its fluorescent mark (Renshaw et al., 2006). Moreover, it has been demonstrated that neutrophils output from hematopoietic tissue toward affected territories correlates with pro-inflammatory cytokine production, thus making transgenic lines "live indicators" of inflammatory process (Barros-Becker et al., 2012). All these assays can be carried out with embryos and larvae, which are distributed individually or in small groups in micro well plates in small volumes (0.5–2 ml) allowing sufficient biological replicas in each experiment.

Directly related to the evaluation of diets are two aspects in which zebrafish can make important contributions: nutritional genomics and nutritional immunity.

## **NUTRITIONAL GENOMICS CONTRIBUTIONS**

Nutritional genomics is a discipline that investigates the relationship between the genome and diets. Two approaches are essentially used: "Nutrigenomics," which studies how dietary ingredients affect the gene expression and "Nutrigenetics," which aims to understand how the genetic makeup of an individual coordinates the response to diet (Mutch et al., 2005). Both approaches attempt to clarify the effect of dietary components that contribute to phenotype by altering gene expression and individual genetic variants.

Since one effect of plant diets on fish phenotypes is growth, for more than one decade two approaches have been used to understand the genomics associated with fish growth: (1) global evaluation of genes by the creation of microarray platforms based on EST libraries, and (2) evaluation of candidate genes involved in growth. These studies have generated a list of genes that are over/under expressed in response to vegetable diets in different development stages of fish (Douglas, 2006; Panserat et al., 2008; Von Schalburg et al., 2008;Alami-Durante et al., 2010; Tacchi et al., 2011). However, the results obtained in the many nutritional studies are commonly difficult to compare among each other. This is due to the use of different origins of the same ingredient, feed formulation, genetic background of fish and experimental design. Moreover it is a common practice that, the biological samples used for transcription analysis are randomly selected. This experimental background makes interpretation of data difficult in order to dissect the relationship between genotype and phenotype, as well as the effects of diet on gene expression. Thus, in order develop a better understanding of molecular mechanism modulated by nutrition, it is necessary to select fish according to genotype, phenotype and/or ideally based on genetically improved populations (Kolditz et al., 2008; Morais et al., 2011; Salem et al., 2012).

Advanced technologies such as RNA-seq and genotyping platforms allow accelerated research in the nutritional genomics field that can be projected to aquaculture species (Houston et al., 2014; Qian et al., 2014). These technologies have been widely used to increase the genomic understanding of phenotypes in other livestock species (Wickramasinghe et al., 2014). A recent aquaculture study analyzed genotype-diet interaction in the transcriptome analysis of Atlantic salmon fed vegetable oil. The authors identified metabolic pathways and key regulators that may respond differently to alternative plant-based feeds depending on genotype (Morais et al., 2011). Salem et al. (2012) using RNA-seq, identified 23 single nucleotide polymorphisms (SNPs) markers in rainbow trout associate with growth response to commercial fish mealbased diet. However, despite these efforts, the identification of genetic differences (gene expression and SNPs) among fish in relation to growth rates in response to plant protein diets has not been reported.

In order to address this subject, a new approach using zebrafish was developed (Ulloa et al., 2013). Briefly, a population of 24 experimental families was generated to examine growth response in zebrafish fed with a plant protein-based diet. At 30 days postfertilization (dpf) each family was split to generate two replicates (40 fish per family replicate), which created two populations of fish with similar genetic backgrounds. The fish were fed from larval transition (35 dpf) to sexual maturity (98 dpf). The first replicate of 24 families was fed a diet containing 100% plant protein as the only protein source (experimental diet) and the second replicate was fed a diet containing 100% animal protein as the only protein source (control diet). The results showed decreased growth in fish fed a plant protein-based diet compared to fish fed a fish meal-based diet, as has been documented in farmed fish (Gómez-Requeni et al., 2004; Mundheim et al., 2004), and very large growth variations from juvenile to adult stages among families (Ulloa et al., 2013). In order to evaluate the effect of a plant protein-based diet on the expression of growth-related genes in the muscle of zebrafish, individuals from three similar families representative of the mean weight in both populations were selected. To understand the effect of family variation on gene expression, these families were evaluated separately. The results showed that the effect of plant diet and family variation on gene expression were significantly different, and clearly suggested that gene expression is influenced not only by nutrition but also by genetic differences in each family; such as been described by Gjedrem (2000). Thus, it was demonstrated that to understand the effect of diet on transcriptome analysis, it is important to homogenize the phenotype and genetic components to avoid conflicts in the interpretation of results (Ulloa et al., 2013).

To measure gene expression and identify SNPs to evaluate growth associations in zebrafish fed a plant protein diet, a new approach was developed using RNA-seq. Samples of muscle collected from low and high growth fish were analyzed to identify SNP in differentially expressed genes. One hundred twenty-four genes were differentially expressed between phenotypes. From these genes 164 SNP were selected and genotyped in 240 fish samples. Marker-trait analysis revealed five SNP in key genes directly related with growth response (Unpublished data). This study provided new candidate genes associated with growth that could be evaluated in farmed fish through comparative genomics. Additionally, this strategy promises to be useful to identify SNP and characterize individuals with the highest performance for growth in response to a plant protein-based diet.

## **NUTRITIONAL IMMUNITY CONTRIBUTIONS**

Diverse studies in cultured fish have shown that soybean meal induces intestinal inflammation, a pathology called enteritis (Bakke-Mckellep et al., 2000; Knudsen et al., 2007). The hallmark of innate immunity is inflammation, this process is triggered in response to different insults, including pathogens, injury or irritants (Chen and Nuñez, 2010). When inflammation occurs, influx, accumulation, and activation of leukocytes (predominantly neutrophils) are triggered at the affected site during the early stages of the response (Witko-Sarsat et al., 2000). One of the first cytokine to be released when inflammation occurs is the pro-inflammatory cytokines Interleukin-1 β (IL-1β; Rodriguez-Tovar et al., 2011). Other essential proteins for neutrophils chemoattraction and migration are the chemokine Cxcl8 and some metalloproteinase enzymes (MMPs). Cxcl8 promotes neutrophils recruitment to the sites of insult; meanwhile MMPs are involved in the degradation of the extracellular matrix in order to promote granulocytes migration (Oliveira et al., 2013). Once neutrophils reach the affected site, they destroy the *insulting agent* through the production of non-specific toxins (Fialkow et al., 2007).

On the opposite side are the anti-inflammatory cytokines, such as transforming growth factor beta (TGF-β) and Interleukin 10 (Il-10), which are mainly secreted by macrophages when the inflammatory agent is removed, promoting the end of the inflammatory process (Engelsma et al., 2002; Ouyang et al., 2011). It is noteworthy that an inflamed intestine of fish is characterized by shorter mucosal folds, loss of vacuole absorptive cells in the intestinal epithelium and large infiltration of neutrophils, macrophages and, eosinophils in the *lamina propria*, among others (Baeverfjord and Krogdahl, 1996).

The severity of inflammation differs between species and depends on the percentage of plant feeds inclusion to the diets. In salmonids, the most affected by the inclusion of plant protein appears to be Atlantic salmon (*Salmo salar*) and to a lesser extent rainbow trout (*Oncorhynchus mykiss*; Morris et al., 2005; Bakke-McKellep et al., 2007). However, the effect of inflammation has also been described in omnivorous fish such as carp (*Cyprinus carpio*) and zebrafish (Urán et al., 2008; Hedrera et al., 2013). This situation affects the cellular and humoral immunological processes, with negative consequences in food intake and growth (Gómez-Requeni et al., 2004; Mundheim et al., 2004; Montero et al., 2010).

In recent years, additives such as prebiotics mannooligosaccharides (MOS) and fructoligosaccharides (FOS), probiotics (bacteria); immunostimulants (β-glucans), and nucleotides have been incorporated into fish diets in order to control diseases, improve health, and the immune status against acute stress (Piaget et al., 2007; Staykov et al., 2007; Tahmasebi-Kohyani et al., 2012). In the case of MOS, the supplementation of 0.2% into diets with 14% inclusion of soybean meal decreased the intestinal inflammation in Atlantic salmon (Refstie et al., 2010). In sea bream, the effect of supplementation of 0.4% MOS into diets with 31% inclusion of soybean meal revealed an increase of microvilli density and length of intestinal folds (Dimitroglou et al., 2010). These results showed that MOS have a protective effect on intestinal inflammation triggered by soybean meal. However, multiple factors were involved, such as interspecific variation, inclusion of soybean meal and percentage of additive used in the supplementation. Thus, further research is needed to compare efficiencies between new "intestinal protector" additives.

To develop research in zebrafish to find solutions to the issues mentioned above requires first the corroboration that intestinal inflammation triggered by soybean meal in zebrafish recapitulates what is seen on farmed fish. This approach has been addressed exactly by Hedrera et al. (2013). They present a new strategy to analyze the potential intestinal impact that the intake of different food ingredients can trigger. Specifically, they analyzed the effects of the ingestion of soybean meal and two of its components, soy protein and soy saponin in zebrafish. Demonstrating that larvae fed with soybean meal developed an intestinal inflammation as early as 2 days after start feeding. Moreover, it was observed that saponin but not soy protein extract was responsible for the inflammatory response.

These findings support the use of zebrafish screening assays to identify novel ingredients/additives that would lead to improved current fish diets or to the formulation of new ones. **Figure 1** illustrates the two steps in the new proposed strategy. The first step is a "pre-screening" developed in zebrafish. The aim of this step is

to evaluate a large number and wide range of ingredients or additives in order to select the more beneficial or less harmful. The second step considers the determination of the intestinal effects generated by the selected ingredients on the target fish specie. This method eliminates the need to evaluate all diets directly on commercial fish, reducing high costs and time consuming experimentation.

Besides zebrafish has potential in nutritional studies, it is important to highlight that assays in this fish cannot replace analysis in farmed fish, as well as results cannot be direct extrapolated to other fish species. For example, results regarding the level of enteritis triggered by soybean meal in Atlantic salmon are different from those detected in rainbow trout. What is important is that in both species soybean causes intestinal inflammation that is mainly due to the saponin, which also occurs in zebrafish. Moreover, in all these species the level of proinflammatory cytokines are upregulated, suggesting that the molecular mechanisms are conserved. In a similar way, several studies have shown that the intake of a diet based on soybean meal decreases the growth rate of salmon, rainbow trout, carp, tilapia, sea bream, and also zebrafish (Pongmaneerat et al., 1993; Médale et al., 1998; Fontaínhas-Fernandes et al., 1999; Gómez-Requeni et al., 2004; Mundheim et al., 2004; Ulloa et al., 2013). These results suggest that the biological processes and molecular mechanisms that

underlie the growth response to nutrients overlap among different fish, regardless of evolutionary distance or environmental conditions. Understanding how signaling cascades are coordinated and their effects on physiological response, such as growth and inflammation, may be unraveled in zebrafish. Thus, investigations undertaken in zebrafish nutrition could make important contributions to aquaculture nutrition research.

## **FUTURE DIRECTIONS**

The current challenge is to apply the knowledge obtained in zebrafish to benefit the aquaculture industry. In the future, one of the principal challenges will be to cultivate carnivorous fish that can tolerate higher levels of plant protein in their diet. New technologies such as RNA-seq and genotyping platforms will be key in our ability to select fish with increased tolerance to a vegetal protein diet. Identification of more friendly vegetal ingredients should also be examined. Thus, it is not hard to imagine that in a near future fish diets will be formulated with ingredients and/or additives according to the genetic background of the strain of interest instead of depending solely on the species.

## **AUTHOR CONTRIBUTIONS**

Pilar E. Ulloa: drafting of the manuscript; Carmen G. Feijoo: drafting of the manuscript and critical revision of the intellectual content; Juan F. Medrano: critical revisions of the intellectual content.

## **ACKNOWLEDGMENT**

Grant of Fondecyt 3130664 to Pilar E. Ulloa.

## **REFERENCES**


hind-gut enteritis. *Aquaculture* 248, 147–161. doi: 10.1016/j.aquaculture.2005. 04.021


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 16 June 2014; accepted: 20 August 2014; published online: 10 September 2014.*

*Citation: Ulloa PE, Medrano JF and Feijoo CG (2014) Zebrafish as animal model for aquaculture nutrition research. Front. Genet. 5:313. doi: 10.3389/fgene.2014.00313*

*This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics.*

*Copyright © 2014 Ulloa, Medrano and Feijoo. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

## Parentage assignment with genomic markers: a major advance for understanding and exploiting genetic variation of quantitative traits in farmed aquatic animals

## **Marc Vandeputte1,2\* and Pierrick Haffray<sup>3</sup>**

1 INRA UMR1313 Génétique Animale et Biologie Intégrative, Institut National de la Recherche Agronomique, Jouy en Josas, France

2 Ifremer, Institut Français de Recherche pour l'Exploitation de la Mer, Palavas-les-Flots, France

<sup>3</sup> Sysaaf, Syndicat des Sélectionneurs Avicoles et Aquacoles Français, Rennes, France

#### **Edited by:**

Ross Houston, University of Edinburgh, UK

#### **Reviewed by:**

Jeff Silverstein, United States Department of Agriculture, USA Gen H. Yue, National University of Singapore, Singapore

#### **\*Correspondence:**

Marc Vandeputte, Ifremer/INRA, Chemin de Maguelone, F-34250 Palavas-les-Flots, France e-mail: marc.vandeputte@ jouy.inra.fr

Since the middle of the 1990s, parentage assignment using microsatellite markers has been introduced as a tool in aquaculture breeding. It now allows close to 100% assignment success, and offered new ways to develop aquaculture breeding using mixed family designs in commercial conditions. Its main achievements are the knowledge and control of family representation and inbreeding, especially in mass spawning species, above all the capacity to estimate reliable genetic parameters in any species and rearing system with no prior investment in structures, and the development of new breeding programs in many species. Parentage assignment should not be seen as a way to replace physical tagging, but as a new way to conceive breeding programs, which have to be optimized with its specific constraints, one of the most important being to well define the number of individuals to genotype to limit costs, maximize genetic gain while minimizing inbreeding. The recent possible shift to (for the moment) more costly single nucleotide polymorphism markers should benefit from future developments in genomics and markerassisted selection to combine parentage assignment and indirect prediction of breeding values.

**Keywords: aquaculture, parentage assignment, selective breeding, microsatellites, SNPs**

## **INTRODUCTION**

Aquaculture is now the fastest growing animal production worldwide, and provides half of the fish for human consumption worldwide (FAO, 2014). Such an important sector would be expected to use the best knowledge-based improvement methods, amongst which selective breeding is of paramount importance. However, Gjedrem et al. (2012) estimated that only 10% of aquaculture production worldwide is based on genetically improved stocks. There may be several reasons for this, but one clear technical weakness of aquaculture regarding the development of optimized selective breeding schemes is the fact that pedigree information is difficult and costly to obtain.

The basic reason is rather straightforward: farmed aquatic animals are all too small at hatching (from a few micrograms in mollusks and crustaceans to ca. 100 mg in salmonids fishes) to be physically tagged.

Then, there were initially two ways for fish genetic studies and breeding programs to deal with the question of pedigrees. The first and simpler solution was not to use a pedigree, using individual selection. In this case, fish are selected solely based on their own individual phenotype (see review in Gjedrem and Thodesen, 2005). Although effective to obtain genetic gain, this method is very limiting for studying genetic variation as: (1) it provides results only after a minimum of two generations, (2) it requires the maintenance of at least two fish lines, selected/control or divergent lines, (3) it limits the evaluation of genetic variation to one trait only, and (4) the precision of realized heritability estimates is low in reasonably-sized two generation experiments (Nicholas, 1980).

The second option to solve the pedigree problem is to use separate rearing of families until a size where tagging is possible, as in the Norwegian salmon breeding program, the first family-based selective breeding program in aquaculture, started in 1972 (Gjedrem, 2010). This was successfully extended to major aquaculture species such as salmonids, tilapias, oyster, or shrimps (Krishna et al., 2011; Thodesen et al., 2011; Gjedrem, 2012; Gjedrem et al., 2012; Zak et al., 2014). Although efficient, this method has three main drawbacks when it comes to estimating genetic parameters of traits. First, as families are reared separately, common environmental effects between tanks may inflate heritability estimates. The second point is that studying genetic variation with separate rearing of progenies requires the preexistence of the family rearing units–i.e., of the infrastructure of the breeding program. Exploratory studies are then difficult to undertake. The third point is that the number of families is limited to the number of family rearing units used. Then, mating designs are constrained to those where the number of families produced is low for a given number of parents tested, like single pair mating or nested designs, which, unlike factorial designs, do not allow the separation of additive, maternal, common environment, and dominance effects (Becker, 1967).

Therefore, the provision of a method to trace pedigrees in groups of mixed families, with any type of family structure, was expected to be of great interest to study genetic variation of quantitative traits in aquaculture species, and subsequently to set up new types of breeding programs.

The principles of parentage assignment were set up for livestock paternity testing with allozymes (Jamieson, 1965). In fish, the very first trials were done in the 1970s in Israel, also using allozymes, but the number of families that could be discriminated was very low (<10) and the use of the method was limited to carp in one research team (e.g., Brody et al., 1981). The real start of parentage assignment studies in fish was in the 1990s with the availability of microsatellite markers (Herbinger et al., 1995; Estoup et al., 1998).

## **TECHNICAL ASPECTS OF PARENTAGE ASSIGNMENT PARENTAGE ASSIGNMENT METHODS**

Basically, two computation methods are used for parentage assignment, exclusion-based methods and likelihood-based methods (see Jones et al., 2010 for a review). Exclusion is very simple and makes no hypotheses other than Mendelian segregation of alleles, but is very sensitive to genotyping errors. When error rates are moderate and theoretical assignment power is high, however, genotyping errors can be dealt with by allowing a limited number of allelic mismatches between an offspring and its parents alleles (Vandeputte et al., 2006), and exclusion remains the gold standard of parentage assignment (Yue and Xia, 2014). Exclusion programs used in aquaculture are PROBMAX (Danzmann, 1997), VITASSIGN (Vandeputte et al., 2006), and FAP (Taggart, 2007). Likelihood methods use a different approach, with probabilities. In this case, the most likely couple is chosen as the true one (eventually integrating a genotyping error rate), but the decision rules rely on hypotheses on allelic frequencies. Likelihood methods generally give higher assignment rates than exclusion with low power marker sets, but sometimes give inconsistent results (Herlin et al., 2007; Trong et al., 2013). Using sibship information in calculations can greatly improve the efficiency of the likelihood methods (Wang and Santure, 2009). Likelihood programs used in aquaculture are CERVUS (Kalinowski et al., 2007), PAPA (Duchesne et al., 2002), and PARENTE (Cercueil et al., 2002).

A specific question is also the assignment of polyploids, especially sturgeons (Rodzen et al., 2004) or induced polyploids (Miller et al., 2014), and specific packages have been developed for tetraploids (wHDP; Galli et al., 2011), diploids to octoploids (VITASSIGN-OCTO; Vandeputte, unpublished), as well as a general method to transform polyploid genotypes to pseudo-diploid dominant genotypes (Wang and Scribner, 2014).

#### **A CRUCIAL ISSUE: THE ASSIGNMENT POWER OF MARKERS USED**

However, whatever the method used, the first requirement to be able to use parentage assignment in practice is to obtain high levels of unique assignments, which primarily depends on the assignment power of the marker set used. It depends on the exclusion probabilities of the markers used and on the size of the problem to be solved, the total number of putative parents having an exponential effect on the proportion of unassigned individuals (Vandeputte, 2012). Overestimation of the assignment power of markers is very frequent (Vandeputte et al., 2011), and can be explained by Hardy–Weinberg disequilibrium (Wang, 2007), sampling variance and relatedness of parents (Villanueva et al., 2002; Matson et al., 2008), incomplete genotypes, genotyping errors especially caused by stuttering or size-shift (Sutton et al., 2011; Yue and Xia, 2014), and null alleles (Christie, 2010). In some species groups like mollusks, null alleles may be extremely frequent and problematic (Hedgecock et al., 2004), but the main cause of overestimation of the theoretical assignment power is a widespread inappropriate calculation method (Vandeputte, 2012). Typically, assignment power >0.99 can generally be obtained by 8–15 microsatellite markers in fish crosses involving a few tens or hundreds of parents, and a reasonable option when designing a marker set is to include a few more markers than theoretically needed. This then spares a lot of time by providing easy assignment even if small problems of genotyping errors, inbreeding or null alleles appear. High quality genotyping is also essential, and a recent review by Yue and Xia (2014) gives very useful insights to this question.

### **MICROSATELLITES AND SNPs FOR PARENTAGE ASSIGNMENT**

Microsatellites, due to their high number and high variability, are the markers that allowed the development of efficient parentage assignment methods. Today, however, SNPs (single nucleotide polymorphisms) use is growing exponentially (Guichoux et al., 2011), but not yet in parentage assignment. It was estimated that ∼6 SNPs give the same assignment power as 1 microsatellite (Glaubitz et al., 2003). Empirical studies tend to suggest that the adequate number of SNP for an efficient panel would be in the 100–450 range (Trong et al., 2013; Lapègue et al., 2014; Nguyen et al., 2014; Sellars et al., 2014). With such numbers, the classical requirement of unlinked markers within a panel cannot be met, thus lowering the real assignment power. SNPs are individually less expensive to genotype than microsatellites, but multiplexing decreased the cost of microsatellites genotyping (Guichoux et al., 2011; Yue and Xia, 2014), and for the moment SNPs remain more expensive due to the number required, but technology is rapidly evolving for SNPs and not for microsatellites. Empirical studies also sometimes reveal quite a high number of genotyping error in SNPs (Trong et al., 2013) and the necessity to test a higher number of SNP markers than expected to select the appropriate ones (Lapègue et al., 2014; Nguyen et al., 2014). However, prospects for development of genomic selection with low-marker density may imply genotyping of a few hundred to several thousand SNPs per fish (Lillehammer et al., 2013), which in this case would be sufficient to provide parentage assignment at no additional cost. The recent shift to SNP markers was, however, efficient to improve assignment at least in some mollusks species which suffered from high numbers of null alleles with microsatellites (Lapègue et al., 2014; Nguyen et al., 2014).

## **IMPLEMENTATION OF PARENTAGE ASSIGNMENT IN AQUACULTURE**

### **INBREEDING CONTROL**

Mass selection is the simplest way to improve traits such as growth or morphology, but bears a high risk of rapid genetic loss, with highly unbalanced families, which was revealed by parentage assignment mostly in mass spawning species (Perez-Enriquez et al., 1999; Waldbieser and Wolters, 1999; Boudry et al., 2002; Brown et al., 2005; Fessehaye et al., 2006; Herlin et al., 2007; Wang et al., 2008), but also in controlled artificial reproduction systems (Saillant et al., 2002; Kaspar et al., 2008).

The impact of different factors (mating design, mating ratio, number of parents per generation, selection pressure, trait heritability, grading practices) were simulated to improve inbreeding control and optimize genetic progress (Gjerde et al., 1996; Dupont-Nivet et al., 2006; Loughnan et al., 2013; Domingos et al., 2014), including optimal contribution selection, which requires pedigree knowledge (Sonesson, 2005; Skaarud et al., 2011).

## **ESTIMATION OF GENETIC PARAMETERS**

Estimation of heritability and genetic correlations allows to evaluate expected genetic gains and to design breeding programs. This is maybe where the possibility to access pedigree information by genotyping gave the most important and fruitful contribution to date to aquaculture genetics.

Optimization of mixed family designs for genetic parameters estimation was done by Vandeputte et al. (2001) for strain effects, Dupont-Nivet et al. (2002) for heritability, and Sae-Lim et al. (2010) for genotype by environment (G × E) interaction. After several feasibility studies with few families (Herbinger et al., 1995, 1999; Saillant et al., 2002), heritabilities were estimated in a growing number of fishes species for growth (Wilson et al., 2003; Saillant et al., 2006; Dupont-Nivet et al., 2008; Pierce et al., 2008; Wang et al., 2008; Gheyas et al., 2009; Domingos et al., 2013; Whatmore et al., 2013), processing traits (Kocour et al., 2007; Navarro et al., 2009; Saillant et al., 2009; Haffray et al., 2012a), flesh color (Norris and Cunningham, 2004), muscle fiber diameter (Vieira et al., 2007), deformities (Bardon et al., 2009), disease resistance (Guy et al., 2006; Antonello et al., 2009) or sex ratio (Vandeputte et al., 2007), and in shrimps and mollusks for growth (Jerry et al., 2006; Lucas et al., 2006; Kong et al., 2013; Nguyen et al., 2014) or meat yield in mussel (Nguyen et al., 2014). Heritabilities obtained in mixed family rearing are often higher than those recorded in separate rearing, which may be linked to the general absence of between-family environmental variance due to family mixing, although non-genetic maternal effects may persist in mixed family rearing (Haffray et al., 2012b), with possible upward biases on heritability estimates. Explicit comparisons of the same families in mixed or separate rearing design concluded that separate family rearing induced much higher levels of between-families environmental effects (Herbinger et al., 1999; Ninh et al., 2011).

Mixed family rearing also allowed the estimation of G × E interactions between rearing systems (Dupont-Nivet et al., 2008, 2010c; Navarro et al., 2009; Kvingedal et al., 2010; Domingos et al., 2013; Sae-Lim et al., 2013; Vandeputte et al., 2014), density or rearing temperatures (Saillant et al., 2006), plant-based vs marine feeds (Pierce et al., 2008; Le Boucher et al., 2013; Bestin et al., 2014), and even separated vs mixed fish family rearing designs (Ninh et al., 2011).

A limiting factor of such studies is that as fish are generally tagged to maximize individual information collection, individual performances are not available before physical tagging, thus limiting genetic studies on early stages. However, recent advances allow individual tagging at 200–400 mg (Ferrari et al., 2014), which should change this matter of fact.

## **IMPLEMENTATION OF BREEDING PROGRAMS Concepts used and implementation**

The first proposal to use parentage assignment in breeding at an acceptable cost was an improvement of within-family selection called "walk-back selection" (Doyle and Herbinger, 1995). A twostep process of assignment was suggested and tested to achieve a minimal number of selected candidates per family (Herbinger et al., 1995).

Since this date, public organizations and breeding companies initiated selection programs using parentage assignment in sturgeons (France, USA), Atlantic salmon (Ireland, Norway, Scotland), tilapia (Philippines), halibut (Norway and Scotland), rainbow trout (France), cod (Norway), gilthead sea bream (France, Greece, Spain), turbot (France), European sea bass (France, Greece), meagre and red drum (France), Asian sea bass (Singapore, Indonesia, Australia), and shrimps (Australia, Thailand, Mexico, Equator, Central and South America). This list may be incomplete and represents the present informal expert knowledge of the authors. Little information is publicly available in these programs but mass selection, family-based selection (often BLUP: best linear unbiased prediction) or a combination of both are used to improve growth, processing yields, quality traits and disease resistance according to different schemes (**Figure 1**).

Key parameters to choose to develop a breeding program using parentage assignment are not only the genotyping cost (12–20 Euros per individual), but also the capacity to produce a large number of families in one batch to avoid tank effects, the true assignment efficiency, as well as the availability of tools such as rapid mass genotyping capacities (specially for species with short generation interval), individual tagging to improve traceability and facilitate data collection, automated database systems to collect, store and link performances to tags, DNA samples and pedigrees, optimized genetic softwares to rank and mate candidates to maximize genetic progress and minimize inbreeding. Use of parentage assignment is not only "genetic tagging," but requires a complete re-optimization of breeding programs.

## **Optimization of breeding schemes using parentage assignment**

One main target for optimization has been the limitation of numbers genotyped, using two-way nested models for partial pedigrees (Li et al., 2003), or extreme phenotypes with family effect considered as a fixed effect (Morton and Howarth, 2005). BLUP selection normally requires the knowledge of performance and pedigree on all candidates, which is not the case in **Figures 1D,F**. In these cases, the loss of selection efficiency (compared to BLUP with pedigree known on all candidates) depends on selection intensity and genetic parameters (Chapuis et al., 2010;

candidates are selected based on their own performance. **(B)** Under walk-back selection, animals selected on their individual performance are genotyped, and a subset of those with balanced family representation is used as broodstock. **(C)** Under BLUP, all animals are genotyped and phenotyped

Dupont-Nivet et al., 2010b; Sonesson et al., 2011). In addition, issues linked to mixing of families were studied, such as methods to limit non-genetic maternal effect in salmonids (Haffray et al., 2012b), effect of grading practices to limit cannibalism on family contributions in barramundi (Loughnan et al., 2013), and the importance to consider male maturation status to estimate her-

are submitted to a lethal challenge (e.g., disease, processing yields) and genotyped and family values are incorporated in breeding value evaluation for the lethal traits. **(F)** BLUP with sib and pre-selection combines panels **(D,E)**.

**Table 1 | SWOT analysis of parentage assignment with genomic markers for aquaculture breeding.**


to initiate programs

itability of growth more accurately (Dupont-Nivet et al., 2010a). Ninh et al. (2011) and Sae-Lim et al. (2013) compared expected genetic gains with different systems of evaluation (mixed/separate families and impact of G × E interactions), while Haffray et al. (2013) proposed application of ultrasound tomography to predict processing yields on live candidates to limit the use of slaughtered sibs.

## **GLOBAL APPRAISAL AND PERSPECTIVES**

The rapid increase of publications using parentage assignment in the last decade shows how powerful this method is to estimate genetic parameters in any species and rearing system. It avoids the initial investment in separate family rearing units and limits associated biases, even more in species with high larval mortality, small larval size, and initial live feeding. Applications are strongly driven by reproductive constraints linked to the need to simultaneously produce enough families (**Table 1**). The cost/information ratio has to be maximized with adequate management of variance sources (number of parents, initial representation of families, or groups of spawns), mating design, and number of individuals genotyped.

Optimal investment in parentage assignment is a balance between the reduction of investment and operational costs needed for the separate family rearing and the cost of genotyping, which presently limits the application of parentage assignment to mass selection and family-based selection on a limited number of traits. Moreover, any new trait that cannot be recorded on the live candidate and has to be measured on sibs then requires additional genotyping with a cost/benefit ratio to estimate case by case, and to compare with the possible use of indirect criteria.

A major benefit of parentage assignment is that it allows high selection pressure (<3%) to be applied in commercial conditions, while still controlling inbreeding. The knowledge of pedigree also allows an increase in selection accuracy (and then a higher selection gain) on all traits, as well as selection on lethal traits which cannot be done by individual selection. This technology also allows to easily combine sanitary protection of the breeding nucleus and sib testing in commercial environments. Parentage assignment offers simplicity and flexibility in the life of the breeding program that can be easily adapted to new traits, new mating schemes, different number of candidates. This is critical, especially at the initiation of domestication, for "niche" species or in developing countries, where the need for separate rearing system has often prevented any investment in selective breeding in the past, or has fixed the architecture of the breeding programs.

## **ACKNOWLEDGMENT**

This work received funding from the European Union's Seventh Framework Programme (FP7 2007-2013) under grant agreement no. 613611 (FISHBOOST).

## **REFERENCES**


the European sea bass (*Dicentrarchus labrax*)? *Aquaculture* 294, 194–201. doi: 10.1016/j.aquaculture.2009.06.018


FAO. (2014). *The State of the World Fisheries and Aquaculture 2014*. FAO, Rome.


loci cloned from the Pacific oyster, *Crassostrea gigas*. *J. Shellfish Res.* 23, 379–385.


using microsatellites to assign parentage. *Aquaculture* 259, 146–152. doi: 10.1016/j.aquaculture.2006.05.039


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 12 September 2014; paper pending published: 20 October 2014; accepted: 22 November 2014; published online: 12 December 2014.*

*Citation: Vandeputte M and Haffray P (2014) Parentage assignment with genomic markers: a major advance for understanding and exploiting genetic variation of quantitative traits in farmed aquatic animals. Front. Genet. 5:432. doi: 10.3389/fgene.2014.00432*

*This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics.*

*Copyright* © *2014 Vandeputte and Haffray. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

## Genetic improvement of Pacific white shrimp [*Penaeus* (*Litopenaeus*) *vannamei*]: perspectives for genomic selection

*Héctor Castillo-Juárez 1, Gabriel R. Campos-Montes 2, Alejandra Caballero-Zamora1 and Hugo H. Montaldo3\**

*<sup>1</sup> Departamento de Producción Agrícola y Animal, División de Ciencias Biológicas y de la Salud, Unidad Xochimilco, Universidad Autónoma Metropolitana, Mexico City, Mexico, <sup>2</sup> Departamento El Hombre y su Ambiente, Unidad Xochimilco, Universidad Autónoma Metropolitana, Mexico City, Mexico, <sup>3</sup> Departamento de Genética y Bioestadística, Facultad de Medicina Veterinaria y Zootecnia, Universidad Nacional Autónoma de México, Mexico City, Mexico*

#### *Edited by:*

*José Manuel Yáñez, University of Chile, Chile*

#### *Reviewed by:*

*Juan Steibel, Michigan State University, USA Ross Houston, University of Edinburgh, UK*

## *\*Correspondence:*

*Hugo H. Montaldo, Departamento de Genética y Bioestadística, Facultad de Medicina Veterinaria y Zootecnia, Ciudad Universitaria, Universidad Nacional Autónoma de México, Circuito Exterior s/n, 04510 Mexico City, Mexico montaldo@unam.mx*

#### *Specialty section:*

*This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics*

> *Received: 22 September 2014 Accepted: 20 February 2015 Published: 24 March 2015*

#### *Citation:*

*Castillo-Juárez H, Campos-Montes GR, Caballero-Zamora A and Montaldo HH (2015) Genetic improvement of Pacific white shrimp [Penaeus (Litopenaeus) vannamei]: perspectives for genomic selection. Front. Genet. 6:93. doi: 10.3389/fgene.2015.00093* The uses of breeding programs for the Pacific white shrimp [*Penaeus (Litopenaeus) vannamei*] based on mixed linear models with pedigreed data are described. The application of these classic breeding methods yielded continuous progress of great value to increase the profitability of the shrimp industry in several countries. Recent advances in such areas as genomics in shrimp will allow for the development of new breeding programs in the near future that will increase genetic progress. In particular, these novel techniques may help increase disease resistance to specific emerging diseases, which is today a very important component of shrimp breeding programs. Thanks to increased selection accuracy, simulated genetic advance using genomic selection for survival to a disease challenge was up to 2.6 times that of phenotypic sib selection.

Keywords: *P. vannamei*, growth, survival rate, disease resistance, genomic selection

## Introduction

Shrimp production is an important activity both in economic terms and from the perspective of its contribution to human nutrition. Total world shrimp production had a value of approximately \$13.6 billion USD in 2013. The most important shrimp production regions in the world are located in Asia, principally China, India, Vietnam, Indonesia, and Bangladesh, and in the Americas, primarily in Ecuador, Brazil, and Mexico (Food and Agriculture Organization of the United Nations [FAO], 2014).

Genetic improvement is an important option for increasing profitability in agriculture and aquaculture (Gjedrem et al., 2012). Several shrimp breeding programs have been implemented in a number of countries, some of which have been reviewed by Neira (2010) and Rye (2012).

The increasing importance of disease to shrimp farming worldwide has stimulated research for developing breeding programs for increasing disease resistance/tolerance to disease. Evidence for successful selection for disease resistance exists in shrimp for Taura virus and other diseases (Cock et al., 2009; Lightner et al., 2012). In a few cases, alternative programs which combine mass selection in communally raised animals with recovery of family identification using DNA markers have been used in order to perform mating that avoids excessive increases in inbreeding rates and to select animals in the presence of disease, allowing natural selection to act toward increased genetic resistance (Rocha, 2012). Other programs have used experimental challenges combined with mass selection in successive generations and obtained increases in the genetic resistance to White Spot Syndrome Virus (Cuéllar-Anjel et al., 2012).

## Shrimp Breeding Programs

Although many commercial shrimp breeding programs are not fully described, most are based on population structures typical for aquaculture breeding, using full- and half-sib family structures (Gjedrem et al., 2012). Most shrimp breeding programs focus on the improvement of growth traits and general survival rate. Some have concentrated on selection for specific diseases. These traits can be improved genetically by within-line selection.

The Maricultura del Pacífico hatchery commercial line population was started in 1998 in Mexico, from a heterogeneous population formed by a mix of domesticated lines and wild shrimp from several origins. The main efforts have been directed toward instrumenting a breeding program to develop a genetic line oriented towards improving profitability of biomass production under the production conditions predominating in the northwest part of Mexico. Since biomass production depends on harvesting body weight and survival rates, these two traits have been incorporated as the broodstock selection criteria. The relative importance of each of these traits in the selection index used is 5:1 for harvesting body weight and survival, respectively, which is based on economic studies under the main production system conditions found in Mexico.

Each year from 2003 to 2010, an average of 15,445 shrimp obtained from 130 females and 93 males were evaluated for body weight at harvesting. These evaluations were performed in 2–4 ponds where commercial like production conditions are reproduced. Starting in 2009, survival from 65 to 130 days of age also has been genetically evaluated. Mating is designed based on the use of shrimp from the selected families with the higher selection index values. Hence, broodstock animals come on average, from the upper 27% families in the case of males, and from the upper 53% families in the case of females. In addition to these selection procedures, withinfamily selection is performed at various growing stages based on approximations to the individual body weight. This is carried out in the genetic nucleus, under strict biosecurity conditions. Animals coming from the within-family selection procedures are the animals ultimately used as the next generation broodstock. Reproduction techniques are based on artificial insemination only, using two females per male. These procedures have been described by Castillo-Juárez et al. (2007) and by Campos-Montes et al. (2013), and some methodological implications have been discussed by Montaldo et al. (2013). Considering the time needed to evaluate animals and to obtain mature broodstock ready to use, the complete production cycle required to yield a new generation (generational interval) is 1 year.

## Genetic Progress

Genetic progress has been evaluated in several shrimp selection programs. Hetzel et al. (2000) estimated the selection response per generation for 6-month weight in *Peneaus japonicus* at 8.3%. Andriantahina et al. (2012) estimated the genetic response per generation for 5-month weight in *Litopenaeus vannamei* at 10.7%. The genetic response for body weight at harvesting (130 days of age) in the Maricultura del Pacífico commercial line has been evaluated using linear models (BLUP-animal model). The estimated genetic gain as a linear trend from 2003 to 2010 represents an increment of 18.4% of the average body weight for the period. For survival, the estimated genetic gain as a linear trend from 2004 to 2010 was also positive (1.56%).

## Potential of Genomic Selection for Disease Resistance

Diseases are major constraints for aquaculture production. As vaccination is not an option in shrimp and management contention measures are frequently unfeasible, genetic selection is considered a possible option in fighting many diseases in *P. vanname*i and other shrimp species. Genomic selection (GS) increases accuracy if compared to conventional selection, by taking advantage of both between and within-family variance, in situations where family testing for disease is used in sibs of the actual selection candidates, in order to avoid the introduction of the pathogen into the breeding nucleus population (Villanueva et al., 2011).

We used SelAction software (Rutten et al., 2002) and methods developed by Dekkers (2007) to deterministically simulate selection response in *P. vannamei* using dense genetic marker arrays (chips). This was done to assess the potential effects of GS in a breeding program oriented to the improvement of disease resistance. This simulation considered the context of a typical shrimp breeding program based on (sib) family selection. We used seven heritability values: 0.01, 0.05, 0.10, 0.20, 0.30, 0.40, and 0.50 for survival under experimental infestation of shrimp produced in breeding shrimp populations of different size, and relatively high selection intensities, if compared to that from our actual breeding program (Campos-Montes et al., 2013). This allows us to evaluate the possible genetic response for survival to different pathogens, which may include a wide range of viral as well as bacterial infections. The considered proportion of common full-sib environmental effects was 0.15 for all cases.

Accuracy from GS was obtained with the formula developed by Daetwyler et al. (2010), assuming a genome size of 28 Morgans, and an effective population size of 50. This yielded a value for the number of independent chromosome segments close to 324, which may be considered conservative (Villanueva et al., 2011); therefore accuracy from GS was probably not overestimated. The proportion of the additive genetic variance explained by markers to reflect the preliminary stage of development of SNP chips in shrimp was also conservatively set at a value of 0.64 which is below the value of 0.80 used for the 50 K SNP chip for cattle (Daetwyler, 2009).

All breeding populations were derived from 30 males and 38 females, with an incomplete nested structure, similar to that described by Montaldo et al. (2013), to produce 150 families to be tested. Population sizes corresponding to 6, 50, and 100 shrimp per female were 900, 7,500, and 15,000 measured offspring, respectively.

Three breeding strategies were considered here for between family selection: (1) phenotypic information, (2) GS (based on using contemporary training and testing population of similar size to avoid an increase of generation interval), and (3) combined selection program which uses both genomic and phenotypic information.

Results (**Figure 1**) show that the potential of GS to develop lines with improved disease resistance is high. GS and combined selection programs had large increases in selection responses measured in phenotypic standard deviation units for survival, in populations of size 7,500 and 15,000, but less for the population of size 900. Ratio from combined and GS programs, with respect to phenotypic programs, increased with lower heritability values. These ratios for populations of size 7,500 were 2.6, 1.7, 1.6, 1.5, 1.4, and 1.4 for combined selection for heritability values of 0.01, 0.10, 0.20, 0.30, 0.40, and 0.50, respectively, and only slightly lower for GS. Analogous ratios for combined and GS relative to phenotypic selection for a population of size 15,000 were very similar. Results for combined selection for a population of size 900 showed an advantage over phenotypic selection with lower ratios (from 1.2 to 1.3), while ratios for GS were slightly below 1. These results indicate that with a training population of adequate size (probably of less than 7,500), GS could more than double the rate of selection response for survival to specific diseases, which is higher than previous estimates for GS programs in the context of aquaculture breeding for continuous traits (Nielsen et al., 2011). Interestingly, the relative advantage was greater for smaller heritability values. This is promising for improving survival to disease challenges in aquaculture species, which often have low heritability values (Yáñez and Martínez, 2010).

In this preliminary GS shrimp evaluation, we compared all the programs at the same selection intensity. This may give some unrealistic detriment to GS programs, because GS would allow selection of individual animals at higher selection intensities, when the individual identification of candidates is possible. Therefore, not all the potential advantages of GS and larger candidate populations to increase selection response were included in this study. Combined and GS responses were similar for populations of sizes 7,500 or 15,000, indicating that GS is capturing almost all the variation explained by the phenotypes, making GS more accurate. A more detailed comparison may include changes in inbreeding rates, and the optimization of the different factors affecting selection response. Advances in shrimp genomics (Yu et al., 2014; Zhang et al., 2014) may lead to the future development of SNP chips for *P. vannamei*, making the possibility of performing GS in this species more viable in the near future.

Within the framework of our preliminary calculations, GS for survival rates to disease challenges in shrimp may lead to large increases in selection responses across a wide range of heritability values.

and C) in *Penaeus vannamei*.

## Conclusion

The use of the so-called "classic" or conventional methods of quantitative genetics to genetically improve the Pacific white shrimp has allowed for continuous progress of great value to increase the profitability of the shrimp industry in several countries and other aquaculture species. Recent advances in such areas as genomics will allow, in the near future, for the development of animal breeding methods, which may increase, and hence accelerate shrimp genetic progress. In particular, these novel techniques may help

## References


increase disease resistance to specific emerging diseases, which is today a very important aspect in shrimp breeding programs.

## Acknowledgments

The authors are thankful to CONACyT, Mexico, for providing funds for our breeding research in the hatchery since 2007. Special thanks to Maricultura del Pacífico workers, for their contribution to operating the breeding program.

model, on the estimation of genetic parameters for body weight at 28 days of age in the Pacific white shrimp (*Penaeus (Litopenaeus) vannamei* Boone, 1931). *Aquaculture Res.* 44, 1715–1723. doi: 10.1111/j.1365-2109.2012.03176.x


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2015 Castillo-Juárez, Campos-Montes, Caballero-Zamora and Montaldo. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*