# GENETIC AND GENOME-WIDE INSIGHTS INTO MICROBES STUDIED FOR BIOENERGY

EDITED BY: Katherine M. Pappas, Ed Louis, Nigel Minton, Biswarup Mukhopadhyay and Shane Yang PUBLISHED IN: Frontiers in Microbiology

#### *Frontiers Copyright Statement*

*© Copyright 2007-2017 Frontiers Media SA. All rights reserved. All content included on this site, such as text, graphics, logos, button icons, images, video/audio clips, downloads, data compilations and software, is the property of or is licensed to Frontiers Media SA ("Frontiers") or its licensees and/or subcontractors. The copyright in the text of individual articles is the property of their respective authors, subject to a license granted to Frontiers.*

*The compilation of articles constituting this e-book, wherever published, as well as the compilation of all other content on this site, is the exclusive property of Frontiers. For the conditions for downloading and copying of e-books from Frontiers' website, please see the Terms for Website Use. If purchasing Frontiers e-books from other websites or sources, the conditions of the website concerned apply.*

*Images and graphics not forming part of user-contributed materials may not be downloaded or copied without permission.*

*Individual articles may be downloaded and reproduced in accordance with the principles of the CC-BY licence subject to any copyright or other notices. They may not be re-sold as an e-book.*

*As author or other contributor you grant a CC-BY licence to others to reproduce your articles, including any graphics and third-party materials supplied by you, in accordance with the Conditions for Website Use and subject to any copyright notices which you include in connection with your articles and materials.*

> *All copyright, and all rights therein, are protected by national and international copyright laws.*

*The above represents a summary only. For the full conditions see the Conditions for Authors and the Conditions for Website Use.*

ISSN 1664-8714 ISBN 978-2-88945-085-5 DOI 10.3389/978-2-88945-085-5

# About Frontiers

Frontiers is more than just an open-access publisher of scholarly articles: it is a pioneering approach to the world of academia, radically improving the way scholarly research is managed. The grand vision of Frontiers is a world where all people have an equal opportunity to seek, share and generate knowledge. Frontiers provides immediate and permanent online open access to all its publications, but this alone is not enough to realize our grand goals.

# Frontiers Journal Series

The Frontiers Journal Series is a multi-tier and interdisciplinary set of open-access, online journals, promising a paradigm shift from the current review, selection and dissemination processes in academic publishing. All Frontiers journals are driven by researchers for researchers; therefore, they constitute a service to the scholarly community. At the same time, the Frontiers Journal Series operates on a revolutionary invention, the tiered publishing system, initially addressing specific communities of scholars, and gradually climbing up to broader public understanding, thus serving the interests of the lay society, too.

# Dedication to Quality

Each Frontiers article is a landmark of the highest quality, thanks to genuinely collaborative interactions between authors and review editors, who include some of the world's best academicians. Research must be certified by peers before entering a stream of knowledge that may eventually reach the public - and shape society; therefore, Frontiers only applies the most rigorous and unbiased reviews.

Frontiers revolutionizes research publishing by freely delivering the most outstanding research, evaluated with no bias from both the academic and social point of view. By applying the most advanced information technologies, Frontiers is catapulting scholarly publishing into a new generation.

# What are Frontiers Research Topics?

Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are collections of at least ten articles, all centered on a particular subject. With their unique mix of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author by contacting the Frontiers Editorial Office: researchtopics@frontiersin.org

# **GENETIC AND GENOME-WIDE INSIGHTS INTO MICROBES STUDIED FOR BIOENERGY**

Topic Editors:

**Katherine M. Pappas,** National & Kapodistrian University of Athens, Greece **Ed Louis,** University of Leicester, UK **Nigel Minton,** University of Nottingham, UK **Biswarup Mukhopadhyay,** VirginiaTech, USA **Shane Yang,** DOE National Renewable Energy Laboratory, USA

Genetic maps of the ethanologenic *Zymomonas mobilis* strain ATCC 29191 chromosome and plasmids (Desiniotis *et al.* 2012; *J. Bacteriol.* 194: 5966). Image provided by the K. M. Pappas Lab.

The global mandate for safer, cleaner and renewable energy has accelerated research on microbes that convert carbon sources to end-products serving as biofuels of the so-called first, second or third generation – e.g., bioethanol or biodiesel derived from starchy, sugar-rich or oily crops; bioethanol derived from composite lignocellulosic biomass; and biodiesels extracted from oil-producing algae and cyanobacteria, respectively. Recent advances in 'omics' applications are beginning to cast light on the biological mechanisms underlying biofuel production. They also unravel mechanisms important for organic solvent or high-added-value chemical production, which, along with those for fuel chemicals, are significant to the broader field of Bioenergy.

The Frontiers in Microbial Physiology Research Topic that led to the current e-book publication, operated from 2013 to 2014 and welcomed articles aiming to better understand the genetic basis behind Bioenergy production. It invited genetic studies of microbes already used or carrying the potential to be used for bioethanol, biobutanol, biodiesel, and fuel gas production, as also of microbes posing as promising new catalysts for alternative bioproducts. Any research focusing on the systems biology of such microbes, gene function and regulation, genetic and/or genomic tool development, metabolic engineering, and synthetic biology leading to strain optimization, was considered highly relevant to the topic. Likewise, bioinformatic analyses and modeling pertaining to gene network prediction and function were also desirable and therefore invited in the thematic forum. Upon e-book development today, we, at the editorial, strongly believe that all articles presented herein – original research papers, reviews, perspectives and a technology report – significantly contribute to the emerging insights regarding microbial-derived energy production.

Katherine M. Pappas, 2016

**Citation:** Pappas, K. M., Louis, E., Minton, N., Mukhopadhyay, B., Yang, S., eds. (2017). Genetic and Genome-Wide Insights into Microbes Studied for Bioenergy. Lausanne: Frontiers Media. doi: 10.3389/978-2-88945-085-5

# Table of Contents

*06 Genetic resources for methane production from biomass described with the Gene Ontology*

Endang Purwantini, Trudy Torto-Alalibo, Jane Lomax, João C. Setubal, Brett M. Tyler and Biswarup Mukhopadhyay

*24 Genetic resources for advanced biofuel production described with the Gene Ontology*

Trudy Torto-Alalibo, Endang Purwantini, Jane Lomax, João C. Setubal, Biswarup Mukhopadhyay and Brett M. Tyler

*41 Aromatic inhibitors derived from ammonia-pretreated lignocellulose hinder bacterial ethanologenesis by activating regulatory circuits controlling inhibitor efflux and detoxification*

David H. Keating, Yaoping Zhang, Irene M. Ong, Sean McIlwain, Eduardo H. Morales, Jeffrey A. Grass, Mary Tremaine, William Bothfeld, Alan Higbee, Arne Ulbrich, Allison J. Balloon, Michael S. Westphall, Josh Aldrich, Mary S. Lipton, Joonhoon Kim, Oleg V. Moskvin, Yury V. Bukhman, Joshua J. Coon, Patricia J. Kiley, Donna M. Bates and Robert Landick

*58 Genomic insights into the fungal lignocellulolytic system of* **Myceliophthora thermophila**

Anthi Karnaouri, Evangelos Topakas, Io Antonopoulou and Paul Christakopoulos

*80 Comparative genomics and evolution of regulons of the LacI-family transcription factors*

Dmitry A. Ravcheev, Matvei S. Khoroshkin, Olga N. Laikova, Olga V. Tsoy, Natalia V. Sernova, Svetlana A. Petrova, Aleksandra B. Rakhmaninova, Pavel S. Novichkov, Mikhail S. Gelfand and Dmitry A. Rodionov

*96 Connecting lignin-degradation pathway with pre-treatment inhibitor sensitivity of* **Cupriavidus necator**

Wei Wang, Shihui Yang, Glendon B. Hunsinger, Philip T. Pienkos and David K. Johnson


Hui Wei, Yan Fu, Lauren Magnusson, John O. Baker, Pin-Ching Maness, Qi Xu, Shihui Yang, Andrew Bowersox, Igor Bogorad, Wei Wang, Melvin P. Tucker, Michael E. Himmel and Shi-You Ding

*146 A mathematical model of metabolism and regulation provides a systems-level view of how* **Escherichia coli** *responds to oxygen*

Michael Ederer, Sonja Steinsiek, Stefan Stagge, Matthew D. Rolfe, Alexander Ter Beek, David Knies, M. Joost Teixeira de Mattos, Thomas Sauter, Jeffrey Green, Robert K. Poole, Katja Bettenbrock and Oliver Sawodny

*158 Death by a thousand cuts: the challenges and diverse landscape of lignocellulosic hydrolysate inhibitors*

Jeff S. Piotrowski, Yaoping Zhang, Donna M. Bates, David H. Keating, Trey K. Sato, Irene M. Ong and Robert Landick

*166 Modeling of* **Zymomonas mobilis** *central metabolism for novel metabolic engineering strategies*

Uldis Kalnenieks, Agris Pentjuss, Reinis Rutkis, Egils Stalidzans and David A. Fell

*173 Comparative genomics and functional analysis of rhamnose catabolic pathways and regulons in bacteria*

Irina A. Rodionova, Xiaoqing Li, Vera Thiel, Sergey Stolyar, Krista Stanton, James K. Fredrickson, Donald A. Bryant, Andrei L. Osterman, Aaron A. Best and Dmitry A. Rodionov

# Genetic resources for methane production from biomass described with the Gene Ontology

# *Endang Purwantini 1, Trudy Torto-Alalibo1, Jane Lomax2, João C. Setubal 3,4, Brett M. Tyler 4,5 and Biswarup Mukhopadhyay1,4,6\**

*<sup>1</sup> Department of Biochemistry, Virginia Polytechnic Institute and State University, Blacksburg, VA, USA*

*<sup>3</sup> Department of Biochemistry, Universidade de São Paulo, São Paulo, Brazil*

*<sup>4</sup> Virginia Bioinformatics Institute, Virginia Polytechnic Institute and State University, Blacksburg, VA, USA*

*<sup>5</sup> Center for Genome Research and Biocomputing, Oregon State University, Corvallis, OR, USA*

*<sup>6</sup> Department of Biological Sciences, Virginia Polytechnic Institute and State University, Blacksburg, VA, USA*

#### *Edited by:*

*Katherine M. Pappas, University of Athens, Greece*

#### *Reviewed by:*

*Jim Spain, Georgia Institute of Technology, USA Marcus Constantine Chibucos, University of Maryland School of Medicine, USA*

#### *\*Correspondence:*

*Biswarup Mukhopadhyay, Department of Biochemistry, Virginia Polytechnic Institute and State University, 123 Engel Hall, 340 West Campus Drive, Blacksburg, VA 24060, USA e-mail: biswarup@vt.edu*

Methane (CH4) is a valuable fuel, constituting 70–95% of natural gas, and a potent greenhouse gas. Release of CH4 into the atmosphere contributes to climate change. Biological CH4 production or methanogenesis is mostly performed by methanogens, a group of strictly anaerobic archaea. The direct substrates for methanogenesis are H2 plus CO2, acetate, formate, methylamines, methanol, methyl sulfides, and ethanol or a secondary alcohol plus CO2. In numerous anaerobic niches in nature, methanogenesis facilitates mineralization of complex biopolymers such as carbohydrates, lipids and proteins generated by primary producers. Thus, methanogens are critical players in the global carbon cycle. The same process is used in anaerobic treatment of municipal, industrial and agricultural wastes, reducing the biological pollutants in the wastes and generating methane. It also holds potential for commercial production of natural gas from renewable resources. This process operates in digestive systems of many animals, including cattle, and humans. In contrast, in deep-sea hydrothermal vents methanogenesis is a primary production process, allowing chemosynthesis of biomaterials from H2 plus CO2. In this report we present Gene Ontology (GO) terms that can be used to describe processes, functions and cellular components involved in methanogenic biodegradation and biosynthesis of specialized coenzymes that methanogens use. Some of these GO terms were previously available and the rest were generated in our Microbial Energy Gene Ontology (MENGO) project. A recently discovered non-canonical CH4 production process is also described. We have performed manual GO annotation of selected methanogenesis genes, based on experimental evidence, providing "gold standards" for machine annotation and automated discovery of methanogenesis genes or systems in diverse genomes. Most of the GO-related information presented in this report is available at the MENGO website (http://www*.*mengo*.*biochem*.*vt*.*edu/).

**Keywords: Gene Ontology, biomass, biodegradation, methanogenesis, methanogen, bioenergy, carbon cycle, waste treatment**

#### **INTRODUCTION**

Methane (CH4), the simplest aliphatic hydrocarbon, is a valuable fuel. It constitutes 70–95% (volume/volume) of natural gas (Strapoc et al., 2011). The biological production of methane, which occurs under strictly anaerobic conditions, is critical to the operation of the global carbon cycle, nutrient recovery in the digestive systems of numerous animals, and treatment of

**Abbreviations:** CODH/ACDS, acetyl-CoA decarbonylase/synthase-carbon monoxide dehydrogenase complex; F420, coenzyme F420 or 7,8-didemethyl-8-hydroxy-5-deazaflavin derivative; F430,coenzyme F430 - a tetrapyrrole; GO:MENGO-UR, GO terms generated in the MENGO project, submitted to the GO consortium and awaiting acceptance; HS-CoM, coenzyme M; HS-CoB or HS-HTP, coenzyme B; H4MPT, tetrahydromethanopterin; H4SPT, tetrahydrosarcinapterin; MF, methanofuran.

municipal and industrial wastes, and it could potentially allow commercial production of methane from renewable resources (Zinder, 1993; Thauer et al., 2008; McInerney et al., 2009). The methane present in geological deposits such as oil and gas reservoirs and coal beds also originated in part from microbial degradation of biomass, and the rest of it was derived from thermal maturation of the remnants from biodegradation (Strapoc et al., 2011). Each of these cases involves anaerobic degradation of biopolymers such as carbohydrates and proteins, as well as lipids, and this process is composed of two broad steps (**Figure 1**): first, generation of substrates for methanogens through a combination of hydrolysis and fermentation; second, methanogenesis or methane production. Methanogenesis is also one of the most ancient respiratory processes on Earth, developing 2.7–3.2 billion

*<sup>2</sup> European Bioinformatics Institute (EMBL-EBI), European Molecular Biology Laboratory, Hinxton, UK*

years ago, and by virtue of the processes described above it continues to be an important process on the present day Earth (Leigh, 2002). Furthermore, biological methanogenesis is a significant contributor to climate change as together with water vapor, carbon dioxide and ozone, methane also contributes to the greenhouse effect (Strapoc et al., 2011). According to United States Environmental Protection Agency (US EPA) "Pound for pound, the comparative impact of CH4 on climate change is over 20 times greater than CO2 over a 100-year period" (http://epa*.*gov/ climatechange/ghgemissions/gases/ch4*.*html).

For the ecological, evolutionary, and applied interests discussed in the preceding paragraph, methanogens have been investigated intensely in the past six decades (Wolfe, 1991; Thauer, 1998, 2012). This research has resulted into a detailed understanding of the biochemistry of these archaea, especially their unique energy metabolism, methanogenesis, and the mechanistic details of their interactions with other microorganisms in numerous ecological niches. For the same reasons, genomes of methanogens have been analyzed from the early days of genome sequencing. In fact, *Methanocaldococcus jannaschii*, a methanogen, was the first archaeon and third organism to be targeted for complete genome sequence determination (Bult et al., 1996). Since then the genome sequences of more than 170 methanogens have appeared in public databases. These genomes have not only helped to advance the research on methanogens, but also have catalyzed major shifts in our understanding of the relationships of these organisms with the rest of the biological world. It is now known that many of the biological parts and processes that were once thought to be specific to methanogenic archaea are major contributors to the metabolism of numerous non-methanogenic organisms from all domains of life (Takao et al., 1989; Batschauer, 1993; Purwantini et al., 1997; Graham and White, 2002; Chistoserdova et al., 2004; Krishnakumar et al., 2008). Often such discoveries have been based on the detection of methanogen genes in non-methanogen genomes, followed by biochemical analysis of their molecular functions and knowledge based deductions of their roles in the metabolic pathways in those organisms. In this context a rich set of GO terms fully describing methanogenesis together with manually generated gene annotations based on experimental evidence (gold standards) could bring great strength, as it would provide expanded qualifications of the methanogen genes in a non-methanogen genome, such as predicted functions and cellular locations of the gene products, through automated analysis. This resource will then allow facile mining of useful parts of methanogenesis systems from both methanogens as well as non-methanogenic organisms. The Gene Ontology (GO) consists of three sets of terms for describing gene products in terms of biological processes (GO:0008150), cellular components (GO:0005575), and molecular functions (GO:0003674) (Ashburner et al., 2000). These terms are related to each other in a semi-hierarchical fashion (a directed graph structure), from very broad terms (at the top of the hierarchy) to specific (at the bottom). GO annotation can thus provide both specific and broader attributes to gene products. This is the primary motivation for the work described in this report.

# **GENE ONTOLOGY (GO) DESCRIBING METHANOGENESIS**

The promise cited above has inspired the work on the methanogenesis component of our MENGO (Microbial Energy production Gene Ontology) project. The goal of the MENGO project is to develop a set of GO terms for describing gene products involved in energy-related microbial processes. A major focus is on the microbial biomass degradation for the production of biofuel (fuel from renewable resources) such as methane, alcohols, fatty acid esters, hydrocarbons, and hydrogen. Until now we have generated 667 terms and these are available at the GO website (AMIGO: http://tinyurl*.*com/kh7fqne) as well as at our MENGO website (http://www*.*mengo*.*biochem*.*vt*.*edu/). Of these, 563 terms are in the Biological Process ontology, 88 in the Molecular Function ontology and 16 in the Cellular Component ontology. More terms are still under review (GO:MENGO-UR) and when the respective GO ID's are assigned, we will post those at the MENGO website (http://www*.*mengo*.*biochem*.*vt*.*edu/). We generated these terms in two ways: 1. Our own effort, which involved a review of the relevant literature and creation of terms as the needs were identified. 2. Community input, where MENGO terms were generated following suggestions from the members of the bioenergy research community who attended the MENGO workshops organized by us at the following locations: Great Lakes Bioenergy Research Center, University of Wisconsin, Madison, WI (2011); Annual User Meeting of the US Department of Energy Joint Genome Institute, Walnut Creek, CA (2011 and 2012); US Department of Energy's Genomic Science Awardee Meeting IX and X (Crystal City, VA, and Bethesda, MD, respectively) (2011 and 2012).

In this report we present GO terms suitable for describing processes, functions and cellular components involved in methanogenic biodegradation of biomass, including methanogenesis, in the context of both natural and engineered processes. We begin this description with a brief review of the relevant systems. More detailed information, especially the mechanistic details of methanogenesis, is available in several reviews including some cited here (Wolfe, 1993; Zinder, 1993; Ferry, 1999; Deppenmeier, 2002; Liu and Whitman, 2008; Thauer et al., 2008; Thauer, 2012; Costa and Leigh, 2014; Welte and Deppenmeier, 2014). Furthermore, to remain focused on bioconversion or catabolism, the general cellular biosynthesis processes have not been covered in this report; exceptions are the syntheses of coenzymes that were once thought to be unique to methanogens (Wolfe, 1991, 1992; Graham and White, 2002) and afterwards some of these were found to occur in the bacteria (Purwantini et al., 1997; Chistoserdova et al., 2004; Krishnakumar et al., 2008). Recently, a non-canonical route that contributes significantly to global biological production of methane has been described (Metcalf et al., 2012) and we describe this system briefly. In numerous environments, complete anaerobic biodegradation of biomass can occur without the formation of methane and here processes such as sulfate reduction and acetogenesis provide avenues for the disposal of reductants (Isa et al., 1986; Gibson et al., 1988; Widdel, 1988; Zinder, 1993; Breznak, 1994; De Graeve et al., 1994; Raskin et al., 1996; Muller, 2003). Those processes will not be covered here.

The work on the GO for methanogenesis began with a review of the GO database. This showed that, although the GO terms describing many of the biological processes and molecular functions associated with methane biosynthesis were available, the coverage of this area was incomplete. To fill this gap we generated an additional 110 GO terms for methanogenesis. A comprehensive source of this information is on our website (http://www*.*mengo*.*biochem*.*vt*.*edu/) where the data are available under two menus: MENGO (All MENGO Terms; Process Specific; Ontology Specific; New MENGO Terms) and PATHWAYS (Natural Pathways; Synthetic/Engineered Pathways). Under the MENGO menu a form (Submit MENGO Term) is available for the submission of new terms that will help to describe gene products involved in methanogenesis in a comprehensive manner and to validate the resource through research community input.

#### **METHANOGENIC DEGRADATION OF BIOMASS**

As mentioned in the Introduction, this process is composed of two broad steps, anaerobic biodegradation of biomass generating substrates for methanogens, and methanogenesis (**Figure 1**). The narrative appearing below covers both natural and engineered systems.

### **ANAEROBIC BIODEGRADATION OF BIOMASS** *Natural systems*

*Anaerobic biodegradation of biomass in sediments.* Annually, plant and photosynthetic microorganisms fix 70 billion tons of carbon into biomass made up of complex biopolymers, such as cellulose, hemicellulose, lignin, proteins and lipids (Thauer et al., 2008). About 1% of this material is mineralized in various anaerobic niches of nature through a process that yields methane and carbon dioxide as end products (**Figure 2**). The combination of photosynthesis (GO:0015979) and macromolecule catabolism (GO:0009057) constitutes the biological component of the biogeochemical process of carbon cycling.

Cellulose is a polymer of D-glucose units connected by β(1→4) bonds. The anaerobic mineralization of cellulose (synonym of "cellulose catabolic process, anaerobic," GO:1990488) starts with hydrolysis of the β(1→4) bonds by cellulases (GO:0008810) produced by anaerobic cellulolytic bacteria and fungi (Adney et al., 1991; Teunissen and Op Den Camp, 1993; Leschine, 1995; Li et al., 1997; Schwarz, 2001; Ljungdahl, 2008; Ransom-Jones et al., 2012). These organisms either secrete the cellulases or carry these enzymes on their cell surfaces (Teunissen and Op Den Camp, 1993; Li et al., 1997). A recent study shows that excreted enzymes with multiple catalytic sites and multiple cellulose-binding modules provide *Caldicellulosiruptor bescii*, an anaerobic thermophile with a high activity of cellulose degradation (Brunecky et al., 2013). The cellulose catabolic process (GO:0030245) involves the actions of endo-β-1,4-glucanases (GO:0052859) and exo-1,4-β-glucanases or cellobiohydrolases (CBH) (reducing-end-specific, GO:0033945; non-reducing-endspecific, GO:0016162) that generate cellobiose, with intermediate formation of fragments with multiple glucose units (Akin, 1980; Beguin and Aubert, 1994; Bayer et al., 1998; Perez et al., 2002; Hilden and Johansson, 2004), and hydrolysis of cellobiose to glucose (cellobiose catabolic process, GO:2000892) by β-glucosidase (GO:0008422).

The cellulose degrading anaerobic microorganisms and other non-cellulolytic anaerobes with access to the products generated by cellulolytic microbes take up and ferment D-glucose to acetate, alcohols, lactate and fatty acids (e.g., propionate, butyrate) via respective biosynthetic processes (**Figure 3**) (Zinder, 1993; Schink, 1997; Ahring, 2003; McInerney et al., 2009). The butyrate biosynthetic process (GO:0046358) involves an intermediate formation of acetyl-CoA (acetyl-CoA biosynthetic process, GO:0006085) whereas for propionate biosynthesis (GO:0019542) succinate generated via the tricarboxylic acid metabolic process or TCA cycle (GO:0072350) serves as the direct precursor (Zinder, 1993; Schink, 1997; Ahring, 2003; McInerney et al., 2009).

Butyrate and propionate, which are called short-chain fatty acids (SCFAs), are further fermented to acetate, hydrogen and CO2 (fatty acid catabolic process: GO:0009062; child term, anaerobic fatty acid catabolic process, GO:1990486) via their respective catabolic processes (butyrate catabolic process, GO:0046359; propionate catabolic process, GO:0019543) (Zinder, 1993; Schink, 1997; Ahring, 2003; McInerney et al., 2009); *Syntrophobacter*, *Syntrophomonas*, *Syntrophus*, *Smithella*, and *Pelotomaculum* species are some of the bacteria that produce these SCFAs. Ethanol and lactate are also fermented to acetate, hydrogen and CO2 (ethanol catabolic process, GO:0006068; anaerobic lactic acid catabolic, process GO:1990485). The hydrogen biosynthetic process (GO:1902422) is a key element of these fermentation processes and those described in the preceding paragraph for the following reason. Several steps of fermentation lead to the reduction of electron carriers such as NAD+ and ferredoxin, producing NADH and reduced-ferredoxin. For the fermentation process to continue, NAD+ and ferredoxin must be regenerated, and often the only available route to meet this requirement is the reduction of protons, yielding molecular hydrogen (H2) (GO:1902422) (Zinder, 1993; Schink, 1997; Ahring, 2003; McInerney et al., 2009).

Degradation of hemicellulose follows a path similar to that described for cellulose (summarized in **Figure 3**). The term hemicellulose includes xylan (polymer of xylose), glucuronoxylan

(polymer of D-glucuronate and xylose), arabinoxylan (polymer of arabinose and xylose), glucomannan (polymer of glucose and mannose), galactomannan (polymer of galactose and mannose), and xyloglucan (polymer of xylose, glucose and galactose) (Akin, 1980; Perez et al., 2002). These are degraded via specific hemicellulose catabolic processes (GO:2000895) to their respective monomers (Akin, 1980; Perez et al., 2002). The fermentation of monomers yields acetate, hydrogen and CO2 (Wolin and Miller, 1983; Schink, 1997).

Lignin degradation in anaerobic environments (anaerobic lignin catabolic process, GO:1990487) is not well studied and is considered rare to impossible (Akin, 1980; Harwood and Gibson, 1988; Perez et al., 2002; Fuchs, 2008); the broader lignin catabolic process is generally considered an aerobic process (Perez et al., 2002). However, following the degradation of lignin by aerobic microorganisms such as fungi, a variety of aromatic compounds (catechol, benzoate, p-hydroxybenzoate, vanillate-, ferulate, syringate, p-hydroxybenzoate, p-hydroxycinnamate, and

3-methoxy-4-hydroxyphenylpyruvate) (Kaiser and Hanselmann, 1982a,b) become available in anaerobic environments. Fermentation of these aromatic compounds by anaerobic bacteria leads to acetate, CO2 and hydrogen (Harwood and Parales, 1996; Fuchs, 2008). Anaerobic degradation of benzoate, one of the lignin monomers, has been studied in detail and this catabolic process (benzoate catabolic process via CoA ligation, GO:0010128) yields acetate and CO2 (**Figure 3**). The metabolism of several other lignin monomers by anaerobes has also been investigated (Harwood and Parales, 1996; Fuchs, 2008) and some of the relevant information for vanillin, ferulate and catechol is summarized in **Figure 3**.

Anaerobic lipid catabolic processes also lead to acetate and hydrogen (McInerney, 1988; Zinder, 1993; Schink, 1997). The process begins with the hydrolysis of lipids (lipase activity, GO:0016298); the broader lipid catabolic process is represented by GO:0016042. The glycerol released by hydrolysis enters the glycolysis pathway generating acetate and hydrogen (Zinder, 1993) (**Figure 3**). The fatty acid units are degraded via the β-oxidation pathway (fatty acid beta-oxidation, GO:0006635) to acetate and the excess reducing equivalents are released as hydrogen (**Figure 3**). In the case of proteins, the amino acids released by the action of proteases or peptidases (peptidase activity, GO:0008233; endopeptidase activity GO:0004175, exopeptidase activity, GO:0008238) are deaminated oxidatively, releasing ammonia and hydrogen, and then the resulting ketoacids are fermented to acetate and hydrogen (McInerney, 1988; Zinder, 1993; Schink, 1997) (**Figure 3**).

In each of the above cases, as H2 accumulates the oxidation of reduced electron carriers becomes thermodynamically unfavorable and consequently the fermentation process slows down or even halts (McInerney, 1988; Zinder, 1993; Schink, 1997). Methanogens consume hydrogen and reduce CO2 to methane, thus relieving the block on fermentation (McInerney, 1988; Zinder, 1993; Schink, 1997). These archaea also convert acetate to methane and CO2 and this action also improves the thermodynamics of biodegradation (Zinder, 1993). As CH4 moves to aerobic zones, such as the surface of water overlaying sediments, methanotrophic bacteria oxidize this hydrocarbon to CO2 (methane catabolic process, GO:0046188) (Kiene, 1991; Zinder, 1993; Conrad, 2007, 2009). More recent work shows that significant amount of methane is oxidized anaerobically and the microbial basis and mechanistic details of this process are beginning to emerge (Conrad, 2009; Thauer, 2011; Milucka et al., 2012; Shima et al., 2012; Haroon et al., 2013; Offre et al., 2013). Hence, by removing the hydrogen-induced thermodynamic block and converting acetate to methane, methanogens facilitate the complete degradation of the biopolymers discussed above.

In marine anaerobic sediments rich in sulfates some of the products of biomass degradation also lead to methane production. In general however, in this environment hydrogen and acetate are not available to the methanogens as, in the presence of sulfate, sulfate-reducing bacteria readily use these materials to reduce sulfate to hydrogen sulfide (dissimilatory sulfate reduction, GO:0019420), and the growth rates and affinities for H2 of the sulfate-reducing bacteria are much higher than those of the methanogens (Widdel, 1988). However, several hydrogenconsuming methanogens belonging to the class of Methanococci have been isolated from marine environments (Whitman et al., 1986). It has been speculated that these organisms may depend primarily on formate which could arise from the catabolism of oxalate (GO:0033611) derived from plant materials (Allison et al., 1985); most methanococci are capable of consuming both hydrogen and formate (Boone et al., 1993).

A significant amount of methane is also produced from methylamines, methylsulfides and carbon monoxide (Zinder, 1993; Thauer, 1998; Deppenmeier et al., 1999; Ferry, 1999, 2011; Liu and Whitman, 2008; Thauer et al., 2008). The sources of methylamines are betaine and choline, (GO:0006579 and GO:0042426, respectively) while methylsulfides are generated from sulfur-containing compounds such as methionine and dimethylpropiothetin (GO:0009087; and GO:0047869, respectively; **Figure 3**) (Boone et al., 1993). In certain marine environments, carbon monoxide provided by kelp algae provides both reductant and carbon for methanogenesis (methane biosynthetic process from carbon monoxide, GO:2001134) (Rother and Metcalf, 2004; Lessner et al., 2006).

*Anaerobic biodegradation of biomass in animal intestines.* Foregut fermenting animals such as the ruminants (cattle, sheep, goats) as well as hindgut fermenters such as human, termites, and horse, employ variations of the overall process shown in **Figure 2** for deriving nutrients from feed or food (Wolin, 1981; Wolin and Miller, 1983; Zinder, 1993; Miller and Wolin, 1996; Weimer, 1998; Hook et al., 2010; Sahakian et al., 2010). In cattle and many other foregut fermenters, the rumen serves as the first site for the degradation of forage (Wolin, 1981; Weimer, 1998; Hook et al., 2010). The residence time for the feed in rumen is rather short (5.6 h for the fluid and 35 h from particulates in rumen; compared to about 4.5 months even for nitrate, a soluble compound, in freshwater sediment) (Hristov et al., 2003), which is not conducible for significant growth and activity of slow-growing fatty acid-fermenting bacteria and acetoclastic methanogens (Zinder, 1993). Thus, in this digestive chamber the fatty acids and acetate are not converted to methane, rather are absorbed by the animal for nutrition (Zinder, 1993). The hydrogen and formate produced during the fermentation are converted to methane by methanogenic archaea. All plant material contains pectin, a methylated carbohydrate, and leaves, shoots and fruit are particularly rich in it. Anaerobic degradation of pectin (anaerobic pectin catabolic process, GO:1990489) serves as an important source of methanol in anaerobic environments (Schink et al., 1981). Thus, ruminants could carry methanogens in their rumens capable of utilizing methanol for methanogenesis and in some cases this has been shown to be true (Mukhopadhyay et al., 1992; Zinder, 1993).

In the hindgut of humans, i.e., the large intestine, the undigested material delivered from the small intestine is fermented, generating fatty acids, some hydrogen, and formate, and the latter two are converted to methane (Wolin, 1981; Zinder, 1993; Miller and Wolin, 1996; Sahakian et al., 2010; Flint, 2011). The process is beneficial to the host as it provides the fatty acids as additional nutrients. However, an uncontrolled production of fatty acids in this hindgut activity has been identified as one of the possible causes of obesity (Schmitz and Langmann, 2006; Nakamura et al., 2010; Sahakian et al., 2010; Flint, 2011).

In certain foregut fermenters such as kangaroos and wallabies and hindgut fermenters such as termites, the removal of hydrogen during biodegradation of complex polymers occurs through acetate formation and not methanogenesis (Brune and Friedrich, 2000; Gagen et al., 2010; Klieve et al., 2012).

#### *Anaerobic biomass degradation in engineered systems: waste treatment and methane production from renewable resources*

Aerobic treatment of municipal and industrial wastes via methods such as activated sludge requires energy input for supplying oxygen (Switzenbaum, 1983; Zinder, 1993). The process also generates a significant amount of microbial biomass (Zinder, 1993), which cannot be discharged to waterways (Zinder, 1993; Paul and Debellefontaine, 2007). In contrast, anaerobic methods not only require much less energy input and produce very little microbial biomass, but also conserve most of the energy present in the waste materials in the form of methane, which can be recovered as fuel (Zinder, 1993; Gao et al., 2014). Anaerobic waste treatment and the production of methane as biofuel from renewable resources follow the basic biological process (macromolecules catabolic process, GO:0009057) that has been described above for methanogenic biodegradation of biomass in sediments (Zinder, 1993). The mixture of methane and carbon dioxide that is produced in all these cases is known as biogas (Ducom et al., 2009). The biogas obtained from waste treatment facilities as well as from bio-digesters processing high sulfur feedstock contains a substantial amount of hydrogen sulfide and nitrogen oxides (Zinder, 1993; Fdz-Polanco et al., 2001; Janssen et al., 2001; Ducom et al., 2009; Diaz and Fdz-Polanco, 2012). Several methods for the removal of these unwanted compounds have been developed and research for developing even better separation methods continues (Fdz-Polanco et al., 2001; Janssen et al., 2001; Ducom et al., 2009; Diaz and Fdz-Polanco, 2012).

### **METHANOGENESIS**

The pathways for methanogenesis or methane biosynthetic process (GO: 0015948) from various substrates and the respective molecular functions are shown in **Figure 4**. Here, the steps leading from CO2 to CH4 form the core, which is used in part or its entirety with other substrates as well (Wolfe, 1991, 1993; Ferry, 1993, 1999, 2011; Thauer et al., 1993, 2008; Deppenmeier et al., 1999; Deppenmeier, 2002; Deppenmeier and Muller, 2008; Liu and Whitman, 2008). These pathways utilize several unusual coenzymes of which methanofuran (MF), tetrahydromethanopterin (H4MPT), tetrahydrosarcinapterin (H4SPT), and coenzyme M (or HS-CoM) carry the carbon moiety destined to generate methane, while coenzyme F420 (a deazaflavin derivative), coenzyme B (HS-CoB or HS-HTP), methanophenazine, and coenzyme F430 (a tetrapyrrole) transfer electrons that are used in carbon reduction (Wolfe, 1991, 1993; Ferry, 1999; Deppenmeier and Muller, 2008; Thauer et al., 2008). Many unique enzymes and unusual mechanisms are also involved (Wolfe, 1991, 1993; Ferry, 1999; Deppenmeier and Muller, 2008; Thauer et al., 2008). In the following narrative the term H4MPT represents both H4MPT and H4SPT, which serve the same function in different organisms.

### *Methanogenesis from H***<sup>2</sup>** *plus CO***<sup>2</sup>**

This process (GO:0019386) utilizes hydrogen as the primary source of electron or reductant (Ferry, 1999; Deppenmeier and Muller, 2008; Thauer et al., 2008). It can operate with or without the involvement of cytochromes (Ferry, 1999; Deppenmeier and Muller, 2008; Thauer et al., 2008). The latter is utilized by methanogens that lack cytochromes (Ferry, 1999; Deppenmeier and Muller, 2008; Thauer et al., 2008) and is considered one of the most ancient respiratory metabolisms on earth (Leigh, 2002). We describe the process starting with the carbon transfer and reduction steps, followed by the energy production avenues.

*Carbon transfer and reduction.* It is believed that carbon dioxide (CO2) is captured by methanofuran (MF) to form an unstable compound called carboxy-MF (Thauer et al., 1993) which is reduced by formyl-MF dehydrogenase (GO:0018493) in an energy-dependent (endergonic) manner to formyl-MF with a low-potential ferredoxin (Fd) serving as electron carrier (Thauer et al., 2008). Formyl-MF dehydrogenase exists in two forms, one of which contains molybdenum (Fmd) and the other tungsten (Fwd) (Thauer et al., 1993); molybdenum and tungsten are found to be bound to a molybdopterin and growth conditions dictate which metal will be incorporated (Hochheimer et al., 1995). At the next step the formyl group is transferred to H4MPT by a transferase enzyme (Ftr, GO:0030270) to form formyl-H4MPT (Donnelly and Wolfe, 1986; Breitung and Thauer, 1990; Thauer et al., 1993). From this stage H4MPT carries four forms of the fixed carbon representing three oxidation states (Wolfe, 1991, 1993; Thauer et al., 1993, 2008; Ferry, 1999; Deppenmeier and Muller, 2008). First formyl-H4MPT is dehydrated by methenyl-H4MPT cyclohydrolase (Mch, GO:0018759) to form methenyl-H4MPT (Donnelly et al., 1985; Dimarco et al., 1986; Mukhopadhyay and Daniels, 1989; Klein et al., 1993), which in turn is reduced to methylene-H4MPT by the action of one of the two enzymes, F420-dependent methylene-H4MPT dehydrogenase (Mtd, GO:0030268) and a Fe-containing hydrogenase (Hmd, GO:0047068) (Hartzell et al., 1985; Mukhopadhyay and Daniels, 1989; Von Bunau et al., 1991; Schworer et al., 1993; Thauer et al., 1993; Mukhopadhyay et al., 1995). Mtd utilizes reduced F420 (F420H2) as reductant whereas Hmd retrieves electrons from molecular hydrogen (H2) (Hartzell et al., 1985; Mukhopadhyay and Daniels, 1989; Von Bunau et al., 1991; Schworer et al., 1993; Thauer et al., 1993; Mukhopadhyay et al., 1995). Methanogens with Hmd also carry paralogs of this protein (HmdII and HmdIII), but these proteins do not reduce methylene-H4MPT (Lie et al., 2013). Two roles of HmdII and HmdIII have been proposed: a. guiding the maturation of Hmd and b. linking energy production and protein synthesis (Oza et al., 2012; Lie et al., 2013). Methylene-H4MPT is reduced with F420H2 and by the action of F420-dependent methylene-H4MPT reductase (Mer, GO:0018537), providing the last H4MPT derivative on the pathway, methyl-H4MPT (Ma and Thauer, 1990; Te Brommelstroet et al., 1990; Ma et al., 1991; Thauer et al., 2008). The transfer of the methyl group from methyl-H4MPT to coenzyme M is catalyzed by a membrane-bound sodium ion (Na+)-pumping enzyme complex called methyl-H4MPT:coenzyme M methyl transferase (Mtr, GO:0044677) (Becher et al., 1992; Kengen et al., 1992; Gartner et al., 1993). This complex not only yields methylcoenzyme M (CH3-CoM), but also generates a Na+-gradient that is used for energy production (see below) (Ferry, 1999; Deppenmeier and Muller, 2008; Thauer et al., 2008). The next step in the sequence yields methane. This last carbon-reduction reaction is catalyzed by CH3-CoM reductase (GO:0044674) with coenzyme B (HS-CoB or HS-HTP) serving as an electron source, resulting in a heterodisulfide, CoM-S-S-CoB, as product in addition to methane (Wolfe, 1991, 1992; Ferry, 1999; Deppenmeier and Muller, 2008; Thauer et al., 2008). The heterodisulfide is reduced by a reductase (Hdr, GO:0051912) to regenerate HS-CoM and HS-CoB (Ferry, 1999; Deppenmeier and Muller,

2008; Thauer et al., 2008). Hydrogen-oxidizing methanogens often carry two CH3-CoM reductase isozymes (McrI and McrII) (Rospert et al., 1990), one of which is effective under high hydrogen availability and the other under low hydrogen conditions (Rospert et al., 1990).

*Energy conservation.* First, we describe the details for methanogens lacking cytochromes. The first site of energy conservation is the Mtr reaction (Ferry, 1999; Deppenmeier and Muller, 2008; Thauer et al., 2008). The Na+-gradient generated at this step is directly used for the production of ATP by a membrane-bound AoA1-ATP synthase (GO:1990490) (Deppenmeier and Muller, 2008). Under certain conditions this gradient assists two membrane-associated and energy-converting hydrogenase complexes, EhaA-T and EhbA-Q, to generate reduced Fd with the ability to deliver low redox potential electrons (Thauer et al., 2008; Costa et al., 2010; Lie et al., 2012). The reduced Fd molecules generated by EhaA-T are used for the endergonic formyl-MF dehydrogenase reaction that yields formyl-MF, and those provided by EhbA-Q are used for cellular biosynthesis (Porat et al., 2006; Thauer et al., 2008; Costa et al., 2010; Major et al., 2010; Kaster et al., 2011; Lie et al., 2012). The next energy yielding step is the reduction of CH3-CoM and it is not clear whether the methanogens conserve this energy or the energy is released to strongly favor the forward reaction toward methane formation (Thauer et al., 2008). The reduction of CoM-S-S-CoB involves rather complex electron transfer mechanisms and also is a major site for energy conservation (Thauer et al., 2008; Costa and Leigh, 2014).

In certain methanogens without cytochromes, the reduction of CoM-S-S-CoB and formyl-MF generation is coupled via a novel mechanism called bifurcation (Thauer et al., 2008; Costa et al., 2010; Kaster et al., 2011; Lie et al., 2012) (**Figure 5**). Here, the Vhu hydrogenase retrieves electrons from hydrogen and transfers those to soluble heterodisulfide reductase (Hdr). Hdr utilizes these electrons for two purposes (**Figure 5**): (i) converting CoM-S-S-CoB to HS-CoM and HS-CoB, which requires a relatively lower investment of energy; and (ii) reducing a low potential ferredoxin, which is energetically suitable for the highly energy intensive reduction of CO2 and generation of formyl-MF (Thauer et al., 2008; Costa et al., 2010; Kaster et al., 2011; Lie et al., 2012). This novel mechanism, where a single input (electrons of moderately low potential) is used to generate two outputs (two pools of electrons, with potentials that are higher and much lower than the input) is called electron bifurcation (GO:MENGO-UR). It is a major factor in energy conservation in methanogens as it helps to perform a highly endergonic reaction, such as the generation of formyl-MF, without an investment of ATP or an ion gradient (Thauer et al., 2008; Costa et al., 2010; Kaster et al., 2011; Lie et al., 2012). It seems to be an important tool for energy poor anaerobes (Thauer et al., 2008; Kaster et al., 2011). When withdrawal of intermediates from the methanogenesis pathway for biosynthesis causes a drop in CoM-S-S-CoB levels, the bifurcation process is rendered less efficient (Lie et al., 2012); in that case, as described in the preceding paragraph, an ion-driven hydrogenase system (EhaA-T) is employed for the generation of formyl-MF and this could be considered to be a type of anaplerosis (GO:MENGO-UR) (Lie et al., 2012).

Methanogens with cytochromes do not employ the bifurcation mechanism. Instead, a membrane-bound complex composed of a cytochrome-containing heterodisulfide reductase (HdrDE) (GO:0044678) and a hydrogenase (VhoECG) where VhoC is a b-type cytochrome is utilized (Thauer et al., 2008). The electrons derived from hydrogen by VhoECG are utilized by HdrDE for reduction of CoM-S-S-CoB (Thauer et al., 2008). The overall process is exergonic and thus, in addition to reducing CoM-S-S-CoB, the VhoECG-Hdr complex utilizes excess energy to extrude protons out of the cell. The low potential reduced ferredoxin, which is needed for the generation of formyl-MF, is provided by a proton-gradient-driven membrane-bound energy-converting hydrogenase complex (EchA-F).

As mentioned above, several methanogenesis enzymes form large protein complexes (GO:0043234) and some these are membrane bound (GO:0019898) and include specialized non-enzyme units such as ion pumps and lipid soluble small compounds. One example is the soluble heterodisulfide reductase complexes of methanogens that lack cytochromes, and can be described using a molecular function term, GO:0051912 ("CoB-CoM heterodisulfide reductase activity") and two cellular component terms GO:0044678 ("CoB-CoM heterodisulfide reductase complex") and GO:0043234 ("protein complex"). In the case of cytochrome-containing methanogens, one further additional cellular component GO term is available for a full description, namely GO:0019898 ("extrinsic component of membrane").

#### *Methanogenesis from formate*

The carbon transfer and reduction steps in this process (GO:2001127) are similar to those described above for methanogensis from H2+CO2 (GO: 0015948). Both the CO2 and reducing power are derived from formate by the action of an F420-dependent formate dehydrogenase (FdhABC) (GO:0043794) (Schauer and Ferry, 1982; Lie et al., 2012); FdhAB subunits form the enzyme that produces CO2 and reduced F420 or F420H2 (HCOO<sup>−</sup> + H<sup>+</sup> + F420 → CO2 + F420H2) and FdhC is thought to import formate into the cell (Wood et al., 2003). CO2 is converted to methane using the CO2-reduction pathway described in **Figure 4**. Some of the reduced F420 (F420H2) participates directly in the Mtd and Mer reactions and a part of it is used by a bifurcating complex that provides electrons of appropriate redox potentials to heterodisulfide reductase and formyl-MF dehydrogenase. In the composition and some of the properties this bifurcating complex differ from the one employed for methanogenesis from H2+CO2 (see above). When a methanogen grows on formate, a part of the Fdh pool associates with the Hdr and Vhu/Vhc hydrogenases, and together with a formyl-MF dehydrogenase they form a bifurcating complex (Lie et al., 2012). This Fdh-containing bifurcating complex utilizes electrons from F420H2 (produced by Fdh) and generates high and low potential electrons, either directly or via production of hydrogen as intermediate, that are consumed in the reduction of CoM-S-S-CoB and the generation of formyl-MF, respectively (Lie et al., 2012).

# *Methanogenesis from ethanol or secondary alcohols plus carbon dioxide*

Only a few methanogens can perform methanogenesis with secondary alcohol as electron source (GO:MENGO-UR; also secondary alcohol catabolic process, GO:MENGO-UR) (Widdel, 1986; Bleicher et al., 1989; Widdel and Wolfe, 1989; Schirmack et al., 2014). These substrates are oxidized to their respective ketones to provide reducing equivalents for the reduction of carbon dioxide to methane via the pathway shown in **Figure 3** (Boone et al., 1993; Zinder, 1993). Ethanol, when used, is converted to acetaldehyde (methanogenesis with ethanol as electron source, GO:MENGO-UR; ethanol catabolic process,

GO:0006068). These conversions are consistent with the general observation that methanogens cannot break carbon-carbon bonds in energy substrates other than that found in acetate (see below). Two types of alcohol dehydrogenase have been found in these organisms: one reduces nicotinamides (NAD+ or NADP+) (GO:0004022 and GO:0008106), and the other transfers electrons to coenzyme F420 during alcohol oxidation (GO:0052753) (Widdel and Wolfe, 1989). Most of these enzymes have broad specificities allowing the organisms to use ethanol, 2-propanol, 2-butanol, 2-pentanol, cyclopentanol, cyclohexanol, and 2,3 butanediol (Bleicher et al., 1989).

### *Methanogenesis from carbon monoxide*

Many methanogens can utilize carbon monoxide (CO) although higher levels of this gas inhibit growth of these archaea (Daniels et al., 1977; O'brien et al., 1984; Rother and Metcalf, 2004; Lessner et al., 2006). Three routes of CO utilization have been found in these organisms. In one, called methanogenesis from carbon monoxide (GO:2001134), CO is simply oxidized to CO2 by carbon monoxide dehydrogenase (CODH) (GO:0008805), and the resulting two electrons are used for either hydrogen production (GO:1902422) or ferredoxin reduction (Daniels et al., 1977; Ferry, 1999; Vepachedu and Ferry, 2012). Then the hydrogen and/or reduced ferredoxin are used for methanogenesis from CO2 (GO:0019386). Overall, for every four moles of CO oxidized, one mole of methane and 3 moles of CO2 are produced. The second mode of CO utilization has been found in *Methanosarcina acetivorans* where methanogenesis (GO: 0015948) is inhibited by CO but growth is not (Rother and Metcalf, 2004). This organism uses two non-methanogenic routes for energy production (Rother and Metcalf, 2004), the primary one being acetogenic (acetate biosynthetic process, GO:0019413) and the secondary one being formate-forming (formate biosynthetic process, GO:0015943). Even under these conditions methanogenesis operates at low rates, primarily to provide cellular biosynthetic precursors (Rother and Metcalf, 2004). Here methanogenesis from CO2 involves novel enzymes that transfer the methyl group of CH3-H4MPT to CH3-CoM and serve in the accompanying energy conservation (Lessner et al., 2006); the methyl transfer step could involve a cytoplasmic methyltransferase (CmtA) in addition to a membrane-bound methyl-H4MPT:coenzyme M methyl transferase (Mtr) (Vepachedu and Ferry, 2012). In the third route, CO promotes the production of dimethyl sulfide and methanethiol (3CO + H2S + H2O → CH3SH + 2CO2) and energy is conserved via a yet to be identified system (Moran et al., 2008).

### *Methanogenesis from methanol, methylamines and methanethiols*

Methanogenesis from all of these substrates involves the formation of methyl-CoM as an intermediate (Ferry, 1999). When methanol serves as the sole substrate for methanogenesis (GO:0019387), it provides both carbon and reductant

for methanogenesis and this process consumes four moles of methanol for every three moles methane generated. Of these, one mole of methanol is oxidized to CO2, generating six-electron equivalents of reductant, which are then used to convert three moles of methanol to three moles of methane (Keltjens and Vogels, 1993). The oxidation of methanol to CO2 involves a part of the CO2 reduction pathway, but in the reverse direction (**Figure 3**) (Wassenaar et al., 1998). The methyl groups enter this oxidation process at the methyl-coenzyme M stage by the action of two methyl transferases, MT1 and MT2 or MT2-M (Van Der Meijden et al., 1984a,b; Keltjens and Vogels, 1993; Wassenaar et al., 1998; Ferry, 1999). MT1 is a two-subunit enzyme (MtaBC) and MT2-M has one subunit (MtaA). The first reaction involves transfer of the methyl group of methanol by MT1 to the corrinoid co-factor of its MtaC subunit; this is an automethylation process (Van Der Meijden et al., 1984a,b; Wassenaar et al., 1998). Then MT2-M or MtaA transfers the methyl group from MtaC to HS-CoM, generating methyl-coenzyme M(Van Der Meijden et al., 1984a,b; Wassenaar et al., 1998). Existence of isozymes of MT1 catering to the growth on methanol under various conditions has also been reported (Bose et al., 2006). The methyl groups destined for oxidation are transferred from CH3-CoM to H4MPT by the membrane-bound methyl-H4MPT:coenzyme M methyl transferase (Mtr) (Fischer et al., 1992; Sauer et al., 1997; Ferry, 1999). This endergonic reaction is assisted by a Na+-gradient and generates CH3-H4MPT (Deppenmeier et al., 1999; Ferry, 1999; Deppenmeier and Muller, 2008). The steps from CH3- H4MPT to CO2 are a reversal of those used for CO2-reduction, except the organisms performing this process lack Hmd and F420 dependent Mtd performs the oxidation of methylene-H4MPT to methenyl-H4MPT (Thauer et al., 1993; Deppenmeier et al., 1999; Ferry, 1999; Deppenmeier and Muller, 2008).

Utilization of mono-, di- and tri-methylamines (MMA, DMA, and TMA) for methanogenesis (GO:2001128, GO:2001129, GO:2001130 respectively) follows the general process that is described above for methanol except that substrate-specific methyl transferases are involved in the transfer of methyl groups to coenzyme M. Using MT1 and MT2 of the methanol systems as the reference the methylamine-specific methyl transferases have been named as follows (Wassenaar et al., 1996, 1998; Ferguson and Krzycki, 1997; Burke et al., 1998; Ferry, 1999; Ferguson et al., 2000; Paul et al., 2000; Bose et al., 2008): for MMA, MMAMT+MMCP (MT1) and MT2-A (MT2); for DMA, DMA-MT (MT1) and MT2-A (MT2); for TMA, TMA-52+TCP (MT1) and MT2-A (MT2). For methanogenesis from TMA, MT2-A could be substituted by MT2-M (Ferry, 1999). Methanogenesis from methylated thiols (methanethiol, dimethylsulfide, or methylmercaptopropionate; GO:2001133, GO:2001131, and GO:2001132) also involves special methyl transferase proteins (Tallant et al., 2001; Bose et al., 2009). For example, dimethylsulfide is converted to methyl-CoM by the actions of MtsB (MT2) and MtsA (MT2) (Tallant et al., 2001).

Energy conservation during methanogenesis from methylated compounds occurs in at least two ways. The F420H2 generated during the oxidation of the methyl group of CH3-H4MPT to CO2 is oxidized via the membrane-bound F420H2-dehydrogenase complex (reduced coenzyme F420 dehydrogenase activity, GO:0043738), and in the process a lipid soluble membrane-resident cofactor called methanophenazine is reduced (Deppenmeier and Muller, 2008). These events lead to the extrusion of two protons per F420H2 oxidized (Deppenmeier and Muller, 2008). There is another avenue that produces the same outcome and it begins with the release of molecular hydrogen through the oxidation of F420H2 by a soluble F420-dependent hydrogenase (Frh, GO:0050454) (Kulkarni et al., 2009). This hydrogen upon its release from the cell is captured by a membrane-bound hydrogenase complex (Vht/Vtx) (GO:MENGO-UR, GO:MENGO-UR), which transfers electrons generated from the oxidation of hydrogen to methanophenazine and releases two protons outside the cell (Kulkarni et al., 2009). In certain methanogens the latter process is the major route of F420H2 oxidation (Kulkarni et al., 2009). The reduced methanophenazine produced by these reactions is utilized by the membrane-bound heterodisulfide reductase (Hdr)-cytochrome b2 complex (GO:MENGO-UR) for the reduction of CoM-S-S-HTP, and this process provides two more protons outside the cell (Deppenmeier and Muller, 2008; Kulkarni et al., 2009; Welte and Deppenmeier, 2014). All these extruded protons generate protonmotive force, which drives ATP synthesis (GO:0015986) via an ATP synthase (GO:0045259) (Deppenmeier and Muller, 2008; Kulkarni et al., 2009; Welte and Deppenmeier, 2014).

# *Methanogenesis from H***<sup>2</sup>** *plus methanol*

A GO term for this process has recently been proposed by us (methanogenesis from H2 and methanol, GO:1990491). Here, the methyl group of methanol is transferred to coenzyme M by two methyl transferases, MT1 and MT2, producing methyl-CoM (Keltjens and Vogels, 1993). The rest of the process, the reduction of methyl-CoM by HS-CoB, the reduction of CoM-S-S-CoB by electrons derived from hydrogen, and the energy conservation, likely follows the system described in the section on methanogenesis from H2 plus CO2; an exception is *Methanosphaera stadtmanae,* which grows only on H2 plus methanol with a supplement of acetate (Miller and Wolin, 1985), as it would employ the cytochrome-independent system (Fricke et al., 2006).

### *Methanogenesis from acetate*

About 70% of the biologically produced methane originates from acetate (GO:0019385) (Ferry, 1992, 1993, 1999). The methyl group of acetate is reduced to methane and the carboxyl group is oxidized to CO2 providing the reductant for methyl reduction (Ferry, 1992, 1993, 1999, 2011; Thauer et al., 2008). The process begins with the activation of acetate by the action of one of two systems, one involving acetate kinase and phosphotransacetylase (GO:0008776 and GO:0008959) and the other catalyzed by acetyl-CoA synthase (synonym of acetate-CoA ligase activity, GO:0003987), both generating acetyl-CoA (Aceti and Ferry, 1988; Jetten et al., 1989; Lundie and Ferry, 1989; Ferry, 1992, 1993, 1999, 2011; Thauer et al., 2008). The first route generates ADP that is converted back to ATP via electron transport phosphorylation at an ATPase (Ferry, 1992, 1993, 1999). In contrast, the second route generates AMP and pyrophosphate, and AMP has to be converted to ADP by adenylate kinase (GO:0004017) through the consumption of one ATP (AMP + ATP → 2ADP) before it can used for the regeneration of ATP (ADP + Pi + energy → ATP) (Jetten et al., 1989; Zinder, 1993; Berger et al., 2012). Thus, organisms utilizing the acetyl-CoA synthase reaction are placed in an energetically unfavorable situation and exhibit slow growth rates (Zinder, 1993). However, by virtue of this investment they are able to utilize acetate even at very low concentrations and consequently are the predominant acetotrophic methanogens in many anaerobic niches of nature (Zinder, 1993). It is not known whether the energy present in pyrophosphate is conserved or is released via hydrolysis for the purpose of making the acetate activation process thermodynamically more favorable (Welte and Deppenmeier, 2014). The methanogens employing acetyl-CoA synthase carry pyrophosphatase (GO:0016462) and whether the enzyme is positioned to harvest or release energy is not known (Berger et al., 2012; Welte and Deppenmeier, 2014). The next step, the breakage of the carbon-carbon bond of the acetate moiety in acetyl-CoA, is catalyzed by an acetyl-CoA decarbonylase/synthase-carbon monoxide dehydrogenase complex (GO:0044672) (Ferry, 1993, 1999; Lu et al., 1994; Grahame, 2003; Li et al., 2006; Wang et al., 2011). The carbonyl group of acetyl-CoA is oxidized to CO2 by the CODH component (GO:0043885) and the reducing equivalents (two-electrons) generated by this process help to reduce ferredoxin (Ferry, 1993, 1999, 2011; Lu et al., 1994; Grahame, 2003; Li et al., 2006; Wang et al., 2011). The methyl group of the acetyl group is transferred to H4MPT via a corrinoid cofactor of the CODH/ACDS complex, producing CH3-H4MPT (Ferry, 1999; Grahame, 2003). The methyl group of CH3-H4MPT leads to methane via the actions of methyl-H4MPT:coenzyme M methyl transferase (Mtr) and methyl-CoM reductase (**Figure 4**). The CO2 produced from acetate is hydrated to bicarbonate by a membrane-bound gammatype carbonic anhydrase (GO:0004089) and is efficiently exported out of the cell (Ferry, 2011). This process is thought to improve the thermodynamic efficiency of methanogenesis from acetate (Ferry, 2011).

There are two avenues for energy conservation in methanogenesis from acetate (Deppenmeier and Muller, 2008). One is via the use of the sodium potential generated by Mtr and has been described above. The other is through the oxidation of reduced ferredoxin through one of two complex processes. Certain acetotrophic methanogens oxidize reduced ferredoxin by use of Ech hydrogenase, generating molecular hydrogen and proton potential (Meuer et al., 1999, 2002; Kulkarni et al., 2009). The molecular hydrogen is utilized for the extrusion of additional protons and for heterodisulfide reduction via the Vho hydrogenase, methanophenazine and heterodisulfide reductase, as during methylotrophic methanogenesis (**Figure 4**; see above) (Kulkarni et al., 2009). In methanogens lacking Ech hydrogenase, a complex called Rnf utilizes reduced ferredoxin, producing a sodium gradient and transferring electrons to heterodisulfide reductase via methanophenazine. Thus, Rnf is considered a replacement of the Ech and Vho hydrogenases (Li et al., 2006; Wang et al., 2011). Both the H<sup>+</sup> and Na<sup>+</sup> potentials are utilized by an A1AO ATP synthase (GO:1990490) for ATP production (Deppenmeier and Muller, 2008); in some cases a Na+/H+ antiporter (GO:0015385) called Mrp adjusts the ratio of the two gradients for optimizing the thermodynamic efficiency of the ATP synthase (Li et al., 2006; Wang et al., 2011; Jasso-Chavez et al., 2013).

#### *Biosynthesis of methanogenesis coenzymes*

Many of the coenzymes involved in methanogenesis, namely methanofuran, tetrahydromethanopterin, tetrahydrosarcinapterin, coenzyme M, coenzyme F420, coenzyme B, methanophenazine, and coenzyme F430, have unusual properties. As a result, the respective biosynthesis pathways have attracted attention (Graham and White, 2002). This interest has increased further as some of these coenzymes have been found to perform critical functions in other organisms, such as in actinobacteria (includes mycobacteria and streptomyces groups), methanotrophic and methylotrophic bacteria, cyanobacteria, and plants (Takao et al., 1989; Batschauer, 1993; Purwantini et al., 1997; Chistoserdova et al., 2004; Krishnakumar et al., 2008). Some of the existing knowledge has been summarized at the MENGO website.

# **SYNTHETIC BIOLOGY EXPLOITATION OF METHANOGENESIS PATHWAYS IN METHANOGENS**

Exploitation of methanogens for the production of methane from unnatural substrates has begun. For example, *Methanosarcina acetivorans* has been made proficient in converting methyl acetate to methane and carbon dioxide, and in converting methyl propionate to methane and propionate. This was achieved by expressing a broad-specificity esterase (hydrolase activity, acting on ester bonds, GO:0016788) from *Pseudomonas veronii* in *M. acetivorans* (Lessner et al., 2010). Wild type *M. acetivorans* exhibits only a minor esterase activity. The heterologous esterase in the engineered strain releases methanol from these two esters, and methanol is used for methanogenesis following the pathways described above. Acetate, the other product from methyl acetate is also converted to methane whereas propionate generated from methyl propionate is excreted (Lessner et al., 2010).

#### **A NEW ROUTE FOR BIOLOGICAL PRODUCTION OF METHANE**

A recent discovery (Metcalf et al., 2012; Yu et al., 2013) shows that some of the abundant marine archaea and bacteria, which are distinct from the well-known methanogenic archaea, are likely major producers of methane in nature. Methane is abundant in the oceans, but the source was unclear (Reeburgh, 2007). Methylphosphonate was suspected as the source as genes encoding carbon-phosphorus lyases are common in marine microbes, but the biosynthetic pathway for methylphosphonate was unknown (Karl et al., 2008). It has recently been shown that the marine archaeon *Nitrosopumilus maritimus* encodes a pathway for methylphosphonate biosynthesis and it produces cell-associated methylphosphonate esters (Metcalf et al., 2012). The production of methylphosphonate seems to be a widespread process in marine microorganisms, and that when facing phosphorus-limitation these organisms would degrade methylphosphonate to obtain phosphorus, thus releasing methane (Metcalf et al., 2012). The GO database lacks description for methylphosphonate biosynthetic and catabolic processes, as well for the following key enzymes: carbon-phosphorus (C-P) lyase, producer of methane from methylphosphonate; phosphonoacetaldehyde dehydrogenase (Pdh) and methylphosphonate synthase (MPn), two key enzymes for methylphosphonate biosynthesis (Metcalf et al., 2012). However, the terms for the first two enzymes on the methylphosphonate biosynthesis pathway that starts from phosphoenolpyruvate, namely "phosphoenolpyruvate mutase (Ppm) activity" (GO:0050188) and "phosphonopyruvate decarboxylase (Ppd) activity," do exist (GO:0033980). To cover this new biological process for methane production we have proposed the following new GO terms: phosphonate carbon-phosphorus lyase activity (GO-MENGO-UR); "methane biosynthetic process" (GO:0015948), a parent term; two child terms, "aerobic methane biosynthetic process" (GO:MENGO-UR) and "anaerobic methane biosynthetic process" (GO:MENGO-UR).

## **GO ANNOTATION**

To begin the application of the GO terms to annotating genomes of methanogenic microbes, we have performed GO annotation of the relevant gene products encoded by these genomes. The annotations we created were based solely on experimental evidence (e.g., results from direct assays or mutant phenotypes), in order to provide "gold standards" for subsequent machine annotations. These annotations are available at the MENGO website under the Gene Annotations menu (Gene Annotations for Natural Biological System; Gene Annotations for Synthetic Biological System). Forms for the submission of new annotations (Submit New Gene Annotation for Natural or Synthetic Biological System) are available under the same menu.

We have annotated 80 gene products with the parent term "methane biosynthetic process" (GO: 0015948) along with appropriate child terms (**Figure 6**). These genes were categorized into three groups; 51 gene products for methanogenesis pathways, 19 gene products for biosynthesis of coenzymes specifically used in methanogenesis, and 10 genes for coenzyme metabolism (see Table S1, Supplementary material).

#### **CONCLUDING REMARKS**

The goal of the MENGO project is to develop a set of GO terms for describing gene products involved in energy-related microbial processes. GO allows annotations of gene products using terms from three ontologies: molecular function, biological process, and cellular component. The GO embodies structured relationships among the terms and the annotations provide links between gene products and the terms (**Figure 6**). This combination allows researchers to infer possible functional roles of gene products in diverse organisms. A set of relevant gene products well-annotated with GO terms will assist bioenergy researchers to efficiently design synthetic biological systems for commercially viable biofuel production, as it will allow effective mining for optimal parts from a larger natural inventory. For example, one could mine for amenable parts of a methanogenesis system from all available genomes, including those of organisms that do not produce methane. Thus, the GO terms and associated "gold standard" manual annotations that the MENGO has developed should provide the foundation for a growing resource that is of wide value to the microbial bioenergy community. We encourage members of the research community to participate in our effort toward the development of additional GO terms and performing manual annotations of gene products with potentials of application in bioenergy production and bioremediation. The MENGO website provides electronic forms for the submission of candidate GO terms and annotations for review and subsequent submission to the GO database.

#### **ACKNOWLEDGMENTS**

This work was supported by grant DE-SC0005011 from the US Department of Energy. João C. Setubal was funded by CNPq and FAPESP, and Jane Lomax was supported by EMBL-EBI core funds. We thank Tirtha Bhattacharjee, Morgan Pixa, and Stephen Slaughter for help with the MENGO website and Morgan Pixa and Sujung Kang for help in annotation. We thank the GO consortium for collaboration and members of the bioenergy and bioremediation research community who attended our MENGO workshops and provided suggestions for GO term generation and annotation of gene products.

### **SUPPLEMENTARY MATERIAL**

The Supplementary Material for this article can be found online at: http://www.frontiersin.org/journal/10.3389/fmicb. 2014.00634/abstract

#### **REFERENCES**


thermophilic *Methanopyrus kandleri*. *Arch. Microbiol.* 159, 213–219. doi: 10.1007/BF00248474


and methanogenic Archaea. *FEMS Microbiol. Lett.* 146, 129–134. doi: 10.1111/j.1574-6968.1997.tb10182.x


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 09 July 2014; accepted: 05 November 2014; published online: 03 December 2014.*

*Citation: Purwantini E, Torto-Alalibo T, Lomax J, Setubal JC, Tyler BM and Mukhopadhyay B (2014) Genetic resources for methane production from biomass described with the Gene Ontology. Front. Microbiol. 5:634. doi: 10.3389/fmicb. 2014.00634*

*This article was submitted to Microbial Physiology and Metabolism, a section of the journal Frontiers in Microbiology.*

*Copyright © 2014 Purwantini, Torto-Alalibo, Lomax, Setubal, Tyler and Mukhopadhyay. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# *Trudy Torto-Alalibo1,2 , Endang Purwantini 1,2 , Jane Lomax3 , João C. Setubal 2,4 , Biswarup Mukhopadhyay1,2,5 and Brett M. Tyler 2,6 \**

<sup>1</sup> Department of Biochemistry, Virginia Polytechnic Institute and State University, Blacksburg, VA, USA

<sup>5</sup> Department of Biological Sciences, Oregon State University, Corvallis, OR, USA

<sup>6</sup> Center for Genome Research and Biocomputing, Oregon State University, Corvallis, OR, USA

#### *Edited by:*

Katherine M. Pappas, University of Athens, Greece

#### *Reviewed by:*

Kesen Ma, University of Waterloo, Canada

Christopher J. Brigham, University of Massachusetts at Dartmouth, USA David E. Graham, Oak Ridge National Laboratory, USA

#### *\*Correspondence:*

Brett M. Tyler, Center for Genome Research and Biocomputing, Oregon State University, 4750 Campus Way ALS3021, Corvallis, OR 97331, USA e-mail: brett.tyler@oregonstate.edu

Dramatic increases in research in the area of microbial biofuel production coupled with high-throughput data generation on bioenergy-related microbes has led to a deluge of information in the scientific literature and in databases. Consolidating this information and making it easily accessible requires a unified vocabulary.The Gene Ontology (GO) fulfills that requirement, as it is a well-developed structured vocabulary that describes the activities and locations of gene products in a consistent manner across all kingdoms of life. The Microbial ENergy processes Gene Ontology (http://www.mengo.biochem.vt.edu) project is extending the GO to include new terms to describe microbial processes of interest to bioenergy production. Our effort has added over 600 bioenergy related terms to the Gene Ontology. These terms will aid in the comprehensive annotation of gene products from diverse energy-related microbial genomes. An area of microbial energy research that has received a lot of attention is microbial production of advanced biofuels. These include alcohols such as butanol, isopropanol, isobutanol, and fuels derived from fatty acids, isoprenoids, and polyhydroxyalkanoates. These fuels are superior to first generation biofuels (ethanol and biodiesel esterified from vegetable oil or animal fat), can be generated from non-food feedstock sources, can be used as supplements or substitutes for gasoline, diesel and jet fuels, and can be stored and distributed using existing infrastructure. Here we review the roles of genes associated with synthesis of advanced biofuels, and at the same time introduce the use of the GO to describe the functions of these genes in a standardized way.

**Keywords: Gene Ontology, advanced biofuels, synthetic biology, cellulosome, advanced alcohols, fatty acid-derived fuel, isoprenoid-derived fuel**

#### **INTRODUCTION**

Depletion of the world's fossil fuel resources and environmental concerns associated with the emission of greenhouses gases has fueled interest in renewable and environmentally friendly alternatives (Köne and Büke, 2010; Hansen et al., 2013; Suranovic, 2013). In this context, advanced biofuels have been of growing interest as these compounds can be generated from non-food cellulosic biomass, can be added directly to gasoline or diesel or sometimes used as stand-alone fuel, and can be stored and distributed using existing infrastructure (Dürre, 2007; Mehta et al., 2010; Weber et al., 2010). Advanced biofuels include alcohols such as butanol, isopropanol, and isobutanol, and fuels derived from fatty acids, isoprenoids, and polyhydroxyalkanoates (PHAs). Advanced biofuels from lignocellulose feedstock begins with biomass deconstruction. The cellulosic component of the biomass is degraded into pentoses and hexoses. The multienzyme complexes involved in the degradation of cellulosic biomass are discussed in this review. The native and engineered pathways leading to the production of advanced biofuels have been

studied extensively. As such the primary published literature is rich in information derived from research on these fuels but this information has yet to be aggregated in a manner that is easily accessible and amenable to computational analysis. The well-established Gene Ontology (GO; Ashburner et al., 2000) provides a basis for harnessing this information. The GO provides sets of standardized terms to describe molecular functions, biological processes and cellular components across all kingdoms of life (Ashburner et al., 2000; Harris et al., 2004; Torto-Alalibo et al., 2009; Dimmer et al., 2012). The Microbial ENergy processes Gene Ontology (MENGO) consortium1, an associate of the GO, initiated an effort in 2011 to extend the GO by developing missing terms associated with bioenergy processes. This effort has generated over 600 bioenergy-related GO terms, most of which are described in this review and the MENGO website1. These new terms, together with existing ones in the GO, were used to annotate gene products involved in bioenergy processes. Our

<sup>2</sup> Virginia Bioinformatics Institute, Virginia Polytechnic Institute and State University, Blacksburg, VA, USA

<sup>3</sup> European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Cambridge, UK

<sup>4</sup> Departamento de Bioquímica, Instituto de Química, Universidade de São Paulo, São Paulo, Brazil

<sup>1</sup>http://www.mengo.biochem.vt.edu

emphasis is on selected native pathways leading to the generation of advanced biofuels. The scope of the review is summarized in **Figure 1**.

#### **THE GENE ONTOLOGY**

2http://www.ebi.ac.uk/chebi/

The GO is a structured, species-neutral ontology that describes the attributes of gene products. The GO consists of three distinct ontologies: "GO:0003674 molecular function,""GO:0008150 biological process" and "GO:0005575 cellular component," which are made up of terms (otherwise known as classes) arranged in a graph structure. Terms in the graph are related to one another via relationships, which are of a given type. Examples of relations used in GO include "is\_a," "part\_of," "regulates," "has\_part" and "occurs\_in." Some relationships, namely "is\_a" (is a type of) and "part\_of," are taxonomic such that "child terms" are more specific than the more general "parent" terms (Ashburner et al., 2000; Harris et al., 2004; **Figure 2**). Terms can have one or more "parent terms" and gene products annotated with specific child terms are automatically associated with the corresponding "parent terms" in the graph. Currently, GO uses the CHEBI database<sup>2</sup> as the source of all primary chemical names (Hill et al., 2013). Alternative names are added as synonyms. We made several additions to the CHEBI as it lacked most of the chemical entities used by the MENGO consortium. The GO also actively interacts

with and maintains mappings to EC, MetaCyc, Rhea, Reactome, and several other systems3. GO is working with Rhea and Reactome on a system of automatic import for enzymatic reactions to avoid duplication of curation effort and to further improve interoperability.

The GO is widely recognized as a powerful tool for the annotation of gene products of all organisms including important biofuel producers like *Saccharomyces cerevisiae* and cyanobacteria (Christie et al., 2009; Beck et al., 2012). The ongoing maintenance and expansion of the GO results from international collaborative efforts that are managed by the Gene Ontology consortium (GOC). The GOC encourages and works with associated groups, such as MENGO, on term development and annotation in focused subject areas.

The MENGO team has produced over 600 bioenergy-related terms4,5. Included in this set are some key terms relevant to the description of advanced biofuel production. For example, terms for the microbial production of isobutanol "GO:1901961 isobutanol biosynthetic process," isopropanol "GO:1902640 propan-2 ol biosynthetic process"and isopentenol"GO:1902934 isopentenol biosynthetic process" were additions made by the MENGO group. When specific gene products are annotated using GO terms, the

<sup>5</sup>http://tinyurl.com/o2l7nsy

**or products described with Gene Ontology (GO) terms.** Advanced alcohols such as isopropanol and butanol are synthesized via the CoA-dependent pathways; isobutanol, and other branched chain alcohols are produced via the keto-acid pathway and isopentenol via the isoprenoid pathway. Other

advanced biofuel products, which include alkanes, alkenes and fatty methyl esters, are derived from fatty acid biosynthesis. Fuel products or their precursors are shown in sky blue shaded boxes. GO descriptions of the biological processes involved in the biosynthesis are shown in tan-shaded boxes. Relevant references are in the text.

<sup>3</sup>http://wiki.geneontology.org/index.php/Enzymes\_and\_EC\_mappings 4http://www.mengo.biochem.vt.edu/

information is stored in tabular form as "gene association" statements6, which are consolidated into a central GO database to enable public access to the annotations (Christie et al., 2009; Beck et al., 2012). Information stored in a gene association statement includes the gene product ID, the GO terms specifying the function of the gene product, its biological context and/or the location of activity. Experimental or bioinformatics evidence is represented by terms from the Evidence Code Ontology<sup>7</sup> (ECO). Also included is the taxonomic source of the gene product and the reference(s) (often a PubMed identity number) indicating the publication(s) from which information was obtained. Selected gene products, annotated with GO bioenergy related terms are provided in **Table 1** and Supplementary Table 1. All other bioenergy related annotations made by MENGO can be found at the MENGO website8 or the GO website<sup>9</sup> (AMIGO).

The GO has also been used extensively in the analysis of highthroughput data including genomic, transcriptomic, proteomic, and metagenomic data (Marinkovi´c et al., 2012; Molnar et al., 2012; Benso et al., 2013; Plewniak et al., 2013; Schmidt et al., 2014). Currently, the GO is limited to terms for describing natural processes (Blake and Harris, 2008). It does not provide terms for non-natural processes such as disease states, nor does it provide terms for novel biological functions that have been produced by genetic manipulations or synthetic biology. This is a current limitation of the GO in the context of bioenergy research. TheMENGO team is currently discussing with the community the creation of a complementary set of GO-like terms suitable for synthetic biological functions (which we provisionally call SYNGO). To

underscore the anticipated broad scope of SYNGO, at the end of this review we provide brief descriptions of some synthetic processes and cell parts that are driving the development of this resource.

#### **MULTIENZYME COMPLEXES FOR DECONSTRUCTION OF BIOMASS: CELLULOSOMES AND XYLANOSOMES**

Complete hydrolysis of cellulose to glucose requires the synergistic action of three general types of glycoside hydrolases (Doi, 2008; Gomez del Pulgar and Saadeddin, 2014): (i) Cellulases or endo-1,4-β-glucanases randomly hydrolyzes internal bonds in a cellulose chain releasing products of varying chain lengths. Non-processive cellulases are truly random and processive cellulases, upon binding a cellulose chain continue cutting through the bound substrate. (ii) Exo-1,4-β-glucanases "GO:0031217 glucan 1,4-beta-glucosidase activity" works from either the reducing or non-reducing end of the cellulose polymer to release cellobiose. They are sometimes called exocellulases and their processive forms are known as cellobiohydrolases "GO:0016162 cellulose 1,4-beta-cellobiosidase activity." A variation of this group, glucohydrolase "GO:0080079 cellobiose glucosidase activity," acts on the non-reducing end and releases glucose. (iii) β-glucosidases "GO:0008422 β-glucosidase activity" hydrolyze cellobiose released by other enzymes to glucose and are also known as cellobiases (Wilson, 2011, 2012). Activities of these enzymes on oligosaccharides have also been reported and their actions from the non-reducing end release glucose, endoglucanases, exoglucanases, and cellobiases (Chang et al., 2013).

The major component of hemicellulose is xylan. Two key enzymes collectively called xylanases "GO:0097599 xylanase activity" are responsible for hydrolysis of xylan. Endo-xylanase (endo-1,4-β-xylanase) "GO:0031176 endo-1,4-β-xylanase activity" acts on the homopolymeric backbone of 1,4-linked β-D-xylopyranose

<sup>6</sup>http://www.geneontology.org/GO.format.annotation.shtml

<sup>7</sup>http://www.evidenceontology.org/

<sup>8</sup>http://www.mengo.biochem.vt.edu/

<sup>9</sup>http://tinyurl.com/o2l7nsy


**Table 1 | Abbreviated Gene Ontology table showing annotations of selected experimentally characterized gene products associated with advanced biofuel production.**

∗Molecular function (F), biological process (P), cellular component (C).

†Inferred from direct assay (IDA), inferred from mutant phenotype (IMP).

producing xylooligomers, and β-xylosidase (xylan-1,4-βxylosidase) "GO:0009044 xylan-1,4-β-xylosidase activity" act on the xylooligomers releasing xylose (Biely et al., 1985; Ahmed et al., 2009). Additionally, accessory enzymes such as acetyl xylan esterases "GO:0046555 acetylxylan esterase activity" act to remove the side chain substitution along the xylan backbone (Juturu and Wu, 2013). The GO uses the CHEBI database<sup>10</sup> as the source of all primary chemical names and all alternate names are listed as synonyms. These hydrolytic enzymes exist as free independent enzymes or as multienzyme complexes called cellulosomes "GO:0043263 cellulosome" and xylanosomes "GO:1990358 xylanosome" (Jiang et al., 2004, 2006). This section will focus on using the GO to describe these multienzyme

complexes that mediate degradation of polysaccharides. Other aspects of deconstruction of biomass for biofuel can be found in several other reviews (Dodd and Cann, 2009; Blanch et al., 2011; Chundawat et al., 2011; Blumer-Schuette et al., 2014).

The cellulosome of *Clostridium thermocellum* serve as a paradigm for this enzymatic nanomachine, and thus a model for studies of the structure and assembly process. The central component of the cellulosome is a non-catalytic "scaffoldin" subunit, which mediates a highly specific interaction between the enzyme-bearing type I dockerin modules and the resident type I cohesin modules (Bras et al., 2012; Smith and Bayer, 2013; Hong et al., 2014). The cellulose-binding domain (CBD) of the scaffoldin subunit aids in binding the complex to the cellulosic substrate. The type II dockerin module, which forms

<sup>10</sup>http://www.ebi.ac.uk/chebi/

part of the primary scaffoldin in turn, binds to the type II cohesin of anchoring scaffoldins, which connect to the bacteria cell surface (**Figure 3**; Bayer et al., 1985; Bayer and Lamed, 1986). Recently, a third type of dockerin–cohesin interaction (type III) has been characterized in *Ruminoccoccus flavefaciens.* The type III dockerins and cohesins show a high degree of sequence divergence compared to their type I and II counterparts (Rincon et al., 2005; Karpol et al., 2013). Prior to our work, GO terms related to cellulosomes were "GO:0043263 cellulosome," its regulation terms, and the different hydrolytic enzyme terms. However, the latter terms were not necessarily linked to the cellulosome term in a way that clearly indicated their associations. We used the "domain binding" aspect of the GO to make the multiple functions of the scaffoldin more obvious. To start with, the ability of the non-enzymatic scaffoldin to bind to several cellulose degradative enzymes via the cohesin domain makes it a complex. Therefore we introduced the term "GO:1990296 scaffoldin complex" which has a "part of" relationship to "GO:0043263 cellulosome." In addition, domain-specific terms were also developed, namely: "GO:1990311 type I cohesin domain binding," "GO:1990308 type I dockerin domain binding," "GO:1990312 type II cohesin domain binding," "GO:1990309 type II dockerin domain binding," "GO:1990313 type III cohesin domain binding" and "GO:1990310 type III dockerin domain binding." Gene products binding to each of

these modules were assigned the appropriate domain binding term.

The xylanosome occurs naturally in anaerobic fungi and bacteria. However, besides containing several hemicellulases, not much is known about its assembly. We introduced the term "GO:1990358 xylanosome" into the GO. A modular nature, if any, has not been described for the xylanosome. Therefore for now, the only way to identify components of the xylanosome will be at the annotation level where the hemicellulases are annotated with their unique enzymatic activity terms using the Molecular Function ontology and also with the cellular component term "GO:1990358 xylanosome" where there is experimental evidence of an association with the xylanosome.

The synergy and combined action of hemicellulases and cellulases within these multi-enzyme systems makes them efficient in the degradation of biomass (Morais et al., 2010). Systematic annotation of the ever-increasing repertoire of dockerin–cohesin pairs of the cellulosome, using the GO terms described above, will serve to inform their use in engineering more efficient cellulosomes, xylanosomes, and cellulosome-like complexes.

### **PRODUCTION OF ISOPROPANOL AND BUTANOL**

Clostridia can produce isopropanol and butanol from sugars via pyruvate and acetyl-CoA [for a review of early work in this area see Stephenson (1949) and references therein]. Short chain

The cellulosome is a multi-enzyme complex involved in the deconstruction of biomass for biofuel production. The central scaffoldin subunit of the cellulosome consists of a cellulose-binding module (colored green), which binds to the cellulosic substrate; and a type-II dockerin (colored dark green), which attaches the scaffoldin to the bacterial cell surface via the type-II cohesins of the anchoring protein (colored blue). The type-I cohesins (colored cellulosome via the enzyme bearing type-I dockerin module. The different subunits of the cellulosome complex and their roles are described with appropriate Gene Ontology terms, notably the "domain binding" terms that are used to curate proteins/enzymes binding to specific modules. Relevant references are in the text. Figure is based on information provided in Smith and Bayer (2013).

alcohols proposed as advanced alcohols include those with C3–C4 units such as isopropanol and butanol (Gronenberg et al., 2013; Xue et al., 2013). These alcohols have superior properties relative to ethanol in terms of energy density or energy content, and ease of storage and distribution. Similar to ethanol, these can also be produced from renewable non-food feedstocks. Traditionally, butanol and isopropanol are produced as a mixture by some clostridial species such as *Clostridium acetobutylicum* possessing the acetone/isopropanol butanol–ethanol (A/IBE) pathway (**Figure 4**). Following breakdown of the pentose sugars via the pentose phosphate pathway "GO:0019323 pentose catabolic process" and hexose sugars via glycolysis, "GO:0006096 glycolysis" to pyruvate, the A/IBE process can be divided into two distinct metabolic phases: one phase involves the formation of the organic acids, butyrate and acetate (called "acidogenesis"). In the second phase, these organic acids may act as substrates for the biosynthesis of acetone and alcohols (butanol and ethanol). The latter phase is referred to as "solventogenesis." The production of butyrate and acetate are represented in the GO as "GO:0046358 butyrate biosynthetic process" and "GO:0019413 acetate biosynthetic process," respectively. These two child terms fall under

the parent term "GO:0016053 organic acid biosynthetic process." Solventogenic enzymes and gene products associated with the formation of acetone, butanol, ethanol, and isopropanol are annotated with appropriate GO molecular function terms and also with the biological process terms "GO:0043445 acetone biosynthetic process," "GO:0071271 1-butanol biosynthetic process," "GO:0006115 ethanol biosynthetic process" and "GO:1902640 isopropanol biosynthetic process," respectively. The processes represented by these terms fall under "GO:0042181 ketone biosynthetic process" and "GO:0046165 alcohol biosynthetic process." The terms for organic acid and solvent productions have synonyms, "acidogenesis" and "solventogenesis," respectively. The addition of synonyms is a feature of the GO that facilitates easy access to concepts represented by diverse descriptions in the literature.

In *Clostridium*, the physiological transition from acidogenesis to solventogenesis is associated with the stationary phase of the organism's growth cycle. The *Clostridium* Spo0A gene product has been shown to regulate solventogenesis (Ravagnani et al., 2000; Harris et al., 2002; Dürre and Hollergschwandner, 2004; Hu et al., 2011) and activate the downstream genes of the acidogenic process,

in addition to controlling sporulation "GO:0043934 sporulation." We annotated Spo0A with the term "GO:1902930 regulation of alcohol biosynthetic process" which has "regulation of solventogenesis" as a synonym, as well as with the term "GO:0043937 regulation of sporulation." Another gene product, SpoIIE was shown to control sporulation in *C. acetobutylicum* but was not associated with solvent synthesis (Scotcher and Bennett, 2005; Bi et al., 2011). The multiple roles of Spo0A are thus documented as separate entries in the GO association table with its role in sporulation overlapping that of SpoIIE. Biosynthesis of butanol and isopropanol share the same metabolic pathway from pyruvate to acetyl-CoA and thereafter follow respective branches (**Figure 4**).

Microbial production of butanol occurs naturally in certain *Clostridium* species as part of the acetone butanol ethanol (ABE) pathway (Lee et al., 2008a; Qureshi et al., 2010). Some of these bacteria can use both pentose and hexose sugars as substrates. Typically, the breakdown of pentose and hexose sugars to pyruvate follows different paths but subsequent steps from pyruvate to butanol production may comprise the same reactions. The GO terms "GO:1990284 hexose catabolic process to 1-butanol" and "GO:1990290 pentose catabolic process to 1-butanol" are child terms of "GO:0071271 1-butanol biosynthetic process." In the so-called CoA-dependent pathway for butanol production, the generation of butanol is initiated with two molecules of acetyl-CoA. Six enzymes encoded by seven genes mediate reactions leading to the production of butanol. Two molecules of acetyl-CoA are condensed, reduced, and dehydrated to form crotonyl-CoA, which is then reduced to butanol. The enzymes responsible for these reactions are shown in **Figure 4** and **Table 1**, with appropriate GO annotations. The key enzymes leading to butanol production are each described with a distinct molecular function term. The transition from acidogenesis to solventogenesis in some *Clostridium* species occurs at low pH as a result of acid accumulation. Butyrate, produced during acidogenesis, becomes a substrate for butanol production in the solventogenesis phase. This process is mediated by the enzymes butyrate acetoacetate CoA-transferase "GO:0047371 butyrate acetoacetate CoA-transferase activity," butyraldehyde dehydrogenase and butanol dehydrogenase "GO:1990362 butanol dehydrogenase activity."

Isopropanol is naturally produced by *Clostridium beijerinckii* and *Clostridium aurantibutyricum* (George et al., 1983), although the natural yield is very low (Chen and Hiu, 1986). It can be used directly as an additive to gasoline or as a feedstock for biodiesel production (Lee et al., 1995); in the latter case esterification with isopropanol is used to reduce the chances of crystallization at low temperatures. The native pathway involves the condensation of two molecules of acetyl-CoA into a molecule of acetoacetyl-CoA; the coenzyme A is then transferred to acetate or butyrate, catalyzed by CoA transferase "GO:0008410 CoA-transferase activity." The two child terms of CoA transferase are "GO:0008775 acetate CoA transferase" and "GO:0047371 butyrate-acetoacetate CoA-transferase." A reaction catalyzed by acetoacetate decarboxylase "GO:0047602 acetoacetate decarboxylase activity" converts acetoacetate to acetone. Subsequently acetone is converted to isopropanol in a NADPH-dependent reaction catalyzed by a secondary alcohol dehydrogenase (ADH) "GO:0050009 isopropanol dehydrogenase activity."

#### **ISOBUTANOL PRODUCTION FROM VALINE AND GLYCINE**

The branched-chain alcohol isobutanol, (2-methylpropan-1-ol) exhibits superior physiochemical properties similar to butanol. Isobutanol is naturally produced during fermentation by *S. cerevisiae* albeit in low amounts (Dickinson et al., 1998; Branduardi et al., 2013). Dickinson and coworkers examined the metabolic pathways used in the degradation of valine to isobutanol. Catalytic breakdown of valine to isobutanol is mediated by the Ehrlich pathway, which starts with transamination of valine to α-ketovalerate via the branched chain amino acid aminotransferase, Bat2, which is also known as Twt2p and Eca40p. The alternate names are curated in the GO association table under "database object symbol synonyms." The location of activity of gene products is described with the GO Cellular Component ontology. Bat2 has been shown to be localized in the cytosol "GO:0005829 cytosol" and a homolog Bat1 is localized to the mitochondrial matrix "GO:0005759 mitochondrial matrix," both having the same molecular function "GO:0004084 branched chain amino acid aminotransferase" (**Table 1**). The subsequent decarboxylation step resulting in the formation of isobutyraldehyde is mediated by ketoacid decarboxylase. All of the three isozymes of pyruvate decarboxylase (Pdc1p, Pdc5p and Pdc6p) have been shown to be capable of decarboxylating α-ketovalerate (Dickinson et al., 1998). Finally a reduction step mediated by ADH converts the aldehyde to isobutanol (Dickinson et al., 1998). The PDCs are each assigned the same GO Molecular Function terms "GO:0004737 pyruvate decarboxylase activity" in separate entries in the GO association table and the aldehyde dehydrogenase gene is assigned "GO:0004022 ADH activity." All these gene products are also associated with the biological process term "GO:1901961 isobutanol biosynthetic process."

Another pathway was recently discovered in *S. cerevisiae* for *de novo* isobutanol biosynthesis. A study by Villas-Bôas et al. (2005) suggested that glycine deamination led to the generation of glyoxylate and subsequently the formation of α-ketovalerate and α-isoketovalerate. Based on the above work, Branduardi et al. (2013) conducted a study to decipher the components of the pathways leading to the formation of butanol and isobutanol with glycine as substrate. Briefly, glycine is converted into serine through serine hydroxymethyltransferase (Shm2; McNeil et al., 1994), which is deaminated by serine deaminase CHA1 to form pyruvate. Pyruvate is then converted to isoketovalerate and then to isobutanol following the reactions as described above. These findings emphasize that it is important to define formation of a product based on the substrate utilized, as sometimes the same end is reached from multiple starting points. The terms "GO:1902697 valine catabolic process to isobutanol" and "glycine catabolic process to isobutanol" can appropriately capture this difference and these two terms are linked to the parent term "GO:1901961 isobutanol biosynthetic process." The pathway also produces butanol, but since a high level of isobutanol production was observed, McNeil et al. (1994) hypothesized that carbon flux was being shunted away

from butanol production toward isobutanol production by the isomerization of α-ketovalerate into α-isoketovalerate. Analogous to these natural processes in *S. cerevisiae*, Atsumi et al. (2008) employed a metabolic engineering approach, which involved diverting 2-keto acid intermediates from the amino acid biosynthetic pathway of *Escherichia coli* to produce other branched-chain alcohols such as 2-methyl-1-butanol, 3-methyl-1-butanol and 2-phenylethanol.

# **FATTY ACID BIOSYNTHESIS AND FATTY ACID-DERIVED BIOFUELS**

Fatty acid metabolism has attracted the most attention as a biological route to convert sugars to liquid transportation fuels (Zhou and Zhao, 2011; Lennen and Pfleger, 2013; Janssen and Steinbuchel, 2014). It is particularly suited to provide precursors for advanced biofuel because of its high efficiency and the high-energy content of the end product (Atsumi and Liao, 2008; Rude and Schirmer, 2009; Peralta-Yahya and Keasling, 2010; Wen et al., 2013). Additionally, natural metabolic pathways have been identified that convert these precursors into the biofuel product. Generally, the first committed step in fatty acid biosynthesis in *E. coli* is the conversion of acetyl-CoA to malonyl-CoA catalyzed by acetyl-CoA carboxylase (AccABCD) "GO:0003989 acetyl-CoA carboxylase activity," which is a protein complex (GO:0009317 acetyl CoA carboxylase complex) comprising four subunits (Li and Cronan, 1992). Malonyl-CoA is then transferred to acyl carrier protein (ACP) via a malonyl-CoA:ACP transacylase (FabD, GO:0004314 malonyl-CoA:ACP transacylase activity; Campbell and Cronan, 2001; Oefner et al., 2006). Subsequently cycles of fatty acid elongation are initiated by malonyl-ACP and acetyl CoA catalyzed by β- ketoacyl-ACP synthase III (FabH, GO:0033818 beta-ketoacyl-ACP synthase III activity; Lai and Cronan, 2003). FabB and FabF use acyl-ACP as substrates to initiate successive chain elongation reactions (Feng and Cronan, 2009). A β-hydroxyacyl-ACP is the product of the second step in elongation, generated by β– ketoacyl ACP reductase (FabG, GO:0004316 beta-ketoacyl ACP reductase activity) while expending one molecule of NADPH. FabA and FabZ catalyze formation of an enoyl-ACP (Mohan et al., 1994). Following enoyl-ACP formation, the last intermediate in the fatty acid elongation cycle, acyl-ACP is formed by FabI with the consumption of NADPH (Heath and Rock, 1995). Two thioesterases (TEs) in *E. coli*, TesA and TesB release the fatty acid chains from the ACP to produce free fatty acids (Lee et al., 2006). In the biofuel field there are preferences for fatty acids of various chain lengths (Lennen and Pfleger, 2012; Torella et al., 2013). For example, longer chain products (C12−C20) fall within the diesel range, which have high energy densities (Wen et al., 2013). The GO has terms describing the synthesis of fatty acids of different chain lengths: for short-chain fatty acids, "GO:0051790 short-chain fatty acid biosynthetic process"; for medium-chain fatty acids, "GO:0051792 medium-chain fatty acid biosynthetic process"; for long-chain fatty acids "GO:0042759 long-chain fatty acid biosynthetic process"; and for very long-chain fatty acids, "GO:0042761 very long-chain fatty acid biosynthetic process." While free fatty acids are valuable, they cannot be used directly as fuels and must first be converted either to fatty acid alkyl esters (for biodiesel), or to fatty acid-derived alkanes, alkenes

or fatty alcohols (Zhang et al., 2011; Peralta-Yahya et al., 2012; Wen et al., 2013).

#### **FATTY ACID-DERIVED ALKANES AND ALKENES**

Alkanes, an integral part of fossil fuels (gasoline, diesel and jet fuel), are naturally found in diverse organisms including plants, insects and microbial species, but the genetic and biochemical bases behind the production of alkanes have been elusive. Those with C4–C23 carbon chain length possess higher energy densities, hydrophobic properties and compatibilities with existing liquid fuel infrastructure. Most evidence supporting the decarbonylation of aldehydes (a fatty acid metabolite) as the primary mechanism for alkane production have been obtained in eukaryotic systems (Cheesbrough and Kolattukudy, 1984; Dennis and Kolattukudy, 1991, 1992). This in part informed the identification of the *Arabidopsis cer* gene as encoding a protein with decarbonylase activity (GO:0071771 decarbonylase activity) involved in alkane biosynthesis. (Aarts et al., 1995). It is only recently that the pathway of alkane biosynthesis was elucidated in cyanobacteria (**Figure 5**; Schirmer et al., 2010; Li et al., 2011, 2012) making this class of hydrocarbons eligible to be categorized as a next generation biofuel (advanced). In this pathway, fatty acids are reduced by a two-step reaction to alkanes. Fatty acyl-ACP (fatty acid metabolite) is reduced to a fatty aldehyde via a fatty acyl-ACP reductase. This is then followed by a deformylation step catalyzed by an aldehyde deformylating oxidase "GO:1990465 aldehyde oxygenase (deformylating) activity," resulting in alkane production (Schirmer et al., 2010; Li et al., 2011, 2012; Warui et al., 2011; Coates et al., 2014). Commonly found hydrocarbons in cyanobacteria are heptadecane (GO:1900636 heptadecane biosynthetic process) and methyl heptadecane (Schirmer et al., 2010; Zhang et al., 2011), which have cetane numbers of 105 and 66 respectively (requirement for US diesel or ASTM standard, 47 minimum (Rashid et al., 2008) making the hydrocarbon products from cyanobacteria ideal candidates for diesel fuel applications.

Another group of hydrocarbons produced from fatty acid derivatives are olefins (alkenes, GO:0043450 alkene biosynthetic process). In *Jeotgalicoccus* sp., the terminal alkenes, 8-methyl-1 nonadecene and 17-methyl-1-nonadecene have been identified. A terminal olefin-forming fatty acid decarboxylase belonging to the cytochrome P450 family of enzymes (OleT) has been identified as the key enzyme involved in the production of these olefins in *Jeotgalicoccus* sp. (Rude et al., 2011). Other studies proposed a headto-head condensation of fatty acids, which involves the formation of a carbon-to-carbon bond between the carboxyl carbon of one fatty acid and the α-carbon of another fatty acid as another mechanism to generate long chain (C23–C33) olefins. OleA, a homolog of the condensing enzyme in fatty acid biosynthesis (FabH -3-oxoacyl-ACP ketosynthase (KS) III), was identified as a key enzyme in long chain olefin production (Beller et al., 2010; Sukovich et al., 2010). In cyanobacteria, the olefin-producing pathway (OLS) involves a polyketide synthase (GO:0034081 polyketide synthase complex) that first elongates fatty acyl-CoA with two carbons from malonyl-CoA via KS and acyl transferase (AT) domains. This is followed by reduction to the hydroxyacid by a ketoreductase (KR, GO:0045703 ketoreductase activity; Mendez-Perez et al.,

2011). In the final step, sulfotransferases (ST, GO:0008146 sulfotransferase activity) activate the β-hydroxy group via sulfonation, and then a TE acts on this substrate to catalyze decarboxylation and loss of sulfate to form the terminal alkene (McCarthy et al., 2012).

### **FATTY ACID ALKYL ESTERS (BIODIESEL)**

Biodiesel is a substitute for petroleum-based diesel fuel. Like its counterparts described above, biodiesel has properties similar to those of diesel, and therefore, can be used in the diesel engines and stored and distributed using the existing infrastructure. Other advantages of biodiesel include reduced fuel toxicity and increased lubricity. Additionally the use of biodiesel leads to lower carbon monoxide and soot emissions than conventional diesel fuels. Biodiesel is traditionally produced by the trans-esterification of mostly plant-derived triacylglycerols yielding glycerol and fatty acid alkyl esters (FAAE), particularly fatty acid methyl esters (FAMEs; "GO:1902899 FAME biosynthetic process"). However, biodiesel production is limited by the availability of inexpensive vegetable oil feedstocks. This has prompted a search for sustainable alternatives. Direct microbial production of FAAEs is an area of intense research as it bypasses the transesterification step, reducing cost and energy and also avoids the use of methanol, (Kalscheuer et al., 2006; Nawabi et al., 2011) which is an expensive and toxic feedstock.

# **ISOPRENOID-DERIVED BIOFUEL**

Isoprenoids, also called terpenes, have been evaluated for pharmaceutical, nutritional and fuel products (Bohlmann and Keeling, 2008; Peralta-Yahya and Keasling, 2010). Two independent biosynthetic pathways for isoprenoid production "GO:0008299 isoprenoid biosynthetic process"are found in nature (**Figure 6**). The methylerythritol phosphate (MEP) pathway "GO:1902768 isoprenoid biosynthetic process via 1-deoxy-Dxylulose 5-phosphate" and the mevalonate pathway "GO:1902767 isoprenoid biosynthetic process via mevalonate" have evolved for the production of the key five-carbon isoprenoid intermediates, isopentenyl diphosphate (IPP), and dimethylallyl diphosphate (DMAPP; Beytia and Porter, 1976; Edwards et al., 1992; **Figure 6**). The MEP pathway consists of seven steps resulting in the conversion of glyceraldehyde-3-phosphate and pyruvate to IPP and DMAPP and it is found in most bacteria, chloroplast, unicellular eukaryotes, and certain parasites (Zhao et al., 2013; Jarchow-Choy et al., 2014; **Figure 6**). The mevalonate pathway is responsible for all the isoprenoid production in archaea, some bacteria and most eukaryotes (Miziorko, 2011; **Figure 6**). It converts acetyl CoA in six steps to IPP via the key intermediate mevalonate. Specifically, acetyl-CoA and acetoacetyl-CoA are condensed into 3-hydroxy-3-methylglutaryl-CoA (HMG-CoA), which is reduced to mevalonate via HMG-CoA reductase (HMGR, GO:0042282 HMG-CoA reductase activity). Following this are two phosphorylation steps, which convert mevalonate

to mevalonate-5-diphosphate via the actions of mevalonic acid kinase "GO:0004496 mevalonic acid kinase activity" and phosphomevalonate kinase "GO:0004496 phosphomevalonate kinase activity." An ATP-coupled decarboxylation step yields mevalonate-5-diphosphate to the C5 building block IPP. An IPP isomerase "GO:0004452 IPP isomerase activity" is responsible for the interconversion of IPP and DMAPP. Condensation of IPP and DMAPP using prenyltransferases "GO:0004659 prenyltransferase activity" yields several prenyl-pyrophosphates including farnesyl-pyrophosphate, geranyl-pyrophosphate and geranylgeranyl-pyrophosphate. The prenyl-pyrophosphatase in turn is converted to diverse terpenes including monoterpenes, sesquiterpenes and diterpenes via terpene synthase. The prenyl-pyrophosphates can be hydrolyzed by pyrophosphatases "GO:0016462 pyrophosphatase activity" to form fuellike esters and alcohols such as isoamylacetate and isopentenol "GO:1902934 isopentenol biosynthetic process." One such

pyrophosphatase "GO:0016462 pyrophosphatase activity," nudF from *B. subtilis*, was shown to produce isopentenol in *E. coli*. Curating the biosynthesis of the intermediates and the vast diversity of terpene synthases "GO:0010333 terpene synthase activity" therefore will provide a useful resource, which will inform decisions in the synthesis of terpene-based fuels. A new class of terpenes (C35), the sesquarterpenes has recently been classified (Sato, 2013) and it will be interesting to know if these compounds could provide precursors for terpene-based biofuels.

# **POLYHYDROXYALKANOATE-DERIVED BIOFUEL**

Microbially produced PHAs have attracted attention as biodegradable polyesters and quite recently as biofuels (Zhang et al., 2009). PHAs are synthesized inside cells during oxygen, phosphorous or nitrogen starvation in the presence of excess carbon and found as insoluble cytoplasmic inclusions called polyhydroxyalkanoate

granules (GO:0070088 PHA granule; Pfeiffer and Jendrossek, 2012; Jendrossek and Pfeiffer, 2013). Synthesis of a PHA could occur through several metabolic pathways and from a variety of carbon precursors including glucose, glycerol, and fatty acids (Park et al., 2012; Shahid et al., 2013). PHAs comprise over 150 both homo- and hetero-polymers of which poly-3-hydroxybutyrate (poly (3HB), P (3HB) or PHB) being the most abundant naturally produced and most studied polyhydroxyalkanoate "GO:0042618 polyhydroxybutyrate biosynthetic process" (Laycock et al., 2014). On the basis of the carbon chain lengths of the monomers, PHAs are divided into two main groups: short chain length PHAs (scl-PHA) containing monomer units with 3–5 carbon atoms, and medium chain length PHAs (mcl-PHA) composed of monomer units with 6–18 carbon atoms. Based on the bacterial strains, PHA synthases and carbon sources involved, products of PHA biosynthesis can be a homopolymer, copolymer, block polymer or even blends (Lu et al., 2009). Functional groups such as unsaturated bonds, benzene, halogens and cyclic chemicals and epoxides can also modify PHA structures, opening up the potential for the production of a vast number of diverse PHAs. The biosynthesis of PHA "GO:190144 polyhydroxyalkanoate biosynthetic process" is quite complex and involves several enzymes that are directly or indirectly involved in PHA synthesis (Reddy et al., 2003; Park et al., 2012). PHA synthesis is well studied in the model organism *Ralstonia eutropha* (Riedel et al., 2014). In this organism, the carbon source is converted into coenzyme A thioesters of (R)-hydroxyalkanoic acids. Following this step, β ketothiolase "GO:0003988 acetyl-CoA C-acyltransferase activity" catalyzes the condensation of two coenzyme-A thioester monomers such as acetyl-CoA and propionyl-CoA. An (R) specific reduction step involving acetoacetyl-CoA reductase "GO:0018454 acetoacetyl-CoA reductase activity" produces (R)- 3 hydroxybutyryl-CoA or (R)-3 hydroxyvaleryl-CoA which is then converted into PHA by the action of PHA synthase. In regards to biofuels, 3-hydroxyalkanoates (3HAs; mcl PHAs) are linked by ester bonds formed with the hydroxyl group (–OH) of one monomer and the carboxyl (–COOH) group of the other monomer through the catalysis by various PHA synthases (Reddy et al., 2003). These hydroxyalkanoate esters (3HA esters) have been found to be similar to methyl esters of long chain fatty acids (in biodiesel). As such they can be used as fuel additives. Hydroxybutyrate and hydroxyalkanoate methyl esters (3HBME and 3HAME), generated from esterification of scl PHB and mcl PHA with methanol, respectively, are considered equivalent to ethanol (Gao et al., 2011). These esters can also be used as gasoline and biodiesel additive (Gao et al., 2011).

#### **TOLERANCE TO BIOFUELS AND BY-PRODUCTS**

Accumulation of biofuels as well as by-products often affect the integrity of the fuel producing organism's cell membrane and also impairs metabolic pathways associated with cell growth, thus compromising product titers (Heipieper et al., 2007; Segura et al., 2012). Most native producers of biofuels are sensitive to the solvents they produce. Advances are being made toward removing this block through investigations on the mechanisms of tolerance

to product accumulation (McEvoy et al., 2004; Fujita et al., 2006; Liu and Qureshi, 2009a; Ghiaci et al., 2013; Zingaro and Terry Papoutsakis, 2013) and several organisms are being investigated as better choices for biofuel production from this standpoint (Dunlop, 2011). These efforts will be substantially aided by curating all gene products associated with native or improved tolerance to biofuels and their by-products. Tolerance is considered a phenotype, not a process and is therefore out of the scope for GO Biological Process ontology. To circumvent this problem, we created the terms "process resulting in tolerance to x" instead of "tolerance to x" (x is any compound). Here we present some examples.

The common inhibitors to microbial growth that are found in lignocellulose hydrolysates include aldehydes, ketones, phenols and organic acids (Klinke et al., 2004; Jonsson et al., 2013). A *S. cerevisiae* gene, *ARI1*, encoding a novel NADPHdependent aldehyde reductase has been shown to be involved in providing tolerance to inhibitors such as furfural, vanillin and cinnamaldehyde that arise from lignocellulose hydrolysis (Liu and Moon, 2009b). We annotated the gene product ARI1 with GO terms for its molecular role "GO:0018455 aldehyde reductase (NADPH/NADH) activity" and a biological process term "GO:1990370 process resulting in tolerance to aldehyde" to describe its ability to confer tolerance to several aldehydes. Other notable *S. cerevisiae* enzymes with efficient aldehyde reduction activities include ADHs (ADH1, ADH6 and ADH7; GO:0004022 ADH activity), aldehyde dehydrogenase (ALD4; GO:0004030 aldehyde dehydrogenase activity) and methylglyoxal reductases (GRE2 and GRE3; GO:0043892 methylglyoxal reductase activity; Petersson et al., 2006; Laadan et al., 2008).

Tolerance to alcohols such as butanol in *C. acetobutylicum* is influenced by the multifunctional Spo0A protein, which, as discussed in an earlier section, controls sporulation and solventogenesis (Alsaker et al., 2004). Kanno et al. (2013) obtained several aerobic and anaerobic bacterial isolates from soil, spanning diverse genera that tolerate butanol and isobutanol levels greater than 2% (vol/vol). The *cfa* gene, which encodes cyclopropane fatty acid (CFA) synthase in one of these bacteria (belonging to the *Firmicutes* phylum), confers solvent tolerance in recombinant *E. coli* (Kanno et al., 2013). CFA is annotated with the terms "GO:1990336 process resulting in tolerance to butanol" and "GO:1990337 process resulting in tolerance to isobutanol." A quantitative transcriptomic analysis revealed that in the cyanobacterium *Synechocystis* sp. PCC 6803, which uses solar energy and carbon dioxide as sole energy and carbon sources, over 250 genes are induced upon exposure to butanol (Zhu et al., 2013). Of these, three, *sll0690*, *slr0947,* and *slr1295* were further characterized using knock-out mutants. The results indicated butanol sensitivity in strains that lacked any of these genes, indicating their involvement in resistance to butanol. Using a genome-scale analysis in *S. cerevisiae*, Gonzàlez-Ramos et al. (2013) demonstrated the role of protein degradation in tolerance to C3 and C4 alcohols (butanol, 2-butanol, isobutanol, and isopropanol). Specifically, the YLR224W gene was found to be associated with increased butanol tolerance and it encodes a subunit of the Skp-Cullin\_Fbox (SCF) ubiquitin ligase that recognizes damaged proteins. The

efflux pumps involved in the extrusion of toxins from cells have been considered appropriate candidates for studies on solvent tolerance (Segura et al., 2012). Such a pump consists of multiple proteins forming a multicomponent complex "GO:1990281 efflux pump complex," which span the inner to the outer membrane of bacteria cells (Andersen, 2003; Du et al., 2014). An exposure to solvents induces the expression of s*rpABC* genes in *P. putida* (Kieboom et al., 1998b). The SrpABC share considerable sequence similarities to multi-drug efflux pump proteins (Kieboom et al., 1998a). Specifically, SrpA functions as a periplasmic linker protein, SrpB as an inner membrane transporter, and SrpC, as an outer membrane channel. For these reasons, besides annotations for their roles in efflux pump and cellular locations ("GO:0015562 efflux transmembrane transporter activity"from Molecular Function ontology; "GO:1990281 efflux pump complex,""GO:0042597 periplasmic space," "GO:0009276 Gram-negative-bacterium type cell wall," and "GO:0019867 outer membrane from the Cellular Component ontology"), we have qualified these proteins with a biological process term "GO:1990367 process resulting in tolerance to organic substance." Rojas et al. (2001) have identified three different solvent efflux pumps in *P. putida*, TtgABC, TtgDEF, and TtgGHI, which extrude toluene. TtgABC also could extrude antibiotics (Ramos et al., 1998; Mosqueda and Ramos, 2000; Rojas et al., 2001). All these multi-protein efflux pumps are assigned the GO term "GO:1990281 efflux pump complex." Hydrocarbons including nonane, decane, and undecane were shown to be toxic to *S. cerevisiae* when these solvents accumulated inside the cell. A transcriptomic analysis identified modified cell membrane and efflux pumps as contributing to alkane export and tolerance. Specifically efflux pumps Snq2p and Pdr5 were shown to reduce intracellular levels of decane and undecane thereby enhancing tolerance to the alkanes "GO:1990373 process resulting in tolerance to alkane" (Ling et al., 2013).

#### **SYNTHETIC BIOLOGY**

The yields of most fuels or fuel derivatives in the native biological systems discussed above are insufficient to be cost-competitive with petroleum-derived fuels, and therefore the technologies of metabolic engineering and synthetic biology are being employed to increase production and/or generate entirely new fuels with superior qualities. As mentioned in the introduction, currently the GO is limited to terms for describing natural processes. The MENGO team is currently discussing with the community the creation of a complementary set of GO-like terms suitable for the annotation of synthetic biological systems, which we provisionally call SYNGO. To underscore the anticipated broad scope of SYNGO, we provide here brief descriptions of selected synthetic processes and parts that are driving the development of commercially viable routes for the production of advanced biofuels.

Based on the modular nature of cellulosomes and the availability of a variety of dockerin-cohesin pairs, efforts are underway to construct designer cellulosomes, xylanosomes, and cellulosomelike structures. In such a precision-engineered multienzyme complex, the molecular architecture and enzyme content are well controlled, and the result is enhanced synergistic deconstruction of biomass (Mitsuzawa et al., 2009; Nordon et al., 2009; Smith and Bayer, 2013; Srikrishnan et al., 2013; Vazana

et al., 2013). For example, Mitsuzawa et al. (2009) modified a thermostable group II chaperonin (18-subunit self-assembling protein complex called rosettasome), from the archaeon *Sulfolobus shibatae*, for use as a scaffold to assemble selected hydrolytic enzymes. A cohesin module was fused to each of the eighteen subunits which in turn was combined with dockerincontaining cellulases from *C. thermocellum* to build an 18 enzyme cellulosome-like structure they termed rosettazyme. Truncated cellulosomes also called minicellulosomes have been constructed in an effort to fully understand the relationship between cellulosome structure and enzymatic activity (Arai et al., 2007; Cha et al., 2007; Mingardon et al., 2007). Examples are the mini-CipA and mini-CipC1 cellulosomes from *Clostridium cellulovorans* and *Clostridium cellulolyticum,* respectively (Murashima et al., 2002; Perret et al., 2004). The engineered minicellulosome platforms can be extended to build complex designer cellulosomes. Designer cellulosomes are not described in the GO and a complementary term "synthetic multi-cellular complex" would be useful to capture designer multienzyme complexes such as the rosettazyme, mini-CipA and mini-CipC1 and their components.

Generally, yield is one of the focal points for improving alcohol production by means of synthetic biology technology (Peralta-Yahya et al., 2012; Gronenberg et al., 2013). For example, low butanol yield in the native Clostridium species is not economically competitive, and thus the Clostridrial CoA-dependent pathway has been heterologously expressed in genetically tractable industrial microbes hosts such as *E. coli and S. cerevisiae* (Inui et al., 2008; Steen et al., 2008) for facile manipulation toward better yield and productivity. For example, *E. coli* strain bearing the CoA-dependent pathway was further modified by substituting the reversible, flavin-dependent butyryl-CoA dehydrogenase (Bcd) with an irreversible trans-enoyl-CoA reductase (Ter) for the reduction of crotonyl-CoA (Bond-Watts et al., 2011). The increased flux to butanol improved product yield (Bond-Watts et al., 2011). While Bcd is associated with "butanol biosynthetic process" in the GO, Ter is not, as it is part of a non-natural (synthetic) pathway. This limitation in the GO could potentially be addressed with complementary SYNGO terms, for example, Ter could be assigned "enoyl-CoA reductase involved in increased butanol biosynthesis."

The natural ability of *E. coli* to produce fatty acids and creation of new biochemical reactions through synthetic biology have provided the means to divert fatty acid metabolism toward the production of fuels (Clomburg and Gonzalez, 2010; Wen et al., 2013). This approach is a more sustainable alternative than, for example, the production of biodiesel from plant oils. Direct microbial production of fatty acid esters (biodiesel) eliminates the need for a subsequent chemical transesterification step. A pathway which includes an acyl-CoA ligase and the broad specificity acyltransferase (WS/DGAT; AtfA), together with the enzymes that provide ethanol for esterification [pyruvate decarboxylase (pdc) and alcohol dehydrogenase B (adhB)] were incorporated into a fatty acid overproducing *E. coli* strain for the production of fatty acid ethyl esters (Stoveken et al., 2005; Steen et al., 2010). Curating the components of the engineered pathway with complementary SYNGO terms, associating them with fatty acid ethyl ester biosynthesis, should be a powerful tool for researchers working to improve these processes.

#### **CONCLUSION**

The MENGO group has created and continues to create GO terms relevant to microbial bioenergy research. In addition, the team has used the terms to produce high quality"gold standard"annotations of gene products based on the experimental scientific literature, which are useful for engineering of microbial strains to optimize bioenergy production. These resources provide easy access to otherwise dispersed information in the scientific literature and also aids in the computational analysis of large datasets. The GO annotations can also be used to uncover metabolic pathways, and to understand bioenergy-relevant microbes at the system biology level. Community involvement in MENGO term development and annotations is welcome and will create a more comprehensive resource for use by all. Since the GO currently is restricted to natural processes, a need has arisen for a complementary set of GO-like terms to describe engineered processes and the gene products that comprise them. Community involvement in this other development is also essential.

#### **ACKNOWLEDGMENTS**

We thank the editors at the GOC for reviewing the MENGO terms. This work was supported by grant DE-SC0005011 from the US Department of Energy. Jane Lomax is funded by European Molecular Biology Laboratories (EMBL) core funds. João C. Setubal is funded by CNPq and FAPESP. The authors wish to thank all participants in the MENGO workshops held at various venues in 2011 and 2012.

#### **SUPPLEMENTARY MATERIAL**

The Supplementary Material for this article can be found online at: http://www.frontiersin.org/journal/10.3389/fmicb.2014.00528/ abstract

#### **REFERENCES**


known unknowns. *Trends Microbiol.* 17, 286–294. doi: 10.1016/j.tim.2009. 04.005


on bacterial biosynthesis of polyhydroxyalkanoates: evidence of an atypical metabolism in *Bacillus megaterium* DSM 509. *J. Biosci. Bioeng.* 116, 302–308. doi: 10.1016/j.jbiosc.2013.02.017


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 28 July 2014; accepted: 22 September 2014; published online: 10 October 2014.*

*Citation: Torto-Alalibo T, Purwantini E, Lomax J, Setubal JC, Mukhopadhyay B and Tyler BM (2014) Genetic resources for advanced biofuel production described with the Gene Ontology. Front. Microbiol. 5:528. doi: 10.3389/fmicb.2014.00528*

*This article was submitted to Microbial Physiology and Metabolism, a section of the journal Frontiers in Microbiology.*

*Copyright © 2014 Torto-Alalibo, Purwantini, Lomax, Setubal, Mukhopadhyay and Tyler. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Aromatic inhibitors derived from ammonia-pretreated lignocellulose hinder bacterial ethanologenesis by activating regulatory circuits controlling inhibitor efflux and detoxification

*David H. Keating1†, Yaoping Zhang1†, Irene M. Ong1†, Sean McIlwain1†, Eduardo H. Morales 1,2 †, Jeffrey A. Grass 1,3, Mary Tremaine1, William Bothfeld1, Alan Higbee1, Arne Ulbrich4, Allison J. Balloon4, Michael S. Westphall 2,4, Josh Aldrich5, Mary S. Lipton5, Joonhoon Kim1,6, Oleg V. Moskvin1, Yury V. Bukhman1, Joshua J. Coon1,2,4, Patricia J. Kiley1,2, Donna M. Bates <sup>1</sup> \* and Robert Landick1,3,7\**

*<sup>1</sup> Great Lakes Bioenergy Research Center, University of Wisconsin-Madison, Madison, WI, USA*

*<sup>2</sup> Department of Biomolecular Chemistry, University of Wisconsin-Madison, Madison, WI, USA*

*<sup>3</sup> Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA*

*<sup>4</sup> Department of Chemistry, University of Wisconsin-Madison, Madison, WI, USA*

*<sup>5</sup> Pacific Northwest National Laboratory, Richland, WA, USA*

*<sup>6</sup> Department of Chemical and Biological Engineering, University of Wisconsin-Madison, Madison, WI, USA*

*<sup>7</sup> Department of Bacteriology, University of Wisconsin-Madison, Madison, WI, USA*

#### *Edited by:*

*Katherine M. Pappas, University of Athens, Greece*

#### *Reviewed by:*

*Carl James Yeoman, Montana State University, USA Helen Zgurskaya, University of Oklahoma, USA*

#### *\*Correspondence:*

*Donna M. Bates, Great Lakes Bioenergy Research Center, 4119 Wisconsin Energy Institute, University of Wisconsin - Madison, 1552 University Ave., Madison, WI 53726, USA e-mail: dbates@glbrc.wisc.edu; Robert Landick, Department of Biochemistry, 5441 Microbial Sciences, 1550 Linden Dr., University of Wisconsin, Madison, WI 53706, USA e-mail: landick@biochem.wisc.edu*

*†These authors have contributed equally to this work.*

Efficient microbial conversion of lignocellulosic hydrolysates to biofuels is a key barrier to the economically viable deployment of lignocellulosic biofuels. A chief contributor to this barrier is the impact on microbial processes and energy metabolism of lignocellulose-derived inhibitors, including phenolic carboxylates, phenolic amides (for ammonia-pretreated biomass), phenolic aldehydes, and furfurals. To understand the bacterial pathways induced by inhibitors present in ammonia-pretreated biomass hydrolysates, which are less well studied than acid-pretreated biomass hydrolysates, we developed and exploited synthetic mimics of ammonia-pretreated corn stover hydrolysate (ACSH). To determine regulatory responses to the inhibitors normally present in ACSH, we measured transcript and protein levels in an *Escherichia coli* ethanologen using RNA-seq and quantitative proteomics during fermentation to ethanol of synthetic hydrolysates containing or lacking the inhibitors. Our study identified four major regulators mediating these responses, the MarA/SoxS/Rob network, AaeR, FrmR, and YqhC. Induction of these regulons was correlated with a reduced rate of ethanol production, buildup of pyruvate, depletion of ATP and NAD(P)H, and an inhibition of xylose conversion. The aromatic aldehyde inhibitor 5-hydroxymethylfurfural appeared to be reduced to its alcohol form by the ethanologen during fermentation, whereas phenolic acid and amide inhibitors were not metabolized. Together, our findings establish that the major regulatory responses to lignocellulose-derived inhibitors are mediated by transcriptional rather than translational regulators, suggest that energy consumed for inhibitor efflux and detoxification may limit biofuel production, and identify a network of regulators for future synthetic biology efforts.

**Keywords:** *Escherichia coli***, lignocellulosic hydrolysate, aromatic inhibitors, transcriptomics, RNAseq, proteomics, ethanol, biofuels**

#### **INTRODUCTION**

Elucidation of metabolic and regulatory barriers in microbial conversion of lignocellulosic sugars to ethanol is crucial for both the immediate goal of economical cellulosic ethanol and for the long-term development of next-generation biofuels and sustainable chemicals from renewable biomass. Efficient conversion of lignocellulose (LC) hydrolysates is limited by multiple factors (Mills et al., 2009; Lau and Dale, 2010), including high osmolarity (Underwood et al., 2004; Purvis et al., 2005; Miller and Ingram, 2007), toxicity of the conversion products (Ingram and Buttke, 1984), and inhibitors of microbial metabolism and growth generated during the deconstruction of LC (Zaldivar et al., 1999; Wang et al., 2011a; Tang et al., submitted). Understanding and overcoming the barriers created by LC-derived inhibitors presents significant challenges as their composition can vary depending on the biomass source of LC, the methods used to deconstruct the LC, and the diverse metabolic and regulatory responses of microbes to inhibitors (Klinke et al., 2004; Liu, 2011). Synergy among the inhibitors, the high osmolarity inherent to hydrolysates, and toxicity of conversion products (e.g., ethanol) are additional factors that contribute to the complex molecular landscape of lignocellulosic hydrolysates (Klinke et al., 2004; Liu, 2011; Piotrowski et al., 2014).

Release of sugars from LC typically requires either acidic or alkaline treatment of biomass prior to or coupled with chemical or enzymatic hydrolysis (Chundawat et al., 2011). Acidic treatments generate significant microbial inhibitors by condensation reactions of sugars (e.g., furfural and 5-hydroxymethylfurfural). Microbes typically detoxify these aldehydes by reduction or oxidation to less toxic alcohols or acids (Booth et al., 2003; Herring and Blattner, 2004; Marx et al., 2004; Jarboe, 2011), but these conversions also directly or indirectly consume energy that otherwise would be available for biofuel synthesis (Miller et al., 2009a,b) The impact of these inhibitors is especially significant for C5 sugars like xylose whose catabolism provide slightly less cellular energy (Lawford and Rousseau, 1995), and can be partially ameliorated by replacing NADPH-consuming enzymes with NADH-consuming enzymes (Wang et al., 2013).

Alkaline treatments, for instance with ammonia, are potentially advantageous in generating fewer toxic aldehydes, but the spectrum of inhibitors generated by alkaline treatments is less well characterized and their effects on microbial metabolism are less well understood. We have developed an approach to elucidate the metabolic and regulatory barriers to microbial conversion in LC hydrolysates using ammonia fiber expansion (AFEX) of corn stover, enzymatic hydrolysis, and a model ethanologen (GLBRCE1) engineered from the well-studied bacterium *E. coli* K-12 (Schwalbach et al., 2012). Our strategy is to compare anaerobic metabolic and regulatory responses of the ethanologen in authentic AFEX-pretreated corn stover hydrolysate (ACSH) to responses to synthetic hydrolysates (SynHs) designed to mimic ACSH with a chemically defined medium. GLBRCE1 metabolizes ACSH in exponential, transition, and stationary phases but, unlike growth in traditional rich media (Sezonov et al., 2007), GLBRCE1 enters stationary phase (ceases growth) long before depletion of available glucose but coincident with exhaustion of amino acid sources of organic nitrogen (Schwalbach et al., 2012). The growth-arrested cells remain metabolically active and convert the remaining glucose, but not xylose, into ethanol (Schwalbach et al., 2012).

Our first version of SynH (SynH1) matched ACSH for levels of glucose, xylose, amino acids, and some inorganics, overall osmolality, and the amino-acid-dependent growth arrest of GLBRCE1 (Schwalbach et al., 2012). However, gene expression profiling revealed that SynH1 cells experienced significant osmotic stress relative to ACSH cells, whereas ACSH cells exhibited elevated expression of efflux pumps, notably of *aaeAB* that acts on aromatic carboxylates (Van Dyk et al., 2004), relative to SynH1 cells (Schwalbach et al., 2012). Osmolytes found in ACSH (betaine, choline, and carnitine) likely explained the lower osmotic stress, whereas phenolic carboxylates derived from LC (e.g., coumarate and ferulate) likely explained efflux pump induction possibly *via* the AaeR and MarA/SoxS/Rob regulons known to be induced by phenolic carboxylates (Sulavik et al., 1995; Dalrymple and Swadling, 1997). We also observed elevated expression of *psp,* *ibp*, and *srl* genes associated with ethanol stress at ethanol concentrations three-fold lower than previously reported to induce expression (Yomano et al., 1998; Goodarzi et al., 2010) and thus consistent with a synergistic stress response with the LC-derived inhibitors. These findings led us to hypothesize that the collective effects of osmotic, ethanol, and LC-derived inhibitor stresses created an increased need for ATP and reducing equivalents that was partially offset in early growth phase by catabolism of amino acids, as N and possibly S sources. However, as these amino acids are depleted, cells transition to stationary phase where they continue to catabolize glucose for maintenance ATP and NAD(P)H but are unable to generate sufficient energy for cell growth or efficient xylose catabolism.

To test this hypothesis, we developed a new SynH formulation (SynH2) that faithfully replicates the physiological responses in ACSH and the effects of LC-derived inhibitors. Using SynH2 with and without the LC-derived inhibitors, we generated and analyzed metabolomic, gene expression, and proteomic data to define the effects of inhibitors on bacterial gene expression and physiology. The analysis allowed identification of key regulators that may provoke stress responses in the presence of LC-derived inhibitors and suggest that coping mechanisms employed by *E. coli* to deal with lignocellulosic stress drains cellular energy, thus limiting xylose conversion.

# **MATERIALS AND METHODS**

#### **REAGENTS**

Reagents and chemicals were obtained from Thermo Fisher Scientific (Pittsburgh, Pennsylvania, USA) or Sigma Aldrich Co. (Saint Louis, Missouri, USA) with the following exceptions. 5-hydroxymethyl-2-furancarboxylic acid and 5- (hydroxymethyl)furfuryl alcohol were obtained from Toronto Research Chemicals Inc. (Toronto, Ontario, Canada). Deuterated compounds for HS-SPME-GC/IDMS were obtained from C/D/N Isotopes (Pointe-Claire, Quebec, Canada). D4-acetaldehyde and U13C6-fructose were obtained from Cambridge Isotope Labs (Andover, Massachusetts, USA).

#### **SYNTHESIS OF FERULOYL AND COUMAROYL AMIDES**

Twenty grams of ferulic or coumaric acid were dissolved in 200 ml of 100% ethanol in a 3-neck, 250 ml round-bottom flask equipped with a magnetic stir bar and a drying tube on one of the outside arms. Ten milliliters of acetyl chloride was added and incubated with stirring at room temperature overnight. Ethanol was removed in a rotary evaporator at 40◦C under modest vacuum; the syrup re-dissolved in 250 ml 100% ethanol and re-evaporated twice. When the final syrup was reduced to *<*25 ml, ∼6 ml portions were transferred to heavy-wall 25 × 150 mm tubes containing ∼30 ml concentrated ammonium hydroxide and sealed with a Teflon-lined cap. The sealed tubes were incubated at 95◦C in a heating block covered with a safety shield overnight. The tubes were cooled and then left open in a hood for 4–8 h to allow evaporation of ammonium hydroxide, during which the feruloyl or coumaroyl amide precipitated. The crystallized products were collected under vacuum on a glass filter and washed with 250 ml ice-cold 150 mM ammonium hydroxide. The product was allowed to air dry in a plastic weigh boat in the hood at room temperature for 2–3 days. Purity of the products was analyzed by silica gel TLC developed with 5% methanol in chloroform. Only preparations exceeding 90% purity were used for experiments.

#### **PREPARATION OF ACSH**

ACSH was prepared by one of two methods that differed in whether or not CS was autoclaved prior to enzymatic hydrolysis. Non-autoclaved CS hydrolysate more closely replicates an industrial process, was used by Tang et al. (submitted) for compositional analysis, and was used for some of our fermentation experiments. Autoclaved CS hydrolysate ensures sterility for bacterial fermentations and was used for our compositional analysis and for experiments to generate RNA-seq data. We did not observe a significant difference in GLBRCE1 behavior in nonautoclaved vs. autoclaved CS hydrolysates, although HMF was detectable in the former, but not the latter (**Table 2**). We observed minor variations in growth with CS harvested in different years. For autoclaved CS hydrolysate, AFEX-pretreated CS was mixed with water to 6–10 L final volume at 60 g glucan/L loading (18– 22% solids, adjusted for moisture content) and autoclaved for 30– 120 min in a 15 L Applikon bioreactor vessel (Schwalbach et al., 2012). For non-autoclaved CS hydrolysate, AFEX pretreated-corn stover was added to the vessel after the water was autoclaved for 30 min. For both, the sample was cooled to ∼70◦C, adjusted to 10 L volume with water, and pH adjusted with ∼30 ml concentrated HCl. Hydrolysis was initiated by adding Novozymes CTec2 to 24 mg/g glucan and HTec2 to 6 mg/g glucan, followed by incubation for 5 days at 50◦C with stir speed at 700 rpm. Some older batches of hydrolysate were prepared using Genencor Accellerase, Genencor Accellerase XY, and Multifect pectinase A in place of Novozyme enzymes (Schwalbach et al., 2012). Solids were then removed by centrifugation (8200 × g, 4◦C, 10–12 h) and the supernatant was filter-sterilized through 0.5μm and then 0.2μm filters. Prior to fermentation, the hydrolysate was adjusted to pH 7.0 using NaOH pellets and filtered again through a 0.2 μm filter to remove precipitates and to ensure sterility.

### **PREPARATION OF SYNTHETIC HYDROLYSATE (SYNH2)**

SynH2 (**Table 1**) was prepared by combining per L final volume of SynH2 the following ingredients. Water (∼700 ml) was mixed with 6.25 ml of 1.6 M KPO4 buffer, pH 7.2, 20 ml of 1.5 M ammonium sulfate, 20 ml of 2.25 M KCl, 1.25 M NaCl, 20 ml of a 50X amino acid stock giving the final concentrations shown in **Table 1** (except tyrosine), 20 ml of 8.75 mM tyrosine dissolved in 50 mM HCl, 50 ml of 1 mM each adenine, guanine, cytosine and uracil dissolved in 10 mM KOH, 10 ml of vitamin stock (1 mM each thiamine, calcium pantothenate, *p*-aminobenzoic acid, *p*hydroxybenzoic acid, and 2,3-dihydroxybenzoic acid), 1 ml of a 1000X stock of micronutrients (ZnCl2, MnCl2, CuCl2, CoCl2, H3BO3, (NH4)6Mo7O24, and FeCl3) giving the final concentrations shown in **Table 1**, 1 ml of 1 M magnesium chloride, 1 ml of 90 mM CaCl2, 10 ml of 1 M sodium formate, 10 mM sodium nitrate, and 50 mM sodium succinate, 1 ml of 3 M glycerol, 1 ml of 500 mM lactic acid, 1 ml of 700 mM glycine betaine, 700 mM choline chloride, 200 mM DL-carnitine (osmolytes), 5.61 g acetamide, 2.71 g sodium acetate, 3.3 g sodium pyruvate, 2.94 g sodium citrate, 1.34 g DL-malic acid, 60 g D-glucose, 30 g D-xylose, 5.1 g D-arabinose, 1.48 g D-fructose, 1.15 g Dgalactose, and 468 mg D-mannose. After adjusting to pH 7 with 10 N NaOH, the final volume was adjusted to 1 L. This base recipe corresponds to SynH2−. To create SynH2, the aromatic inhibitors were added as solids to the base recipe in the following quantities per L SynH2 and stirred until fully dissolved before filter sterilization; 531 mg feruloyl amide, 448 mg coumaroyl amide, 173 mg *p*-coumaric acid, 69 mg ferulic acid, 69 mg hydroxymethylfurfural, 59 mg benzoic acid, 15 mg syringic acid, 14 mg cinnamic acid, 15 mg vanillic acid, 2 mg caffeic acid, 20 mg vanillin, 30 mg syringaldehyde, 24 mg 4-hydroxybenzaldehyde, 3.4 mg 4-hydroxybenzophenone. For some experiments (Figures S3, S4), feruloyl amide, coumaroyl amide, *p*-coumaric acid, ferulic acid, and hydroxymethylfurfural were added at up to twice these concentrations. The medium was filter-sterilized through a 0.2μm filter.

#### **CHEMICAL ANALYSIS OF ACSH**

Carbohydrates, ethanol, and short chain acids in ACSH and fermentation media were quantified using HPLC-RID, NMR, and GC-MS as previously described (Schwalbach et al., 2012). ACSH osmolality was measured using a Vapro osmometer 5520 (Wescor Inc., Logan, Utah, USA). The synthetic hydrolysate medium used in these studies (SynH2) was based on a previously described synthetic hydrolysate medium (Schwalbach et al., 2012) that was modified to more closely approximate the composition of ACSH media, particularly with regard to the presence of alternative carbon sources and protective osmolytes. Concentrations of components in the modified SynH2 are described in Table S1.

#### **FERMENTATIVE GROWTH CONDITIONS**

Cell culture was conducted as described previously (Schwalbach et al., 2012), except fermentations were carried out in 3 L bioreactors (Applikon Biotechnology) containing 2.45 L of ACSH or SynH media, and cultures were diluted into ACSH or SynH with initial OD600 at 0.2, grown anaerobically overnight, and then inoculated into bioreactors to a starting OD600 of 0.2. For fermentation experiments to determine the effect of osmolytes, it was carried out in 0.5 L Sartorius BIOSTAT Qplus bioreactors (Sartorius Stedium Biotech, Bohemia, NY) containing 0.35 L of SynH2 media in the absence or presence of osmolytes or aromatic inhibitors. Culture density was measured using a Beckman Coulter DU720 in a 1 ml cuvette. Due to the high absorbance of ACSH at 600 nm, cells were diluted 1:10 in water prior to OD600 measurement, with diluted ACSH (1:10) as a blank. For SynH, diluted SynH (1:10) was used as a blank.

#### **RNA-seq GENE EXPRESSION ANALYSES**

Samples for RNA-seq were captured and RNA extracted as described previously (Schwalbach et al., 2012). FASTQ formatted sequence files from strand-specific Illumina RNA-Seq reads were aligned to the GLBRCE1 reference genome using Bowtie version 0.12.7 (Langmead et al., 2009) with "*—nofw*" strandspecific parameter and maximal distance between the paired reads of 1000 bp. Nucleotide-level read quality information was used to weight each alignment at subsequent probabilistic expression counting step using the RNA-Seq by Expectation-Maximization

#### **Table 1 | Composition of ACSH, SynH1, SynH2−, and SynH2.**

#### **Table 1 | Continued**



*aACSH data are from Schwalbach et al. (2012). Sugar concentrations are averages of HPLC-MS and NMR determinations.*

*bIn the SynH2 recipe, D-Arabinose was substituted for the L-Arabinose present in ACSH to avoid AraC-mediated repression of xylose-utilization genes (Desai and Rao, 2010). In other contexts, use of L-Arabinose in SynH2 would be appropriate.*

*<sup>c</sup> –, not determined in ACSH or not added in SynH.*

*<sup>d</sup> n.d., not detectable by methods used.*

*eAromatic compounds detected at less than 20µM in ACSH are not reported here.*

*<sup>f</sup> The sets of acids, amides, and aldehydes used for supplemental studies in formulating SynH2 consisted of p-Coumaric acid, Ferulic acid, Benzoic acid, Syringic acid, Cinnamic acid, Vanillic acid, and Caffeic acid (acids), Feruloyl amide and Coumaroyl amide (amides), and HMF, Vanillin, Syringaldehyde, 4- Hydroxybenzaldehyde, and 4-Hydroxyacetophenone (aldehydes) at the concentrations listed for non-autoclaved ACSH or fractions thereof as described in the Supplemental Results.*

*gACSH Inhibitor concentrations for non-autoclaved CS hydrolysate are from (Tang et al., submitted). Hydrolysate preparations are described in Materials and Methods.*

(RSEM) version 1.2.4 (Li and Dewey, 2011). Posterior mean estimates of counts and FPKM values were used in the downstream analysis.

The program edgeR v.3.0.2 (Robinson et al., 2010) was used to compute differential expression by using the procedures and steps described in the package documentation in all function calls with median normalization rather than the default TMM procedure. We found that median normalization better adjusted for the particular biases within the dataset. Adjusted *p*-values for multiple hypothesis corrections were used as calculated by edgeR. Pairwise **Table 2 | Growth, sugar uptake, and ethanol production by GLBRCE1 grown in ACSH and SynH2−, and SynH2a.**


*aEach value is from at least three biological replicates in different bioreactors.*

*bExponential phase is between 4 and 12 h in all media. Unit for glucose uptake rate is mM*·*OD*−*<sup>1</sup> <sup>600</sup>*·*h*−*1.*

*cTransition phase is between 12 and 30 h for SynH2-, and between 12 and 23 h for SynH2 and ACSH. Units for glucose and xylose uptake rate are mM* ·*OD*−*<sup>1</sup> <sup>600</sup>*·*h*−*1.*

*dStationary phase when glucose is present (Glu-Stationary) is between 23 and 100 h for SynH2 and ACSH. However, there was no Glu-stationary phase for SynH2*− *because it remained in transition phase until the glucose was gone.*

*eStationary phase when glucose is gone (Xyl-Stationary) is between 47 and 78 h for SynH2*−*. The Xyl-Stationary rates for SynH2 and ACSH were measured in follow-up experiments carried out long enough to exhaust glucose in stationary phase.*

*fCalculated from the total ethanol produced and the total glucose and xylose consumed, assuming 2 ethanol per glucose and 1.67 ethanol per xylose.*

fold-changes and adjusted *p*-values are calculated between media types and within each phase and between phases within each media type.

To catalog the most significant effects, we examined the ratios using several different strategies. In addition to identifying the largest changes in expression of individual genes in SynH2 and ACSH relative to SynH2− (Table S2), we also used gene set enrichment analyses as described by Subramanian et al. (2005) and Varemo et al. (2013). We compiled gene sets for these analyses from pathways, transporters, and regulons documented in Ecocyc (Keseler et al., 2013) and KEGG.

#### **PROTEOMIC MEASUREMENTS**

Thirty-four *Escherichia coli* samples were processed for analysis by mass spectrometry at PNNL. Each sample was typically digested using a global urea digestion (Pasa-Tolic et al., 2004; Smyth, 2004) prior to isobaric labeling with an iTRAQ 4-plex labeling kit, following the manufacturer's directions (ABSciex, Redwood City, CA) (Ross et al., 2004; Bantscheff et al., 2008). Prior to high pH reverse phase fractionation with concatenated pooling (Wang et al., 2011b), the samples were desalted using C18 solid-phase extraction (SPE) (SUPELCO, Bellefonte, PA). All samples were processed with a custom LC system using reversed-phase C18 columns (unpublished variation of Maiolica et al., 2005) and the

Raw files were searched against a concatenated *Escherichia coli* K-12 database and contaminant database using MS-GF+ (v9018) with oxidation as a dynamic modification on methionine and 4-plex iTRAQ label as a static modification (Kim et al., 2008). The parent ion mass tolerance was set to 50 ppm. The resulting sequence identifications were filtered down to a 1% false discovery rate using target-decoy approach and MS-GF derived *q*-values. Reporter ion intensities were quantified using the tool MASIC (Monroe et al., 2008). Results were then processed with the MAC (Multiple Analysis Chain) pipeline, an internal tool which aggregates and filters data. Missing reporter ion channel results were retained. Degenerate peptides, i.e., peptides occurring in more than one protein, were filtered out. Proteins with one peptide detected were removed if they were not repeatable across at least two replicates. Redundant peptide identification reporter ions were summed across fractions and median central tendency normalization was applied to account for channel bias. Each 4-plex sample group was normalized using a pooled sample for comparison between groups. The final protein values were obtained by averaging their associated peptide intensity values and varied from ∼5000 to 350000. Finally, the protein values were then log2 transformed.

All proteins that had missing values in their replicates were removed and the pair-wise protein expression level changes and significance *p*-values between the SynH2 and SynH2− cells at each growth phase were estimated using limma (Smyth, 2004; Smith, 2005), which fits a linear model across the replicates to calculate the fold changes, smooths the standard errors for significance and adjusts the *p*-values via the Benjamini-Hochberg method.

#### **COMPARISON OF PROTEOMIC DATA TO TRANSCRIPTOMIC DATA**

Pair-wise RNA expression level changes and significance *p*-values were estimated using the edgeR package as previously discussed. The log2-fold-changes for the Protein and RNA were z-score scaled separately to correct for the difference in dynamic ranges between the protein and RNA measurements.

Significant discrepant Protein/RNA ratios between SynH2 and SynH2− cells were estimated using a two-sample *z*-test and the corresponding *p*-values are adjusted for multiple comparisons using the Benjamini-Hochberg method. All Protein/RNA ratios that are either significant in the RNA or protein ratio (*p <* 0*.*05) and that significantly disagree (*p <* 0*.*05) are tabulated in Table S7.

#### **MEASUREMENT OF INTERNAL METABOLITE ABUNDANCES PREPARATION OF INTRACELLULAR EXTRACTS**

Two ml of cell culture was rapidly removed from bioreactors with a 10 ml sterile syringe and cells captured on Whatman 0.45 um nylon syringe filters (GE Healthcare Bio-Sciences, Pittsburgh, Pennsylvania, USA) as described previously (Schwalbach et al., 2012). To reduce the background associated with metabolites present in ACSH and SynH the cells on the filter were then rapidly washed with 5 ml of M9 medium (Neidhardt et al., 1974) lacking a carbon source. Acetonitrile-methanol-water (40:40:20; 2 ml) containing 0.1% formic acid was then applied to the filters, and the eluate captured in a 15 ml conical tube. The eluate was passed through the cells a second time to ensure complete cell lysis and then flash frozen in a dry ice/ethanol bath.

#### **DETECTION/QUANTIFICATION OF METABOLITES**

The concentration of internal glycolytic and TCA cycle intermediates were determined using high performance anion exchange chromatography electrospray ionization tandem mass spectrometry (HPAEC-ESI-MS/MS). Reagents and non-labeled reference compounds were from Sigma Aldrich Co.

*HPAEC* was adapted from a previously reported method (Buescher et al., 2010), and was used for determination of pyruvate, citrate, α–ketoglutarate, glucose-6-phosphate, fructose-6 phosphate, fructose-1,6-bis phosphate, phospho(enol)pyruvate, and ATP. Chromatography was carried out on an Agilent 1200 series HPLC comprised of a vacuum degasser, binary pump, and a heated column compartment, and a thermostated autosampler set to maintain 6◦C. Mobile Phase A was 0.5 mM NaOH and mobile phase B was 100 mM NaOH. Compounds were separated by a gradient elution of 0.35 mL per minute starting at 10% B, increased to 15% B over 5 min and held at 15% B for 10 min, then increased to 100% B over 12 min and held for 10 min before returning to 10% B to be re-equilibrated for 5 min prior to the next injection. The column temperature was 40◦C. The injection volume was 20μL of intracellular extract or calibrant standard mixture.

# **MEASUREMENT OF AROMATIC INHIBITORS IN ACSH AND SynH**

Samples of ACSH and SynH cultures were prepared by centrifugation as described previously (Schwalbach et al., 2012), and then were subjected to reverse phase HPLC high resolution/accurate mass spectrometry (RP-HPLC-HRAM MS) and headspace solidphase microextraction gas chromatography-isotope dilution mass spectrometry (HS-SPME/IDMS) analysis.

The majority of phenolic compounds were determined by RP-HPLC-HRAM MS, which was carried out with a MicroAS autosampler (Thermo Scientific) equipped with a chilled sample tray and a Surveyor HPLC pump (Thermo Scientific) coupled to a Q-Exactive hybrid quadrupole/orbitrap mass spectrometer by electrospray ionization. The analytical column was an Ascentis Express column (150 × 2.1 mm × 2.7μm core-shell particles, Supelco, Bellefonte, PA) protected by a 5 mm C18 precolumn (Phenomenex, Torrance, CA). Mobile phase A was 10 mM formic acid adjusted to pH 3 with ammonium hydroxide and mobile phase B was methanol with 10 mM formic acid and the same volume of ammonium hydroxide as was added to mobile phase A. Compounds were separated by gradient elution. The initial composition was 95% A, which was held for 2 min after injection, then decreased to 40% A over the next 8 min, changed immediately to 5% A and held for 5 min, then changed back to 95% A for a column re-equilibration period of 7 min prior to the next injection. The flow rate was 0.3 mL/min.

The HPLC separation was coupled to the mass spectrometer via a heated electrospray (HESI) source (HESI II Probe, Thermo Scientific). The operating parameters of the source were: spray voltages: +3000, −2500; capillary temperature: 300◦C; sheath gas flow: 20 units; auxiliary gas flow: 5 units; HESI probe heater: 300◦C. Spectra were acquired with fast polarity switching to obtain positive and negative mode ionization chromatograms in a single analysis. In each mode, a full MS<sup>1</sup> scan was performed by the Orbitrap analyzer followed by a data dependent MS<sup>2</sup> scan of the most abundant ion in the MS<sup>1</sup> scan. The Q-Exactive parameters (both positive and negative modes) were: MS<sup>1</sup> range 85–500 Th, resolution: 17,500 (FWHM at 400 *m/z*), AGC target: 1e6, maximum ion accumulation time 100ms, S-lens level: 50. Settings for data dependent MS2 scans were: isolation width: 1.8 Th, normalized collision energy: 50 units, resolution: 17,500, AGC target: 2e5, maximum ion accumulation time: 50 ms, underfill ratio: 1%, apex trigger: 5–12 s, isotope exclusion enabled, dynamic exclusion: 10 s.

HS-SPME/IDMS was used to quantify acetaldehyde, acetamide, furfural, furfuryl alcohol, HMF, 5-(hydroxymethyl)fu rfural (HMF), and Bis(hydroxymethyl) furan ("HMF alcohol"/BHMF). Samples were thawed and briefly vortex mixed prior to measuring 500 microliters of sample, 500 microliters of stable isotope labeled internal standard mixture, and ∼300 mg of NaCl into a 20 mL screw top headspace and quickly capped with magnetic screwtop cap with 4 mm PTFE backed silicone rubber septum for SPME. Automated SPME sample processing and analysis was carried out using a Pegasus 4D GCxGC-TOF MS (Leco Corp. Saint Joseph, Michigan) with an Agilent 6890A gas chromatograph coupled to the ToF mass analyzer via a heated capillary transfer line, and a Gerster-LEAP combi PAL autosampler and sample preparation system with Twister heated sample agitator fitted with an automated SPME holder containing a gray hub 50/30, 23 ga. Stabiliflex DVB/Carboxen/PDMS SPME fiber (Supelco, Inc.). Chromatof software (Leco, Corp.) V. 4.50.8.0 was used for system control during acquisition and for data processing, calibration and calculation of final concentrations. Sample incubation temperature 95◦C, agitation speed 100 rpm, during extraction time, 100 rpm, agitation on 4 s/off 15 s, sample extraction time (SPME fiber exposed to the sample headspace in heated agitator) 20 min, desorb time (SPME fiber inserted in hot GC inlet) 60 min. GC cycle time 40 min. Critical injector positions were determined empirically through trial, error, and careful measurement: vial penetration 11 mm, Injector penetration 54 mm, Injector penetration—needle 40 mm. GC was carried out using a StabilWAX-DA column (Restek Corp, Bellefonte, Pennsylvania, USA) 0.25 mm ID × 30 m, *df* = 0*.*25μm; carrier gas He, 1 mL/min; split 5:1; purge flow 3 mL/min; inlet temp 250◦C; inlet liner type straight split/splitless deactivated glass 0.75 mm ID; equilibration time 1 min; Oven temperature program: initial temperature 30◦C, hold 2 min. Increase to 10◦C/min to 250◦C, hold 10 min; MS transfer line 250◦C. ToF mass spectrometer (unit mass resolution) Acquisition delay 85 s; start mass 10 end mass 500; acquisition 10 spectra/s; electron multiplier delta V 1475 (dependent on QC procedure) source temperature 200◦C.

Quantification of organic acids in ACSH was carried out by HPAEC-MS/MS in a similar manner to that described for intracellular metabolites.

**Table 2 | Growth, sugar uptake, and ethanol production by GLBRCE1 grown in ACSH and SynH2−, and SynH2a.**


*aEach value is from at least three biological replicates in different bioreactors.*

*bExponential phase is between 4 and 12 h in all media. Unit for glucose uptake rate is mM*·*OD*−*<sup>1</sup> <sup>600</sup>*·*h*−*1.*

*cTransition phase is between 12 and 30 h for SynH2-, and between 12 and 23 h for SynH2 and ACSH. Units for glucose and xylose uptake rate are mM* ·*OD*−*<sup>1</sup> <sup>600</sup>*·*h*−*1.*

*dStationary phase when glucose is present (Glu-Stationary) is between 23 and 100 h for SynH2 and ACSH. However, there was no Glu-stationary phase for SynH2*− *because it remained in transition phase until the glucose was gone.*

*eStationary phase when glucose is gone (Xyl-Stationary) is between 47 and 78 h for SynH2*−*. The Xyl-Stationary rates for SynH2 and ACSH were measured in follow-up experiments carried out long enough to exhaust glucose in stationary phase.*

*fCalculated from the total ethanol produced and the total glucose and xylose consumed, assuming 2 ethanol per glucose and 1.67 ethanol per xylose.*

fold-changes and adjusted *p*-values are calculated between media types and within each phase and between phases within each media type.

To catalog the most significant effects, we examined the ratios using several different strategies. In addition to identifying the largest changes in expression of individual genes in SynH2 and ACSH relative to SynH2− (Table S2), we also used gene set enrichment analyses as described by Subramanian et al. (2005) and Varemo et al. (2013). We compiled gene sets for these analyses from pathways, transporters, and regulons documented in Ecocyc (Keseler et al., 2013) and KEGG.

#### **PROTEOMIC MEASUREMENTS**

Thirty-four *Escherichia coli* samples were processed for analysis by mass spectrometry at PNNL. Each sample was typically digested using a global urea digestion (Pasa-Tolic et al., 2004; Smyth, 2004) prior to isobaric labeling with an iTRAQ 4-plex labeling kit, following the manufacturer's directions (ABSciex, Redwood City, CA) (Ross et al., 2004; Bantscheff et al., 2008). Prior to high pH reverse phase fractionation with concatenated pooling (Wang et al., 2011b), the samples were desalted using C18 solid-phase extraction (SPE) (SUPELCO, Bellefonte, PA). All samples were processed with a custom LC system using reversed-phase C18 columns (unpublished variation of Maiolica et al., 2005) and the

Raw files were searched against a concatenated *Escherichia coli* K-12 database and contaminant database using MS-GF+ (v9018) with oxidation as a dynamic modification on methionine and 4-plex iTRAQ label as a static modification (Kim et al., 2008). The parent ion mass tolerance was set to 50 ppm. The resulting sequence identifications were filtered down to a 1% false discovery rate using target-decoy approach and MS-GF derived *q*-values. Reporter ion intensities were quantified using the tool MASIC (Monroe et al., 2008). Results were then processed with the MAC (Multiple Analysis Chain) pipeline, an internal tool which aggregates and filters data. Missing reporter ion channel results were retained. Degenerate peptides, i.e., peptides occurring in more than one protein, were filtered out. Proteins with one peptide detected were removed if they were not repeatable across at least two replicates. Redundant peptide identification reporter ions were summed across fractions and median central tendency normalization was applied to account for channel bias. Each 4-plex sample group was normalized using a pooled sample for comparison between groups. The final protein values were obtained by averaging their associated peptide intensity values and varied from ∼5000 to 350000. Finally, the protein values were then log2 transformed.

All proteins that had missing values in their replicates were removed and the pair-wise protein expression level changes and significance *p*-values between the SynH2 and SynH2− cells at each growth phase were estimated using limma (Smyth, 2004; Smith, 2005), which fits a linear model across the replicates to calculate the fold changes, smooths the standard errors for significance and adjusts the *p*-values via the Benjamini-Hochberg method.

#### **COMPARISON OF PROTEOMIC DATA TO TRANSCRIPTOMIC DATA**

Pair-wise RNA expression level changes and significance *p*-values were estimated using the edgeR package as previously discussed. The log2-fold-changes for the Protein and RNA were z-score scaled separately to correct for the difference in dynamic ranges between the protein and RNA measurements.

Significant discrepant Protein/RNA ratios between SynH2 and SynH2− cells were estimated using a two-sample *z*-test and the corresponding *p*-values are adjusted for multiple comparisons using the Benjamini-Hochberg method. All Protein/RNA ratios that are either significant in the RNA or protein ratio (*p <* 0*.*05) and that significantly disagree (*p <* 0*.*05) are tabulated in Table S7.

#### **MEASUREMENT OF INTERNAL METABOLITE ABUNDANCES PREPARATION OF INTRACELLULAR EXTRACTS**

Two ml of cell culture was rapidly removed from bioreactors with a 10 ml sterile syringe and cells captured on Whatman 0.45 um nylon syringe filters (GE Healthcare Bio-Sciences, Pittsburgh, Pennsylvania, USA) as described previously (Schwalbach et al., 2012). To reduce the background associated with metabolites present in ACSH and SynH the cells on the filter were then rapidly washed with 5 ml of M9 medium (Neidhardt et al., 1974) lacking a cells to cease growth before glucose was consumed, decreased the rate of ethanol production, and to lesser extent decreased final amounts of ethanol produced.

#### **GLBRCE1 GENE EXPRESSION PATTERNS ARE SIMILAR IN SynH2 AND ACSH**

To test the similarity of SynH2 to ACSH and the extent to which LC-derived inhibitors impact ethanologenesis, we next used RNA-seq to compare gene expression patterns of GLBRCE1 grown in the two media relative to cells grown in SynH2− (Materials and Methods; **Table 1**). We computed normalized gene expression ratios of ACSH cells vs. SynH2− cells and SynH2 cells vs. SynH2− cells, and then plotted these ratios against each other using log10 scales for exponential phase (**Figure 2A**), transition phase (**Figure 2B**), and stationary phase (**Figure 2C**). For simplicity, we refer to these comparisons as the SynH2 and ACSH ratios. The SynH2 and ACSH ratios were highly correlated in all three phases of growth, although were lower in transition and stationary phases (Pearson's *r* of 0.84, 0.66, and 0.44 in exponential, transition, and stationary, respectively, for genes whose SynH2 and ACSH expression ratios both had corrected *p <* 0*.*05; *n* = 390, 832, and 1030, respectively). Thus, SynH2 is a reasonable mimic of ACSH.

We used these data to investigate the gene expression differences between SynH2 and ACSH (Table S3). Several differences likely reflected the absence of some trace carbon sources in SynH2 (e.g., sorbitol, mannitol), their presence in SynH2 at higher concentrations than found in ACSH (e.g., citrate and malate), and the intentional substitution of D-arabinose for L-arabinose. Elevated expression of genes for biosynthesis or transport of some amino acids and cofactors confirmed or suggested that SynH2 contained somewhat higher levels of Trp, Asn, thiamine and possibly lower levels of biotin and Cu2<sup>+</sup> (Table S3). Although these discrepancies point to minor or intentional differences that can be used to refine the SynH recipe further, overall we conclude that SynH2 can be used to investigate physiology, regulation, and biofuel synthesis in microbes in a chemically defined, and thus reproducible, media to accurately predict behaviors of cells in real hydrolysates like ACSH that are derived from ammonia-pretreated biomass.

#### **AROMATIC ALDEHYDES IN SynH2 ARE CONVERTED TO ALCOHOLS, BUT PHENOLIC CARBOXYLATES AND AMIDES ARE NOT METABOLIZED**

Before evaluating how patterns of gene expression informed the physiology of GLBRCE1 in SynH2, we first determined the profiles of inhibitors, end-products, and intracellular metabolites during ethanologenesis. The most abundant aldehyde inhibitor, HMF, quickly disappeared below the limit of detection as the cells entered transition phase with concomitant and approximately stoichiometric appearance of the product of HMF reduction, 2,5-bis-HMF (hydroxymethylfurfuryl alcohol; **Figure 3A**, Table S8). Hydroxymethylfuroic acid did not appear during the fermentation, suggesting that HMF is principally reduced by aldehyde reductases such as YqhD and DkgA, as previously reported for HMF and furfural generated from acid-pretreated biomass (Miller et al., 2009a, 2010; Wang et al., 2013). In contrast, the concentrations of ferulic acid, coumaric acid, feruloyl amide, and coumaroyl amide did not change appreciably over the course

**FIGURE 2 | Relative gene expression patterns in SynH2 and ACSH cells relative to SynH2− cells.** Scatter plots were prepared with the ACSH/SynH2− gene expression ratios plotted on the y-axis and the SynH2/SynH2<sup>−</sup> ratios on the x-axis (both on a log10 scale). GLBRCE1 was cultured in a bioreactor anaerobically (**Figure 1** and Figure S5); RNAs were prepared from exponential **(A)**, transition **(B)**, or stationary **(C)** phase cells and subjected to RNA-seq analysis (Materials and Methods). Dark gray dots represent genes for which *p* = 0*.*05 for each expression ratio. Sets of genes with related functions that exhibited significant discrepant or parallel changes are color-coded and described in the legend at the top (see also Tables S3, S4, respectively).

**FIGURE 3 | Growth phase-dependent changes in SynH2 aromatic inhibitor levels.** GLBRCE1 was cultured under anaerobic conditions in SynH2 in bioreactors. Levels of the major LC-derived inhibitors in the culture medium were determined as described in Materials and Methods. "Hydrolysate" refers to medium immediately prior to inoculation, "Exp," "Trans," and "Stat" refers to samples collected during exponential, transition, and stationary phase growth, respectively. **(A)** Metabolic fate of hydroxymethylfurfural (HMF). Concentrations of HMF and 2,5-bis-HMF (2,5-bis-hydroxymethylfurfuryl alcohol) are represented. **(B)** Metabolic fates of the major aromatic acids and amides. Concentrations of ferulic acid, feruloyl amide, coumaric acid, and coumaroyl amide are shown. **(C)** Concentration of acetaldehyde in the culture medium when GLBRCE1 was grown in SynH2, SynH2−, or SynH2 with aromatic aldehydes only omitted.

of the experiment (**Figure 3B**, Table S8), suggesting that *E. coli* either does not encode activities for detoxification of phenolic carboxylates and amides, or that expression of such activities is not induced in SynH2.

Although HMF disappeared early in fermentation, acetaldehyde accumulated to *>*10 mM during exponential and transition phase in both SynH2 and ACSH (**Figure 3C**, Table S8). Elevated acetaldehyde relative to SynH2− was also observed upon omission of aromatic aldehydes from SynH2, demonstrating that LCderived phenolic acids and amides alone can cause accumulation of acetaldehyde (**Figure 3C**). Thus, acetaldehyde accumulation was not simply a consequence of diverting reducing equivalents to detoxification of the aromatic aldehydes like HMF but likely resulted from a broader impact of LC-derived inhibitors on cellular energetics that decreased the pools of NADH available for conversion of acetaldehyde to ethanol.

#### **LIGNOCELLULOSE-DERIVED INHIBITORS NEGATIVELY IMPACT CARBON AND ENERGY METABOLISM, RESULTING IN ACCUMULATION OF PYRUVATE AND ACETALDEHYDE**

Examination of intracellular metabolites revealed that aromatic inhibitors decreased the levels of metabolites associated with glycolysis and the TCA cycle (**Figures 4B,E**; Table S1). Strikingly, metabolites associated with cellular energetics and redox state were also decreased in SynH2 cells relative to SynH2− cells (**Figures 4A,C,D,F**; Table S1). ATP was reduced 30%; the NADH/NAD+ ratio decreased by 63%; and the NADPH/NADP+ ratio decreased 56%. Together, these data indicate that the aromatic inhibitors dramatically decreased cellular energy pools and available reducing equivalents in SynH2 cells. The consequences of energetic depletion were readily apparent with an approximate 100-fold increase in the intracellular levels of pyruvate in SynH2 cells (to ∼14 mM), despite the disappearance of pyruvate from the growth medium (Table S1, **Figure 4B**, and data not shown). The increase in pyruvate and correspondingly in acetaldehyde (**Figures 3C**, **4B**) suggest that the reduced rate of glucose-toethanol conversion caused by aromatic inhibitors results from inadequate supplies of NADH to convert acetaldehyde to ethanol.

Transition-phase SynH2 vs. SynH2− cells exhibited similar trends in aromatic-inhibitor-dependent depletion of some glycolytic intermediates, some TCA intermediates, and ATP, along with elevation of pyruvate and acetaldehyde (Table S1; **Figure 3C**). Stationary phase cells displayed several differences, however. Glycolytic intermediates (glucose 6-phosphate, fructose 6-phosphate, fructose 1,6 diphosphate, and 2-, 3-phosphoglycerate) were approximately equivalent in SynH2 and SynH2− cells, whereas pyruvate concentrations dropped significantly (Table S1). The impact of the inhibitors was largely attributable to the phenolic carboxylate and amides alone, as removal of the aldehydes from SynH2 changed neither the depletion of glycolytic and TCA intermediates nor the elevation of pyruvate and acetaldehyde (data not shown). We conclude that phenolic carboxylates and amides in SynH2 and ACSH have major negative impacts on the rate at which cells grow and consequently can convert glucose to ethanol.

### **AROMATIC INHIBITORS INDUCE GENE EXPRESSION CHANGES REFLECTING ENERGY STRESS**

Given the major impacts of aromatic inhibitors on ethanologenesis, we next sought to address how these inhibitors impacted gene expression and regulation in *E. coli* growing in SynH2.

To that end, we first identified pathways, transporters, and regulons with similar relative expression patterns in SynH2 and ACSH using both conventional gene set enrichment analysis and custom comparisons of aggregated gene expression ratios (Materials and Methods). These comparisons yielded a curated set of regulons, pathways, and transporters whose expression changed significantly in SynH2 or ACSH relative to SynH2− (aggregate *p <* 0*.*05; Table S4).

For many key pathways, transporters, and regulons, similar trends were seen in both SynH2 and ACSH vs. SynH2− (**Figure 2** and Table S4). The most upregulated gene sets reflected key impacts of aromatic inhibitors on cellular energetics. Anabolic processes requiring a high NADPH/NADP+ potential were significantly upregulated (e.g., sulfur assimilation and cysteine biosynthesis, glutathione biosynthesis, and ribonucleotide reduction). Additionally, genes encoding efflux of drugs and aromatic carboxylates (e.g., *aaeA*) and regulons encoding efflux functions (e.g., the *rob* regulon), were elevated. Curiously, both transport and metabolism of xylose were downregulated in all three growth phases in both media, suggesting that even prior to glucose depletion aromatic inhibitors reduce expression of xylose genes and thus the potential for xylose conversion. Currently the mechanism of this repression is unclear, but it presumably reflects either an indirect impact of altered energy metabolism or an interaction of one or more of the aromatic inhibitors with a regulator that decreases xylose gene expression.

During transition phase, a different set of genes involved in nitrogen assimilation were upregulated in SynH2 cells and ACSH cells relative to SynH2− cells (Table S5). Previously, we found that transition phase corresponded to depletion of amino acid nitrogen sources (e.g., Glu and Gln; Schwalbach et al., 2012). Thus, this pattern of aromatic-inhibitor-induced increase in the expression of nitrogen assimilation genes during transition phase suggests that the reduced energy supply caused by the inhibitors increased difficulty of ATP-dependent assimilation of ammonia. Interestingly, the impact on gene expression appeared to occur earlier in ACSH than in SynH2, which may suggest that availability of organic nitrogen is even more growth limiting in ACSH.

Of particular interest were the patterns of changes in gene expression related to the detoxification pathways for the aromatic inhibitors. Our gene expression analysis revealed inhibitor induction of genes encoding aldehyde detoxification pathways (*frmA*, *frmB*, *dkgA*, and *yqhD*) that presumably target LC-derived aromatic aldehydes (e.g., HMF and vanillin) and acetaldehyde that accumulates when NADH-dependent reduction to ethanol becomes inefficient (Herring and Blattner, 2004; Gonzalez et al., 2006; Miller et al., 2009b, 2010; Wang et al., 2013) as well as efflux pumps controlled by MarA/SoxS/Rob (e.g., *acrA* and *acrB*) and the separate system for aromatic carboxylates (*aaeA* and *aaeB*) (Van Dyk et al., 2004). Interestingly, we observed that expression of the aldehyde detoxification genes *frmA*, *frmB*, *dkgA*, and *yqhD* paralleled the levels of LC-derived aromatic aldehydes and acetaldehyde detected in the media (**Figure 3**). Initially high-level expression was observed in SynH2 cells, which decreased as the aldehydes were inactivated (**Figure 5A**). Conversely, expression of these genes increased in SynH2− cells, surpassing the levels in SynH2 cells in stationary phase when the level of acetaldehyde in the SynH2− culture spiked past that in the SynH2 culture. The elevation of *frmA* and *frmB* is particularly noteworthy as the only reported substrate for FrmAB is formaldehyde. We speculate that this system, which has not been extensively studied in *E. coli*, may also act on acetaldehyde. Alternatively, formaldehyde, which we did not assay, may have accumulated in parallel to acetaldehyde.

In contrast to the decrease in *frmA*, *frmB*, *dkgA*, and *yqhD* expression as SynH2 cells entered stationary phase, expression of *aaeA*, *aaeB*, *acrA*, and *acrB* remained high (**Figure 5B**). This continued high-level expression is consistent with the persistence of phenolic carboxylates and amides in the SynH2 culture (**Figure 3**), and presumably reflect the futile cycle of antiporter excretion of these inhibitors to compete with constant leakage back into cells.

#### **POST-TRANSCRIPTIONAL EFFECTS OF AROMATIC INHIBITORS WERE LIMITED PRIMARILY TO STATIONARY PHASE**

We next investigated the extent to which the aromatic inhibitors could exert effects on cellular regulation post-transcriptionally rather than *via* transcriptional regulators by comparing inhibitorinduced changes in protein levels to changes in RNA levels. For this purpose, we used iTRAQ quantitative proteomics to assess changes in protein levels (Material and Methods). We then normalized the log2-fold-changes in protein levels in each of the three growth phases to changes in RNA levels determined by RNA-seq and plotted the normalized values against each other (**Figures 6A–C**; Tables S6, S7). Most proteome and transcriptome fold-changes fall within a factor of 2 of the diagonal, consistent with concordant changes in mRNA and protein and thus limited post-transcriptional effects of aromatic inhibitors. A small number of RNA-protein pairs exhibited an *>*2-fold change with *p <* 0*.*05. During exponential phase, four proteins were present at elevated levels relative to changes in RNA levels, which actually decreased (RpoS, TnaA, MalE, and GlnH; red circles, **Figure 6A**; Table S7A), whereas 26 RNAs increased or decreased significantly with little difference in proteins levels (blue circles, **Figure 6A**; Table S7A). These disparate increases in RNA levels included some of the major transcriptional responses to the inhibitors (S assimilation and the FrmA aldehyde detoxification pathway), and these proteins were present at high levels both with and without inhibitors (Table S7D). Several observations led us to conclude that these discrepancies in protein and RNA levels between SynH2− and SynH2 cells reflect induction of expression in SynH2 cells but carryover of elevated protein levels in the inoculum of SynH2− cells not yet diluted in exponential phase. First, we sampled exponential phase between one and two cell doublings so that proteins elevated in stationary phase in the inoculum might still be present. Second, FrmRAB and S assimilation genes are elevated in stationary SynH2− cells relative to SynH2 cells (Table S7C), likely reflecting the greater accumulation of acetaldehyde in SynH2− cells in stationary phase (**Figure 3C**). Finally, RpoS and TnaA are markers of stationary phase (Lacour and Landini, 2004) and may reflect elevation of these proteins in SynH stationary cells carried over from the inoculum. In a similar

**gene expression.** Changes in RNA levels for genes that comprise the major regulatory response to aromatic inhibitors in SynH2. Shown are normalized RNA-seq measurements (top panel) from GLBRCE1 grown in SynH2 (solid

exponential, transition, and stationary phases of growth as indicated. **(A)** Aldehyde detoxification genes (*frmA, frmB, dkgA*, and *yqhC*). **(B)** Genes that encode efflux pumps (*aaeA, aaeB, acrA, acrB*).

vein, the apparent overrepresentation of PyrBI, GadABC, and MetEF proteins in SynH2 cells could reflect their greater abundance in stationary phase SynH2 cells that were carried over to early exponential phase.

Supporting this view, transition phase cells in which the inoculum was diluted *>*5-fold exhibited a higher correlation between protein and RNA levels and only limited evidence of post-transcriptional regulation caused by the aromatic inhibitors (**Figure 6B**). Three clusters of outliers reflected (i) reduced transcript levels for S assimilation genes in SynH2− without a corresponding drop in protein level (*cys* genes), (ii) higher levels of *glnAGHLQ* transcripts in SynH2 cells than SynH2− cells with high protein levels in both, and (iii) high induction of transcripts for the citrate assimilation system (*citDEFX*) in SynH2 with lesser induction of protein levels. These effects likely reflect adjustment of S assimilation gene expression during transition phase, a greater induction of N assimilation in the more rapidly growing SynH2 cells, and induction of citrate assimilation by the aromatic inhibitors.

The clearest evidence for post-transcriptional regulation caused by the aromatic inhibitors appeared in stationary phase (**Figure 6C**). A set of proteins involved in arginine, glutamate, lysine and citrate biosynthesis (ArgABCGI, GdhA, LysC, GltA) and periplasmic proteins arginine high-affinity import (ArtJ), histidine high-affinity import (HisJ), molybdate import (ModA), and lysozyme inhibition (PliG) decreased dramatically in SynH2 cells relative to SynH2− cells without corresponding reductions of their transcripts. GdhA, other biosynthetic enzymes, and other periplasmic binding proteins are degraded by the ClpP protease during C or N starvation (Maurizi and Rasulova, 2002; Weichart et al., 2003); Lon protease also has been implicated in proteolysis upon C starvation (Luo et al., 2008). Thus, we suggest that aromatic inhibitors may enhance degradation of proteins involved in N and C metabolism in stationary phase cells. The periplasmic proteins must be degraded as precursors or mediated by an additional effect involving periplasmic proteases.

#### **DISCUSSION**

Results of our investigation into the effects of LC-derived inhibitors on *E. coli* ethanologenesis support several key conclusions that will guide future work. First, a chemically defined mimic of ACSH (SynH2) that contained the major inhibitors found by chemical analysis of ACSH adequately replicated both growth and the rates of glucose and xylose conversion to ethanol by *E. coli*. SynH2-replication of ACSH required inclusion of osmolytes found in ACSH and established that, at the ratios present in ACSH, phenolic carboxylates and amides, which are not metabolized by *E. coli*, had a greater overall impact on cell growth than phenolic aldehydes and furfurals, which were metabolized. In both SynH2 and ACSH, *E. coli* entered a metabolically active stationary phase as cells exhausted organic sources of N and S (e.g., amino acids) and during which the inhibitors greatly reduced xylose conversion. The impact of inhibitors on cellular energetics reduced levels of ATP, NADH, and NADPH and was seen most dramatically for energetically challenging processes requiring NADPH (like SO−<sup>2</sup> <sup>4</sup> assimilation and deoxyribonucleotide production), during transition to the stationary phase

**FIGURE 6 | Effects of aromatic inhibitors on protein levels compared to effects on cognate RNA levels.** Scatter plot comparing log2-fold RNA ratios (x-axis) to log2-fold protein ratios (y-axis) of GLBRCE1 genes and gene *(Continued)*

#### **FIGURE 6 | Continued**

products for cells for grown in SynH2 compared to the reference medium, SynH2−. Cells were collected and proteomic samples prepared from exponential **(A)**, transition **(B)**, and stationary **(C)** growth phases. The lines indicate boundaries beyond which changes exceed 2-fold. The dotted lines demarcate the area expected for parallel changes in protein and RNA levels. *Red*, genes for which changes in protein levels were not paralleled by changes in the corresponding RNA and for which the discrepancy had a *p* ≤ 0*.*05 (see Table S7). *Blue*, genes for which changes in RNA levels were not paralleled by changes in the corresponding protein and for which the discrepancy had a *p* ≤ 0*.*05. *Gray*, *p >* 0*.*05 for both RNA and protein ratios. *Light blue*, *p* ≤ 0*.*05 for RNA ratio but not for protein ratio. *Light* pink, *p* ≤ 0*.*05 for protein ratio but not for RNA ratio. *Green*, *p* ≤ 0*.*05 for both RNA and protein ratios and effects are parallel.

on ATP-dependent NH3 assimilation, and in elevated pyruvate levels presumably reflecting reduced NADH-dependent flux of pyruvate to ethanol (**Figure 7**). The direct effects of the inhibitors on cells appear to be principally mediated by transcriptional rather than translational regulators, with the MarA/SoxS/Rob network, AaeR, FrmR, and YqhC being the most prominent players. Although the effect of the inhibitors on transcriptional regulation of the efflux pumps was striking, increased efflux activity itself may perturb cellular metabolism. For example, Dhamdhere and Zgurskaya (2010) have shown that deletion of the AcrAB-TolC complex results in metabolic shutdown and high NADH/NAD+ ratios. By analogy, overexpression of efflux pumps may have the opposite effect (e.g., lowering of NADH/NAD+ ratios), which is consistent with observations in this study. In addition, recent work suggests that the *acrAB* promoter is upregulated in response to certain cellular metabolites (including those related to cysteine and purine biosynthesis), which are normally effluxed by this pump (Ruiz and Levy, 2014). Therefore, upregulation of AcrAB-TolC may impact homeostatic mechanisms of cellular biosynthetic pathways, resulting in continuous upregulation of pathways that require large amounts of reducing power in the form of NADPH. It is also possible that LC-derived inhibitors perturb metabolism directly in ways that generate additional AcrAB-TolC substrates, potentially increasing energy-consuming efflux further. Given these intricacies, further studies to unravel the mechanistic details of the effects of efflux pump activity on cellular metabolism, as a result of exposure to LC-derived inhibitors, are warranted.

The inability of cells to convert xylose in the presence of inhibitors appears to result from a combination of both effects on gene expression and some additional effect on transport or metabolism. The inhibitors lowered xylose gene expression (XylR regulon; *xylABFGH*) by a factor of 3-5 during all three growth phases (Table S4). This effect was not caused by the previously documented AraC repression (Desai and Rao, 2010), since it persisted in SynH2 when we replaced the AraC effector Larabinose with D-arabinose, but might reflect lower levels of cAMP caused by the inhibitors (**Figure 4**); both the *xylAB* and *xylFGH* operons are also regulated by CRP·cAMP. Nonetheless, significant levels of XylA, B, and F were detected even in the presence of inhibitors (Table S7D), even though xylose conversion remained inhibited even after glucose depletion (**Table 2**). Thus, the inability to convert xylose may also reflect either the overall impact of inhibitors on cellular energetics somehow making xylose conversion unfavorable or an effect of xylose transport or metabolism that remains to be discovered. Further studies of the impact of inhibitors on xylose transport and metabolism are warranted. It would be particularly interesting to test SynH formulations designed to compare the conversion efficiencies of xylose, arabinose, and C6 sugars other than glucose.

The central focus of this study was to understand the impact of inhibitors of gene expression regulatory networks. The apparent lack of involvement of post-transcriptional regulation suggests that *E. coli* mounts a defense against LC-derived inhibitors principally by controlling gene transcription, probably reflecting evolution of specific bacterial responses to LC-derived inhibitors. Although enteric bacteria do not ordinarily encounter industrial lignocellulosic hydrolysates, they likely encounter the same suite of compounds from digested plant material in the mammalian gut. Thus, evolution of specific responses is reasonable. A key question for future studies is whether phenolic amides, not ordinarily present in digested biomass, will also invoke these responses in the absence of carboxylates or aldehydes. We note that the apparent absence of a translational regulatory response in the cellular defense against LC-derived inhibitors does not preclude involvement of either direct or indirect post-transcriptional regulation in fine-tuning the response. Our proteomic measurements would likely not have detected fine-tuning. Additionally, we did detect an apparently indirect induction by inhibitors of protein degradation in stationary phase, possibly in response to C starvation (**Figure 6C**). Finally, we note that the sRNA micF, a known post-transcriptional regulator, is a constituent of the MarA/SoxS/Rob regulon and was upregulated by inhibitors. Although confidence was insignificant due to poor detection of sRNAs in RNAseq data, the induction of micF was confirmed in a separate study of sRNAs (Ong and Landick, in preparation). Thus, a more focused study of the involvement of sRNAs in responses to LC inhibitors would likely be informative.

MarA/SoxS/Rob is a complex regulon consisting of the three inter-connected primary AraC-class regulators that bind as monomers to 20-bp sites in promoters with highly overlapping specificity and synergistically regulate ∼50 genes implicated in resistance to multiple antibiotics and xenobiotics, solvent tolerance, outer membrane permeability, DNA repair, and other functions (Chubiz et al., 2012; Duval and Lister, 2013; Garcia-Bernardo and Dunlop, 2013) (**Figure 7**). Twenty-three genes, including those encoding the AcrAB·TolC efflux pump, the NfsAB nitroreductases, the micF sRNA, superoxide dismutase, some metabolic enzymes (e.g., Zwf, AcnA, and FumC) and incompletely characterized stress proteins are controlled by all three regulators, whereas other genes are annotated as being controlled by only a subset of the regulators (Duval and Lister, 2013), www*.* ecocyc*.*org; (Keseler et al., 2013). MarA and SoxS lack the Cterminal dimerization domain of AraC; this domain is present on Rob and appears to mediate regulation by aggregation that can be reversed by effectors (Griffith et al., 2009). Inputs capable of inducing these genes, either through the MarR and SoxR repressors that control MarA and SoxS, respectively, or by direct effects on Rob include phenolic carboxylates, Cu2+, a variety of organic oxidants, dipyridyl, decanoate, bile salts, Fis, and Crp·cAMP

(Martin and Rosner, 1997; Rosner et al., 2002; Rosenberg et al., 2003; Chubiz and Rao, 2010; Duval and Lister, 2013; Hao et al., 2014) (**Figure 7**). Given these diverse inputs, it seems highly likely that ferulate and coumarate in ACSH induce the MarA/SoxS/Rob regulon via MarR. Indeed, LC-hydrolysate and ferulate induction of MarA has been reported (Lee et al., 2012). Interestingly, Cu2<sup>+</sup> recently was shown to induce MarR by oxidation to create MarR disulfide dimer (Hao et al., 2014). Given the elevated levels of Cu2<sup>+</sup> in ACSH reflected by induction of Cu2<sup>+</sup> efflux (**Figure 2**; Table S4), induction of MarA/SoxS/Rob in ACSH may result from synergistic effects of Cu2<sup>+</sup> and phenolic carboxylates, oxidants that affect SoxR, and yet-to-be-determined compounds that affect Rob. A second response in LC-derived inhibitors appears to be mounted by the LysR-type regulator AaeR, which controls the AaeAB aromatic carboxylate efflux system (Van Dyk et al., 2004) (**Figure 7**). Both phenolic and aryl carboxylates induce AaeAB through AaeR, but little is known about its substrate specificity or mechanism of activation.

regulators and signaling interactions that mediate the regulatory responses.

Two distinct regulators, YqhC and FrmR, control synthesis of the YqhD/DkgA NAPDH-dependent aldehyde reductases and the FrmAB formaldehyde oxidase, respectively (Herring and Blattner, 2004; Turner et al., 2011). Even less is known about these regulators, although the DNA-binding properties of YqhC have been determined. In particular, it is unclear how aldehydes cause induction, although the current evidence suggests effects on YqhC are likely to be indirect. Given the central role of the regulators AaeR, YqhC, and FrmR in the cellular response to LC-derived inhibitors, further study of their properties and mechanisms is likely to be profitable. With sufficient understanding and engineering, they could be used as response regulators to engineer cells that respond to LC-inhibitors in ways that maximize microbial conversion of sugars to biofuels.

effects of inhibitors mediated by reductions in ATP and NADPH levels.

What types of responses would optimize biofuel synthesis? It appears the naturally evolved responses, namely induction of efflux systems and NADPH-dependent detoxification pathways, may not be optimal for efficient synthesis of biofuels. We infer this conclusion for several reasons. First, our gene expression results reveal that crucial pathways for cellular biosynthesis that are among the most energetically challenging processes in cells, S assimilation, N assimilation, and ribonucleotide reduction, are highly induced by LC-derived inhibitors (**Figures 2**, **7**; Table S4). A reasonable conjecture is that the diversion of energy pools, including NADPH and ATP, to detoxification makes S assimilation, N assimilation, and ribonucleotide reduction difficult, increasing expression of genes for these pathways indirectly. The continued presence of the phenolic carboxylates and amides (**Figure 3**) likely causes futile cycles of efflux. As both the AcrAB and AaeAB efflux pumps function as proton antiporters (**Figure 7**), continuous efflux is expected to decrease ATP synthesis by depleting the proton-motive force. Although this response makes sense evolutionarily because it protects DNA from damage by xenobiotics, it does not necessarily aid conversion of sugars to biofuels. Disabling these efflux and detoxification systems, especially during stationary phase when cell growth is no longer necessary, could improve rates of ethanologenesis. Indeed, Ingram and colleagues have shown that disabling the NADPHdependent YqhD/DkgA enzymes or better yet replacing them with NADH-dependent aldehyde reductases (e.g., FucO) can improve ethanologenesis in furfural-containing hydrolysates of acid-pretreated biomass (Wang et al., 2011a, 2013). That simply deleting *yqhD* improves ethanologenesis argues that, in at least some cases, it is better to expose cells to LC-derived inhibitors than to spend energy detoxifying the inhibitors.

Some previous efforts to engineer cells for improved biofuel synthesis have focused on overexpression of selected efflux pumps to reduce the toxic effects of biofuel products (Dunlop et al., 2011). Although this strategy may help cells cope with the effects of biofuel products, our results suggest an added potential issue when dealing with real hydrolysates, namely that efflux pumps may also reduce the rates of biofuel yields by futile cycling of LC-derived inhibitors. Thus, effective use of efflux pumps will require careful control of their synthesis (Harrison and Dunlop, 2012). An alternative strategy to cope with LC-derived inhibitors may be to devise metabolic routes to assimilate them into cellular metabolism.

In conclusion, our findings illustrate the utility of using chemically defined mimics of biomass hydrolysates for genome-scale study of microbial biofuel synthesis as a strategy to identify barriers to biofuel synthesis. By identifying the main inhibitors present in ammonia-pretreated biomass hydrolysate and using these inhibitors in a synthetic hydrolysate, we were able to identify the key regulators responsible for the cellular responses that reduced the rate of ethanol production and limited xylose conversion to ethanol. Knowledge of these regulators will enable design of new control circuits to improve microbial biofuel production.

#### **ACKNOWLEDGMENTS**

The authors thank Trey Sato and Jeff Piotrowski for critical reading of the manuscript, Fachuang Lu and John Ralph for advice on synthesis of feruloyl and coumaroyl amide, and Christa Pennacchio and colleagues at the Joint Genome Institute for cDNA library preparation and sequencing. This work was funded by the DOE Great Lakes Bioenergy Research Center (DOE BER Office of Science DE-FC02-07ER64494). Portions of this research were enabled by the DOE GSP under the Pan-omics project. Work was performed in the Environmental Molecular Science Laboratory, a U.S. Department of Energy (DOE) national scientific user facility at Pacific Northwest National Laboratory (PNNL) in Richland, WA. Battelle operates PNNL for the DOE under contract DE-AC05-76RLO01830.

# **SUPPLEMENTARY MATERIAL**

The Supplementary Material for this article can be found online at: http://www*.*frontiersin*.*org/journal/10*.*3389/fmicb*.* 2014*.*00402/abstract

# **REFERENCES**


activator and preventing degradation of Rob by Lon protease. *J. Mol. Biol.* 388, 415–430. doi: 10.1016/j.jmb.2009.03.023


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 15 April 2014; accepted: 17 July 2014; published online: 13 August 2014. Citation: Keating DH, Zhang Y, Ong IM, McIlwain S, Morales EH, Grass JA, Tremaine M, Bothfeld W, Higbee A, Ulbrich A, Balloon AJ, Westphall MS, Aldrich J, Lipton MS, Kim J, Moskvin OV, Bukhman YV, Coon JJ, Kiley PJ, Bates DM and Landick R (2014) Aromatic inhibitors derived from ammonia-pretreated lignocellulose hinder bacterial ethanologenesis by activating regulatory circuits controlling inhibitor efflux and detoxification. Front. Microbiol. 5:402. doi: 10.3389/fmicb.2014.00402*

*This article was submitted to Microbial Physiology and Metabolism, a section of the journal Frontiers in Microbiology.*

*Copyright © 2014 Keating, Zhang, Ong, McIlwain, Morales, Grass, Tremaine, Bothfeld, Higbee, Ulbrich, Balloon, Westphall, Aldrich, Lipton, Kim, Moskvin, Bukhman, Coon, Kiley, Bates and Landick. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Genomic insights into the fungal lignocellulolytic system of *Myceliophthora thermophila*

#### *Anthi Karnaouri 1,2†, Evangelos Topakas 1†, Io Antonopoulou2 and Paul Christakopoulos <sup>2</sup> \**

*<sup>1</sup> Biotechnology Laboratory, Department of Synthesis and Development of Industrial Processes, School of Chemical Engineering, National Technical University of Athens, Athens, Greece*

*<sup>2</sup> Biochemical Process Engineering, Chemical Engineering, Department of Civil, Environmental and Natural Resources Engineering, Luleå University of Technology, Luleå, Sweden*

#### *Edited by:*

*Katherine M. Pappas, University of Athens, Greece*

#### *Reviewed by:*

*Gianni Panagiotou, The University of Hong Kong, Hong Kong Ulrika Rova, Luleå University of Technology, Sweden*

#### *\*Correspondence:*

*Paul Christakopoulos, Biochemical Process Engineering, Chemical Engineering, Department of Civil, Environmental and Natural Resources Engineering, Luleå University of Technology, C-Building, Universitetsområdet, Porsön, SE-97187 Luleå, Sweden e-mail: paul.christakopoulos@ltu.se †These authors have contributed equally to this work.*

The microbial conversion of solid cellulosic biomass to liquid biofuels may provide a renewable energy source for transportation fuels. Cellulolytic fungi represent a promising group of organisms, as they have evolved complex systems for adaptation to their natural habitat. The filamentous fungus *Myceliophthora thermophila* constitutes an exceptionally powerful cellulolytic microorganism that synthesizes a complete set of enzymes necessary for the breakdown of plant cell wall. The genome of this fungus has been recently sequenced and annotated, allowing systematic examination and identification of enzymes required for the degradation of lignocellulosic biomass. The genomic analysis revealed the existence of an expanded enzymatic repertoire including numerous cellulases, hemicellulases, and enzymes with auxiliary activities, covering the most of the recognized CAZy families. Most of them were predicted to possess a secretion signal and undergo through post-translational glycosylation modifications. These data offer a better understanding of activities embedded in fungal lignocellulose decomposition mechanisms and suggest that *M. thermophila* could be made usable as an industrial production host for cellulolytic and hemicellulolytic enzymes.

**Keywords:** *Myceliophthora thermophila***, plant biomass, lignocellulolytic enzymes, CAZy, biofuels**

### **INTRODUCTION**

Ethanol production from lignocellulosic biomass, comprised primarily of cellulose and hemicellulose, appears to evolve as one of the most important technologies for sustainable development. Given its renewable nature, biomass is a potential raw material not only for the production of biofuels, but also chemicals, energy and other materials of main industrial interest (Zhang et al., 2006). The monosaccharides contained in the cellulosic (glucose) and hemicellulosic fractions (xylose, arabinose, mannose, and galactose) represent substrates that can be used for ethanol production via fermentation. To initiate the degradation of these fractions, it is necessary to overcome the physical and chemical barriers presented by the cohesive combination of the main biomass components, which hinders the hydrolysis of cellulose and hemicellulose into fermentable sugars. The above include high substrate viscosity, poor mass transfer conditions and long reaction times, during which hydrolysis reactors are susceptible to contamination. Fungi are the main decomposers of lignocellulosic biomass in terrestrial ecosystems and the enzymes they secrete to break down lignocellulose may be useful in industrial processes. Thermophilic fungi provide a potential source of plant cell wall degrading enzymes with higher levels of specific activity and better stability at higher temperatures, thus making it feasible to minimize the hydrolysis time, reduce substrate viscosity and contamination levels (Margaritis and Merchant, 1986).

*Myceliophthora thermophila* (synonym *Sporotrichum thermophile*) is a thermophilic filamentous fungus, classified as an ascomycete, which was isolated from soil in eastern Russia and constitutes an exceptionally powerful cellulolytic organism, which synthesizes a complete set of enzymes necessary for the breakdown of cellulose. The growth rate and cell density of this microorganism appear to be similar in media containing cellulose or glucose (Bhat and Maheshwari, 1987). The 38.7 Mbp genome of *M. thermophila*, comprising about 9500 genes, organized in 7 chromosomes, has been sequenced and annotated (Joint Genome Institute, University of California, http:// genome.jgi-psf.org; Berka et al., 2011). It revealed a large number of genes putatively encoding industrially important enzymes, such as carbohydrate-active enzymes (CAZy), proteases, oxidoreductases, and lipases, while more than 200 sequences have been identified exclusively for plant cell-wall-degrading enzymes. These sequences encode a large number of glycoside hydrolases (GH) and polysaccharide lyases, covering the most of the recognized families (**Table 1**). In addition, *M. thermophila* was developed into a proprietary mature enzyme production system with easy scaling (C1 strain; Visser et al., 2011). The main features of C1 are the high production levels (up to 100 g/L protein), as well as the maintenance of low viscosity levels of the culture medium, thus enabling fermentation process to reach very high densities.

*M. thermophila* exhibits an impressing number of accessory enzymes belonging to AA9 (previously described as GH61) and family 1 carbohydrate binding modules (CBM), which are the highest found in fungi (Berka et al., 2011). Family 1 CBM presents a cellulose-binding function and is almost exclusively found in



*GHs, Glycoside hydrolases; CEs, carbohydrate esterases; and PLs, polysaccharide lyases are included, covering the most of the recognized families.*

enzymes of fungal origin (http://www.cazy.org; Guillén et al., 2009). In addition, *M. thermophila* distinguishes itself from other cellulolytic fungi, such as *Aspergillus niger* and *Trichoderma reesei* by the presence of a relatively high number of (glucurono) arabinoxylan degrading enzymes (Hinz et al., 2009). Eleven putative xylanases were found that belong into GH 10 and 11 families compared to five in both *A. niger* (Broad Institute of Harvard and MIT, http://www.broadinstitute.org) and *T. reesei* (Joint Genome Institute, University of California, http://genome. jgi-psf.org), while 14 arabinofuranosidases belonging to GH 43, 51, and 62 families were found compared to 13 in *A. niger* and three in *T. reesei*, rendering *M. thermophila* a promising source of hemicellulolytic enzymes. Studying the secretome of *M. thermophila* after 30 h of growth in barley and alfalfa straws, it was found to comprise of 683 predicted proteins, 230 of which are proteins with unknown function (Berka et al., 2011). Based on transcriptome analysis, many secreted enzymes including accessory enzymes, hypothetical proteins and proteins with unknown function were upregulated, when the fungus is grown in more complex substrates, such as agricultural straws, compared to glucose, indicating their crucial role in lignocellulose degradation (Berka et al., 2011).

*M. thermophila* grows in temperatures between 25 and 55◦C, while a relative growth performance study on mycobroth agar plates indicated that the optimum condition is at 45◦C (Morgenstern et al., 2012). The temperature optima for several enzymes with the same specific activity, characterized from *M. thermophila*, range from 50 to 70◦C. For example, *StEG5* endoglucanase, expressed in *A. niger*, exhibits a Topt of 70◦C (Tambor et al., 2012), while recombinant *MtEG7* expressed in *Pichia pastoris* exhibited an optimal temperature of 60◦C (Karnaouri et al., 2014). The same characteristic is also observed for *M. thermophila* xylanases expressed in *A. niger,* showing optimal activity at temperatures between 50 and 70◦C (Berka et al., 2011), underpinning the enzymatic potential that is not only diverse in catalytic activities, but also in properties increasing its efficiency in various temperatures.

Individual cellulolytic enzymes exhibit comparable activities on cellulose; however, synthetically composed multienzyme mixtures display a much higher performance than those from other lignocellulolytic thermostable fungi (Szijártó et al., 2011; Zhang et al., 2013). This can be attributed to synergistic mode of action between the enzymes. For example, synergism between GH 11 xylanase and type C feruloyl esterase has been proved (Moukouli et al., 2010), as well as between cellobiohydrolases acting on the reducing and the non-reducing end of cellulose molecules (Gusakov et al., 2007).

In this review, an overview will be given of the cellulolytic and hemicellulolytic potential of *M. thermophila* regarding the degradation of plant cell wall material. The genomic potential of this thermophilic fungus demonstrates a strong enzymatic toolbox including hydrolytic, oxidative and accessory activities that may enhance its ability to decompose plant biomass. Many of these enzymes have been isolated from culture supernatant or selectively overexpressed in *M. thermophila* (C1 strain) or in other heterologous hosts and have been characterized. All sequences used in this study were extracted from Genome Portal database (http://genome.jgi-psf.org) and the continually updated CAZy database (http://www.cazy.org/; Lombard et al., 2014). The conserved domains were found with Pfam/InterProscan (http:// pfam.sanger.ac.uk/; Punta et al., 2012), while the theoretical molecular mass and isoelectric point for each protein were calculated using the ProtParam tool of ExPASY (http://web.expasy. org/protparam/). Post-translational glycosylation sites were predicted with NetNGlyc 1.0 server (http://www.cbs.dtu.dk/services/ NetNGlyc/) and NetOGlyc 3.1 server (http://www.cbs.dtu.dk/ services/NetOGlyc/). Predicted secretome was extracted using SignalP v4.0 (http://www.cbs.dtu.dk/services/SignalP/).

#### **CELLULOLYTIC SYSTEM**

Cellulose is composed of β-D-anhydroglucopyranose units linked by (1,4)-glycosidic bonds. Polymorphism or allotropy refers to the existence of more than one crystalline forms differing in physical and chemical properties. Cellulose degradation is attributed to the synergistic action of three complementary enzyme activities: (1) endoglucanases (EGs, EC 3.2.1.4); (2) exoglucanases, including cellodextrinases (EC 3.2.1.74) and cellobiohydrolases (CBHs, EC 3.2.1.91 for the non-reducing end acting CBHs and EC 3.2.1.176 for the reducing end acting ones) and (3) βglucosidases (BGs, EC 3.2.1.21) (Lynd et al., 2002). Amorphous regions of the polysaccharide chain are cleaved randomly by EGs, while CBHs remove processively cellooligosaccharides from chain ends. The latter are the most abundant enzymes in the secretome of cellulolytic fungi (Jun et al., 2011; Ribeiro et al., 2012). Their main representatives are GH family 7 (CBH I) that attack the reducing end of a cellulose chain and GH family 6 (CBH II) that are specific toward the non-reducing end of the chain. Until very recently, CBHs were considered as the main degraders of the crystalline part of cellulose (Sweeney and Xu, 2012).

EGs are widespread among GH families, with examples described for families 5–9, 12, 44, 45, 48, 51, and 74 on the continually updated CAZy database (http://www.cazy.org/; Lombard et al., 2014). Most of them show optimal activity at neutral or acidic pH and at temperatures below 50◦C (Maheshwari et al., 2000). Exo-glucanases (or CBHs) act in a processive manner (Davies and Henrissat, 1995) and are classified only to two families, as referred previously. One of the important features of all CBHs is that they can act on microcrystalline cellulose (Terri, 1997). BGs include enzymes of GH1 and GH3 families that hydrolyze cellobiose and short (soluble) cellooligosaccharides to glucose that could subsequently fermented to ethanol; e.g., the hydrolysis reaction is performed in the liquid phase, rather than on the surface of the insoluble cellulose particles, such as EGs and CBHs. The removal of cellobiose is an important step of the enzymatic hydrolysis process, as it assists in reduction of the inhibitory effect of cellobiose on EG and CBH. BG activity has often been found to be rate-limiting during enzymatic hydrolysis of cellulose (Duff and Murray, 1996; Tolan and Foody, 1999), and due to that the commercial cellulase enzyme preparations are often supplemented with BG activity.

Until recently, only hydrolytic enzymes were thought to play a role in the degradation of recalcitrant cellulose and hemicelluloses to fermentable sugars. Recent studies demonstrate that enzymes from the GH family 61 show lytic polysaccharide monooxygenase activity (LPMO) and have an enhancing cellulolytic effect when combined with common cellulases (Horn et al., 2012). Together with cellobiose dehydrogenase (CDH; EC 1.1.99.18), an enzymatic system capable of oxidative cellulose cleavage is formed, which increases the efficiency of cellulases and boosts the enzymatic conversion of lignocellulose. It has long been thought that the proteins of GH family 61 are accessory proteins enhancing cellulose decomposition. They were thus frequently referred to as the "cellulose enhancing factors" (Harris et al., 2010) and previously thought to have no or only weak endoglucanase activity (Karlsson et al., 2001). Now, these enzymes are now reclassified to AA9 family of CAZy database and their mode of action provide a new dimension to the classical concept of cellulose degradation, as recently reviewed by Dimarogona et al. (2013). These copperdependent enzymes were shown to cleave cellulose by an oxidative mechanism provided that reduction equivalents from CDH or low molecular weight reducing agents (e.g., ascorbate) are available (Langston et al., 2011). In some genomes, AA9 genes even outnumber cellulose genes. It remains to be elucidated whether all of these encoded enzymes have PMO activity, but their large number emphasizes the importance of oxidative cellulose cleavage. *M. thermophila'*s genome has 25 AA9 genes, encoding putative proteins acting as accessory LPMOs enzymes (Berka et al., 2011). This number is outstanding in comparison to common lignocellulolytic organisms, as *A. niger* (seven sequences) and *T. reesei* (nine sequences). This difference can explain the high efficiency of hydrolysis of *Myceliopthora* in nature substrates and reveals the crucial role of these enzymes in the whole procedure.

Throughout the genome of *M. thermophila*, there are eight sequences encoding EG activity, seven sequences of CBH activity and nine sequences of BG activity (**Figure 1**; **Table 2**). The theoretical average molecular weight of the translated proteins is calculated at 51.05 ± 16.2 kDa and the theoretical pI at 5.58 ± 0.3. EGs are distributed to families GH5, 7, 12, and 45, all predicted to possess a secretion signal and several *N*- and *O*-glycosylation


*Humicola grisea*

BAA74957.1]

*(Continued)*

[GenBank:


**Table 2 | Continued**


**Table2|Continued**

sites. Only two of them exhibit a CBM that belong in family 1. CBHs represent three non-reducing acting enzymes of GH6 family and four reducing-end acting enzymes of GH7 family. All of these enzymes seem to be targeted to secretion pathway and modified with glycans during post-translational modifications. BGs are classified to GH3 family, except one GH1 sequence, while none of them exhibit a CBM, as expected. Four are secreted and have potential *N*- and *O*-glycosylation sites, showing the highest molecular weight compared to the other cellulases, with a theoretical average value of 85.18 ± 3.2 kDa.

Totally, 12 cellulases have been isolated and characterized (**Table 3**). The group of Bukhtojarov et al. (2004) investigated the properties of individual cellulases from the multienzyme complex produced by a mutant strain of *M. thermophila* C1 (Visser et al., 2011). Among EGs, the highest saccharification activity was displayed by EG60 and EG51, representing enzymes of 60 and 51 kDa, respectively, which exhibited pI values of 3.6 and 5.0, respectively. It has been shown later that the EG51 and EG60 represent the GH5 and GH7 EGs from *M. thermophila*, respectively, (Gusakov et al., 2011). A different EG (*St*Cel5A) displays a typical GH5 domain, exhibiting optimal activity at pH 6.0 and 70◦C and retained greater than 50% of its activity following 2 h of incubation at 55◦C, diluted in 10 mM citrate buffer pH 4.5 (Tambor et al., 2012). A GH7 EG gene was functionally expressed in methylotrophic yeast *P. pastoris* and subsequently characterized (Karnaouri et al., 2014). Substrate specificity analysis revealed that the enzyme is one of the most thermostable fungal enzymes reported up to now and exhibits high activity on substrates containing β-1,4-glycosidic bonds as well as activity on xylancontaining substrates. Moreover, *Mt*EG7a was proved to liquefy rapidly and efficiently pretreated wheat straw, indicating EGs' key role to the initial step of hydrolysis of high-solids lignocellulose substrates (Karnaouri et al., 2014). This change in viscosity of these substrates is probably due to the gradual reduction of

**Table 3 | Description of the characterized cellulolytic enzymes either isolated from the culture broth of a** *M. thermophila* **C1 mutant strain or expressed in a heterologous host.**


*Most of them exhibit optimum temperatures above 60* ◦*C and pH around 5.0.*

the average chain length of cellulose polysaccharides by endoacting enzymes, such as endoglucanases. Totally, four CBHs and two BGs have been isolated from *M. thermophila* crude supernatant and studied. CBH IIb is the product of MYCTH\_66729 gene that represents an enzyme of GH6 family, which is attached to polysaccharide substrate through a CBM and exhibits high levels of activity in comparison to other CBHs (Gusakov et al., 2007). In the same study, the isolation of CBH Ib, a GH7 family enzyme (MYCTH\_2140736) is reported, which acts mainly against microcrystalline cellulose and CMC. Bukhtojarov et al. (2004) studied the properties of CBH Ia and CBH IIa, which are classified to GH7 and GH6 family, respectively. CBH Ia is the product of MYCTH\_109566 gene, and seems to be expressed in two isoforms with distinct molecular weights, one exhibiting the catalytic domain owing a CMB and the other only the catalytic domain and part of the linker, after proteolysis. This enzyme is produced as a major protein of fungi's secretome (20– 25% of the total extracellular protein) and adsorbed strongly on microcrystalline cellulose. It has been shown that there is a significant synergism between CBH IIb and CBH Ia enzymes during substrate hydrolysis (Gusakov et al., 2007).

Recently, another type of specific activity was revealed. Xyloglucan specific exo-β-1,4-glucanase (Xgl74A; EC 3.2.1.155) is classified to GH74 family and catalyzes the hydrolysis of (1-4)-D-glucosidic linkages in xyloglucans aiming in the successful removal of oligosaccharides from the chain end (Grishutin et al., 2004). Xyloglucan is a major structural polysaccharide found in the primary cell walls of higher plants that interact with cellulose microfibrils via hydrogen bonds to form a structural network that is assumed to play a key role in cell wall integrity. It consists of a cellulose-like backbone of β-1,4 linked D-glucopyranose (D-Glcp) residues, which most of them are substituted at C-6 with α-d-Xylp-(1→6) residues, to which other saccharides may be attached (most frequently, d-Galp and l-Fucp). *M. thermophila* was found to produce an exo-β-1,4 glucanase (Xgl74A) (Grishutin et al., 2004) with high specific activity toward tamarind xyloglucan, and very low or absent activity against carboxymethylcellulose (CMC) and barley βglucan. Due to its unique substrate specificity the enzyme was given a new number in the Enzyme Nomenclature (EC 3.2.1.155). Apart from Xgl74A, tw*o* out of the seven cellulases reported from *M. thermophila* (Cel12A and Cel45A) possess a notable activity against xyloglucan, together with their major activities toward CMC and barley β-glucan (Bukhtojarov et al., 2004).

# **HEMICELLULOLYTIC SYSTEM**

Hemicellulose polymers have a much more diverse structure than cellulose and consequently several enzymes are needed to completely degrade the polysaccharide into monosaccharides. Xylan that is the major component of hemicellulose in the plant cell wall, is consisted of a β-D-(1,4)-linked xylopyranosyl backbone, which, depending on the origin, can be substituted with arabinofuranosyl, 4-0-methylglucopyranosyl, feruloyl and acetyl groups (Shibuya and Iwasaki, 1985). Feruloyl groups can form strong networks through peroxidase-catalyzed oxidative coupling forming diferuloyl bridges (Topakas et al., 2007). The main enzymes needed for depolymerization are xylanases, assisted by accessory enzymes such as β-xylosidases and different arabinofuranosidases making the xylan backbone more accessible (Sørensen et al., 2007). Other accessible enzymes that enhance xylan degradation are acetyl-xylanesterases (Poutanen et al., 1990), ferulic acid esterases (Topakas et al., 2007), and α-glucuronidases (De Vries et al., 1998). *M. thermophila'*s hemicellulase genes are organized in 10 GH families (3, 10, 11, 30, 43, 51, 62, and 67) (**Figure 2**) and nine carbohydrate esterase (CE) families (1, 3, 4, 5, 8, 9, 12, 15, and 16) (**Figure 3**). Many of the encoding proteins have been isolated from the WT culture supernatant or expressed in heterologous hosts and finally characterized in terms of specific

activity and physicochemical properties. The majority of them are predicted to follow the secretion pathway, while modified with *N*and/or *O*- glucans, comprising a total amount of 66 enzymes that act synergistically for the degradation of hemicellulose.

#### **XYLANASES/XYLOSIDASES**

The degradation of xylan requires the concerted action of a number of powerful enzymes with varying specific activities, including xylanases and β-xylosidases. Xylanases (endo-1,4-βxylanases, EC 3.2.1.8) are enzymes hydrolyzing β-1,4-glycosidic linkages in the backbone of xylans, while most of them belong to GH family 10 or 11 based on amino acid similarities and structural features (Henrissat, 1991). GH10 xylanases exhibit less substrate specificity than GH11 enzymes and can hydrolyze different types of decorated xylans, while GH11 xylanases are highly specific and do not tolerate many decorations on the xylan backbone (Biely et al., 1997). β-Xylosidases (EC 3.2.1.37) hydrolyze the soluble xylo-oligosaccharides and xylobiose from the non-reducing end liberating xylose, produced by the activity of xylanases. These enzymes play an important role in xylan degradation by relieving the end product inhibition of endoxylanases (Knob et al., 2010). The genome of *M. thermophila* encodes totally 12 xylanases with endo- mode of activity, classified to GH 10 and 11 and four xylosidases, classified to GH3 and 43 families (**Table 4**). All aforementioned xylan-degrading translated sequences, apart from three, are predicted to exhibit a potential secretion signal. Xylanases possess 1-3 *N*-glycosylation and several *O*-glycosylation sites, whereas more *N*-sites are predicted for xylosidases, though not all of them are glycosylated during post-translational modifications. GH family 30 contains two genes encoding xylanolytic enzymes with endo-exo activity and one sequence for a characterized xylobiohydrolase, releasing xylobiose units from the substrate (Emalfarb et al., 2012).

Ten xylanases have been purified and characterized from multienzyme preparations of *M. thermophila* modified strains (Ustinov et al., 2008; van Gool et al., 2013). Four of them, belonging to GH10 family (**Table 5**), are the products of two genes, either with the presence of a family 1 CBM or displaying only the catalytic domain after partial proteolytic digestion (Ustinov et al., 2008). These enzymes, thought classified to the same family, can hydrolyze different types of decorated xylans. They differ in degradation of high and low substituted substrates and the substitution pattern seems to be an important factor influencing their efficiency (van Gool et al., 2012). Six xylanases, belonging to GH11 family, represent true xylanases, with high specific activities against glucuronoxylans and arabinoxylans. Four of these enzymes exhibit lower thermostability in comparison to GH10 xylanases, in which extended glycosylation has been noticed (Ustinov et al., 2008). One showed a substrate specificity pattern similar to GH10 enzymes and secreted in two forms, with or without CBM (van Gool et al., 2013).

# **ARABINOHYDROLASES**

L-arabinose is widely present in various hemicellulosic biomass components, such as arabinoxylan, where the main β-D-(1,4) linked xylopyranosyl backbone is substituted with arabinose residues. α-L-arabinofuranosidases (AFase; EC 3.2.1.55) are enzymes that release arabinofuranose residues substituted at position *O*-2 or *O*-3 of mono or di-substituted xylose residues (Gruppen et al., 1993). Apart from that, AFases act in synergism with other arabinohydrolases, endo-(1,5)-α-L-arabinanases (ABNase; EC 3.2.1.99) for the decomposition of arabinan, a major pectin polysaccharide. Arabinan consists of a backbone of α-(1,5)-linked L-arabinofuranosyl residues, some of which are substituted with α-(1,2)- or α-(1,3)-linked arabinofuranosides (Weinstein and Albersheim, 1979). Degradation of arabinan polymer to arabinose sugars is driven by the synergistic action of


*(Continued)*



**Table 5 | Description of the characterized** β**-1,4-xylanases isolated from the culture broth of a** *M. thermophila* **C1 mutant strain.**

*They display optimal activity at temperatures between 50 and 70* ◦*C, increasing fungi's hydrolytic efficiency in various temperatures. Marked proteins (\*) were isolated in two different forms, with (high molecular weight enzyme) or without CBM (low molecular weight enzyme).*

two major enzymes, AFases and ABNases (Kim, 2008). AFases specifically catalyze the hydrolysis of terminal non-reducing L-arabinofuranosyl residues from arabinan, while the resulting debranched backbone could be efficiently hydrolyzed by endo-acting ABNases, thus generating a variety of arabinooligosaccharides with an inverting mode of action (Beldman et al., 1997). Thoughout CAZy families, arabinohydrolases belong to the GH family 43, 51, 54, 62, and 93 (**Figure 3**).

The genome of the *M. thermophila* encodes 14 enzymes that putatively release arabinose or arabinose oligomers from arabinan (Hinz et al., 2009). Eleven sequences contain a secretion signal peptide and produced as extracellular or cell-bounded proteins, while almost all of them exhibit isoelectric point around 4.6– 5.6 (**Table 6**). Seven of them have been selectively overexpressed homologously in *M. thermophila* C1 host and found to release arabinose from wheat arabinoxylan polymers and oligomers (Hinz et al., 2009). *M. thermophila* arabinofuranosidases are selective in releasing arabinose from either single or double substituted xylose residues in arabinoxylans. Eight enzymes, belonging to GH families 43, 51, 62, and 93 with different type of arabinolytic activity have been purified and characterized (Hinz et al., 2009; Kühnel et al., 2011; Pouvreau et al., 2011b) (**Table 7**).

Abn7 and Abf3 are GH43 and GH51 arabinases respectively, which were selectively produced in C1 host. Abn7 was found to hydrolyze arabinofuranosyl residues at position O-3 of double substituted xylosyl residues in arabinoxylan-derived oligosaccharides, while Abf3 released arabinose from position *O*-2 or *O*-3 of single substituted xyloses. When these enzymes were incubated together, in combination with a GH10 endoxylanase for the hydrolysis of arabinoxylans, they resulted in a synergistic increase in arabinose release from the substrate (Pouvreau et al., 2011b). In addition, a-L-arabinohydrolases Abn1, Abn2, and Abn4 were overexpressed in C1 and the produced culture supernatant has been shown to produce neutral branched arabino-oligosaccharides from sugar beet arabinan by enzymatic degradation. As found by sugar analysis, neutral arabino-oligosaccharides contained an α-(1,5)-linked backbone of l-arabinosyl residues and carried single substituted α-(1,3) linked l-arabinosyl residues or consisted of a double substituted α-(1,2,3,5)-linked arabinan structure within the molecule (Westphal et al., 2010). Enzyme Abn4 belongs to GH43 family and is more active toward branched polymeric arabinan substrate that releases arabinose monomers from single substituted arabinose residues, while Abn1 and Abn2 are active toward linear arabinan (Kühnel et al., 2010). Abn2 is a member of GH93 family that consists of exoarabinases acting on linear arabinan, hydrolyzing the α-1,5-linkages of arabinan polysaccharides presented as side chains of pectin. Their mode of action was studied with Abn2, which binds two arabinose units at the subsites −1 and −2 and releases arabinose. Three more arabinohydrolases were also overexpressed in C1 strain (Hinz et al., 2009). Abn5 was found to be specifically active toward arabinan, but not arabinoxylan. Arabinofuranosidases Abf1 and Abf2, members of GH62 family released *O*-2 or *O*-3 substituted arabinose or linked arabinofuranosyl from mono substituted xylose. GH family 62 arabinofuranosidases are reported to be predominantly active toward arabinoxylan and are, therefore, also called arabinoxylan arabinofuranohydrolases (Beldman et al., 1997). Several of these enzymes contain either a CBM1, like Abf1, or a CBM43 (xylan)-binding domain.

#### **ESTERASES**

The role of esterases in the breakdown of lignocellulosic material is complex and includes the cleavage of bonds between the main hemicellulose part and many types of side chains. So, upon a closer examination of the genome sequences of *M. thermophila*, there is a wide distribution of enzymatic activities through CE families. These enzymes are classified into nine families and their main activities, among others, include the hydrolysis of feruloyl and acetyl ester bonds.

Feruloyl esterases (FAEs; EC 3.1.1.73) are enzymes responsible for cleaving the ester-link between the polysaccharide main chain of xylans and monomeric or dimeric ferulates. They act synergistically with xylanases to release ferulic acid from cellwall material and can be divided into four groups, namely A–D. The main difference between groups A and D is their substrate specificity toward synthetic substrates and their capability of liberating diferuloyl bridges (Crepin et al., 2004). One of the first FAEs reported from thermophilic fungi, was produced from *M. thermophila* under solid-state fermentation (SSF) conditions. The esterase activity was isolated and partially characterized for its ability to release ferulic acid from complex substrate, destarched wheat bran (Topakas et al., 2003). Two other FAEs, *St*FaeB, a protein with molecular weight of 66 kDa

 **activity** 

**(endoarabinases**

 **and** 

**arabinofuranosidases).**



 *proteins.*


**Table 7 | Description of the characterized endoarabinases and arabinofuranosidases isolated from the culture broth of a** *M. thermophila* **C1 mutant strain.**

(homodimers of 33 kDa) (Topakas et al., 2004) and *St*FaeC, 46 kDa (homodimers of 23 kDa) (Topakas et al., 2005), were purified to homogeneity from culture supernatants of *M. thermophila*. *St*FaeB hydrolyzed methyl *p*-coumarate, methyl caffeate and methyl ferulate and was active on substrates containing ferulic acid ester linked to the C-5 and C-2 linkages of arabinofuranose. StFaeC showed maximum catalytic efficiency on 4-hydroxy-3-methoxy cinnamate, a substrate with both hydroxyl and methoxy substituents, indicating that it may be the most promising type of FAE as a biocatalyst for the enzymatic feruloylation of aliphatic alcohols, oligo- and polysaccharides. Properties of characterized FAEs are summarized in **Table 8**. Among the sequences registered to Genome Portal, there are four sequences encoding proteins with catalytic activity of FAE, all belonging to CE family 1. Two of them (MYCTH\_48379, MYCTH\_39279) seem to be identical with characterized FAEs secreted from *M.*


**Table 8 | Description of** *M. thermophila* **characterized esterases (FAEs, AcEs, and GEs).**

*\*All proteins are monomeric, while in case of FaeB2, dimeric structures are detected, after comparing the results of SDS-PAGE and native electrophoresis.*

*thermophila* C1 strain (Kühnel et al., 2012). One sequence (JDI ID: 96478) has been heterologously expressed in *P. pastoris* and encoded a 39 kDa protein (*fae1A*; MtFae1a), which showed high activity toward methyl caffeate and p-coumarate and a strong preference for the hydrolysis of n-butyl and iso-butyl ferulate (Topakas et al., 2012). In addition, MtFae1 esterase release ferulic acid from destarched wheat bran only by the synergistic action of an endo-xylanase (a maximum of 41% total ferulic acid released after 1 h incubation). MYCTH\_2302953 sequence has not yet been characterized, however it still shows 66% identity with a type B FAE from *Neurospora crassa* (CAC05587.1). All proteins encoded by the above sequences appear to be secreted and bring several *N*- and *O*-glycosylation sites, as shown in **Table 9**.

About 60–70% of the xylose residues in hardwood xylan are acetylated at the C2 and/or C3 positions (Lindberg et al., 1973). The complete degradation of acetylated xylans by microbes requires the action of acetyl esterases (AcEs; EC 3.1.1.72), which cleave acetyl side groups from the heteroxylan backbone, and act in synergy with other hemicellulases (Tenkanen et al., 1996). Eight sequences that encode proteins with AcE activity were detected in the genome of *M. thermophila* and showed identity with characterized enzymes. All of them are secreted, as predicted with SignalP and belong to CE families 1, 3, 5, 16 (**Table 9**). Two of them, Axe2 and Axe3, which bare members of CE5 and CE1 families, respectively, were isolated and characterized (Pouvreau et al., 2011a). Annotated genes, encoding the putative enzymes were cloned into the specially designed *M. thermophila* C1-expression host (Verdoes et al., 2010) and over-produced in the culture medium. Axe2 and Axe3 are able to hydrolyze acetyl groups when they are substituted to the *O*-2 and *O*-3 positions of acetylated xylo-oligosaccharides and complex insoluble polymeric substrates and had a preference for xylooligosaccharides (Pouvreau et al., 2011a).

Glucuronoyl esterases (GEs) are recently discovered enzymes that are suggested to play an important role in the dissociation of lignin from hemicellulose and cellulose by cleaving the ester bonds between the aromatic alcohols of lignin and the carboxyl groups of 4-*O*-methyl-D-glucuronic acid residues in glucuronoxylan (Špániková and Biely, 2006). Sequence alignment studies of these enzymes have revealed a novel conserved amino acid sequence G-C-S-R-X-G that features the characteristic serine residue involved in the mechanism of this esterase family. It has been shown that the mode of action probably involves a nucleophilic serine (Topakas et al., 2010). The genome of *M. thermophila* possesses two genes classified to family CE15 that encode proteins with activity of 4-*O*-methyl-glucuronoyl esterase. Both putative enzymes are secreted and have potential glycosylation sites. The first GE (*St*EG1), isolated from the culture filtrate of *M. thermophila,* was proved to be a thermophilic enzyme that presents a C-terminal CBM, which was active on substrates containing glucuronic acid methyl ester (Vafiadi et al., 2009). Another CE15 protein molecule, *St*GE2 was heterologously expressed in yeast *P. pastoris* and was used to prove that nucleophilic serine residue is responsible for catalytic action of GEs, through sitedirected mutagenesis studies (Topakas et al., 2010) and crystal structure determination (Charavgi et al., 2013).

#### **MANNAN-DEGRADING ENZYMES**

Mannan is a great component of hemicellulose, therefore, as expected, the lignocellulolytic toolbox of *M. thermophila* possesses a complete reservoir of genes encoding mannan degrading enzymes. Mannan polymer primarily consists of a backbone structure composed of β-1,4-bound mannose residues or combination of glucose and mannose residues and can be hydrolyzed to its monomers with the synergistic action of β-mannanases (EC 3.2.1.78), β-mannosidases (EC 3.2.1.25), α-galactosidases (EC 3.2.1.22), and acetylmannan esterases (E.C. 3.1.1.6) (McCleary, 1988). The genome of *M. thermophila* encodes three enzymes that putatively catalyze random cleavage of the mannan polysaccharide and belong to GH family 5 and 26. One of these enzymes has been isolated from culture supernatant, characterized and classified as GH5 endo-β-1,4-mannosidase (bMan2, Dotsenko et al., 2012). In addition, there are two genes encoding putative βmannosidases belonging to GH2 family, while one of them has been characterized in terms of its specificity and physicochemical



*All of them contain a secretion signal peptide and have a theoretical molecular weight of 28.51*

± *4.1 kDa, ranging between 23 and 39 kDa.*

properties (bMann9, Dotsenko et al., 2012). Two GH27 and one GH26 α-galactosidases boost the efficiency of fungal culture supernatant against hydrolysis of mannan substrate (Emalfarb et al., 2012), while two CE12 family genes encoding proteins with high similarity to known acetyl-mannan esterases have been found.

#### **AUXILIARY ENZYMES**

In spite of the cooperative activity exhibited by the cellulolytic and hemicellulolytic enzymes, the impressive hydrolytic ability of various microorganisms in nature cannot be attributed only to this endo–exo mechanism. Apart from the hydrolytic system responsible for carbohydrate degradation, it seems that an oxidative system catalyze lignin depolymerization and oxidation of plant cell wall components, yielding reactive molecules (e.g., H2O2). Recent evidence highlights the critical role of alternative enzymatic partners involved in the oxidation of cell wall components. Among these enzymes, outstanding role during hydrolysis exhibit the originally described as cellulases LPMO enzymes, CDH and multicopper enzymes such as laccases. The genome of *M. thermophila* possesses more than 30 genes that encode proteins with such auxiliary activities (**Figure 4**). Members of the LPMO family AA9, have been shown to be copper-dependent monooxygenases that enhance cellulose degradation in concert with classical cellulases, as aforementioned before and reviewed by Dimarogona et al. (2013). These enzymes catalyze the cleavage of cellulose by an oxidative mechanism provided that reduction equivalents are available. These equivalents either involve low molecular weight reducing agents (e.g., ascorbate) or are produced by CDH activity (Langston et al., 2011). CDHs are extracellular enzymes produced by various wood-degrading fungi that oxidize soluble cellodextrins, mannodextrins and lactose efficiently to their corresponding lactones by a ping-pong mechanism using a wide spectrum of electron acceptors (Henriksson et al., 2000). Throughout the genome of *M. thermophila*, two genes encoding proteins classified to AA3 and 8 families have been identified (**Figure 4**). Both of them are predicted to be secreted in the culture supernatant and have potential glycosylation sites. The translated CDH MYCTH\_111388 exhibits a C-terminal CBM and a cDNA clone of this sequence has been isolated and biochemically characterized by screening an expression library of *M. thermophila* (Subramaniam et al., 1999). Canevascini et al. (1991) purified a monomeric (91 kDa) and a dimeric (192 kDa) form of CDH that differed not only in molecular weight, but amino acid composition and carbohydrate content. Both forms oxidized cellobiose in the presence of cytochrome c or dichlorophenol–indophenol.

Laccases (EC 1.10.3.1) are multicopper enzymes that catalyze the oxidation of a variety of phenolic compounds, with concomitant reduction of O2 to H2O. These polyphenol oxidases are produced by most ligninolytic basidiomycetes (Baldrian, 2006) and can degrade lignin and other recalcitrant compounds in the presence of redox mediators (Ruiz-Dueñas and Martínez, 2009). The genome of the *M. thermophila* encodes eight putative enzymes with multicopper oxidase activity. Four of them have been annotated and one (MYCTH\_51627) matches the *lcc1* gene product encoding an extracellular laccase (Berka et al., 1997).

**activities, classified to AA3/8, AA9 families and multicopper oxidases.** *M. thermophila* distinguishes itself from other cellulolytic fungi, exhibiting an impressing number of LPMOs accessory enzymes belonging to AA9 family (previously described as GH61).

Four sequences are predicted to possess a secretion signal, while one appears to remain membrane-bound. *Lcc1* gene has been isolated from fungi's genome, heterologously expressed in *A. oryzae* and the produced 85 kDa enzyme (MtL) was characterized as a thermostable low oxidation potential laccase with high reactivity in aqueous medium at room temperature and neutral pH. MtL was tested for its capacity to catalyze enzymatic oxidation of several phenolic and polyphenolic compounds (ferulic acid, gallic acid, caffeic acid, and catechin) (Mustafa et al., 2005). *M. thermophila* laccases have been reported to oxidize lignin surface, by increasing the amount of radicals during thermomechanical pulp fiber material bleaching (Grönqvist et al., 2003) and promote oxidative polymerization of Kraft lignin from back liquor, which is the main by-product of pulp and paper industry (Gouveia et al., 2013).

#### **LIGNOCELLULOSIC POTENTIAL—STATISTICS**

*M. thermophila* is a powerful lignocellulolytic organism, which secretes a complex system of carbohydrate hydrolases for the breakdown of cellulose and hemicellulose, as well as oxidoreductases embedded in lignin degradation. Genome analysis in this review revealed 30 genes encoding cellulases classified to 10 GH families, 66 genes encoding hemicellulases classified to 10 GHs, 9 CEs and 35 genes encoding auxiliary enzymes. The latter include CDHs (AA3/AA8 family), LPMOs (AA9 family) and multicopper oxidases (laccases or laccase-like enzymes). Out of the total consortium of *M. thermophila* sequences encoding proteins with putative lignocellulosic activity, 80.2% are predicted to have a secretion signal peptide. Almost 76% of cellulases, hemicellulases and 88% of the accessory redox enzymes are targeted to secretion pathway, while only a very small amount remain inside the cell or represent membrane cell—bound macromolecules. Only 15.8% of the secreted enzymes in this

review are predicted to possess a CBM and the majority of them comprise of auxiliary enzyme activities. The theoretical average molecular weight of secreted enzymes is 41.36 ± 15.9, varying between 10 and 97 kDa. The majority of secreted enzymes have molecular weight varying between 20 and 50 kDa, whereas β-xylosidases and β-glycosidases (GH3 family), and arabinofuranosidases (GH43 and GH51) appear to be high molecular weight proteins (**Figure 5**). The theoretical average isoelectric point of secretory enzymes is calculated 5.27 ± 0.8, at a range 4.34–7.9. *In vivo* expression and study of these enzymes would give different results, as the proteins are glycosylated, so size and pI value tend to moderate.

#### **PROTEIN GLYCOSYLATION**

A total proportion of 92.8% of secreted proteins have either *N*or *O*- putative glycosylation sites. These proteins are often glycosylated due to the existence of many Asn-Xaa-Ser/Thr sequons, which are known to be a prerequisite for *N*-glycosylation posttranslational modifications. The molecules of many GHs and accessory enzymes have a modular structure consisting of a catalytic module, flexible peptide linker, and CBM. Flexible linker peptides, which are rich in Ser and Thr residues, are typically *O*-glycosylated (Gilkes et al., 1991). The *N*-glycosylation seems to be restricted to the catalytic modules, and it is usually absent in other parts of enzyme molecules. Various *N*-linked glycan structures have been found in different enzymes from *M. thermophila*, belonging to different enzyme classes and protein families (Gusakov et al., 2008). It has been noticed that glycosylation follows a heterogeneity pattern, meaning that in some molecules, the same Asn residue was modified with oligosaccharides having different structure, while not all of the potential glycosylation sites were found to be occupied. The most frequently met *N*linked glycan was (Man)3(GlcNAc)2, a pentasaccharide which represents a well-known conserved core structure that forms mammalian-type high-mannose and hybrid/complex glycans in glycoproteins from different organisms (Dwek et al., 1993). Both types of glycosylation occur in 65% of secreted cellulases, 62.1% of secreted hemicellulases, while only *O*-glycosylation patterns appear in most of accessory enzymes. The presence of *N*-linked glycans is common for catalytic domain of the enzymes, while *O*-glycosylation usually occurs in linker region. Even though predicted to, non-secreted enzymes are not modified *in vivo* with glycans, since this procedure has been noticed as a post-translational modification in proteins targeted to the secretory pathway of the cell (Blom et al., 2004).

#### **CONCLUSIONS**

Rapid depolymerization of lignocellulosic material is a distinguishing feature of thermophilic fungi, such as *M. thermophila*, which was isolated from soil and self-heating masses of composted vegetable matter (Domsch et al., 1993). However, the precise biochemical mechanisms and underlying genetics of this procedure are not completely understood. Systematic examination of the *M. thermophila* genome revealed a unique enzymatic system comprising of an unusual repertoire of auxiliary enzymes, especially those classified to AA9 family, and provided insights into its extraordinary capacity for protein secretion. The current review constitutes, to the best of our knowledge, the first genomic analysis of the lignocellulolytic system of *M. thermophila*. The genomic data, along with the observed enzymatic activity of several isolated and characterized enzymes suggest that this fungus possesses a complete set of enzymes, including 30 cellulases, 66 hemicellulases, and 35 proteins with auxiliary auxiliary enzymes, covering the most of the recognized CAZy families. From its cellulases to its oxido-reductases and multicopper enzymes, *M. thermophila* gene complement represents several avenues for further research and its diverse array of enzymatic capabilities will contribute to the study of lignocellulose degradation and the subsequent ethanol biofuel production.

#### **ACKNOWLEDGMENTS**

Anthi Karnaouri thanks the State Scholarships Foundation (Greece) for a Grant. Paul Christakopoulos thanks Bio4Energy, a strategic research environment appointed by the Swedish government, for supporting this work.

#### **REFERENCES**


β-glucosidase from *Myceliophthora thermophila. PeerJ* 1, e46. doi: 10.7717/ peerj.46


**Conflict of Interest Statement:** The Review Editor Ulrika Rova declares that, despite being affiliated to the same institution as authors Anthi Karnaouri, Io Antonopoulou, and Paul Christakopoulos, the review process was handled objectively and no conflict of interest exists. The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 22 April 2014; accepted: 22 May 2014; published online: 18 June 2014. Citation: Karnaouri A, Topakas E, Antonopoulou I and Christakopoulos P (2014) Genomic insights into the fungal lignocellulolytic system of Myceliophthora thermophila. Front. Microbiol. 5:281. doi: 10.3389/fmicb.2014.00281*

*This article was submitted to Microbial Physiology and Metabolism, a section of the journal Frontiers in Microbiology.*

*Copyright © 2014 Karnaouri, Topakas, Antonopoulou and Christakopoulos. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Comparative genomics and evolution of regulons of the LacI-family transcription factors

*Dmitry A. Ravcheev1†, Matvei S. Khoroshkin1†, Olga N. Laikova1, Olga V. Tsoy1,2, Natalia V. Sernova1, Svetlana A. Petrova1,2, Aleksandra B. Rakhmaninova2, Pavel S. Novichkov3, Mikhail S. Gelfand1 \* and Dmitry A. Rodionov1,4\**

*<sup>1</sup> Research Scientific Center for Bioinformatics, A.A. Kharkevich Institute for Information Transmission Problems, Russian Academy of Sciences, Moscow, Russia*

*<sup>2</sup> Faculty of Bioengineering and Bioinformatics, Moscow State University, Moscow, Russia*

*<sup>3</sup> Lawrence Berkeley National Laboratory, Genomics Division, Berkeley, CA, USA*

*<sup>4</sup> Department of Bioinformatics, Sanford-Burnham Medical Research Institute, La Jolla, CA, USA*

#### *Edited by:*

*Katherine M. Pappas, University of Athens, Greece*

#### *Reviewed by:*

*Paul Alan Hoskisson, University of Strathclyde, UK John Alan Gerlt, University of Illinois, USA*

#### *\*Correspondence:*

*Mikhail S. Gelfand, Research Scientific Center for Bioinformatics, A.A. Kharkevich Institute for Information Transmission Problems, Russian Academy of Sciences, 19-1, Bolshoi Karetny pereulok, Moscow 127994, Russia e-mail: gelfand@iitp.ru; Dmitry A. Rodionov, Sanford-Burnham Medical Research Institute, 10901 North Torrey Pines Road, La Jolla, CA 92037, USA e-mail: rodionov@ sanfordburnham.org*

*†These authors have contributed equally to this work.*

DNA-binding transcription factors (TFs) are essential components of transcriptional regulatory networks in bacteria. LacI-family TFs (LacI-TFs) are broadly distributed among certain lineages of bacteria. The majority of characterized LacI-TFs sense sugar effectors and regulate carbohydrate utilization genes. The comparative genomics approaches enable *in silico* identification of TF-binding sites and regulon reconstruction. To study the function and evolution of LacI-TFs, we performed genomics-based reconstruction and comparative analysis of their regulons. For over 1300 LacI-TFs from over 270 bacterial genomes, we predicted their cognate DNA-binding motifs and identified target genes. Using the genome context and metabolic subsystem analyses of reconstructed regulons, we tentatively assigned functional roles and predicted candidate effectors for 78 and 67% of the analyzed LacI-TFs, respectively. Nearly 90% of the studied LacI-TFs are local regulators of sugar utilization pathways, whereas the remaining 125 global regulators control large and diverse sets of metabolic genes. The global LacI-TFs include the previously known regulators CcpA in Firmicutes, FruR in Enterobacteria, and PurR in Gammaproteobacteria, as well as the three novel regulators—GluR, GapR, and PckR—that are predicted to control the central carbohydrate metabolism in three lineages of Alphaproteobacteria. Phylogenetic analysis of regulators combined with the reconstructed regulons provides a model of evolutionary diversification of the LacI protein family. The obtained genomic collection of *in silico* reconstructed LacI-TF regulons in bacteria is available in the RegPrecise database (http://regprecise.lbl.gov). It provides a framework for future structural and functional classification of the LacI protein family and identification of molecular determinants of the DNA and ligand specificity. The inferred regulons can be also used for functional gene annotation and reconstruction of sugar catabolic networks in diverse bacterial lineages.

**Keywords: bacteria, transcription factors, regulons, sugar metabolism, comparative genomics**

#### **INTRODUCTION**

Evolution of regulatory interactions in bacteria can be approached from three directions. The first approach is the comparative analysis of regulation of a functional system, e.g., a metabolic pathway, in a variety of species. Such analysis demonstrates high flexibility of regulatory interactions even in closely related species, with expansion, contraction, and merging of regulons or a complete change of regulators (Manson McGuire and Church, 2000; McCue et al., 2001; Tan et al., 2001; Gelfand, 2006; Rodionov et al., 2006; Ravcheev et al., 2007; Kazakov et al., 2009; Suvorova et al., 2011, 2012). The second approach is to consider a taxon of a relatively low level (genus or family) and to use comparative genomics to predict as many regulatory interactions as possible. This has been done for γ-Proteobacteria from the *Shewanella* genus (Rodionov et al., 2011); Firmicutes closely related to *Bacillus subtilis* (Leyn et al., 2013) and *Staphylococcus aureus* (Ravcheev et al., 2011); two families of lactic acid bacteria from the Lactobacillales order (Ravcheev et al., 2013b); hyperthermophilic bacteria related to *Thermotoga maritima* (Rodionov et al., 2013); human gut habitant *Bacteroides thetaiotaomicron*; and related organisms (Ravcheev et al., 2013a). An important side product of such studies is functional annotation of hypothetical proteins by assigning them, via co-regulation, to known metabolic pathways and other functional subsystems (Rodionov, 2007; Gelfand and Rodionov, 2008).

The third approach, implemented here, is to consider a family of transcription factors (TFs) and then identify binding motifs for as many TFs as possible. This is mainly motivated by the desire to analyze the structure of protein–DNA interactions and the co-evolution of TFs and the motifs they recognize (Desai et al., 2009; Huang et al., 2009; Camas et al., 2010; Leyn et al., 2011; Ravcheev et al., 2012). An important issue in such studies is to connect TFs to the cognate TF binding sites (TFBSs) identified by phylogenetic footprinting and other computational techniques (Conlan et al., 2005; Wels et al., 2006; Liu et al., 2008). This problem is either solved experimentally or addressed computationally, for instance for regulons controlled by local TF from specific protein families (Rigali et al., 2004; Francke et al., 2008; Sahota and Stormo, 2010; Ahn et al., 2012; Kazakov et al., 2013). Phylogenetic profiling of TF genes and motifs upstream of candidate regulon members is an alternative bioinformatics approach for assigning TFs to putative regulons (Rodionov and Gelfand, 2005). The comparative analysis of ligand-binding domains in TFs also helps identify ligand specificity determinants and propose models of functional diversification within large and functionally heterogeneous families of TFs (Kazanov et al., 2013).

Here we study the LacI family of bacterial transcription factors (LacI-TFs). The namesake of the family, the lactose repressor LacI of *E. coli,* has been the model object for the analysis of bacterial transcriptional regulation since the classical papers of Jacob and Monod (1959, 1961). The family was established by the analysis of similarity of protein sequences, and simultaneously the similarity of DNA motifs recognized by the family members was noted (Weickert and Adhya, 1992). At the same time, it was observed that its DNA-binding domains of LacIfamily regulators are similar to the helix-turn-helix domains of other TFs (Nguyen and Saier, 1995), whereas the ligand-binding domain of LacI-TFs is homologous to the periplasmic proteins of ABC-transporters (Mauzy and Hermodson, 1992; Fukami-Kobayashi et al., 2003). Interestingly, this domain was also seen in combination with another DNA-binding domain, winged helixturn-helix of the GntR family (Franco et al., 2006). While early observations on a limited number of sequences suggested that the history of the family involved a series of duplications at the early stage and a low level of duplications later on (Nguyen and Saier, 1995), more data, derived from bacterial genomes becoming available, demonstrated that the duplications were occurring throughout the history of the family (Fukami-Kobayashi et al., 2003). Given its large size and high level of structural similarity, yielding reliable multiple alignments, this family was widely used as a model for algorithms for identification of functionally important residues both uniformly conserved and specificitydetermining (Mirny and Gelfand, 2002; Fukami-Kobayashi et al., 2003; Kalinina et al., 2004; Pei et al., 2006; Tungtur et al., 2007; Parente and Swint-Kruse, 2013). Some of these predictions were further tested in experiment (Meinhardt and Swint-Kruse, 2008; Camas et al., 2010; Tungtur et al., 2010, 2011).

The structural similarity of the DNA motifs recognized by the LacI-TFs was used to identify regulons in *Lactobacillus plantarum* (Francke et al., 2008) and *Dickeya dadantii* [*Erwinia chrysanthemi*] (Van Gijsegem et al., 2008). The structural aspects of interactions of the LacI-TFs with DNA and ligands and between subunits in dimers have been summarized in a recent review (Swint-Kruse and Matthews, 2009).

Here we report results of a large-scale, manual comparative genomics analysis of LacI-TFs aimed at identification of their binding sites, motifs, and regulons in bacterial genomes. By analyzing the genomic and metabolic context of the reconstructed regulons and by combining this analysis with information gathered from literature, we inferred the biological roles and molecular effectors for a large number of the studied LacI-TFs. As result, we made a number of observations on the distribution of orthologous LacI-TFs in genomes, the statistics of the binding sites' arrangement in regulatory regions, and the number and functional characteristics of the regulated genes. By combining the functional annotations with phylogenetic analysis, we proposed evolutionary models of functional diversification for a number of LacI-TF groups. The obtained reference dataset of 1281 regulons in 272 genomes was deposited in the RegPrecise database (Novichkov et al., 2013).

# **MATERIALS AND METHODS**

The genomes were downloaded from the MicrobesOnline database (Dehal et al., 2010). TFs from the LacI family were identified by similarity searches and domain predictions in the Pfam database (Finn et al., 2014). LacI-TFs consist of two characteristic domains, an N-terminal HTH DNA-binding domain (PF00356) and a C-terminal effector-binding domain, which is homologous to periplasmic binding proteins of sugar ABC transporters (PF00532, PF13377, or PF13407). Gene orthology was defined by the bidirectional best-hit criterion implemented in the GenomeExplorer software (Mironov et al., 2000) and validated by phylogenetic trees from the MicrobesOnline database (Dehal et al., 2010). Genes were considered as orthologs if they (i) formed a mono- or paraphyletic branch of the phylogenetic tree and (ii) demonstrated conserved chromosomal gene context. For genes in the reconstructed regulons, correspondence to both of these criteria was sufficient for these genes to be considered as orthologs. For the studied LacI-family regulators, an additional criterion of orthology was used. Thus, orthologous groups of LacI-TFs should (i) form a mono- or paraphyletic group on the phylogenetic tree; (ii) have a conserved gene context; (iii) have highly similar TFBS motifs; and (iv) have the same effector specificity (known or predicted based on the regulon content).

For the regulon reconstruction, we used a previously established comparative genomics approach (Rodionov, 2007) implemented in the RegPredict Web server (Novichkov et al., 2010). This approach includes prediction of putatively regulated genes, inference of TFBSs, construction of positional weight matrices (PWMs) for TFBS motifs, and a further search for additional regulon members on the basis of predicted TFBSs in gene promoter regions. Overall, three main strategies were used for the reconstruction of regulons: (1) construction of PWMs on the basis of known TFBSs, for regulons being previously analyzed in model organisms; (2) prediction of novel TFBS motifs in promoter regions of regulated genes, for regulons with only regulated genes but not TFBSs known; and (3) prediction of putatively co-regulated genes followed by the inference of putative TFBS motifs in their promoter regions and attribution of a candidate TF to a putative regulon. Presumably, regulated genes were predicted by the analysis of conserved gene neighborhoods around an analyzed LacI-TF gene. Data about known regulated genes and LacI-TFBS motifs were extracted from the literature and from the RegTransBase (Cipriano et al., 2013), RegulonDB (Salgado et al., 2013), DBTBS (Sierro et al., 2008), and CoryneRegNet (Pauling et al., 2012) databases.

Candidate motifs in upstream regions of regulated operons were identified by the Discover Profile tool in RegPredict (Novichkov et al., 2010). A search for palindromic DNA motifs of 14- to 24-bp length was carried out within putative promoter regions from −400 to +100 bp relative to the translational gene start. Motifs were further manually validated by phylogenetic footprinting, that is, analysis of conserved islands in multiple alignments of DNA fragments (Shelton et al., 1997). The constructed PWMs were further used to search for additional regulon members using the Run Profile tool in RegPredict. The lowest score observed in the training set of known and/or predicted TFBSs was used as the threshold for a site search in genomes. To eliminate false-positive TFBS predictions, the consistency check approach (Ravcheev et al., 2007; Rodionov, 2007) and/or functional relatedness of candidate target operons were used. In this approach, an operon can be considered as regulated when its upstream region contains putative TFBSs with a score higher than the threshold, and such sites can be found in a number of related genomes. Operons were defined as groups of genes satisfying the following criteria: same direction of transcription, intergenic distance up to 100 bp, absence of internal TF-binding sites, and conservation of the locus structure in a number of related genomes. All predicted TFs, motifs, and sites are available in the RegPrecise database (http://regprecise*.*lbl*.*gov/) (Novichkov et al., 2013), where they are publicly available within the TF family collections of regulons.

Functional gene annotations were extracted from the literature and uploaded from the SEED (Disz et al., 2010), UniProt (Magrane and Consortium, 2011), MicrobesOnline (Dehal et al., 2010), and KEGG (Kanehisa et al., 2012) databases. Known functional annotations for a particular gene were expanded to all orthologous genes. For prediction of gene functions, both the comparative genomics and context-based methods were used (reviewed in Osterman and Overbeek, 2003; Overbeek et al., 2005; Rodionov, 2007). Multiple alignments of protein and DNA sequences were built by MUSCLE (Edgar, 2004). Phylogenetic trees were constructed using a maximum-likelihood algorithm implemented in PhyML 3.0 (Guindon et al., 2010) and visualized via Dendroscope (Huson et al., 2007) and iTOL (Letunic and Bork, 2011). Sequence logos for DNA motifs were drawn with WebLogo (Crooks et al., 2004).

# **RESULTS**

The comparative genomics workflow for regulon reconstruction implemented in the RegPredict Web server (Novichkov et al., 2010) and the RegPrecise database (Novichkov et al., 2013) includes three steps: (i) selection of a taxonomic group of related bacteria; (ii) selection of a subset of diverse genomes that represent a given group; and (iii) reconstruction of regulons in the selected genomes. For the analysis of LacI-TF regulons, we selected a set of 344 representative genomes from 39 taxonomic groups from 7 bacterial phyla (Table S1 in Supplementary Material). Among the analyzed lineages, there are 19 taxonomic groups of Proteobacteria (183 genomes), 9 groups of Firmicutes (72 genomes), and 7 groups of Actinobacteria (57 genomes). The Bacteroides, Chloroflexi, Deinococcus-Thermus, and Thermotogae phyla are each represented by a single taxonomic group and have 32 genomes in total.

# **REPERTOIRE OF LacI-TF GENES IN BACTERIAL GENOMES**

To estimate the abundance of LacI-family TFs (LacI-TFs) in the studied genomes, we collected primary LacI-TF sets using a similarity search and the existing prokaryotic TF compilations. In total, 2572 proteins were found unevenly distributed in most (309/344; 90%) of the studied genomes, whereas 10% of the genomes do not encode putative LacI-TFs (Table S1 in Supplementary Material). The largest average numbers of LacI-TFs per genome were found in several lineages of the Actinobacteria phylum including Streptomycetaceae and Bifidobacteriaceae (from 32 and to 17 regulators), in two lineages of Proteobacteria–Rhizobiales and Enterobacteriales (15 regulators in each group), and in two lineages of Firmicutes–Bacillales and Enterococcaceae (12 regulators in each group). The remaining taxonomic groups possess less than 10 LacI-TFs per genome on average. Noteworthily, the Methylophilales, Neisseriales, Nitrosomonadales, Oceanospirillales, Magnetospirillum/Rhodosprillum, and Desulfovibrionales groups completely lack LacI-TFs in their genomes. The absence of LacI-TFs in these taxonomic groups of Proteobacteria can be related to (i) relatively small proportion of sugar catabolic genes in their genomes (as LacI-TFs mostly control sugar catabolism, see below); (ii) increasing usage of TFs from other families to compensate the contraction of the LacI-TF pool.

#### **STATISTICS OF RECONSTRUCTED REGULONS AND REGULOGS**

The entire set of identified LacI-TFs was broken into taxonomic group-specific orthologous groups that were subjected to further comparative genomics analysis using the RegPredict Web server (Novichkov et al., 2010). Normally, an orthologous group contained no more than one TF per genome. However, in some cases TFs formed by recent, mainly genome-specific, duplication were assigned to the same orthologous group (Table S1 in Supplementary Material). By analyzing orthologous groups of regulators in each taxonomic group, candidate motifs and binding sites were predicted for 1303 LacI-TFs (50% of all putative LacI-TFs) in 272 bacterial genomes (80% of studied genomes). The main outcome of this analysis is an annotated regulog, which is defined as a set of genome-specific regulons controlled by orthologous TFs. Overall, we inferred 1281 LacI-TF regulons that constitute 322 populated regulogs unevenly distributed across 39 studied taxonomic groups of genomes (Tables S1, S2 in Supplementary Material). The reconstructed regulons included 7465 candidate sites, 6076 operons, and 13,558 genes.

The taxonomical distribution of the reconstructed LacI-TF regulons is highly uneven but generally follows the distribution of all LacI-family TFs (**Figure 1**): 57% of regulons were from Proteobacteria, about 30% from Firmicutes, 7% from Actinobacteria, about 1–2% from each of Thermotogales, Bacteroides, Chloroflexi, and Deinococcus/Thermus. Yet, compared to the genomic distribution of all putative LacI-TFs (Table

S1 in Supplementary Material), the Actinobacteria phylum is underrepresented. Based on the phylogenetic analysis of LacI-TF proteins, regulators from the reconstructed regulogs were merged into larger orthologous groups that were consistent with the taxonomy, had regulated orthologous genes, and had similar binding motifs. TFs were assigned to an orthologous group if they formed a mono- or paraphyletic branch (see below) in the phylogenetic tree (Figure S1 in Supplementary Material). As in most gene families containing multiple paralogs resulting from frequent duplications, losses, and horizontal transfers, the resolution of orthology in some cases was difficult and required arbitrary decisions that are supported by the genomic context and/or functional attributes of the reconstructed regulons.

As a result, the studied LacI-TFs were classified into 190 orthologous groups characterized by conserved DNA motifs and regulated pathways (Table S2 in Supplementary Material). Twothirds of the obtained orthologous groups (125/190) contain TFs from a single regulog, which is defined as a set of orthologous regulons in a group of closely related genomes. Thirty-seven orthologous groups include two regulogs, whereas the remaining 28 groups were assigned to three or more regulogs (**Figure 2A**). The total number of regulons (and corresponding TFs) per orthologous group of LacI-TFs varies between 1 and 59, with the average being 6.7 (**Figure 2B**). The maximal number of groups was observed for the groups including two TFs (37 groups). Orthologous groups containing up to 5 and between 6 and 10 regulons constitute 60 and 20% of all groups, respectively. The most populated groups of LacI-TF orthologs were found for the global catabolite control regulator CcpA in Firmicutes (59 regulons, 6 regulogs), the ribose repressor RbsR in Proteobacteria (52 regulons, 10 regulogs), the maltose repressor MalR in Firmicutes (42 regulons, 6 regulogs), as well as sugar catabolism regulators FruR (40 regulons, 6 regulogs), GntR (37 regulons, 7 regulogs), and GalR (27 regulons, 5 regulogs) in γ-Proteobacteria.

#### **GLOBAL AND LOCAL REGULONS**

The reconstructed LacI-TF regulons demonstrate drastic differences in the numbers of predicted target genes and operons. The majority of regulons (1198/1288, 93%) include 20 or fewer genes (**Figure 3A**), and further, three-fourths of these regulons contain between 2 and 7 genes, whereas 26 regulons have only 1 target gene. With respect to the number of regulated operons (**Figure 3B**), the largest portion of the studied LacI-TFs regulates one (31%) or two (38%) operons. We divided all reconstructed LacI-TF regulons into two main categories depending on their size (number of regulated genes and operons) and functional diversity (number of regulated pathways). A total of 125 regulons (12 regulogs) were classified as global, since each of them (i) contained more than 15 target genes that were arranged in at least 7 operons and (ii) controlled multiple metabolic pathways. The remaining LacI-TF regulons (1163/1288, 90%) were classified as local, each having a smaller number of targets and controlling a single metabolic pathway, usually a particular carbohydrate catabolic pathway.

Almost one-half of the identified global regulons (59 regulons, 6 regulogs) are operated by orthologs of the *B. subtilis* catabolite control regulator CcpA in all analyzed taxonomic groups of Firmicutes. CcpA regulons were previously described in detail for bacteria from the Bacillaceae (Sonenshein, 2007;

Fujita, 2009; Leyn et al., 2013), Staphylococcaceae (Seidl et al., 2009; Ravcheev et al., 2011), Lactobacillales (Mahr et al., 2000; Zheng et al., 2011, 2012; Zotta et al., 2012; Ravcheev et al., 2013b), and Clostridiaceae (Antunes et al., 2012) lineages. Another large group of global regulons (31 regulons, 3 regulogs) are operated by orthologs of the *E. coli* purine repressor PurR in three related taxonomic groups of γ-Proteobacteria— Enterobacteriales, Pasteurellales, and Vibrionales. PurR is a global transcriptional regulator of *E. coli*, controlling biosynthesis of purines, some steps of biosynthesis of pyrimidines, polyamine metabolism, and nitrogen assimilation (Ravcheev et al., 2002; Cho et al., 2011). The remaining three global regulogs— FruR in Enterobacterales, PckR in Rhizobiales, and GapR in Rhodobacterales (totaling 35 regulons)—that control the central and periphery carbohydrate catabolic pathways in diverse groups of Proteobacteria are described in more detail below.

Autoregulation was observed for 72% (943 TFs) of the studied LacI-TFs. The autoregulation of TFs was more typical for local regulons, as expected. Thus, less than one-half of the studied global regulators (55 TFs) demonstrated autoregulation, whereas more than three-fourths of the analyzed local regulators (888 TFs) were autoregulated.

#### **BINDING SITES AND MOTIFS**

Binding motifs of the considered LacI-TFs are palindromes formed by highly conserved inverted repeats, which is consistent with previous studies (Francke et al., 2008; Camas et al., 2010; Milk et al., 2010). The distance between the repeats is usually constant for a given orthologous group of TFs, although in rare cases there is some flexibility. The overwhelming majority of LacI-TF binding motifs (1251/1303 TFs) are even palindromes (16, 18, 20, or 22 nt long), whereas non-canonical palindromes (17, 19, or 21 nt long) were found for only 4% of regulators (e.g., LacR, GalR, and EbgR). The characteristic feature of even palindromes is the presence of a consensus CG pair in the center of the palindrome (1173 TFs). Nonetheless, in some even palindromes the central pair can be either different (49 TFs) or degenerate (29 TFs).

More than 75% of identified LacI-TFBSs are located within the area between 140 and 30 bp upstream of the start codon (**Figure 4A**). Less than 1% of sites are localized within coding regions, including experimentally demonstrated *E. coli* sites of LacI deep within the *lacZ* gene (Lewis, 2005) and PurR in the *purB* gene (He et al., 1992). Approximately 7% of sites are localized far upstream (*>*200 bp), and while some of them might in fact regulate divergently transcribed genes, experimental examples of such localization are known, e.g., the PurR site upstream of *prsA*, again in *E. coli* (He et al., 1993).

About 20% of regulated operons are preceded by more than one binding site. Thus, tandem sites were observed upstream of 1118 operons. The multiplicity of sites within regulatory regions of divergently transcribed genes (i.e., divergons) increases with the divergon length. Groups of three and four adjacent sites were found for 116 and 5 regulated operons, respectively.

The histogram of intersite distances for pairs of sites localized in the same intergenic region has pronounced peaks at 13, 22, and 32 bp (**Figure 4B**). The latter two lengths are multiples of the DNA helix step and hence are clearly indicative of cooperative

binding of two TF dimers. Several LacI-TF regulogs, e.g., RafR from Enterobacteriales and ScrR from Burkholderiales, have only double sites at the distance of 21–22 bp, suggesting that the cooperative binding in this case is obligatory. Overlapping sites, situated at a distance of about 13 bp, were observed upstream of 110 operons. Such an arrangement may be functional, as has been demonstrated for GntR-binding sites upstream of *gntKU* of *E. coli* (Tsunedomi et al., 2003a) and it is conserved for more than one-half of the operons regulated by GntR and its orthologs in γ-Proteobacteria.

A trivial explanation—that these observations are an artifact does not seem plausible for two reasons. Firstly, such multiple sites are conserved and even preferred in some orthologous groups. Secondly, the artifact hypothesis does not explain the preferred distance of 13 bp. In the available tertiary structures of LacI-TFs complexed with DNA (Schumacher et al., 1994; Barbier et al., 1997; Kalodimos et al., 2002), seven base-pairs nearest to the site center form contacts between bases and side residues. Hence, the site-overlap region strongly overlaps with the zone of specific TF-DNA contacts. While tetramer binding has been suggested, it is difficult to reconcile this with the structural data, as dimers bound at these distances cannot interact. It is possible that there exists a specific binding mode involving partial unwinding of the DNA strands.

#### **REGULATED METABOLIC PATHWAYS AND EFFECTORS**

By assessing the functional content of the reconstructed regulons, we tentatively predicted possible biological functions and effectors for 190 orthologous groups of LacI-TFs. As result, metabolic pathways were predicted for 182 groups of LacI-TFs. These include 54 groups that were only assigned to the general category of sugar metabolism, and for them, a specific sugar catabolic pathway remained unknown (Table S2 in Supplementary Material). We compared the predicted regulon functions with previous results of experimental studies available for 24 selected LacI-TFs. The previously established functions of these regulators are in good agreement with the target-regulated pathways that were predicted in this work. Based on the metabolic pathway reconstruction and the knowledge of pathway metabolites, a range of possible molecular effectors was suggested for 108 groups of LacI-TFs (Table S2 in Supplementary Material). Of these, effectors were previously known for regulators from 21 LacI-TF groups.

As expected, the overwhelming majority of the studied LacI-TF orthologous groups control carbohydrate metabolism (176/182; 96% of groups with assigned pathways). At that, most of the orthologous groups containing local LacI-TFs are assigned to specific carbohydrate utilization pathways. In contrast, five orthologous groups of local regulators including AdeR, HpxR, and UriR control the nucleoside utilization pathways, whereas a local regulator NtdR controls the neotrehalosadiamine biosynthesis. Most global regulators from the LacI family also are involved in the control of carbohydrate metabolism (FruR, CcpA, PckR, GapR), whereas PurR regulates several key metabolic pathways including purine and pyrimidine biosynthesis.

In agreement with the observed tendency of LacI-TF to control sugar metabolism, we report that carbohydrates constitute the largest class of effectors for these regulators (103/107; 96% of orthologous groups with assigned effectors). The majority of carbohydrate effectors assigned to the LacI-TF groups are monosaccharides and their derivatives (26/45), including hexoses (e.g., glucose, galactose, mannose), pentoses (e.g., ribose, xylose), sugar phosphates (e.g., fructose-1-phosphate, allose-6-phosphate), sugar acids (e.g., gluconate, galacturonate), sugar alcohols (e.g., ribitol), and amino sugars (e.g., N-acetylglucosamine). The second-largest category of carbohydrate effectors contains various oligosaccharides (14/26), including common disaccharides cellobiose, maltose, sucrose, and trehalose or their phosphorylated derivatives (e.g., cellobiose-6-phosphate, sucrose-6-phosphate). Finally, non-carbohydrate effectors of LacI-TFs include nucleobases (e.g., guanine and hypoxanthine are co-repressors of PurR in *E. coli* Schumacher et al., 1994), nucleosides (e.g., cytidine and adenosine are inducers of CytR in *E. coli* Barbier et al., 1997), and proteins (e.g., phosphoprotein HPr-Ser46-P is a co-repressor of CcpA in *B. subtilis* Schumacher et al., 2011).

Functional analysis of reconstructed LacI-TF regulons revealed that many sugar utilization pathways are regulated by two or more non-orthologous regulators. These include catabolic pathways for at least 10 distinct types of carbohydrates that are controlled by more than 90 orthologous groups of LacI-TFs (**Table 1**; Table S1 and Figure S1 in Supplementary Material). The observed large numbers of non-orthologous regulators for the glucoside and galactoside catabolic pathways correlate with structural diversity of glucose- and galactose-containing oligosaccharides that can be **Table 1 | Sugar utilization pathways controlled by non-orthologous LacI-TFs.**


utilized by bacteria in diverse natural habitats. On the other hand, the diversity of regulators for several other sugars including ribose and sucrose suggest a high frequency of convergent evolutionary events when the same ligand specificity has evolved independently in different branches of the LacI family.

For example, the sucrose utilization pathway is regulated by LacI-TFs from at least 11 orthologous groups from the phyla of Proteobacteria (11 lineages) and Firmicutes (5 lineages), as well as a single lineage of Actinobacteria (**Figure 5**). Analysis of the respective sucrose regulons revealed multiple distinct combinations of sucrose uptake transporters including permeases (*scrT*, *cscB*, *sut1*), phosphotransferase systems (PTSs) (*scrA*) and porins (*scrY*, *scrO*, *omp*), and sucrose catabolic enzymes including phosphorylases (*scrP*), hydrolases (*scrB*), and fructokinases (*scrK*). The effectors sucrose and sucrose-6-phosphate were assigned to the respective groups of ScrR regulators based on the type of regulated sucrose-specific transporters, i.e., a permease or a PTS, respectively. Interestingly, some families such as Bacillales and Enterobacteriales contain non-orthologous ScrR regulators with different effectors. As expected, DNA motifs of TFs controlling the sucrose catabolic pathway are well conserved within orthologous groups but are clearly different between non-orthologous regulators (**Figure 5**).

#### **GLOBAL REGULONS FOR CENTRAL SUGAR METABOLISM**

The LacI family contains a number of global regulators for central carbohydrate metabolic pathways, in particular the previously known regulators CcpA in Firmicutes and FruR in Enterobacteria. Here, we report a comparative genomics reconstruction of orthologous FruR regulons in γ-Proteobacteria, while the reconstructions of CcpA regulons in different lineages of Firmicutes has been reported previously (Ravcheev et al., 2011, 2013b; Antunes et al., 2012; Leyn et al., 2013) and is available in

shown by square brackets.

the RegPrecise database. Further, we report identification of three novel non-orthologous regulons (named PckR, GapR, and GluR) that control central carbohydrate metabolism in three lineages of α-Proteobacteria, namely, Rhizobiales, Rhodobacterales, and Caulobacterales. Below we provide functional analysis of regulons for each of these four LacI-family regulators in Proteobacteria.

The fructose repressor FruR (also known as the catabolism repressor and activator Cra) is a global regulator of central metabolism in *E. coli* (Saier and Ramseier, 1996). FruR/Cra coordinates the carbon flow by repressing glycolytic genes involved in the Embden–Meyerhof, Entner–Doudoroff, and pentose– phosphate pathways and by activating gluconeogenesis genes. The comparative genomics reconstruction of orthologous FruR regulons in γ-Proteobacteria revealed that the regulon size correlates with the taxonomy of studied groups and with the phylogeny of the FruR proteins (**Figure 6**). In Vibrionales and Pseudomonadales, the fructose utilization operon *fruBKA* and the *fruR* gene are the only members of the reconstructed FruR regulons; therefore, FruR operates as a local regulator in these lineages. In the Enterobacteriales, the FruR regulon is expanded to cover genes of the central glycolytic pathways, a part of the TCA cycle, and several fermentation and respiration pathways. Further regulon expansion to the genes of glyoxylate bypass is observed in the *Escherichia*, *Salmonella*, *Citrobacter*, *Enterobacter*, and *Klebsiella* species. Finally, in the closely related *Escherichia* and *Salmonella* species, the FruR/Cra regulon is expanded to include the Entner–Doudoroff pathway genes. Similar trends in the evolution of global regulons in closely-related bacterial species

circles show the numbers of genomes with correspondent regulation.

were previously demonstrated for the PhoP regulon in enterobacteria (Perez and Groisman, 2009). On the other hand, in Pasteurellales, the regulon is degrading: FruR is absent in the *Haemophilus* spp. and *Actinobacillus pleuropneumoniae*, whereas in *Pasteurella multocida*, *Mannheimia succiniciproducens, A. succinogenes*, and *A. aphrophilus* the *fruR* gene is seemingly intact, but no candidate FruR-binding sites could be detected.

The hypothetical TF PckR (SMc02975 in *Sinorhizobium meliloti*) was previously annotated as a putative regulator of the phosphoenolpyruvate carboxykinase *pckA* (EMBL accession number AF004316.1); however, it had not yet been studied experimentally. Orthologs of PckR were found in 10 out of 15 analyzed genomes from the Rhizobiales order including species from the Rhizobiaceae, Brucellaceae, Phyllobacteriaceae, and Xanthobacteraceae families. By using the comparative genomics approach, we identified the putative PckR binding motif and reconstructed the PckR regulons in each of these genomes (**Figure 7**). Furthermore, by analyzing the binding-site position within promoter regions, we predicted a negative or positive mode of PckR regulation. As result, PckR was predicted to function as a dual transcriptional regulator that represses glycolytic genes from the Embden–Meyerhof and Entner–Doudoroff pathways (*glk*, *fba*, *pykA*, *zwf-pgl-edd*, *eda*) and activates genes from the gluconeogenesis and TCA cycle (*pckA*, *mdh-sucCDAB*, *sdhABCD*).

In the Xanthobacteraceae family, the reconstructed PckR regulon includes a minimal number of genes (*pckA*, *pckR*, *edd*, and *hpr*), whereas it is significantly expanded in the other three families of Rhizobiales and contains from 14 to 28 genes per genome. The most conserved members of the reconstructed PckR regulons include the *pckA* and *edd* genes (in 10 and 9 genomes, respectively), the *zwf-pgl* and *mdh-sucCDAB* operons (in 8 genomes), and the *fba*, *glk*, and *eda* genes (in 7 genomes). In most of the analyzed genomes, the *pckR* genes are not clustered with their target genes, which is a common feature of many global regulators in bacteria. However, in the genome of *Xanthobacter autotrophicus*, *pckR* is divergently transcribed with the 6-phosphogluconate dehydrogenase gene *gnd*, which is preceded by a putative PckRbinding site.

Orthologs of PckR and their cognate DNA motifs were found in Rhizobiales but not in other lineages, suggesting that the PckR regulon was introduced relatively recently in the evolution of α-Proteobacteria. PckR from Rhizobiales can be considered as a partial functional replacement of the Enterobacterial Cra/FruR (see above) and the *Shewanella* HexR regulators (Leyn et al., 2011) that both play a pleiotropic role modulating the direction of carbon flow through different carbohydrate metabolic pathways. The molecular effector of PckR is unknown. A plausible hypothesis is that PckR dissociates from its DNA sites in response to an intermediate of the central glycolytic pathway in rhizobia.

A novel global transcriptional regulon for carbohydrate metabolism genes named GapR (RSP\_1663 in *Rhodobacter capsulatus*) was identified in all 13 studied genomes from the Rhodobacteraceae family (**Figure 7**). GapR was predicted to recognize a 20-bp palindromic DNA consensus, which is distinct from the PckR-binding consensus. The reconstructed GapR regulons contain from 7 to 18 genes per genome organized in 5–13 operons. GapR regulates glycolytic genes involved in the Embden–Meyerhof and Entner–Doudoroff pathways (*zwf-pglpgi*, *edd-eda*, *fba*, *gapA*, *gapB*, *eno*, *pykA)*, gluconeogenesis (*pckA*, *pycA*), fructose utilization (*scrK*), pentose phosphate pathway (*tal*), and the TCA cycle (*mdh*). A similar DNA motif was identified upstream of the *gapR* genes in five genomes, suggesting their autoregulation. Similarly to PckR in Rhizobiales, the *gapR* genes do not cluster on the chromosome with their target genes in most genomes. However, *Roseobacter* possess two copies of *gapR*. One of these paralogs is divergently transcribed with the *gapB* gene, which is preceded by a putative GapR-binding site. The molecular mechanism and effector for GapR regulators remain to be elucidated.

Although PckR and GapR regulate overlapping sets of genes from the central carbohydrate metabolism in two distinct lineages of α-Proteobacteria, these regulators are not orthologous to each other (Figure S1 in Supplementary Material) and recognize different DNA motifs (**Figure 7**). Another non-orthologous regulator from the LacI family named GluR (CC2053 in *Caulobacter crescentus*) was predicted to control the central carbohydrate metabolism in the Caulobacteraceae family (**Figure 7**). GluR recognizes a conserved 20-bp palindromic consensus, different from the PckR- and GapR-binding motifs above. The reconstructed GluR regulon in the *Caulobacter* spp. is composed of the glycolytic genes *zwf-pgl-edd-glk*, *pykA,* and *gnl-gfo*, as well as the gluconeogenic gene *ppdK* encoding pyruvate-phosphate dikinase and the *sucABCD* operon involved in the TCA cycle. The predicted regulatory gene *gluR* is located immediately downstream of the target operon *zwf-pgl-edd-glk* and is preceded by a conserved GluR-binding site in all three analyzed *Caulobacter* genomes.

Orthologs of GluR were not found in other α-Proteobacteria, although bacteria from the Caulobacterales lineage have three paralogs (BglR1-3) that are predicted to control the cognate operons involved in the β-glucoside utilization (Figure S1 in Supplementary Material). The molecular effector of GluR is not known. Based on its close similarity to the predicted BglR repressors that possibly respond to β-glucosides and/or glucose, we propose that GluR dissociates from its DNA sites in response to glucose. In confirmation of our hypothesis, it has been demonstrated that glucose induces expression of the Entner–Doudoroff pathway genes and that the *edd* and *glk* genes are essential for the glucose utilization in *C. crescentus* (Hottes et al., 2004).

# **DISCUSSION**

We used the comparative genomics reconstruction of regulons for analysis of the LacI family of bacterial TFs. This choice was based on the following features. First, the LacI family is large, varied, and broadly distributed in bacteria. Second, proteins from this family have a rigid domain structure and a highly conserved structure of TFBS motifs. This study resulted in a detailed reconstruction of 1281 LacI-TF regulons in 272 bacterial genomes. Most (∼90%) of the analyzed TF-LacI regulons are local, i.e., they control a small number of genes and operons that are involved in only one metabolic pathway. However, some LacI regulons are global, controlling tens to hundreds of genes involved in multiple metabolic pathways. In addition to the reconstruction of previously known global regulons, such as FruR/Cra, PurR,

and CcpA, we identified three novel regulators for the central carbohydrate metabolism in α-Proteobacteria, PckR, GluR, and GapR, and reconstructed their corresponding regulons. For two of these global regulons, FruR and GluR, we reconstructed their possible evolutionary histories. Both these TFs likely originated from local regulators during a process of gradual regulon expansion.

A large-scale phylogenetic analysis of LacI-TFs reveals numerous examples of various evolutionary processes for regulators and their regulons including divergent evolution (diversification of TF functions and binding specificities after duplication), convergent evolution (appearance of the same function in distantly related branches of a phylogenetic tree), and formation of paraphyletic groups (origin of novel functions and specificities, noncharacteristic for a given branch of TFs). Below we discuss these evolutionary processes in more detail and provide examples of functional diversification within the LacI-TF family.

The LacI-TF phylogenetic tree demonstrates that some orthologous groups of TFs form branches consistent with the taxonomy, with TFs regulating orthologous genes, and with recognizing similar motifs, but those branches included an internal clade with demonstrated differences in the motif structure, effector specificity, or regulon content, i.e., these TFs formed so-called paraphyletic groups. The most interesting example of paraphyly is the branch of ribose repressors RbsR in β- and γ-Proteobacteria that have the purine repressor PurR as an excluded clade. Phylogenetic analysis revealed the presence of PurR orthologs in only three bacterial orders, Enterobacteriales, Pasteurellales, and Vibrionales. The closest PurR paralogs in these groups are ribose repressors (RbsRs) (Figure S1 in Supplementary Material). Most probably, PurR was originated by duplication of RbsR in the common ancestor of Enterobacteriales, Pasteurellales, and Vibrionales. The Pseudomonadales and β-Proteobacteria with a single RbsR repressor seemingly feature the ancestral state. The RbsR orthologs from the above three orders retain the ligand specificity but have a slightly modified DNA binding motif, compared to RbsR of Pseudomonadales and β-Proteobacteria (**Figure 8**). On the contrary, PurR retained the motif but changed the ligand and the default state, as it binds DNA in the presence of its ligand, whereas RbsR, like the majority of the LacI-TFs, binds DNA in the absence of the ligand.

This example illustrates a possible way of formation of paraphyletic branches by duplication of an ancestral gene with subsequent conversion of one copy of the copies, resulting in the origin of a novel function. In the case of RbsR and PurR, duplication and conversion are rather ancient, and we observe only the result of such evolution but cannot observe the intermediate states of the evolutionary process. On the contrary, for the α-glucoside utilization regulon AglR and the trehalose regulon ThuR, such an intermediate state is clearly observable. AglR regulators from Rhizobiales and Rhodobacteriales form a paraphyletic branch with ThuR regulators from Rhizobiales as an excluded clade (Figure S1 in Supplementary Material). Both AglR and ThuR are local regulators, each controlling expression of the regulator genes and one other operon, divergently transcribed with the latter. The regulated operons contain homologous genes for kinases and ABC transporters, but non-homologous genes for hydrolases. The TFBS motif for ThuR (natcnAAAnCGnTTTngatt) is different from the one for AglR (nnntcAAAGCGCTTTgannn). Thus, during diversification, ThuR changed both the ligand and motif specificity. For the paraphyletic group CelR in γ-Proteobacteria, the AscG regulator in Enterobacteriales is an excluded clade. In

the case of the CelR and AscG regulators, their TFBS motifs and sets of regulated genes are drastically changed after duplication, but the effector specificity (cellobiose-6P) and the target metabolic pathway (cellobiose utilization) are retained.

The phylogenetic tree for the analyzed LacI-TFs (Figure S1 in Supplementary Material) is patchy, complicating the reconstruction of the evolutionary history. However, in some cases we can observe two or more adjacent branches, each corresponding to one orthologous group. A natural explanation is that these groups appeared as a result of initial duplication of a TF gene followed by diversification of copies. The LacI-TF tree contains multiple examples of monophyletic taxon-specific branches that consist of proteins from a single bacterial lineage. One such branch includes the *Bacteroides* UxaR, UxuR, and KdgR regulators that control the catabolic pathways for galacturonate, glucuronate, and 2-keto-3 deoxygluconate, respectively. Two other monophyletic branches include (i) kojibiose regulator KojR and the unknown sugar regulator Caur\_3448 from *Chloroflexus* and (ii) ribose regulator RbsR and uridine regulator UriR from *Corynebacteria* spp. Previously, a similar situation was observed for the ROK family of sugarspecific regulators in the deeply branched Thermotogales lineage (Kazanov et al., 2013).

Other examples of divergent evolution of regulator specificity are often demonstrated by adjacent branches in the phylogenetic tree. Thus, idonate repressor IdnR in Enterobacteriales and gluconate repressor GntR in multiple lineages of γ-Proteobacteria are the closest paralogs (Figure S1 in Supplementary Material). In the IdnR regulons, the *idnK* and *idnT* genes are the closest homologs of the GntR-regulated genes *gntK* and *gntU*, respectively. Thus, duplication affected not only a TF gene, but also some of the regulated genes. The ability of GntR to recognize IdnR-binding sites upstream of the *idnK* and *idnDOTR* operons in *E. coli* (Tsunedomi et al., 2003b) also confirms a recent duplication of these regulators. Structural similarity of sugar effectors for GntR and IdnR also points to a recent duplication and further specialization of IdnR in Enterobacteriales.

Another scenario of TF diversification is duplication of an ancestral TF gene followed by acquisition of novel regulated genes and, accordingly, new effector specificity. An example is provided by a branch containing fructose (FruR) and sucrose (ScrR) orthologous groups in γ-Proteobacteria. Sucrose is a fructosecontaining disaccharide. Because the FruR- and ScrR-regulated genes are functionally and structurally different, the regulator was most probably duplicated alone, and then one copy acquired new regulated genes. The acquisition of a novel regulatory function was coupled with the changes in the cognate TFBS motif. Here, divergent TFs have structurally similar effectors, fructose-1,6-biphosphate for FruR and sucrose-6-phosphate for ScrR.

Based on the analysis of paraphyletic branches and adjacent monophyletic branches, three main types of the origin of TF with novel functions can be described. The first type is duplication of both the TF gene and a regulated gene or operon followed by diversification, as in the case of ThuR in Rhizobiales. During diversification, the TF and regulated genes change their specificities, some regulated genes may be lost, and novel genes may be included in the regulon. The second type is duplication of a regulator followed by acquisition of novel regulated genes, as for PurR in γ-Proteobacteria or for AscG in Enterobacteriales. The third type is rare, with only one example—the SCO5692 regulon in Streptomycetaceae. In this case, novel specificities for TFs originated without duplication, probably resulting from the loss of regulated genes.

In the process of acquisition of a new function, three characteristics of a TF regulon can be changed: (i) a set of regulated genes, (ii) effector specificity, and (iii) a TFBS motif structure. In most cases of TFs with a novel function, we observed the change of at least two of these characteristics. Change of all three characteristics is rarer and is usually observed in TFs from deeply branched lineages of bacteria such as Thermotogales. Most probably, in these cases change of all three characteristics is a result of a long evolutionary history. Change of only one of these characteristics is observed for only recently duplicated TFs, for which the diversification process is not yet complete.

In summary, the obtained extensive dataset for the LacI-TF family provides numerous examples of various evolutionary processes for regulators and their regulons. These data are publicly available in the RegPrecise database within the LacI family collection, which will enable further detailed analysis of signature residues in both DNA- and ligand-binding domains of regulators and establishment of the correlations between these residues and specificities toward the DNA motifs and molecular effectors they recognize.

# **AUTHOR CONTRIBUTIONS**

Dmitry A. Rodionov, Pavel S. Novichkov, and Mikhail S. Gelfand conceived and designed the research project. Dmitry A. Ravcheev, Dmitry A. Rodionov, and Mikhail S. Gelfand wrote the manuscript. Dmitry A. Ravcheev, Dmitry A. Rodionov, Matvei S. Khoroshkin, Olga N. Laikova, Olga V. Tsoy, Natalia V. Sernova, and Svetlana A. Petrova performed comparative genomics analysis to reconstruct TF regulons. Dmitry A. Rodionov provided the quality control of annotated regulons in the RegPrecise database. Matvei S. Khoroshkin analyzed statistical properties of TF regulons. All authors read and approved the final manuscript.

# **ACKNOWLEDGMENTS**

This work was supported by the Russian Foundation for Basic Research (14-04-00870, 14-04-91154) and by the Russian Academy of Sciences via the programs "Molecular and Cellular Biology" and "Living Nature." This research was also partially supported by the Genomic Science Program (GSP), Office of Biological and Environmental Research (OBER), and U.S. Department of Energy (DOE) and is a contribution of the Pacific Northwest National Laboratory (PNNL) Foundational Scientific Focus Area. Preliminary reconstruction of regulons was done by students of the Spring 2013 Bioinformatics course at the Faculty of Bioengineering and Bioinformatics in the Lomonosov Moscow State University as a part of regular coursework.

# **SUPPLEMENTARY MATERIAL**

The Supplementary Material for this article can be found online at: http://www.frontiersin.org/journal/10.3389/fmicb. 2014.00294/abstract

# **REFERENCES**


in proteobacterial genomes. *Nucleic Acids Res.* 29, 774–782. doi: 10.1093/nar/ 29.3.774


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 06 April 2014; paper pending published: 12 May 2014; accepted: 28 May 2014; published online: 11 June 2014.*

*Citation: Ravcheev DA, Khoroshkin MS, Laikova ON, Tsoy OV, Sernova NV, Petrova SA, Rakhmaninova AB, Novichkov PS, Gelfand MS and Rodionov DA (2014) Comparative genomics and evolution of regulons of the LacI-family transcription factors. Front. Microbiol. 5:294. doi: 10.3389/fmicb.2014.00294*

*This article was submitted to Microbial Physiology and Metabolism, a section of the journal Frontiers in Microbiology.*

*Copyright © 2014 Ravcheev, Khoroshkin, Laikova, Tsoy, Sernova, Petrova, Rakhmaninova, Novichkov, Gelfand and Rodionov. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Connecting lignin-degradation pathway with pre-treatment inhibitor sensitivity of *Cupriavidus necator*

*Wei Wang1 \*†, Shihui Yang2 \*†, Glendon B. Hunsinger 2, Philip T. Pienkos <sup>2</sup> and David K. Johnson1*

*<sup>1</sup> National Renewable Energy Laboratory, Biosciences Center, Golden, CO, USA*

*<sup>2</sup> National Renewable Energy Laboratory, National Bioenergy Center, Golden, CO, USA*

#### *Edited by:*

*Katherine M. Pappas, University of Athens, Greece*

#### *Reviewed by:*

*Harold J. Schreier, University of Maryland Baltimore County, USA Claudio Avignone-Rossa, University of Surrey, UK*

#### *\*Correspondence:*

*Wei Wang and Shihui Yang, 15013 Denver West Parkway, Golden, 80401 CO, USA e-mail: wei.wang@nrel.gov; shihui.yang@nrel.gov*

*†These authors have contributed equally to this work.*

To produce lignocellulosic biofuels economically, the complete release of monomers from the plant cell wall components, cellulose, hemicellulose, and lignin, through pre-treatment and hydrolysis (both enzymatic and chemical), and the efficient utilization of these monomers as carbon sources, is crucial. In addition, the identification and development of robust microbial biofuel production strains that can tolerate the toxic compounds generated during pre-treatment and hydrolysis is also essential. In this work, *Cupriavidus necator* was selected due to its capabilities for utilizing lignin monomers and producing polyhydroxylbutyrate (PHB), a bioplastic as well as an advanced biofuel intermediate. We characterized the growth kinetics of *C. necator* in pre-treated corn stover slurry as well as individually in the pre-sence of 11 potentially toxic compounds in the saccharified slurry. We found that *C. necator* was sensitive to the saccharified slurry produced from dilute acid pre-treated corn stover. Five out of 11 compounds within the slurry were characterized as toxic to *C. necator,* namely ammonium acetate, furfural, hydroxymethylfurfural (HMF), benzoic acid, and p-coumaric acid. Aldehydes (e.g., furfural and HMF) were more toxic than the acetate and the lignin degradation products benzoic acid and p-coumaric acid; furfural was identified as the most toxic compound. Although toxic to *C. necator* at high concentration, ammonium acetate, benzoic acid, and p-coumaric acid could be utilized by *C. necator* with a stimulating effect on *C. necator* growth. Consequently, the lignin degradation pathway of *C. necator* was reconstructed based on genomic information and literature. The efficient conversion of intermediate catechol to downstream products of cis,cis-muconate or 2-hydroxymuconate-6-semialdehyde may help improve the robustness of *C. necator* to benzoic acid and p-coumaric acid as well as improve PHB productivity.

**Keywords:** *Cupriavidus necator***, pre-treatment inhibitor, saccharified slurry, deacetylation, lignin degradation, biofuel, polyhydroxylbutyrate (PHB), genomics**

## **INTRODUCTION**

Lignocellulosic biomass is considered a renewable and sustainable source for energy production. The great environmental, energy security, and economic benefits of biofuels have driven research on biomass throughout the world, resulting in several pilot and demonstration scale projects, based primarily on cellulosic ethanol. One of the leading routes for cellulosic ethanol production is based on the deconstruction of biomass to a monomeric sugar solution produced by chemical pre-treatment and enzymatic saccharification termed a saccharified slurry (hereafter referred to as slurry) followed by fermentation of the sugars to ethanol. However, the physico-chemical properties of ethanol limit its further penetration into the current petroleum-based transportation fuel infrastructure (Serrano-Ruiz and Dumesic, 2011). The lower energy density of ethanol due to its higher oxygen content compared to conventional petroleum hydrocarbons, is a significant disadvantage to using bioethanol as a fuel alternative.

Current transportation fuel infrastructure has been established for hydrocarbon-based petroleum. Many advanced biofuels under development exhibit advantages such as higher energy densities, longer carbon chains and lower oxygen numbers. For example, there are some fuels derived from fatty acids with carbon numbers in the range from C12 to C22 that can be upgraded to high-energy-density hydrocarbons via hydrotreating. Therefore, development of hydrocarbon fuels, which are compatible with the current fuel infrastructure as replacements or blendstocks for gasoline, jet, and diesel, has recently received great attention.

The primary routes leading to hydrocarbon synthesis from biomass are thermochemical, biochemical, and hybrid approaches which employ elements of both. Bacteria, yeasts, or fungi can naturally synthesize fatty acids, isoprenoids, and polyalkanoates for energy storage. Although these compounds are mostly exploited in pharmaceutical, nutritional, and packaging sectors, they also have great potential for production of hydrocarbon fuels (Zhang et al., 2011). However, several obstacles need to be solved for future hydrocarbon production at the commercial scale. Firstly, although compounds, such as acetate, furfural, and HMF that are produced during chemical pre-treatments (e.g., dilute acid pre-treatment) and have been investigated extensively due to their toxicity to ethanologens (Olsson and HahnHagerdal, 1996; Liu et al., 2004, 2005, 2008; Gorsich et al., 2006; Endo et al., 2008; Franden et al., 2009, 2013; Allen et al., 2010; Bowman et al., 2010; Yang et al., 2010a,b, 2012b; He et al., 2012; Ask et al., 2013; Bajwa et al., 2013; Wilson et al., 2013), there are few reports on the effect of these compounds on hydrocarbon producers (Huang et al., 2012). Various mechanisms and approaches have been proposed and applied for decreasing inhibition in cellulosic hydrolysates (Olsson and HahnHagerdal, 1996; Larsson et al., 1999; Zaldivar and Ingram, 1999; Zaldivar et al., 1999; Petersson et al., 2006; Endo et al., 2008; Franden et al., 2009, 2013; Mills et al., 2009; Yang et al., 2010a,b, 2012a,b; Ask et al., 2013; Bajwa et al., 2013; Iwaki et al., 2013), but strategies for strain improvement have yet to be applied (Dunlop, 2011; Dunlop et al., 2011; Kang and Chang, 2012).

Moreover, the efficient utilization of lignin, a major cell component, would greatly improve the economics of advanced biofuel production. Due to its highly recalcitrant aromatic structure, lignin is seen largely as an impediment to biochemical processes due to its reinforcement of the cellulose/hemicellulose matrix complicating the saccharification process (Zeng et al., 2014). Owing to the high energy state of its aromatic ring structures, lignin could have great value as a substrate for energy production, but so far very limited opportunities for microbial transformation of lignin to high value-added fuels or fuel intermediates have been identified. Some bacteria are known to degrade lignin monomers via the β-ketoadipate pathway (Harwood and Parales, 1996). Currently lignin is almost untouched in lignocellulosic ethanol fermentations and remains behind in the fermentation residue after distillation. These residues are mostly burned to provide power in current cellulosic ethanol production processes, which places a low value on this material. To reach the goal of utilizing lignin as a carbon source for advanced biofuel production, the composition of lignin degradation compounds as well as the toxic effects of these compounds on hydrocarbon-producing microorganisms should be investigated.

In this study we are seeking a biological understanding of the impact of model inhibitors, which include lignin degradation products as well as furans and acetate, on hydrocarbon-producing microorganisms. *Cupriavidus necator* (Syn. *Alcaligenes eutrophus, Ralstonia eutropha*) has been extensively studied for production of polyhydroxybutyrate (PHB), which consists of a C4 repeating unit that can be thermally depolymerized and then decarboxylated to propene (Fischer et al., 2011; Pilath et al., 2013), an intermediate, which can be upgraded to hydrocarbon fuels via commercial oligomerization technologies. *C. necator* has been reported to be able to utilize lignin monomers as a carbon source (Pérez-Pantoja et al., 2008).

The genome sequence of *C. necator* H16 has been published with *in silico* genome modeling and a developed genetics system (Pohlmann et al., 2006; Park et al., 2011; Brigham et al., 2012). In addition, several transcriptomic studies have recently been reported (Peplinski et al., 2010; Brigham et al., 2012), and the genome sequences for a number of other *Cupriavidus* spp. are also now available (Amadou et al., 2008; Pérez-Pantoja et al., 2008; Janssen et al., 2010; Lykidis et al., 2010; Poehlein et al., 2011; Cserhati et al., 2012; Hong et al., 2012; Van Houdt et al., 2012; Li et al., 2013). This information will facilitate future comparative genomics and systems biology studies to develop *C. necator* H16 as a robust and metabolically diverse hydrocarbon-intermediate production strain. Genomics is applied in this study to explore the metabolic pathways related to lignin utilization and response to toxic compounds in slurries, which will provide perspectives for strain metabolic engineering toward future economic hydrocarbon production using lignin.

# **MATERIALS AND METHODS STRAINS AND MEDIA**

The strain used in this study is a glucose-utilizing mutant of *C. necator* H16 (wild-type H16 is not able to metabolize glucose) (Orita et al., 2012), *C. necator* 11599, which was purchased from NCIMB culture collection. It is routinely cultured in LB at 37◦C. A minimal medium recipe was selected for this study (Cavalheiro et al., 2009). Specifically, the defined minimal medium for *C. necator* (per liter, pH 6.8) was: 10 g glucose, 1.0 g (NH4)2SO4, 1.5 g KH2PO4, 9 g Na2HPO4 12H2O, 0.2 g MgSO4 7H2O, 1.0 mL trace element solution. The Trace Element Solution (per liter): 10 g FeSO4.7H2O, 2.25 g ZnSO4.7H2O, 0.5 g MnSO4.5H2O, 2 g CaCl2.2H2O, and 1 g CuSO4.5H2O, 0.23 g Na2B4O7.10H2O, 0.1 g (NH4)6MO7O24, 10 mL 35% HC1.

#### **PRODUCTION OF SACCHARIFIED SLURRY AND MOCK MEDIA**

A deacetylated saccharified slurry, which was produced from the modified sulfuric acid pre-treatment and enzymatic hydrolysis of corn stover including an added deacetylation step before pre-treatment, was used in this study (Chen et al., 2012). The composition of the mock sugar media simulating the saccharified slurry is summarized in **Table 1**. The composition is based on the composition of the saccharified slurry in fermentation media at the level of 20% total solids.

### **GROWTH OF** *C. NECATOR* **ON SACCHARIFIED SLURRY**

*Cupriavidus necator* was first grown in 5 mL of LB in 125 mL baffled flasks, cultured at 200 rpm, and 37◦C. After 1 day, a 10% inoculum was added to 50 mL of fermentation media in a 250 mL flask and incubated in a shaker at 37◦C and 180 rpm for 4 days. The fermentation media contained either mock sugar slurry as shown in **Table 1** or saccharified slurry supplemented with tryptone (10 g/L) and yeast extract (5 g/L) as nutrients. Mock slurry was added at a level to achieve the same sugar concentrations (e.g., the glucose concentration in the 2X-diluted mock medium was 50 g/L). All experiments were run in duplicate.

### **PHB ANALYSIS**

The PHB content of the bacterial cells was determined by a quantitative method that used HPLC analysis to measure the crotonic acid formed by acid-catalyzed depolymerization of PHB (Karr et al., 1983). Cell mass samples were freeze-dried before analysis. PHB-containing dried bacterial cells (15–50 mg) were then digested in 96% H2SO4 (1 mL) at 90◦C for 1 h. The reaction


vials were then cooled on ice, after which, ice-cold 0.01N H2SO4 (4 mL) was added followed by rapid mixing. The samples were further diluted 20- to 150-fold with 0.01N H2SO4 before analysis by HPLC.

The concentration of crotonic acid was measured at 210 nm using an HPLC equipped with a photodiode array detector (Agilent 1100, Agilent Technologies, Palo Alto, CA). A Rezex RFQ Fast Acids column (100 × 7*.*8 mm, 8μm particle size, Phenomenex, Torrance, CA) and Cation H+ guard column (BioRad Laboratories, CA) operated at 85◦C were used to separate the crotonic acid present in the reaction solutions. The eluent was 0.01N H2SO4 at a flow rate of 1.0 mL min−1. Samples and crotonic acid standards were filtered through 0.45μm pore size nylon membrane syringe filters (Pall Corp., NY) prior to injection onto the column. The HPLC was controlled and data were analyzed using Agilent ChemStation software (Rev.B.03.02).

### **CHARACTERIZATION OF POTENTIAL INHIBITORS**

The selection of chemicals for sensitivity assays was based on analyses of saccharified slurry by GC-MS and LC-MS analyses performed at NREL as well as ICP-MS analyses carried out by Huffman Labs (Golden, CO). The compounds identified with high concentrations in the slurry and their potential derivatives were selected for further investigation. The compounds included ammonium (added to neutralize pre-treated corn stover) with two common anions of acetate (released by hydrolysis of hemicelluloses) and sulfate (from the sulfuric acid pre-treatment); sugar degradation products furfural and HMF; lignin monomers vanillin, coumaric acid, ferulic acid, and 4-hydroxybenzaldehyde as well as benzoic acid, a common intermediate from lignin monomer (coumarate and cinnamate) degradation (**Figure 5**). We also included products from the oxidation of lignin monomers, vanillic acid and 4-hydroxybenzoic acid, in a similar concentration range to their aldehyde forms (**Table 2**). The 1X concentrations for some chemicals (e.g., benzoic acid) in the slurry used for testing are based on concentrations used in a previous study.

### **BIOSCREEN C HIGH THROUGHPUT TOXICITY ASSAY**

The high-throughput Bioscreen C assay was carried out as reported previously (Franden et al., 2009, 2013). Briefly, *C. necator* cells were revived from overnight LB culture with OD600nm adjusted to 3.0 using minimal medium. This cell suspension was used as seed culture to inoculate Bioscreen C plates containing 290μL minimal medium per well at an initial OD600nm of 0.1. Growth was then monitored using the Bioscreen C instrument (GrowthCurves USA, NJ) with three technical replicates. The experiments were repeated at least two times.

*Cupriavidus necator* grown in the absence of potential inhibitory compounds was used as the control, and inhibition studies utilized cultures of *C. necator* challenged with different concentrations of each compound ranging from 0.1- to 10-fold (0.1X to 10X, **Table 2**). Stock solutions of compounds at 10X concentrations were prepared by dissolving the compounds to be tested in the minimal medium. These stock solutions were then diluted in minimal medium for testing at lower concentrations. For certain compounds with low aqueous solubility, incubation at 55–60◦C in a hot dry bath for several hours was needed for complete dissolution. The pH of the stock solutions was adjusted to the desired point of 6.8 using ammonium hydroxide (NH4OH) or sulfuric acid (H2SO4) and then filter sterilized before using.

# **GENOMIC INVESTIGATION FOR LIGNIN DEGRADATION PATHWAY RECONSTRUCTION**

To identify the enzymes related to lignin degradation, the protein sequences of *C. necator* H16 were extracted and reannotated functionally. Briefly, 6626 protein sequences were downloaded from NCBI (Genbank#: AM260479) and imported into CLC Genomics Workbench (V5.5) as the reference protein sequences for Blast search. In addition, the protein sequences were also imported into Blast2GO for the functional annotation (Gotz et al., 2008). The Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways were then extracted as well as the information of euKaryotic Orthologous Groups (KOG), enzyme code, and the reaction substrate(s) and product(s). The potential homologous gene(s) in *C. necator* H16 were identified by re-iterated BlastP searches. The information on protein product and conserved domains were examined and the pathway was reconstructed with the enzyme and pathway information from a literature search.

### *Gas chromatographic analysis*

Analysis of samples was performed on an Agilent 7890 GC equipped with a 5975 MS (Agilent Technologies, Palo Alto, CA). Sample compounds were separated using a 30 m × 0.25 mm × 0.25 mm DB-FFAP column (Agilent). HP MSD Chemstation software (Agilent) equipped with NIST database Rev. D.03.00 was used to determine the identity of the unknown compounds found within the samples.

Each sample was placed on an auto-sampler (Agilent) and injected at a volume of 1μL into the GC-MS (Agilent). The GC-MS method consisted of a front inlet temperature of 250◦C, MS transfer line temperature of 280◦C, and a scan range from 35 to 550 m/z. A starting temperature of 50◦C was held for 1.5 min and then ramped at 11◦C/min to a temperature of 165◦C with no hold time, then continued at a ramped rate of 35◦C/min to 250◦C and held for 9.617 min. The method resulted in a run time of 28 min for each sample.

# **RESULTS AND DISCUSSIONS**

# **GROWTH KINETICS OF** *C. NECATOR* **IN THE SACCHARIFIED SLURRY**

*Cupriavidus necator* did not grow in the original deacetylated saccharified slurry (**Figure 1A**) even in the 2-fold diluted saccharified slurry (**Figure 1B**) though it was able to grow in media containing the mock slurry. The sugars provided by the mock slurry allowed *C. necator* for superior growth over the LB alone in both 1-fold concentration and 2-fold dilution conditions with fast growth and high final OD600nm value. This complete inhibition of growth in real slury suggests that *C. necator* is not tolerant to the combination of all toxic compounds present in the saccharified slurry.

### **KINETICS OF** *C. NECATOR* **IN THE SACCHARIFIED SLURRY AFTER ACTIVATED CARBON (AC) TREATMENT**

It is well-known that acid pre-treatment of corn stover can generate various kinds of potential toxic inhibitors including organic



*The concentration (mM) following the chemical abbreviation is the concentration in the slurry calculated from analytical results of GC-MS, LC-MS, or ICP-MS.*

acids and aldehydes, and can also add to the toxicity through the build up of inorganic salts from the sulfuric acid used for pre-treatment and ammonium hydroxide added for neutralization (e.g., organic and inorganic acids of levulinic acid, vanillic acid, hydroxybenzoic acid, sulfuric acid, and aldehydes of HMF

**Table 3 | The compositions of deacetylated saccharified slurry before and after activated carbon (AC) treatment.**


*ND, Not detected; the detection limitation for furfural is 2.6 mM, HMF is 2.8 mM.*

and furfural). Furfural is generally considered the most potent inhibitor to various microbial catalysts such as *E. coli*, *Z. mobilis*, and yeast (Zaldivar et al., 1999; Liu et al., 2004, 2005, 2008; Gorsich et al., 2006; Allen et al., 2010; Bowman et al., 2010; He et al., 2012; Huang et al., 2012; Franden et al., 2013). The maximum amount of furfural was 2 g/L in the slurries in this study. Shake flask growth experiments showed that furfural greatly inhibited cell viability when the concentration was above 2 g/L (data not shown).

A chemical treatment process using activated carbon (AC) was applied in this study in order to improve growth by removing potential inhibitors such as furfural from the slurry. AC was added to the slurry at 0.1 g/mL loading and incubated for 2 h at 130 rpm, 24◦C. The compositions of saccharified slurry before and after AC-treatment indicated that all the furfural and about 30% of the acetate was removed by the AC treatment (**Table 3**).

The impact of AC treatment on the toxicity of diluted saccharified slurry (4-fold dilution), which provided about 25 g/L of glucose to the growth media, was tested. As shown in **Figure 2A**, untreated saccharified slurry was very toxic to *C. necator,* while AC-treated saccharified slurry was much less inhibitory to growth, although it was still inhibitory compared to the

mock slurry control. It is noteworthy that the consumption of glucose from AC-treated slurry was far slower than that from mock slurry and at 24 h, only about 25% of the total glucose was removed from the culture medium compared to 85% in the mock slurry (**Figure 2B**). This indicates that inhibitors remained in AC-treated slurry although furfural was completely removed. The remaining inhibitors are either unidentified residual components in the slurry (other than furfural) or

components released from pre-treated corn stover during saccharification. It is noteworthy that besides the removal of all furfural by AC-treatment, our preliminary data indicated that a significant fraction of the other potentially toxic compounds such as acetic acid (**Table 3**) and lignin degradation products (e.g., vanillin, 4-hydroxybenzaldehyde, and p-coumaric acid) were also removed by AC-treatment. The impact of these individual toxic compounds in the saccharified slurry will be evaluated in the following section.

The cell yield from the AC-treated slurry (4-fold diluted, 4X) was about one third that of the culture grown on the mock slurry (4X), with similar amounts of PHB accumulated in the cells from both growth conditions 96 h post-inoculation (**Figure 2C**). However, when 2-fold diluted slurry was used, increased inhibition led to a lower cell mass and accordingly a lower PHB yield even with the carbon treated slurry (**Figure 2C**). All these results indicate that the toxicity of the biomass-derived slurry had a major impact on PHB production. Moreover, the low cell viability on 2x diluted slurry indicates that the inhibitory effect on cells is likely due to the combined contribution of several inhibitors besides furfural. Although the efficacy of the AC-treatment is obvious in term of cell growth, it only partially mitigated the inhibition. Identification of the other inhibitors in the AC-treated slurry is needed to help eliminate or mitigate the observed inhibition.

#### **INHIBITOR SENSITIVITY INVESTIGATION USING THE HIGH-THROUGHPUT BIOSCREEN C ASSAY**

To systematically explore the impact of inhibitors in the saccharified slurry on *C. necator*, a high-throughput Bioscreen C growth assay was performed using growth medium augmented with the potentially toxic compounds that had been identified at concentration ranges discussed above (**Table 2**). Briefly, growth curves were generated by subtracting OD readings from the test wells in the Bioscreen C from the background wells containing blank medium only. Typical growth curves with *C.*

**Table 4 | The growth rates and responses of** *C. necator* **to various furfural concentrations (mM) supplemented in the minimal medium.**


**Table 5 | The IC50 values and the ratio of the IC50 value to Conc. value (the concentration of the toxic compound identified in the saccharified slurry, mM) for** *C. necator.*


**Table 6 | Response values of** *C. necator* **to 11 potentially toxic compounds identified in the saccharified slurry at different concentrations (mM) supplemented in the minimal medium.**


*The values in bold font indicate stimulation of growth by the compound at the corresponding concentration.*

**concentrations, 5 days post inoculation.** R1, R2, and R3 are three technical replicate wells in the Bioscreen C honeycomb plate.

*necator* grown with varied concentrations of furfural are shown in **Figure 3**. The high-throughput nature of this assay allowed for utilization of both biological replicates on multiple plates and technical replicates on the same plate. Generally speaking, correlation between growth curves in both technical and biological replicates was quite high (data not shown), emphasizing the power of this method. The results were consistent with previous shake flask experiments showing that furfural greatly

inhibited cell growth when the concentration was above 2 g/L (ca. 20 mM).

Growth rates (μ in terms of h−1) for each growth curve were then calculated as described previously (Franden et al., 2009, 2013). The response values, given as the percentage of the growth rate compared with the control in the absence of supplemented inhibitor, were then calculated for each concentration (**Table 4**). The response values were then used to determine the

concentration of inhibitor (IC50) that resulted in 50% growth compared to the control. The application of IC50 in this study was used to determine the top inhibitors in the slurries by comparing the IC50 of different compounds. This approach has frequently been used as a general toxicity indicator for potential inhibitors (Franden et al., 2013). IC50 values for 4-hydroxybenzaldehyde, vanillin, ferulic acid, 4-hydroxybenzoic acid, and vanillic acid were above the highest values tested, and the IC50 values for the remaining compounds were listed in **Table 5**. The growth of *C. necator* was inhibited by 5 out of the 11 compounds tested. The aldehydes, furfural and HMF, were the most toxic compounds in the slurry. The IC50 value of 9 mM for furfural was lower than the concentration detected in the saccharified slurry (12.6 mM or 1.2 g/L). Although the IC50 for benzoic acid had the lowest value, its concentration in the slurry was only 0.037 mM and less than 10% of the IC50 value making it the least toxic component in the slurry. The lignin degradation products were minimally toxic in the relevant concentration range except for p-coumaric acid with an IC50 value about 4 times higher than the concentration in the saccharified slurry. Ammonium acetate was found to be more toxic than ammonium sulfate (**Table 5**).

A few key points, however, must be made that aren't captured in **Table 4**. One major observation is that some of the compounds found in saccharified slurry stimulate growth by as much as 50% (**Table 6**). Growth stimulation with some of these compounds (e.g., ammonium sulfate, ammonium acetate, benzoic acid, and coumaric acid) was seen at low concentrations, but these compounds became inhibitory at higher concentrations. This indicated that *C. necator* might utilize these compounds as supplemental carbon or nitrogen sources at low levels. The utilization of acetate by *C. necator* in shake flask experiments (data not shown) supports this hypothesis that the higher response values could be an indicator of growth stimulation by utilizing the supplemented substrates (**Figure 2B**), which is consistent with previous report that addition of acetate as a supplementary substrate improved the cell growth and PHB production in *C. necator* DSMZ 545 (Sharifzadeh Baei et al., 2009). The combination of growth stimulation by some compounds and inhibition by others complicated analysis of the impact of saccharified slurry on growth and productivities and should be further explored in the future.

An additional point concerning the bacterial toxicity profiles is that they can provide additional information that could be used to understand microbial physiology and to propose genetic targets for metabolic engineering. For example, as the concentration of benzoic acid increased and the response changed from growth stimulation to inhibition, the color of the culture medium also became darker (**Figure 4**). This color change was not seen

in un-inoculated control wells. The availability of the *C. necator* genome sequence will facilitate our understanding of this phenomenon using a genomics approach that will be explained in the following section.

#### *C. NECATOR* **LIGNIN-DEGRADATION PATHWAY RECONSTRUCTION**

*Cupriavidus necator* is capable of utilizing lignin monomers and recently a *Cupriavidus* sp. (*C. basilensis* B-8) has been reported to be able to utilize kraft lignin (Shi et al., 2013). With that information in hand, we then set out to reconstruct the lignin degradation pathway of *C. necator* based on genome information and literature reports to help understand the bottleneck for further development of *C. necator* as a lignocellulosic biofuel-production strain (**Figure 5**).

Although the enzymatic reaction from lignin to lignin monomers and most reactions from lignin monomers to the metabolic intermediates are not yet fully revealed, homologous genes encoding the enzyme BenA, B, C, D were identified in *C. necator* that would convert benzoate to catechol. When the concentration of the supplemented benzoate is lower than the concentration characterized in the slurry, catechol will be further converted into cis,cis-muconate or 2-hydroxymuconate-6 semialdehyde through two different metabolic pathways leading to the accumulation of the key intermediate of acetyl-CoA for PHB production (**Figure 5**). Previous study already indicated that the accumulation of catechol in *Pseudomonas mendocina* can lead to its conversion of catechol into 1, 2-benzoquinone (Parulekar and Mavinkurve, 2006), which caused the culture medium to turn dark and at the same time inhibited the cellular growth since benzoquinone is a toxic agent against microorganisms such as *E. coli*, *Pseudomonas fluorescens,* and *Erwinia amylovora* (Beckman and Siedow, 1985). Our GC/MS study also indicated the correlation between the catechol appearance and benzoic acid disappearance when 1 g/L benzoic acid was supplemented into the minimum medium (**Figure 6**). Although further experimental data are needed to quantify the disappearance of benzoic acid and the appearance of catechol and 1,2-benzoqionone, our current study suggest that at high benzoate concentrations, *C. necator* can not convert catechol completely to cis,cis-muconate or 2-hydroxymuconate-6-semialdehyde and catechol began to accumulate which potentially caused the formation of 1, 2 benzoquinone and then turned medium into dark brown color and inhibited cell growth (**Figures 4**–**6**).

In addition, homologous genes were also identified in *C. necator* to carry out the reaction from p-coumarate to 4 hydrobenzoate and then feed into TCA cycle (**Figure 5**), which indicates that *C. necator* can utilize lignin monomer coumarate as carbon source for cell growth and may explain the stimulus effect of coumarate on *C. necator* growth in the lower concentration range. When high concentration of p-coumarate was supplemented into the medium, high concentration of catechol may be produced and the bottleneck reaction from catechol to cis,cis-muconate or 2-hydroxymuconate-6-semialdehyde (**Figure 5**) caused the inhibitory effect as discussed above. However, it is possible that other metabolic pathways that we have not been covered will also involved in and even play a key role on lignin monomer utilization and pre-treatment inhibitor sensitivity.

This study is an attempt and shows the feasibility to connect genomic information to explain microbial physiological phenomena. Although it could provide genetic targets for metabolic engineering to improve strain robustness (e.g., overexpression catechol dioxygenase to drive catechol into TCA instead of catechol accumulation), other experimental approaches such as systems biology study are needed to completely understand the mechanism of pre-treatment inhibitor sensitivity and genetic studies are especially required to confirm the hypothesis generated by bioinformatics study.

#### **AUTHOR CONTRIBUTIONS**

Wei Wang, Shihui Yang, Philip T. Pienkos, and David K. Johnson designed the experiments. Wei Wang carried out the flask assay and AC treatment. Shihui Yang carried out the Bioscreen C assay and genomic study. Glendon B. Hunsinger performed the GC/MS. Shihui Yang and Wei Wang did the spectrum scanning. Wei Wang, Shihui Yang, Philip T. Pienkos, and David K. Johnson analyzed the data, and wrote the manuscript.

# **ACKNOWLEDGMENTS**

This work was supported by the US Department of Energy, Bioenergy Technology Office (BETO) under contract number DE-AC36-08-GO28308 to NREL. NREL is a national laboratory of the US Department of Energy, Office of Energy Efficiency and Renewable Energy, operated by the Alliance for Sustainable Energy, LLC. We thank Michael Guarnieri and Thieny Trinh for preliminary toxicity experiments with *C. necator*, Ashutosh Mittal and William E. Michener for PHB and GC/MS analyses.

### **SUPPLEMENTARY MATERIAL**

The Supplementary Material for this article can be found online at: http://www*.*frontiersin*.*org/journal/10*.*3389/fmicb*.*2014*.* 00247/abstract

# **REFERENCES**


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 09 January 2014; accepted: 06 May 2014; published online: 27 May 2014. Citation: Wang W, Yang S, Hunsinger GB, Pienkos PT and Johnson DK (2014) Connecting lignin-degradation pathway with pre-treatment inhibitor sensitivity of Cupriavidus necator. Front. Microbiol. 5:247. doi: 10.3389/fmicb.2014.00247*

*This article was submitted to Microbial Physiology and Metabolism, a section of the journal Frontiers in Microbiology.*

*Copyright © 2014 Wang, Yang, Hunsinger, Pienkos and Johnson. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Elucidation of *Zymomonas mobilis* physiology and stress responses by quantitative proteomics and transcriptomics

# *Shihui Yang1,2,3\*, Chongle Pan4,5, Gregory B. Hurst 5, Lezlee Dice1,2, Brian H. Davison1,2 and Steven D. Brown1,2\**

*<sup>1</sup> Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN, USA*

*<sup>5</sup> Chemical Sciences Division, Oak Ridge National Laboratory, Oak Ridge, TN, USA*

#### *Edited by:*

*Katherine M. Pappas, University of Athens, Greece*

#### *Reviewed by:*

*Harold J. Schreier, University of Maryland Baltimore County, USA Claudio Avignone-Rossa, University of Surrey, UK*

#### *\*Correspondence:*

*Shihui Yang, National Bioenergy Center, National Renewable Energy Laboratory, 15013 Denver West Parkway, Golden, CO 80401, USA e-mail: shihui.yang@nrel.gov; Steven D. Brown, BioEnergy Science Center, Oak Ridge National Laboratory, 1 Bethel Valley Road, Oak Ridge, TN 37831, USA e-mail: brownsd@ornl.gov*

*Zymomonas mobilis* is an excellent ethanologenic bacterium. Biomass pretreatment and saccharification provides access to simple sugars, but also produces inhibitors such as acetate and furfural. Our previous work has identified and confirmed the genetic change of a 1.5-kb deletion in the sodium acetate tolerant *Z. mobilis* mutant (AcR) leading to constitutively elevated expression of a sodium proton antiporter encoding gene *nhaA,* which contributes to the sodium acetate tolerance of AcR mutant. In this study, we further investigated the responses of AcR and wild-type ZM4 to sodium acetate stress in minimum media using both transcriptomics and a metabolic labeling approach for quantitative proteomics the first time. Proteomic measurements at two time points identified about eight hundreds proteins, or about half of the predicted proteome. Extracellular metabolite analysis indicated AcR overcame the acetate stress quicker than ZM4 with a concomitant earlier ethanol production in AcR mutant, although the final ethanol yields and cell densities were similar between two strains. Transcriptomic samples were analyzed for four time points and revealed that the response of *Z. mobilis* to sodium acetate stress is dynamic, complex, and involved about one-fifth of the total predicted genes from all different functional categories. The modest correlations between proteomic and transcriptomic data may suggest the involvement of posttranscriptional control. In addition, the transcriptomic data of forty-four microarrays from four experiments for ZM4 and AcR under different conditions were combined to identify strain-specific, media-responsive, growth phase-dependent, and treatment-responsive gene expression profiles. Together this study indicates that minimal medium has the most dramatic effect on gene expression compared to rich medium followed by growth phase, inhibitor, and strain background. Genes involved in protein biosynthesis, glycolysis and fermentation as well as ATP synthesis and stress response play key roles in *Z. mobilis* metabolism with consistently strong expression levels under different conditions.

**Keywords:** *Zymomonas mobilis***, microarray, proteomics and metabolomics, acetate, pretreatment inhibitor, stress responses, quantitative proteomics, systems biology**

### **BACKGROUND**

Yeast strains are among the leading current generation industrial biocatalyst microorganisms for fuel production (Hahn-Hagerdal et al., 2006). However, engineered bacteria such as *Zymomonas mobilis*, *E. coli, Bacillus subtilis* are also being developed and deployed to address commercial biofuel catalyst requirements (Dien et al., 2003; Inui et al., 2004; Romero et al., 2007; Alper and Stephanopoulos, 2009). *Z. mobilis* is an ethanologenic bacterium with many desirable industrial characteristics such as high-specific productivity and yield, high ethanol tolerance, and wide pH range (Gunasekaran and Raj, 1999; Panesar et al., 2006; Rogers et al., 2007). Recently, transformation efficiency has been improved by modifying the DNA restriction-modification systems (Kerr et al., 2010), and the inhibitor tolerance genes have been identified to improve the pretreatment inhibitor tolerance using genes from *Z. mobilis* (Yang et al., 2010a,b) or from *Deinococcus radiodurans* (Zhang et al., 2010). The genome sequences for strains ZM4, NCIMB 11163, 10988, 29291 and 29292 have been determined (Seo et al., 2005; Kouvelis et al., 2009, 2011; Pappas et al., 2011; Desiniotis et al., 2012), and the ZM4 genome annotation was improved recently (Yang et al., 2009a). Genome-scale *in silico* metabolic modeling analysis have been reported (Lee et al., 2010; Widiastuti et al., 2011; Rutkis et al., 2013) and recombinant strains have been engineered to express and secret cellulase (Linger et al., 2010) or ferment hexoses and pentose sugars such as xylose and arabinose (Zhang et al., 1995; Deanda et al., 1996).

A core challenge in next-generation biomass-based cellulosic biofuel is the recalcitrance of biomass to breakdown into sugars (Himmel et al., 2007; Alper and Stephanopoulos, 2009).

*<sup>2</sup> BioEnergy Science Center, Oak Ridge National Laboratory, Oak Ridge, TN, USA*

*<sup>3</sup> National Bioenergy Center, National Renewable Energy Laboratory, Golden, CO, USA*

*<sup>4</sup> Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, TN, USA*

Biomass pretreatment regimes are required to release the sugars, which can create inhibitors such as furfural, acetate, and vanillin (Almeida et al., 2007; Pienkos, 2009). The existence of pretreatment inhibitors increases production costs due to lower production rates and decreased yields. The development and deployment of robust inhibitor-tolerant biocatalysts for efficient fermentation of biomass to biofuel will be a critical component for successful production of biofuels at industrial-scale quantities to meet sustainability and energy security challenges associated with fossil fuels (Almeida et al., 2007). Several recent transcriptomics studies have used microarray approach to characterize mutant strain or *Z. mobilis* stress responses (Yang et al., 2009b, 2013; Hayashi et al., 2011; He et al., 2012a,b; Jeon et al., 2012; Skerker et al., 2013).

Acetic acid is an important inhibitor produced by the de-acetylation of hemicelluloses during biomass pretreatment. Unlike another major inhibitor furfural which is volatile and can be converted into a less toxic product of furfural alcohol (Liu et al., 2005; Heer and Sauer, 2008; Franden et al., 2009; Agrawal and Chen, 2011), acetate is stable during fermentation and poses a constitutive stress on the growth and ethanol production of *Z. mobilis* (Yang et al., 2010a,b). A *Z. mobilis* mutant strain, designated AcR, generated by chemical mutagenesis and selection, is able to produce ethanol efficiently in the presence of 20 g/L sodium acetate (NaAc), while the parent ZM4 is inhibited significantly above 12 g/L (Joachimstahl et al., 1998). Through comparative genome sequencing and next-generation sequencing (NGS)-based genome resequencing, we characterized the AcR mutant and identified a 1.5-kb deletion in strain AcR, which likely truncated the promoter of the *nhaA* gene encoding a sodium proton antiporter. We have carried out genetics study to confirm the association of 1.5-kb deletion in AcR mutant with its sodium acetate tolerance phenotype, we further performed microarray study to identify the differentially expressed genes between wildtype ZM4 and AcR mutant, and identified that *nhaA* gene is consistently upregulated in AcR mutant background compered to wild-type ZM4 under different conditions of NaCl and NaAc stress (Yang et al., 2010b).

Although we confirmed the 1.5-kb deletion in AcR mutant background leading to *nhaA* gene overexpression for enhanced sodium acetate tolerance phenotype (Yang et al., 2010b), we haven't systematically explored the global transcriptional profile difference between AcR mutant and wild-type ZM4 especially in the condition of minimal medium (MM), which is potentially more relevant to industrial fermentation conditions and more stressful to *Z. mobilis* than rich media (RM) we used for previous study (Yang et al., 2010b). In addition, there is no systems biology for this important industrial strain in minimal medium yet. To further explore the acetate-tolerance differences between ZM4 and AcR in MM condition, comprehensive microarraybased transcriptomic profiles and <sup>14</sup>*/*15N-labelled quantitative proteomic data were generated for ZM4 and AcR in MM with a large number of *Z. mobilis* proteins detected and quantified, which will be useful for future biocatalyst development. In addition, data collected in this and previous studies were used to hypothesize condition-responsive (strain, media, growth phase, or treatment) genes.

# **RESULTS**

# **PHYSIOLOGICAL RESPONSE OF** *Z. MOBILIS* **TO SODIUM ACETATE IN MINIMAL MEDIUM (MM)**

The growth of *Z. mobilis* wild-type ZM4 and AcR mutant in MM supplemented with 0, 12, or 16 g/L NaAc as well as 8.65 or 11.4 g/L NaCl with same corresponding Na+ molar concentrations as NaAc was assessed using a Bioscreen C instrument (Growth Curves USA, NJ) under anaerobic conditions to determine the effect of NaCl and NaAc on *Z. mobilis* growth and to decide on an appropriate NaAc concentration for subsequent systems biology studies. *Z. mobilis* grew more slowly in MM and attained lower cell densities (Additional File 1) compared to that in RM conditions (Yang et al., 2010b). Consistent with earlier RM results, wild-type ZM4 growth was arrested when NaAc was added to minimum medium at 16 g/L, and differences were observed between ZM4 and AcR with NaAc supplemented at concentrations of 12 g/L (Additional File 1). The concentration of 10 g/L was chosen for NaAc treatment in subsequent systems biology studies.

Similar to previous reports in RM (Joachimstahl et al., 1998; Yang et al., 2010b), mutant strain AcR also outperformed wildtype strain ZM4 in MM supplemented with NaAc (**Figure 1**, Additional File 1). Strain AcR overcame the acetate stress and reached stationary phase after about 130 h post-inoculation while ZM4 reached stationary phase after approximately 166 h. Both strains achieved similar final cell densities based upon OD600nm readings. Strain ZM4 had a longer lag phase and began to consume glucose much later than AcR, with most of the glucose in AcR cultures consumed at 148 h post-inoculation while cultures of ZM4 took 166 h to use the majority of the glucose in MM (**Figure 1**). As glucose was consumed earlier by AcR cultures, there was also a concomitant earlier onset for ethanol production compared to ZM4. Both strains produced similar yield of ethanol by the end of the experiment with acetate concentration kept steady during the whole experiment (**Figure 1**).

### *Z. MOBILIS* **WILD-TYPE ZM4 AND AcR MUTANT PROTEOMIC PROFILING DIFFERENCES IN MM**

Quantitative proteomics was used to compare the proteomic differences between *Z. mobilis* wild-type ZM4 and AcR mutant cells at 148 and 190 h post-inoculation. We identified 705 chromosomal encoded proteins and 6 plasmid derived proteins for the 148 h comparison, and 728 chromosomal proteins and 7 plasmid proteins for the 190 h comparison. We thus identified 638 chromosomal proteins and 4 plasmid proteins in common for the two time points. Altogether, this study identified 795 unique chromosomal proteins and 9 unique plasmid proteins, which is about 46% of the predicted *Z. mobilis* ZM4 proteome based upon a reannotation of the genome (Yang et al., 2009a) (Additional Files 2, 3A). The PI and MW distributions for the identified proteins were similar to that of the predicted theoretical PI and MW distributions for all proteins (Additional File 4). This study represents, to our knowledge, the largest quantitative measurement of the *Z. mobilis* proteome published to date.

The 148 h time point comparison between ZM4 and AcR strains revealed 120 proteins that had at least 1.5-fold differences in abundance and were significantly different based upon

the ProRata likelihood algorithm for quantitative shotgun proteomics and a confidence interval of 90% as described previously (Pan et al., 2006). This included 55 upregulated proteins and 65 downregulated proteins in AcR compared to ZM4, of which 11 were upregulated and 25 were downregulated more than 2-fold (Additional File 2). There were 107 proteins with a significant difference and at least a 1.5-fold change in levels for stationary phase comparison (190 h) using the same criteria. In strain AcR, 47 proteins were upregulated and 60 downregulated compared to ZM4, of which 10 were upregulated and 26 were downregulated and had changes of at least 2-fold (Additional File 2). Approximately half of the proteins that were identified as being different for one time point were also statistically different in the other time point (Additional Files 2, 3B–C).

The interactions among the down-regulated or up-regulated proteins with at least a 1.5-fold change were also analyzed for their previously documented interactions using STRINGS database (Jensen et al., 2009). The 14 proteins consistently upregulated in AcR for both time points compared to ZM4 had fewer interactions (Additional Files 2, 3D). Among the 39 proteins downregulated in AcR at both the 148 and 190 h time points post-inoculation, stress-responsive proteins such as catalase (ZMO0918), glutathione synthetase (ZMO1913), glutaredoxin 2 (ZMO0070), and a glutaredoxin-related protein (ZMO1873) were downregulated and most of them have been shown to interact with one another (Additional Files 2, 3E). Other proteins downregulated and connected included preprotein translocase (ZMO1896-8), cofactor synthesis of biotin synthase (ZMO0094), and amino acid biosynthesis such as shikimate 5-dehydrogenase (AroB, ZMO0041), threonine synthase (ThrC, ZMO1891), and 3-isopropylmalate dehydratase large subunit (LeuC, ZMO0105) (Additional Files 2, 3E).

Although our intention was to identify differentially expressed proteins more confidently by combining datasets from both time points than merely the second time point of stationary at 190 h to exclude the growth effect, the first comparison however will be potentially problematical due to the growth difference between AcR mutant and wild-type ZM4 at 148 h time point (**Figure 1**). Therefore, although comparison for second time point of stationary phase at 190 h is reasonably accurate, caution should be taken to interpret proteomic comparison result of second time point due to potential growth effect impact. To overcome this problem, samples from same growth phase between AcR mutant and ZM4 wild-type were used for our transcriptomic comparisons as discussed below.

### *Z. MOBILIS* **WILD-TYPE ZM4 AND AcR MUTANT TRANSCRIPTOMIC PROFILING DIFFERENCES IN MM**

Transcriptomic profiles for ZM4 and AcR in the presence of NaAc in MM were examined using NimbleGen high density expression arrays, essentially as reported previously (Yang et al., 2010b). The expression profiles generated in this study have been deposited in the GEO database (accession number GSE25443). About seventeen hundred genes were identified to be significantly differentially expressed using ANOVA modeling with strain (ZM4 and AcR) and time points post-inoculation as variables in MM (Additional File 5), which covered nearly all of the reannotated *Z. mobilis* ZM4 genes (Yang et al., 2009a). Hence dynamic gene expression changes were observed and for NaAc responses, 474 genes were significantly differentially expressed with at least a 2-fold change in expression values (Additional File 5).

Nine differentially expressed genes from different functional categories with a broad range of expression ratios were chosen for real-time quantitative PCR (RT-qPCR) validation (Additional File 6). RT-qPCR results indicated a high degree of concordance between microarray and RT-qPCR data with R-squared correlation coefficient values of 0.88, 0.87, 0.71, or 0.78 for the time points of 130, 148, 166, or 190 h respectively (Additional File 7).

Four genes were upregulated and 30 genes downregulated significantly with at least 2-fold changes in AcR when comparing AcR expression profiles to ZM4 profiles for all the time points (Additional File 5). Similar to the transcriptomics study in RM as reported previously (Yang et al., 2010b), the sodium/proton antiporter gene (ZMO0119) was also upregulated and ZMO0117 was downregulated in AcR in MM conditions. The other three genes that were consistently upregulated in AcR were Lasparaginase (ZMO1683), beta-fructofuranosidase (ZMO0375), and aldose 1-epimerase (ZMO0889). The 29 remaining genes that were consistently downregulated encoded mostly ribosomal proteins as well as proteins related to chemotaxis and flagellar biosynthesis, electron transport, fatty acid biosynthesis and protein export (Additional File 5).

### **CORRELATIONS BETWEEN PROTEOMIC AND TRANSCRIPTOMIC DATA OF ZM4 AND AcR IN MM**

We also examined the correlations between the gene or protein expression ratios at two time points of 148 and 190 h post-inoculation. There were 632 common proteins identified in both the 148 and 190 h time points. When the relative log2 ratios (AcR/ZM4) were compared, a moderate correlation (*R*<sup>2</sup> <sup>=</sup> <sup>0</sup>*.*47) was observed between the different sampling times (**Figure 2A**). Similarly, the correlation of transcriptomic profiles (AcR/ZM4) between 148 and 190 h was moderately high (*R*<sup>2</sup> <sup>=</sup> <sup>0</sup>*.*59) for the 530 common genes that showed statistically significant differential expression, but had not undergone any fold-change filtering (**Figure 2B**). When transcriptomic and proteomic data from the same time point (either 148 or 190 h) were compared, there are 460 gene-protein pairs for the 148 h time point that correlated poorly (*R*<sup>2</sup> <sup>=</sup> <sup>0</sup>*.*09) (**Figure 2C**), and the correlation between the 302 geneprotein pairs at 190 h post-inoculation is also relatively low (*R*<sup>2</sup> <sup>=</sup> <sup>0</sup>*.*17)( **Figure 2D**).

We have reported that the correlation between transcriptomic and proteomic data increases when only the significant genes or proteins were included for comparison, and the correlation even increases more when the ratios of transcriptomics or proteomics comparison increases from our recent *Z. mobilis* ethanol stress experiment of high-density microarray and shot-gun proteomics (Yang et al., 2013). Similarly, in this study, when only the genes or proteins with a significant and a ≥1.5-fold change were used for comparison purposes, the correlation between transcriptomic and proteomic data increased to *<sup>R</sup>*<sup>2</sup> <sup>=</sup> <sup>0</sup>*.*30 (*<sup>N</sup>* <sup>=</sup> 35) and *<sup>R</sup>*<sup>2</sup> <sup>=</sup> <sup>0</sup>*.*25 (*<sup>N</sup>* <sup>=</sup> 32) for 148 and 190 h comparisons, respectively (**Figure 2E**). Eighteen common gene-protein pairs were identified between proteomics and transcriptomics data in both 148 and 190 h post-inoculation with at least a 1.5-fold significant change, and all were downregulated in AcR (**Table 1**). The correlation for the proteomic study between 148 and 190 h went up to *<sup>R</sup>*<sup>2</sup> <sup>=</sup> <sup>0</sup>*.*79 (*<sup>N</sup>* <sup>=</sup> 18) when significance and fold change were considered (**Table 1**, **Figure 2F**). However, the correlation for the transcriptomic study between 148 and 190 h were decreased to 0.01 (*N* = 18) when significance and fold change criteria were applied (**Table 1**, **Figure 2F**). This result may suggest that change at transcriptional level is rapid which is consistent with our transcriptomic study in *Clostridium thermocellum* that genes were differentially expression even after several minutes post ethanol or furfural shock (Yang et al., 2012; Wilson et al., 2013). Moreover, this study also suggested that the change at protein level is relatively steady with potential post-transcriptional modification existing.

### **METABOLIC PATHWAY ANALYSES OF PROTEOMIC AND TRANSCRIPTOMIC DATA OF ZM4 AND AcR IN MM**

The PathwayTools Omics Viewer (Karp et al., 2002, 2011) was used to further examine transcriptomics and proteomics data and their relationships. Genes in the fatty acid and hopanoid biosynthetic pathways are upregulated in ZM4 compared to AcR (Additional File 8A). Hopanoids have been found in a variety of bacteria including *Z. mobilis* and are reported to protect against the toxic effects of ethanol (Flesch and Rohmer, 1989; Hermans et al., 1991; Horbach et al., 1991; Shigeri et al., 1991; Welander et al., 2009). For example, they have also been reported to play a role in *Rhodopseudomonas palustris* membrane integrity and pH homeostasis (Welander et al., 2009). However, there is debate of their role in *Z. mobilis* (Moreau et al., 1997) and we didn't identify obvious differential gene expression when *Z. mobilis* encountered ethanol stress (Yang et al., 2013). Further study is required to elucidate their role in *Z. mobilis*, and their relative contribution, if any, to sodium acetate sensitivity or tolerance to the strains used in this experiment.

In other pathway comparisons of AcR to ZM4 at 148 and 190 h, the strains had similar patterns of protein detection, with 120 or 107 differentially expressed proteins with ≥1.5-fold change at each time point respectively were used for analyses (Additional Files 8B,C). We identified 509 chromosomal gene/protein pairs that have no significant changes between AcR and ZM4 in MM with NaAc in both 148 and 190 h post-inoculation (Additional File 2), the majority of which are involved in the ED and mixedacid fermentation pathways, and tRNA charging (Additional File 8A–C). When the log2 transformed peptide hits of all the proteins identified in 148 and 190 h post-inoculation were used for pathways analyses, proteins involved in ED pathway and ethanol production are among the most abundant proteins (Additional Files 2, 8D). The abundance of ED pathway and ethanol production enzymes and the relatively stability of protein levels between ZM4 and AcR indicates that this pathway for carbon and electron flow are core elements in the physiology of both strains.

Recently, a number of studies have added important details to our understanding of *Z. mobilis* physiology (Kouvelis et al., 2009; Yang et al., 2009a, 2010a,b; Kerr et al., 2010; Linger et al., 2010; Zhang et al., 2010; Widiastuti et al., 2011). However, many genes encode proteins for which we now have proteomics evidence of their expression, their functions remain unknown. The largest quantitative measurement of the *Z. mobilis* proteome obtained in this study may help elucidate *Z. mobilis* genes, proteins, their functions and regulation.

### **ZM4 AND AcR TRANSCRIPTOMIC PROFILES IN DIFFERENT CONDITIONS**

In this study, we also combined *Z. mobilis* microarray data collected from previous studies with expression data collected in this experiment to provide greater insights in *Z. mobilis* physiology and gene regulation. Forty-four microarrays were used in the analysis (Additional File 9) to compare the gene expression differences between the variables of strain (AcR and ZM4), growth phase (exponential and stationary), media (RM and MM), and treatment (NaCl, NaAc, and control of RM only). Considering the growth differences between ZM4 and AcR in this

experiment (**Figure 1**) and the fact that *Z. mobilis* has dramatic transcriptional differences between exponential and stationary phase (Additional Files 5, 10) (Yang et al., 2009b), the relative large number of genes significantly differentially expressed at time point 148 h post-inoculation between AcR and ZM4 as well as the low transcriptomic correlation between 148 and 190 h post-inoculation is likely associated with the growth phase differences. To ensure compatible comparisons were being made among different time points in RM or MM under different conditions, the correlations among all the forty-four microarrays were investigated using the JMP Genomics 4.0 (SAS, NC) and based upon a hierarchical clustering and growth curve analyses (**Figure 3**), time points were assigned to either exponential or stationary phase for further statistical analyses (**Figure 3**, Additional File 9). Differences between the variables of strain, growth phase, media, and treatment were then analyzed by ANOVA, as described previously (Yang et al., 2009b, 2010b). The ANOVA identified that nearly every gene in the *Z. mobilis* genome was differentially expressed under one or more of the many conditions tested in this global analysis (Additional File 11), leading to an opportunity to identify the strain, media, growth phase or treatment responsive genes, as discussed below.


**Table 1 | Gene-protein pairs identified in both transcriptomic and proteomic studies at both 148 and 190 h post-inoculation, and those only identified in either 148 or 190 h post-inoculation.**

*(Continued)*

#### **Table 1 | Continued**


*A148 and A190: array result comparing AcR to ZM4 at time point 148 and 190 h post-inoculation respectively; P148 and P190: proteomics result comparing AcR to ZM4 at time point 148 and 190 h post-inoculation respectively. All the gene-protein pairs are statistically significant with at least 1.5-fold difference between AcR and ZM4 in both transcriptomic and proteomic studies. Negative number indicates gene/protein was downregulated comparing AcR to ZM4 and positive number indicates upregulation.*

#### *Strain-specific genes*

Gene expression profiles for ZM4 and AcR were compared and when all the conditions (Additional File 9) were taken into account, only five genes were significantly differentially expressed with at a least 2-fold change between strains. The significant differentially expressed strain specific genes included two genes with increased expression and three genes downregulated in strain AcR (Additional Files 11, 12). We have previously reported that *nhaA* (ZMO0119) is upregulated and ZMO0117 is downregulated in AcR compared to ZM4 in exponential and stationary growth conditions for RM (Yang et al., 2010b). In this study, which considered more conditions, we identified that that another gene (hypothetical protein ZMO1787, 82aa) was consistently upregulated in AcR under the conditions tested and two AcR genes encoding a predicted permease (ZMO0055, 263 aa) and a conserved hypothetical protein (ZMO0025, 234 aa) were consistently downregulated Additional File 11).

#### *Media-responsive genes*

Gene expression differences between MM and RM in the presence of NaAc were compared (Additional File 9). When the strain responses of both AcR and ZM4 in both exponential and stationary phases were combined to examine media differences, 232 genes were identified as being upregulated and 247 downregulated in MM compared to RM (Additional File 13). Almost half of upregulated genes encode hypothetical proteins with unknown functions, and many of the remaining upregulated genes encode proteins involved in stress sensing and responses. These included two-component signal transduction genes (7), transcriptional regulators (11), sigma factors (*rpoD*, *rpoH*), flagellar and chemotaxis genes, and other stress responsive genes such as an putative operon containing heat-inducible transcription repressor *hrcA* (ZMO0015) and chaperone gene *grpE* (ZMO0016). Other genes included cold shock gene (ZMO0925), DnaJ-class chaperone (ZMO1069) and chaperone *hspD* (ZMO0989), three glutaredoxin-related genes, three thioredoxin-like genes, and genes associated with DNA repair such as UvrABC system gene *uvrB* (ZMO0362), *mutM* (ZMO1187), *mutS* (ZMO1907) and ZMO1426 (Additional File 13).

There are more previously documented interactions among the downregulated genes in MM relative to RM compared to the upregulated genes (Additional File 14). These MM-downregulated genes are mostly related to the central carbon metabolism, which included genes such as 6-phosphogluconolactonase gene *pgl* (ZMO1478), phosphoglycerate mutase gene *pgmA* (ZMO1240), gluconolactonase gene *gnl* (ZMO1649), pyruvate decarboxylase gene *pdC* (ZMO1360), and pyruvate dehydrogenase gene *pdhA* and *pdhB* (ZMO1606, ZMO1605) (Additional Files 13, 14B). Genes involved in amino acid biosynthesis, encoding ribosomal proteins and genes for nucleotide biosynthesis, as well as genes related to energy metabolism such as electron transport system and ATP generation were also mainly down-regulated in MM (Additional Files 13,14B).

### *Treatment-responsive genes*

The following conditions were used to examine treatmentspecific responses: ZM4 grown in RM; both ZM4 and AcR growth in RM with NaCl, RM with NaAc, or MM with NaAc (Additional File 9). These conditions were used to make the following comparisons: treatment of NaCl or NaAc in RM vs. RM for strain ZM4; and as the treatment of NaAc vs. NaCl in RM for both ZM4 and AcR strains (Additional File 13).

The NaCl-responsive genes in the ZM4 wild-type background included 88 downregulated genes and 47 upregulated genes (Additional Files 13, 15A,B). More genes were responsive in a ZM4 background grown in RM supplemented with NaAc, including 159 downregulated and 103 upregulated genes. This was approximately twice the number as responsive to the less severe NaCl treatment (Additional Files 13, 15C,D). Differentially expressed genes shared between NaCl and NaAc in RM compared to RM included 27 upregulated and 66 downregulated genes (Additional Files 13, 15E,F). Genes upregulated in both NaCl and NaAc treatment related to cysteine synthesis, with 6 genes clustered together (*cysD, I, J, K, N*, ZMO0006 and ZMO0055) and tryptophan synthesis (*trpB, F*), and most of the downregulated genes encoded hypothetical proteins with unknown function with the exception of several related to flagellar biosynthesis and sucrose metabolism (Additional Files 13, 15).

When the responses of *Z. mobilis* in RM with NaAc were compared to that of NaCl, there were 37 upregulated and 41 downregulated genes (Additional Files 13, 15I,J). Genes related to ribosomal proteins and amino acid biosynthesis were downregulated and upregulated genes related to the Entner–Doudoroff (ED) pathway (ZMO1518 and *pgi*), energy metabolism [e.g., electron transport (ZMO0021, ZMO1851, ZMO1885) and ATP synthesis gene *atpC* (ZMO0242)], cell wall formation (ZMO1724), regulator gene *zrp* (ZMO0372), and several transporter related genes such as signal peptidase I gene *lepB* (ZMO1710), Sec-independent protein translocase tatA/E homolog gene (ZMO1220), TolQ biopolymer transport gene (ZMO0161) and biopolymer transport gene *exbD* (ZMO1715), Fe2<sup>+</sup> transport system gene *feoB* (ZMO1541), and potassium transport system gene *kup* (ZMO1209).

#### *Growth phase-dependent genes*

When strain profiles in the presence of NaAc were considered together and a ≥2-fold change was applied, 331 genes were significantly differentially expressed in a phase-dependent manner for RM (Additional Files 11, 12). When the RM NaCl profiles were analyzed, 661 genes were identified (Additional Files 11, 12). Similar patterns were identified for the comparisons when variable of strain was taken into consideration separately (Additional Files 11, 12). In the presence of NaAc, the stationary phase upregulated genes were less correlated than the downregulated ones (Additional Files 16). The genes downregulated in stationary phase are related to ribosomal proteins, chemotaxis and flagellar systems, amino acid and nucleotide biosynthesis, and electron transport (Additional Files 11, 16A). Except for those encoding hypothetical proteins, genes upregulated in stationary phase are related to stress responses such as the catalase gene (ZMO0918), glutaredoxin gene (ZMO0070), thioredoxin gene (ZMO1705), ferredoxin gene *fdxN* (ZMO1818), organic hydroperoxide resistance gene (ZMO0693), ATP-dependent Clp protease gene *clpB* (ZMO1424), RNA polymerase sigma-32 factor gene *rpoH* (ZMO0749), and integration host factor gene *ihfB* (ZMO1801) (**Additional Files 11, 16B**).

# **DISCUSSION**

## **GENES CONTRIBUTING TO SODIUM ACETATE TOLERANCE PHENOTYPE OF AcR**

AcR mutant had advantages under NaAc stress in both RM (Yang et al., 2010b) and MM conditions, with shorter lag phase leading to earlier glucose consumption and ethanol production than ZM4 cells (**Figure 1**). We have reported previously that the truncation within gene ZMO0117 in AcR causes the consistent upregulation of the sodium proton antiporter gene *nhaA* (ZMO0119), which is the determining factor of AcR mutant for sodium acetate tolerance in RM (Yang et al., 2010b). In this study, the *nhaA* gene also had higher expression levels in strain AcR compared to ZM4 in MM (Additional File 11). ZM4 *nhaA* expression in unamended RM was higher (∼2-fold) in stationary phase compared to exponential phase (Additional File 11) and differential expression was also measured for other conditions and strain comparisons (Additional File 13). Acetate accumulates in stationary phase *Z. mobilis* fermentations (Yang et al., 2009b) and greater expression of *nhaA* gene may ameliorate cellular function and activity under such stressful conditions.

In addition, compared to ZM4, the ZMO0117 gene in AcR showed a similar profile of down-regulation in MM (Additional File 11) to that of in RM (Yang et al., 2010b). Proteomic data also showed that the peptide hits for ZMO0117 were less abundant in AcR compared to ZM4 at 148 h post-inoculation (Additional File 2). However, the NhaA protein was not detected in any sample in this experiment (Additional File 2) and yet differential expression was observed for this gene, which demonstrates the value of conducting integrated experiments and may point to a technical challenge in detecting hydrophobic proteins like NhaA that have many transmembrane spanning domains (Poetsch and Wolters, 2008; Gilmore and Washburn, 2010).

As discussed above, this study also identified other AcR specific genes. For example, ZMO0055 was significantly downregulated in AcR compared to ZM4 in all conditions with NaAc and responded to other conditions too (Additional File 11). ZMO0055 is predicted to belong to the sulfite exporter TauE/SafE family protein (pfam01925), a family of integral membrane proteins involved in transporting anions across the cytoplasmic membrane during taurine metabolism as an exporter of sulfoacetate (Weinitschke et al., 2007). As discussed above, ZMO0055 is closely related to CysC, D, I, N (ZMO0003, 4, 5, 8) based on protein analysis using String pre-computed database (Additional File 15), which were upregulated in both NaCl and NaAc treatment. In addition, recent work by Skerker et al. (2013) also indicated that cysteine synthases, which are required for L-cysteine biosynthesis from sulfate, are related to hydrolysate tolerance. They proposed that the increased demand for sulfite and cysteine is due to the increased needs of glutathione, which is formed from glutamate and cysteine, during *Z. mobilis* growth in stressful condition (Skerker et al., 2013). Therefore, the downregulation of sulfite exporter potentially could help Z. *mobilis* retain sulfite for enhanced sodium acetate tolerance phenotype. However, further detailed study is needed to understand the role and contribution of these AcR specific genes, which will help reveal the complete difference between AcR and ZM4 and at the same time help assign the function for these AcR-specific genes with hypothetical function (ZMO1787 and ZMO0025).

#### **RELATIONSHIP AMONG MEDIA, INHIBITOR, GROWTH PHASE, AND GENE EXPRESSION DYNAMICS**

The gene expression intensity mean value of the significantly expressed genes (Additional File 17) were further used for hierarchical clustering to compare each gene at 14 different conditions, the result indicated that minimum medium had the most dramatic effect on gene expression compared to those in RM, followed by growth phase, inhibitors (NaAc*>*NaCl), and strain (Additional File 18A). Gene expression profiles in MM were clustered together and separated from the rest in RM (Additional File 18A). *Z. mobilis* growth in MM has a longer lag phase and generates a lower final cell density compared to that in RM (Additional File 1) (Yang et al., 2010b), which indicates that one or more factors are limiting *Z. mobilis* growth in MM, and MM is possibly a more stressful environment for *Z. mobilis* cells which is further supported by the upregulation of stress-responsive genes in MM compared to that in RM (Additional Files 13, 14).

The second level of clusters was between exponential and stationary growth phases, consistent with an earlier global profiling report for strain ZM4 that growth phase tends to have more dramatic effect on gene expression than a single stressor (Yang et al., 2009b). The greatest number of differentially expressed genes between exponential and stationary phases is observed for strain ZM in RM, followed by strain ZM in RM with NaCl, RM with NaAc and then ZM4 in MM (Additional File 12). In RM, *Z. mobilis* growth is inhibited by NaCl but NaAc is more inhibitory (Yang et al., 2010b), and a similar trend was observed in MM (Additional File 1). In this experiment, the inhibitor (NaAc) was added into the medium before the fermentation allowing the strains to respond and adapt to the conditions for several generations before samples were taken for systems biology studies. Hence, when transcriptomic and proteomic profiles between strains are compared under the same growth condition, the differences likely represent differences in homeostasis that allow AcR to function better than ZM4. The two strains were in different growth phases for the first proteomics time point, which confounds strain comparisons for tolerance but does still allow gene-protein relationships to be examined. In general, the large datasets reported in this study will allow others to investigate aspects such as codon bias and RNA secondary structures.

In addition, the microarray results for each gene at different conditions can help us understand the microbial physiology and identify consistently strong or inducible promoters for metabolic engineering application. For example, in this study, 52 genes and 6 other genetic features were among the top 2.5% of all genetic features with consistently strong expression intensity. Except for a dozen of genes encoding hypothetical proteins, they involve in protein biosynthesis, glycolysis and fermentation as well as ATP synthesis and stress response indicating their key role on *Z. mobilis* metabolism (Additional File 17). Several papers have been published recently to study the stress responses of *Z. mobilis* to different inhibitors (e.g., ethanol, furfural) using systems biology approaches (Yang et al., 2009b, 2010b, 2013; He et al., 2012a,b; Jeon et al., 2012; Skerker et al., 2013), we are currently working on several transcriptomics studies using microarray and next-generation sequencing based strand-specific RNA-Seq technique. Combining all these systems biology datasets, we will revisit this topic to investigate the impact of different variables of media, carbon source (glucose or xylose), different inhibitors, and growth phase on *Z. mobilis.* The transcriptional profiles will be further compared to help identify the condition specific promoters more confidently, and the discrepancy between transcriptomics and proteomics at different conditions will help understand the post-transcriptional regulation mechanism.

# **CONCLUSIONS**

Our study has provided the global profiling of the model ethanogenic bacterium *Z. mobilis* at both transcriptomic and proteomic levels in minimal medium (MM) for the first time. The results indicated that AcR had similar advantage in MM as in RM. AcR mutant can overcame acetate stress earlier with shorter lag phase, earlier glucose utilization and ethanol production than wild-type ZM4, and sodium proton antiporter gene (*nhaA*) also plays an important role in sodium acetate tolerance in MM as that of in RM we reported previously (Yang et al., 2010b). In addition, this is also the first attempt that we are aware of to combine massive transcriptomic data for condition-specific gene identification. The proteomic and transcriptomic data generated in this study and the one we have reported will provide massive datasets for future metabolic modeling and strain improvement.

# **MATERIALS AND METHODS**

#### **STRAINS AND GROWTH CONDITIONS**

Wild-type *Z. mobilis* ZM4 was obtained from the American Type Culture Collection (ATCC 31821). *Z. mobilis* acetate tolerant strain AcR has been described previously (Joachimstahl et al., 1998). ZM4 and AcR were cultured in RM (Glucose, 20.0 g/L; Yeast Extract, 10.0 g/L; KH2PO4, 2.0 g/L, pH5.0) (Yang et al., 2009b) for routine strain maintenance. MM was similar to the one reported to isolate auxotrophic *Z. mobilis* mutants (Goodman et al., 1982): 20 g Glucose, 1 g KH2PO4,1gK2HPO4, 0.5 g NaCl, 1 g (NH4)2SO4 was dissolved in 986.5 mL distilled H2O with pH adjusted to pH 5.0 before autoclave. The sterilized MM broth was then cooled to below 55◦C before 500 mg MgSO4.7H2O (2.5 mL 200 g/L stock), 25 mg Na2MoO4.2H2O (1 mL 25 g/L stock), and 10 mL Vitamin (ATCC MD-VS) were added to make a 1-L MM.

Growth assays were used to identify the effects of sodium acetate (NaAc) and sodium chloride (NaCl) on the growth of *Z. mobilis* in MM for subsequent fermentation experiments using a Bioscreen C instrument (GrowthCurves USA, NJ). Fermentations were conducted in 7.5-L BioFlo110 bioreactors (New Brunswick Scientific, NJ) fitted with agitation, pH, and temperature probes and controls, and bacterial growth was monitored turbidometrically by measuring optical density at 600nm with a model 8453 spectrophotometer (Hewlett-Packard, CA.) as described previously (Yang et al., 2009b), except that the fermentation volume was 4 L. Samples were harvested during fermentation at different time points (**Figure 1**) as described previously (Yang et al., 2009b).

#### **HPLC**

HPLC analysis was used for the measurements of the concentration of glucose, acetate, and ethanol in 0.2µm-filtered samples taken at different time points during fermentation (**Figure 1**) and analyzed as described previously (Yang et al., 2009b).

#### **PROTEOME SAMPLE PREPARATION**

Duplicate mixtures of microbial cells that were metabolically labeled with either 15N ammonium sulfate for the wild-type *Z. mobilis* ZM4 culture or 14N ammonium sulfate for the AcR mutant culture were prepared by mixing equal weights of cell pellets from duplicate cultures for each strain. Cell mixtures were lysed by sonication in ice-cold 50 mM Tris-HCl (pH 7.5) buffer, and unbroken cells were removed by centrifugation at 5000 × g for 10 min. Protein concentration for each sample was determined with the RC DC™ protein assay (Bio-Rad Lab, CA). The two fractions from each cell mixture were digested using the following protocol. The proteins were denatured and reduced with 6 M guanidine and 10 mM dithiothreitol (DTT) (Sigma-Aldrich, MO) at 60◦C for 1 h. The denatured proteome fractions were diluted 6-fold with 50 mM Tris/10 mM CaCl2 (pH 7.6), and sequencing grade trypsin was added at the ratio of 1:100 (wt:wt). The first digestion was run overnight at 37◦C and, after adding additional trypsin, the second digestion was run for 5 h at 37◦C. The samples were then reduced with 20 mM DTT for 1 h at 60◦C and were desalted using Sep-Pak Plus C-18 solid-phase extraction (Waters Co, MA).

## **QUANTITATIVE PROTEOMICS MEASUREMENT**

The protein digests were examined with LC-MS/MS using twelvestep split-phase MudPIT (MacCoss et al., 2002; McDonald et al., 2002) in duplicate. The samples were loaded via a pressure cell (New Objective, MA) onto a 250-*u*m-I.D. fused silica front column fritted into an M-520 filter union (Upchurch Scientific, WA). The column packing consisted of 2 cm strong cation exchange resin Luna® and 2 cm C18 reverse-phase resin Aqua (Phenomonex, CA). A 100-*u*m-I.D. PicoFrit column (New Objective, MA) was packed with 15 cm C18 reverse-phase resin. The front column was connected with the PicoFrit column and then placed in-line with a Dionex Ultimate quaternary HPLC. Two-dimensional LC separation was performed with twelve salt pulses, each of which was followed by a 2-h reverse-phase gradient. MS/MS analysis was performed on an LTQ linear ion trap instrument (ThermoFinnigan, CA) with dynamic exclusion enabled. Each full scan (400–1700 *m/z*) was followed by three data-dependent MS/MS scans at 35% normalized collision energy. The full scans were averaged from five microscans and the MS/MS scans were averaged from two microscans.

#### **QUANTITATIVE PROTEOMICS DATA ANALYSIS**

All MS/MS scans were searched in two iterations against the FASTA database containing all annotated *Z. mobilis* proteins using the SEQUEST program (Eng et al., 1994). In the first iteration, the molecular masses of amino acids containing 14N were used, and, in the second iteration, the masses of amino acids containing 15N were used. The peptide identifications from the two iterations were merged. The DTASelect program (Tabb et al., 2002) was used to filter the peptide identifications and to assemble the peptides into proteins using the following parameters: retaining the duplicate MS/MS spectra for each peptide sequence (DTASelect option: -t 0), fully tryptic peptides only, with a delCN of at least 0.08 and cross-correlation scores (Xcorrs) of at least 1.8 (for parent ion charge state, *z* = +1), 2.5 (*z* = +2), or 3.5 (*z* = +3). Selected ion chromatogram extraction, peptide abundance ratio estimation and protein abundance ratio estimation were completed with the ProRata program as described previously (Pan et al., 2006, 2008).

#### **MICROARRAY ANALYSIS AND qRT-PCR VALIDATION**

Microarray analysis was conducted essentially as described previously (Yang et al., 2010b). Briefly, total cellular RNA was extracted using the TRIzol reagent (Invitrogen, CA) followed by RNase-free DNase I (Ambion, TX) digestion. RNA quality and quantity were tested with a NanoDrop ND-1000 spectrophotometer (NanoDrop Technologies, DE) and Agilent Bioanalyzer (Agilent, CA) before ds-cDNA synthesis using Invitrogen dscDNA synthesis kit (Invitrogen, CA). The ds-cDNA was sent to NimbleGen for labeling, hybridization, and scanning following company's protocols. Quality assessments, normalization, and statistical analyses were conducted using JMP Genomics 4.0 software (SAS Institute, Cary, NC) as described earlier (Yang et al., 2010b). An analysis of variance (ANOVA) determined differential expression levels between strains and time points using the FDR testing method (*p <* 0*.*05). The interaction among differentiallyregulated genes was investigated using the String 8.2 database (Jensen et al., 2009), available at http://string*.*embl*.*de/. The transcriptomic and proteomic data are also mapped to predicted metabolic pathway using PathwayTools Omics Viewer (Karp et al., 2002, 2011) at http://biocyc*.*org/expression*.*html.

Microarray data were validated using real-time qPCR as described previously (Yang et al., 2009b, 2010b), except that the Bio-Rad MyiQ2 Two-Color Real-Time PCR Detection System (Bio-Rad Lab, CA) and Roche FastStart SYBR Green Master (Roche Applied Science, IN) were used for this study. Nine genes representing different functional categories and a range of gene expression values based on microarray hybridizations were analyzed using qPCR from cDNA derived from different time point samples. Primer pairs were designed as described previously (Yang et al., 2009b), and the oligonucleotide sequences of the nine genes selected for qPCR analysis are listed in Additional File 6. The qRT-PCR ratios were plotted against the microarray ratios and a regression analysis was conducted to generate R-squared correlation coefficient value.

# **AUTHOR CONTRIBUTIONS**

Steven D. Brown, Shihui Yang, Gregory B. Hurst, and Chongle Pan designed the experiment. Shihui Yang carried out the fermentation, RNA extraction, and sample preparation for HPLC, microarray, and proteomics. Lezlee Dice performed the qRT-PCR. Chongle Pan performed the proteomic runs and generated proteomic raw data. Shihui Yang, Steven D. Brown, Chongle Pan, and Gregory B. Hurst analyzed the data. Shihui Yang and Steven D. Brown wrote the manuscript, and Brian H. Davison, Gregory B. Hurst, and Chongle Pan provided inputs on manuscript revision.

### **ACKNOWLEDGMENTS**

We thank Miguel Rodriguez Jr. for assistance with HPLC analyses. This work was sponsored by the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory (ORNL) and concluded under the BioEnergy Science Center which is a U.S. Department of Energy Bioenergy Research Center supported by the Office of Biological and Environmental Research in the DOE Office of Science. This manuscript has been authored by UT-Battelle, LLC, under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes.

#### **SUPPLEMENTARY MATERIAL**

The Supplementary Material for this article can be found online at: http://www*.*frontiersin*.*org/journal/10*.*3389/fmicb*.* 2014*.*00246/abstract

# **REFERENCES**


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 28 February 2014; accepted: 06 May 2014; published online: 22 May 2014. Citation: Yang S, Pan C, Hurst GB, Dice L, Davison BH and Brown SD (2014) Elucidation of Zymomonas mobilis physiology and stress responses by quantitative proteomics and transcriptomics. Front. Microbiol. 5:246. doi: 10.3389/fmicb. 2014.00246*

*This article was submitted to Microbial Physiology and Metabolism, a section of the journal Frontiers in Microbiology.*

*Copyright © 2014 Yang, Pan, Hurst, Dice, Davison and Brown. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Metabolic engineering of yeasts by heterologous enzyme production for degradation of cellulose and hemicellulose from biomass: a perspective

# *William Kricka , James Fitzpatrick and Ursula Bond\**

*School of Genetics and Microbiology, Department of Microbiology, Trinity College Dublin, Dublin, Ireland*

#### *Edited by:*

*Katherine M. Pappas, University of Athens, Greece*

#### *Reviewed by:*

*Biswarup Sen, Feng Chia University, Taiwan*

*Ed Louis, University of Leicester, UK*

#### *\*Correspondence:*

*Ursula Bond, School of Genetics and Microbiology, Department of Microbiology, Trinity College Dublin, College Green, Dublin 2, Ireland e-mail: ubond@tcd.ie*

# **INTRODUCTION**

With dwindling fossil fuel resources and the necessity to combat climate change by reducing greenhouse gas emissions, there is a growing need to identify alternative environmentally sustainable energy sources. One potential green energy source is that derived from biomass, which can be converted to useable energy such as biofuels. The two most common examples of biofuels are bio-ethanol and bio-diesel. Biofuels are considered the cleanest liquid fuel alternative to fossil fuel and it is estimated that replacing fossils fuels with biofuels could decrease CO2 emissions by 60–90% (Hasunuma and Kondo, 2012). Currently, over 100 billion liters of biofuels are produced annually, yet this accounts for only a fraction (2.7%) of total energy used in transportation. Bioethanol production reached 86 billion liters in 2010, with the United States and Brazil as the world's top producers, accounting together for 90% of global production (Buijs et al., 2013).

#### **GENERATIONS OF BIOETHANOL**

Bioethanol production has been classified into different generations based on biotechnology developments and also in terms of the feedstock used. First generation bioethanol is produced from either corn or sugarcane. The released sugars, glucose and sucrose, respectively, are readily fermentable into ethanol by microorganisms such as the yeast *Saccharomyces cerevisiae* (Buijs et al., 2013). A disadvantage to using these substrates as energy sources is the fact that they are food crops and hence controversy regarding the ethics of exploiting "food for fuel" surrounds the established bioethanol industry. Currently over 90% of the world's bioethanol is produced from food crops, however governmental directives are setting exacting targets on the limitation of generating renewable energy from food-based crops. The US Renewable Fuel Standards mandate (RFS; US Energy Policy Act (EPAct) 2005) requires that 44% (16 billion gallons) of renewable-fuel to be blended into gasoline by 2022 be derived from non-food cellulosic biomass. Current negotiations at the European Council aim to limit

This review focuses on current approaches to metabolic engineering of ethanologenic yeast species for the production of bioethanol from complex lignocellulose biomass sources. The experimental strategies for the degradation of the cellulose and xylose-components of lignocellulose are reviewed. Limitations to the current approaches are discussed and novel solutions proposed.

#### **Keywords: recombinant yeasts, cellulases, xylose-utilizing enzymes**

the amount of biofuels from food-based crops that can be counted toward the 10% target for renewable energy in the transport sector by 2020 (EU Renewable Energy Directive (RED; 2009/28/EC).

Second generation biofuels are derived from more complex non-food based biomass, which can be grouped into roughly four categories, namely, wood residues, municipal solid waste, agricultural waste, and dedicated energy crops. The most abundant renewable form of biomass is lignocellulose comprising between 50 and 90% of all plant matter. The global production of plant biomass amounts to approximately 2 <sup>×</sup> <sup>10</sup><sup>11</sup> Mt per annum of which 8–20 <sup>×</sup> <sup>10</sup>9Mt is potentially accessible for processing. Lignocellulose is composed of three major components, cellulose, hemicellulose, and lignin. The relative amount of each component varies in different plant types, with an average composition of cellulose 30–50%, hemicellulose 20–30%, and lignin 15–25%.

Cellulose is the most abundant polysaccharide on earth (Chandel and Singh, 2011). In its simplest chemical form, cellulose is a β-glucan linear polymer of D-glucose linked by β-1,4-glycosidic bonds. The basic repeating subunit is cellobiose, consisting of two glucose molecules. At the macroscopic level, cellulose exists as two distinct forms, tightly packed crystalline and non-organized amorphous regions. The amorphous regions of cellulose are believed to result from surface shaving caused by natural erosion. Both the amorphous and crystalline forms are made up of cellulosic fibers comprising microfibrils, which are composed of approximately 30 β-glucan chains. The highly accessible amorphous regions account for approximately 1% of the structure of cellulose (Ruel et al., 2012).

Hemicellulose is the second most abundant polysaccharide within lignocellulose (Peng et al., 2012). It is a highly branched heteropolymer composed of pentoses and hexoses such as xylose, arabinose, mannose, glucose, and galactose as well as sugar acids. The composition of hemicellulose is variable in nature and depends upon the plant source, however the most abundant component of hemicellulose is generally xylose.

The third component of lignocellulose is lignin, a polymer of three aromatic alcohols, coniferyl, p-coumaryl, and sinapyl. Lignin links both hemicelluloses and cellulose together forming a physical barrier in the plant cell wall. Lignin is recalcitrant to degradation and is resistant to most microbial attacks and oxidative stress (Dashtban et al., 2009). Unlike cellulose and hemicellulose, hydrolysis of lignin does not generate fermentable sugars. Furthermore, phenolic compounds produced during lignin hydrolysis actively inhibit fermentation.

Biomass is not readily fermentable and expensive pretreatments, both physical (milling and steam explosion) and chemical (acid and alkaline hydrolysis), are required to increase access to the sugars within the biomass. The sugars released from pre-treatment must be further hydrolyzed by enzymatic actions to yield fermentable glucose.

Several different approaches have been developed to generate bioethanol from biomass. One such process, called separate hydrolysis and fermentation (SHF), as its name suggests, requires a two-step process in which biomass hydrolysis and fermentation of released sugars are performed in separate reaction vessels. The main advantage of this method is that the two processes can be performed under their own individual optimal conditions. Hydrolysis of cellulose by enzymes, referred to as cellulases, is most efficient at temperatures between 50 and 60◦C, whereas fermentation reactions are generally performed at 25–35◦C, the optimum temperature range for yeast growth and metabolism. Cellulase enzymes are not found naturally in fermenting microorganisms and must be supplied *ex vivo*. A disadvantage of SHF is that glucose and cellobiose released by the action of cellulases inhibit the subsequent activity of enzymes.

To streamline production, an alternative process of simultaneous saccharification and fermentation (SSF) was devised. In this process the saccharification and fermentation are performed together in a single vessel and glucose released by the action of cellulases is immediately metabolized, thereby reducing enzyme inhibition. However, a disadvantage with SSF is the need to use a compromise temperature that is neither optimal for hydrolysis or fermentation.

The requirement for maximum efficiency to ensure economic viability of bioethanol production led to the development of consolidated bioprocessing (CBP) or third generation biofuels. In CBP, all steps are performed in the same reaction vessel by a single organism capable of both producing biomass hydrolyzing enzymes and fermenting the resultant sugars to ethanol. Thus, CBP avoids the need to supply exogenously produced cellulase enzymes.

#### **THE CELLULASES**

Cellulases belong to the O-glycoside hydrolases group of enzymes. There are three major classes of cellulases, endoglucanases (EG), cellobiohydrolases (CBH) or exoglucanases, and β-glucosidases (BGL). The concerted action of all three cellulases is required to efficiently convert cellulose into glucose. The generally accepted view is that cellulases act sequentially and synergistically. Endoglucanases randomly cleave the cellulose backbone at amorphous sites along the cellulose fiber. This leads to a rapid decrease in the degree of polymerization of the cellulose fiber and exposes new chain ends. Cellobiohydrolases act processively on reducing and non-reducing chain ends to release mainly cellobiose. β-Glucosidases hydrolyze the β-1,4 glycosidic bond of cellobiose and oligosaccharides to release glucose units.

Several novel classes of enzymes such as the copper-requiring polysaccharide monooxygenases, for example GH61, contribute to cellulose degradation by acting in synergy with the exo- and endoglucanases (Leggio et al., 2012; Žifcáková and Baldrian, ˇ 2012). Elastin-like proteins such as swollenin and other cellulaseenhancing proteins contribute to the hydrolysis of cellulose by increasing access of the cellulase enzymes to the cellulose chains ends (Kubicek, 2013; Nakatani et al., 2013). Cellulase enzymes are naturally produced by a variety of filamentous fungi of different genera such as *Trichoderma*, *Aspergillus*, *Talaromyces*, and several anaerobic bacteria. Two different cellulase systems, referred to as complexed and non-complexed have been described. Complexed cellulase systems are multi-enzyme complexes, referred to as the cellulosome, that remain tethered to the cell wall of cellulolytic bacteria (Lynd et al., 2002; Fontes and Gilbert, 2010). These cell wall tethered complexes are primarily encountered in anaerobic bacteria such as species of the *Clostridium* and *Ruminococcus* genera. Filamentous fungi and certain actinomycete bacteria such as *Cellulomonas* species use non-complexed cellulase systems. Non-complexed cellulases are secreted into the extracellular environment and are not attached to the cell surface. The most extensively studied cellulolytic organism is the filamentous fungus *Trichoderma reesei. T. reesei* synthesizes an array of cellulases, including at least five EGs, two CBHs, and two BGLs (Foreman et al., 2003; Martinez et al., 2008; Kubicek, 2013).

#### **STRATEGIES FOR HYDROLYSIS OF CELLULOSE AND FERMENTATION OF RELEASED SUGARS TO ETHANOL**

While clearly capable of degrading cellulose, *T. reesei* is not an efficient fermenter of sugars to alcohol, a prerequisite for a CBP microorganism. *T. reesei* can generate up to 4.8 g/L ethanol from growth in glucose while *Aspergillus orzyae* can produce 24.4 g/L ethanol (Skory et al., 1997). Natural ethanologens such as *Zymomonas mobilis* or *Saccharomyces cerevisiae* can produce 130–200 g/L under the right environmental conditions. One major disadvantage with using mycelial fungi for ethanol production is the slow bioconversion rate compared to that observed in yeasts. Furthermore, filamentous fungi show intolerance to high concentrations of ethanol. An analysis of conversion of glucose to ethanol by 19 strains of *Aspergillus*revealed efficiencies of 21–98% after 6 days compared to 100% conversion by *S. cerevisiae* within 48 h (Skory et al., 1997).

To date, no naturally occurring microorganism capable of CBP at the desired efficiency for industrial bioethanol production has been identified. Therefore, researchers have pursued two strategies (native and recombinant) to generate the ideal microorganism for CBP. The native strategy focuses on modifying natural cellulolytic organisms to improve ethanol yields. Several approaches have been pursued, including directed evolution using error-prone Polymerase Chain Reaction-based mutagenesis of cellulase genes, adaptive evolution using natural selection to specific environmental conditions or rational protein design to improve the enzymatic activity of cellulases or to expand the physiological conditions at which the enzymes are active (Elkins et al., 2010; Voutilainen et al., 2010; Liang et al., 2011; Anbar et al., 2012; Gefen et al., 2012; Wang et al., 2012). Challenges still remain for these types of approaches in order to upscale to industrial fermentation conditions.

The recombinant strategy involves genetic engineering of native cellulase-producing species to improve ethanol yields or heterologous expression of cellulase genes in natural ethanologens. Ethanol yields in *A. niger* can be increased by expression of a pyruvate decarboxylase gene from *Z. mobilis* (Skory et al., 1997). Likewise, heterologous gene expression of pyruvate decarboxylase and alcohol dehydrogenase genes from *Z. mobilis* in the cellulolytic bacterium *C. cellulolyticum* was found to increase ethanol production by 53% (Guedon et al., 2002).

The vast majority of research into producing a CBP candidate has however followed the recombinant strategy of heterologous expression of cellulase genes in natural ethanologens. In order to achieve complete hydrolysis of cellulose, at least one copy of each of the three classes of cellulase genes must be expressed in the host cell. The most commonly used host for heterologous expression of cellulase genes for CBP is the baker's yeast, *S. cerevisiae* (Fujita et al., 2004; Den Haan et al., 2007a; Tsai et al., 2010; Wen et al., 2010; Yamada et al., 2011; Fan et al., 2012; Nakatani et al., 2013), although the three classes of cellulase genes have also been expressed in other *Saccharomyces* species such as *S. pastorianus* (Fitzpatrick et al., 2014) as well as in bacterial species such as *Escherichia coli* (Ryu and Karim, 2011) (**Table 1**). Since the native promoters of cellulase genes are repressed by glucose, strategies of using inducible or constitutive promoters of the host have been pursued. Inducible promoters such as *S. cerevisiae* GAL1 or CUP1 promoters are extremely efficient but require the addition of an inducer, galactose or copper, respectively. The requirement for such inducers can be expensive and often incompatible with fermentation conditions for ethanol production. Moreover, the GAL promoters are repressed in the presence of glucose and therefore not suited for industrial CBP. Partow et al. (2010) tested the performance of several constitutive and inducible promoters for heterologous gene expression in *S. cerevisiae*. Their findings indicated that the constitutive promoters TEF1 and PGK1 produced the most constant expression profiles. These promoters have been used for heterologous cellulase gene expression in *S. cerevisiae* (Den Haan et al., 2007b; Yamada et al., 2011), however no more than a 2-fold difference in expression was observed in genes driven by these two promoters, with TEF1 generating the highest levels (Fitzpatrick et al., 2014).

Another strategy for increasing cellulase activity is to increase the gene copy number. Episomal plasmids have been extensively used to express cellulase genes (Fujita et al., 2004; Den Haan et al., 2007b; Tsai et al., 2010; Wen et al., 2010; Fan et al., 2012), however there is an issue with their genetic stability. Under nonselection, plasmids are lost in several generations and constant selection such as culturing in auxotrophic or antibiotic media must be applied. This set up is not suited to industrial production due to the cost of such selection reagents. The expression of *T. reesei* endoglucanase gene EGI on a high copy number 2 μ episomal vector (pRSH-series) was 50-fold greater than when expressed from an ARS/CEN vector (pGREG-series) (Fitzpatrick et al., 2014). A preferred solution is the integration of cellulase gene cassettes directly into the chromosome of the host microbe. Classic integration methods ensure stability of the genes, however, enzyme production is limited by gene copy number (Du Plessis et al., 2010; Yamada et al., 2013). Multi-copy integration of cellulase genes offers a solution. Yamada and co-workers constructed *S. cerevisiae* strains containing multiple-copies of cellulase genes integrated into the delta (δ) repeat sites of transposable elements (Tn) in the host chromosome, leading to increased ethanol yields (Yamada et al., 2010).

Defining the optimum ratio of the different classes of cellulases is also important to achieve maximum cellulose hydrolysis. The true optimum ratio of cellulases used by natural cellulolytic microorganisms is not fully known (Yamada et al., 2013) but it has been estimated that the total secreted protein of *T. reesei* under inducing conditions is 60% CBHI, 20% CBHII, 10% EG, and 1% BGL (Takashima et al., 1998). Hence, a 1:1:1 ratio of the three classes of enzymes may not yield an optimum hydrolysis synergy. Yamada and colleagues performed repeated rounds of integration at delta sites to generate an *S. cerevisiae* strain with 24 cellulase genes integrated into the genome in a ratio of 16:6:2 for*egl2*, *cbh2,* and *bgl*1, respectively (Tien-Yang et al., 2012). Cellulose degradation activity of this strain was lower than in a strain which had the cellulase genes *egl2*, *cbh2,* and *bgl*1 in a ratio of 13:6:1, respectively. Interestingly, a strain with *egl2*, *cbh2,* and *bgl*1 in a ratio of 5:9:6, respectively produced 1.3-fold less cellulose hydrolysis activity than a strain containing the genes in a 13:6:1 ratio, suggesting that the ratio of the cellulase enzymes as well as the copy number is crucial to ensure efficient cellulose hydrolysis.

The choice of cellulase system for integration into the recipient ethanologenic host is important to consider. The cellulosome from *C. thermocellum* has been reconstituted in *S. cerevisiae* (Wen et al., 2010) as have the secretory cellulases from several filamentous fungi including *A. aculeatus, A. oryzae, Saccharomycopsis fibuligera, Thermoascus aurantiacus, and T. reesei* (**Table 1**). Genes encoding the different classes of cellulases from a single species or from different fungal origins have been co-expressed in *S. cerevisiae* (**Table 1**). Secretion of the expressed cellulases has been achieved using the native secretory signals or host signals, for example, the α-mating type secretory signal. In some cases, the recombinant cellulases have been tethered to the cell surface (**Table 1**, **Figure 1**) using anchor proteins such as *S. cerevisiae* α-agglutinin (Yanase et al., 2010) or cell wall protein (Van Rooyen et al., 2005). Cell wall tethering leads to an effective increase in concentration of the cellulases. On the other hand, expressing cellulases so that the proteins are untethered facilitates the binding of the enzymes at multiple sites along the length of the cellulose chain. Interesting, very little difference in ethanol yields was observed in yeast strains in which the cellulases were secreted or cell wall tethered, indicating that one approach is not superior to the other (Yanase et al., 2010).

To date, haploid *S. cerevisiae* has been the host of choice for heterologous expression of cellulase genes, although a few studies have examined the expression of cellulase genes in other hosts. Yamada and colleagues constructed a diploid *S. cerevisiae* yeast strain with cellulase genes integrated at δ-integration sites. The diploid strain displayed 6-fold higher phosphoric acid swollen


**Table 1 | Ethanol production from PASC by recombinant yeast strains expressing cellulases.**

cellulose (PASC) degradation than the parental haploid (Yamada et al., 2011). Cellulases have also been expressed in the polyploid lager yeast *S. pastorianus* (Fitzpatrick et al., 2014). The latter study compared the expression of the three classes of cellulases from *T. reesei* in strains of *S. pastorianus* and *S. cerevisiae*. Enzymatic activity for all three classes of enzymes was up to 10-fold higher in *S. pastorianus* strains compared to *S. cerevisiae* strains. Thus for CBP, it will be important to explore the use of yeast species other than *S. cerevisiae* for cellulase production. This point is particularly important in light of the temperature difference between optimal growth conditions for *Saccharomyces* species (25–35◦C) and for cellulases activity (50–60◦C). Cellulase activity at 30◦C is 3-fold less than at 50◦C and therefore cellulose hydrolysis will always be compromised when *Saccharomyces* species are used as hosts. An organism that could bridge the gap between saccharification and fermenting temperatures is *Kluyveromyces* sp., which are more thermotolerant than *Saccharomyces* sp. (Fonseca et al., 2008). Expression of all three classes of cellulases in *K. marxianus* produced up to 43 g/L ethanol using the simple di-saccharide cellobiose as a sole carbohydrate source (Hong et al., 2007). The hydrolysis of more complex cellulose substrates such as PASC has yet to be tested with this recombinant host. One disadvantage with *K. marxianus* is that it is less ethanol tolerant than *S. cerevisiae*.

# **THE CHICKEN AND EGG PROBLEM**

Ultimately, a key requirement of a recombinant host species for CBP is the ability to utilize complex cellulose substrates such as lignocellulose biomass as the sole carbohydrate source. It remains unclear how the hydrolysis of cellulose by recombinant microbes can be initiated. The host must produce enzymes to degrade cellulose into glucose to ensure cell growth while at the same time, the cell must be growing in order to produce and secrete enzymes, thus the classic chicken and egg conundrum. Cellulose cannot be transported directly into the cell and must be hydrolyzed extracellularly. In order to produce the cellulases, the cells must be actively dividing to ensure transcription from inducible or constitutive promoters. It may be therefore necessary to supply a residual amount of fermentable sugars, such as glucose or sucrose to kick-start the process. We have previously shown that yeast cell growth can be maintained with as little as 0.5 g/L glucose, although the addition of sugars to CBP will ultimately add costs to the process. The approach taken by several groups is to use very high cell numbers to carry out fermentations of cellulose substrates. The idea here is that residual cellulases synthesized in the pre-fermentation starter cultures will be released from cells upon incubation with fresh medium containing the cellulose substrate. A disadvantage to this approach is that the use of high cell densities limits the continued growth of the culture, which inevitably will quickly enter stationary phase. Tethering the cellulases to the cell surface may solve this dilemma, as cellulases produced in pre-fermentation starter cultures may remain active once the cells are switched to the biomass fermentation process.

Taking the known problems associated with CBP into account, we favor an SHF *in situ* approach in which a pre-hydrolysis of the cellulose substrate is performed prior to fermentation. Pre-hydrolysis can be carried out with recombinant cellulases secreted into the spent medium in yeast starter cultures. This spent medium is then used in a pre-fermentation step to begin the process of hydrolysis of the cellulose substrate at the optimal temperature of 50–60◦C to release glucose. The medium can then be cooled, replenished with essential nutrients and cellulaseexpressing yeasts, and fermentations carried out at 25–30◦C (**Figure 3**). Interestingly, the majority of industrial fermentation facilities used for the production of potable alcohols (beer, ales, lagers, and spirits) incorporate a pre-fermentation hydrolysis step at high temperatures (50–63◦C) in order to release fermentable sugars (maltose, sucrose etc.) from complex carbohydrate substrates (starches) such as wheat and barley. Therefore, a SHF *in situ* approach could easily be incorporated into current fermentation processes. SHF *in situ* differs from SHF in eliminating the requirement for the addition of costly commercially produced cellulases.

The ultimate test of any CBP host candidate, whether generated by native or recombinant strategies, is the amount of ethanol the microorganism can produce from a cellulosic substrate. The most commonly used cellulosic substrate tested by research groups is phosphoric acid swollen cellulose (PASC). **Table 1** summarizes the ethanol yields from fermentation of hydrolyzed PASC reported to date. Despite the myriad of approaches undertaken, ethanol levels are in the range of 7–16.5 g/L, far below what is achievable with more conventional complex carbohydrate substrates such as grains where yields of up to 200 g/L are commonplace. For industrial bioethanol production from cellulosic biomass, far greater yields are required. The low yields most likely reflect the slow bioconversion rate of cellulose-based substrates such as PASC. The results summarized in **Table 1** highlight the need for further improvements in this process.

It should be noted that it is difficult to compare ethanol yields from the various studies due to variations in experimental parameters such as starting substrate concentrations, incubation times, and the density of cells used for fermentations. Furthermore, the unit of activity for cellulase hydrolysis differs widely. For comparative purposes, going forward, it would be useful for research groups to standardize the experimental conditions used for PASC hydrolysis and to adopt a standard definition of a unit of cellulase activity, for example, grams of ethanol produced per gram of theoretical glucose present in the cellulose substrate per hour (g ethanol g glucose−<sup>1</sup> h <sup>−</sup>1). Furthermore, while current studies almost exclusively use PASC as an experimental cellulose substrate, it will be essential to expand this analysis to the use of more complex lignocellulose substrates such as straw, spent grains, and grasses.

### **METABOLISM OF XYLOSE FROM HEMICELLULOSE FOR BIOETHANOL PRODUCTION**

The pentose sugar xylose is a major component of hemicellulose. Its abundance and relative ease of extraction from biomass by pre-treatments described above make it an attractive source of fermentable sugar for the production of bioethanol. In nature, xylose can be utilized by various yeast species and bacteria that are found associated directly or indirectly with lignocellulose. The mining and isolation of yeast from environments where xylose is likely to be an abundant natural sugar has developed greatly over recent years. Several xylose-utilizing yeast species such as *Spathaspora passalidarum* are indirectly associated with lignocellulose, via their symbiotic relationship with wood boring beetles such as *Odontotaenius disjunctus* (Hou, 2012) or wood roaches such as *Cryptocercus* sp. (Urbina et al., 2013). Additionally xylose-utilizing species including *Candida* sp., *Geotrichum* sp., *Sporopachydermis* sp., *Trichosporon* sp., *Pichia* sp., and *Sugiyamaella* sp. have been isolated from buffalo feces (Wanlapa et al., 2013) or from soil (Zhang et al., 2014). Although some success has been achieved using natural xylose fermenting species, their industrial relevance has yet to be demonstrated.

Two major xylose utilizing pathways have been identified. In xylose-fermenting fungi and yeasts, xylose utilization involves the action of two oxidoreductases, xylose reductase (XR) and xylitol dehydrogenase (XDH), each requiring the co-factors NADPH and NAD+, respectively in the forward reactions (**Figure 2**, pathway 1). Most bacterial species use an alternative pathway requiring just a single enzyme, xylose isomerase (**Figure 2**, pathway 2). The product of both pathways, xylulose, is phosphorylated by xylulose kinase, which can enter the Pentose Phosphate Pathway (PPP), thus generating intermediates of the glycolytic pathway.

While a few xylose utilizing *Saccharomyces* sp. have been identified (Wenger et al., 2010; Schwartz et al., 2012), in general, yeasts belonging to the *Saccharomyces stricto sensu* group cannot metabolize xylose despite harboring genes encoding putative xylose-utilizing enzymes. To facilitate the incorporation of xylose utilization into CBP, efforts have focussed on introducing xylose metabolic pathways from other species into natural ethanologenic *Saccharomyces* sp. (Karhumaa et al., 2007; Matsushika et al., 2009a,b; Fernandes and Murray, 2010; Bera et al., 2011; Hasunuma et al., 2011; Hector et al., 2011; Usher et al., 2011; Xiong et al., 2011; Cai et al., 2012; Fujitomi et al., 2012; Kim et al., 2012; Tien-Yang et al., 2012; De Figueiredo Vilela et al., 2013; Demeke et al., 2013; Hector et al., 2013; Ismail et al., 2013; Kato et al., 2013; Kim et al., 2013). This topic has been reviewed recently (Matsushika et al., 2009a; Fernandes and Murray, 2010; Cai et al., 2012) and so will not be extensively covered here. Instead, we summarize the various experimental approaches used, the heterologous genes that have been expressed and the ethanol yields generated (**Table 2**). As with the data on ethanol yields from cellulose (**Table 1**), it is difficult to compare results on ethanol yields from xylose, with different studies representing ethanol yields as ethanol (g/L−1), grams ethanol/g−<sup>1</sup> xylose or % yield (**Table 2**). Despite the variety of approaches, the levels of

ethanol produced are remarkably consistent ranging from 4 to 25 g/L, far below that which is achievable by native yeasts using carbohydrate sources such as glucose, maltose, or sucrose. These low yields reflect the poor uptake of xylose by *S. cerevisiae*, the species of choice for most of these studies and the slow metabolism of the sugar due to redox imbalances in the cell. Below we summarize some of the approaches that have been taken to improve these processes.

Pentose sugars can naturally be transported into *S. cerevisiae* by hijacking various hexose sugar transporters (Hxt7/Hxt5/Hxt4/Hxt2 and Gal2). The major problem encountered here is the greater affinity of these transporters for their natural substrate glucose. Therefore, in media containing both hexose and pentose sugars, such as that generated by pretreatment of lignocellulose biomass, glucose is preferentially transported while xylose uptake is inhibited. Two distinct strategies, the overexpression of genes encoding native transporters from *S. cerevisiae* and the heterologous expression of genes encoding transporters from xylose utilizing species have been pursued. The genes encoding hexose transporters Hxt1, Hxt7, Hxt13, and Gal2 have been individually overexpressed in *S. cerevisiae*, with growth rates and xylose consumption varying between individual studies. Xylose transport was increased by the over expression of HXT1 and GAL2 but not by overexpression of HXT7 (Tanino et al., 2012). These studies revealed a trade off between transporter specificity and transporter efficiency, with transporters displaying high specificity (low Km) being less efficient at xylose transport (Young et al., 2012).

Based on these findings, the use of specific xylose transporters from xylose-utilizing species would seem like a more logical approach. Jaewoong et al. (2013) expressed genes encoding 6 xylose transporters from *Scheffersomyces stipitis* in *S. cerevisiae*, all of which demonstrated improved growth rates and ethanol production when grown solely on xylose, with the genes XUT7, RGT2, and SUT4 showing the most benefit. A directed mutagenesis of xylose transporters GXS1 and XUT3 from *Candida intermedia* and *S. stipites*, respectively led to improved growth rate on xylose (Young et al., 2012). As might be expected, the expression of heterologous xylose specific transporter genes shows improved xylose uptake in mixed hexose and pentose medium (Runquist et al., 2010), however, in some cases, strains where xylose consumption was improved, overall sugar consumption (glucose and xylose) was lower than that achieved by overexpression of native hexose transporters such as HXT7 and GAL2 (Young et al., 2011).

Identifying the best xylose transporters for heterologous expression also poses problems. A comparative study on the heterologous expression of genes encoding different sugar transporters in *S. cerevisiae* demonstrated that of an initial 23 transporters tested, only five conferred the ability to grow on xylose (Young et al., 2011). Likewise, an analysis of 18 genes encoding putative xylose transporters led to the cloning of just two xylosespecific transporters, XYP29 and AN25 from *Pichia stipitis* and *Neurospora crassa*, respectively (Du et al., 2010).

In addition to problems associated with xylose transport, xylose utilization in *S. cerevisiae* is hampered by bottlenecks in the metabolism of xylose through the PPP (Fiaux et al., 2003). Several studies have been conducted to improve the metabolic fluxes through the PPP, including the overexpression of genes encoding key enzymes in the pathway such as transaldolase (TAL1), transketolase (TKL1), ribose-5-phosphate isomerase (RK11), and ribulose 5-phosphate epimerase (RPE1) (Karhumaa et al., 2005). While over expression of these genes individually produced mixed outcomes, the over expression of all 4 PPP enzymes produced a 30-fold increase in growth rate on xylose, however ethanol yields only increased fractionally (Bera et al., 2011). Although over expression of TAL1 did not greatly improve ethanol production from synthetic xylose, improved ethanol yields were achieved when detoxified hemicellulosic hydrosylate was used as a sole carbohydrate source (Hasunuma et al., 2014).

Genome-wide expression analysis identified additional host genes required for optimum xylose utilization including the PPPassociated SOL3 and GND1 as well as non PPP-associated genes such GAL1, 7, and 10 (Wahlbom et al., 2003; Bengtsson et al., 2008). Deletion of YLR042C, MNI1, and RPA49, resulted in an improvement in growth rates on xylose (Bengtsson et al., 2008), while deletion of PHO13,ALP1, ISC1 RPL20B, BUD21, NQM1, TKL2 led to increased ethanol yields from xylose (Van Vleet et al., 2008; Usher et al., 2011).

It is clear that PPP involves complex regulation involving many gene products. Thus, major alterations in the PPP may be required to optimize xylose utilization by *S. cerevisiae.* While mutational and over expression approaches have helped to identify key genes regulating the PPP, the industrial relevance of any such modifications must be considered. Given the complexity of the system, an adaptive evolutionary approach involving continued rounds of cell conditioning on xylose might be a more beneficial approach to increasing xylose utilization in *S. cerevisiae*.

#### **CONCLUSIONS AND PERSPECTIVES**

The ultimate goal of research into third generation biofuel is to create an organism capable of CBP. For the efficient exploitation



of biomass as a source of bioenergy, it is obvious that both the cellulosic and hemicellulosic fractions of biomass must be used. The degradation of cellulose by cellulase expressing ethanologenic yeast strains is now well established, except for the caveat that to date only synthetic forms of cellulose have been tested (**Table 1**). The main stumbling block here is the poor enzyme activity of recombinant cellulases at fermentation conditions (30◦C). Until activity can be increased at this low temperature, it is likely that there will be a need for an enzymatic pre-hydrolysis of biomass prior to fermentation, i.e., SHF *in situ* (**Figure 3**). To move the field forward, it is essential that the degradation of more complex biomass by engineered strains be tested, be it straw, grasses, or waste products such as spent grains. Progress on xylose fermentation has also been made, however both fields have been developing in parallel rather than merging into one. To date only cellobiose and xylose co-utilizing strains having been generated (Katahira et al., 2006; Ha et al., 2011; Aeling et al., 2012).

To achieve CBP, many criteria have to be satisfied. The ideal host should be capable of such metabolizing both cellulose and xylose, maintaining maximum heterologous expression of

enzymes at the optimum ratio, be resistant to high ethanol concentrations and ideally be thermotolerant. Ideally, a single organism should possess all of these attributes, however the heterologous production of so many enzymes can reduce the fitness of the host cell (Tsai et al., 2010). Co-culturing yeast strains expressing individual cellulase or xylanases offers a easy way to vary the ratio of cellulase and xylanase enzymes by simply altering the number of cells expressing each enzyme in the fermentation. With the proposed SHF *in situ* approach, various scenarios can be envisioned. Cellulose and xylose can be fermented separately from pre-treated biomass (**Figure 3A**). Alternatively, yeast strains co-expressing xylose and cellulose utilizing genes together or cocultures of strains expressing individual genes can be cultured on hydrolysates derived from pre-treatment of biomass. The available xylose and glucose in the hydrolysate can be used as energy sources to allow the growth of yeast and thus the production and secretion of cellulases into the medium (**Figure 3B**). The spent medium (supernatant) can then be used for enzymatic pre-hydrolysis of cellulose followed by a fermentation of released glucose. This step-wise process is compatible with current industrial fermentation processes. Ultimately, the most efficient CBP would require a single microorganism capable of fermentation of both xylose and cellulose in a single step at 30◦C (**Figure 3C**). We eagerly await the development of this super strain!

In conclusion, every component involved in lignocellulosic bioethanol generation such as the host organism, hydrolyzing enzymes and even biomass substrate are now the focus of bioengineering research in the pursuit for greater efficiency to reduce production costs. Thus, the future of biofuel utilization is dependent on the economics of its production, which is itself reliant on the science used to generate it.

#### **REFERENCES**


and furfural. *Bioresour. Technol.* 111, 161–166. doi: 10.1016/j.biortech.2012. 01.161


and its application to cellulose hydrolysis and ethanol production. *Appl. Environ. Microbiol.* 76, 7514–7520. doi: 10.1128/AEM.01777-10


Zhang, G., Lin, Y., He, P., Li, L., Wang, Q., and Ma, Y. (2014). Characterization of the sugar alcohol-producing yeast *Pichia anomala*. *J. Ind. Microbiol. Biotechnol.* 41, 41–48. doi: 10.1007/s10295-013- 1364-5

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 30 January 2014; paper pending published: 13 March 2014; accepted: 31 March 2014; published online: 22 April 2014.*

*Citation: Kricka W, Fitzpatrick J and Bond U (2014) Metabolic engineering of yeasts by heterologous enzyme production for degradation of cellulose and hemicellulose from biomass: a perspective. Front. Microbiol. 5:174. doi: 10.3389/fmicb.2014.00174*

*This article was submitted to Microbial Physiology and Metabolism, a section of the journal Frontiers in Microbiology.*

*Copyright © 2014 Kricka, Fitzpatrick and Bond. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Comparison of transcriptional profiles of *Clostridium thermocellum* grown on cellobiose and pretreated yellow poplar using RNA-Seq

*Hui Wei <sup>1</sup> \* ‡, Yan Fu2† ‡, Lauren Magnusson1, John O. Baker 1, Pin-Ching Maness 1, Qi Xu1, Shihui Yang3, Andrew Bowersox1,3, Igor Bogorad1, Wei Wang1, Melvin P. Tucker 3, Michael E. Himmel <sup>1</sup> and Shi-You Ding1 \**

*<sup>1</sup> Biosciences Center, National Renewable Energy Laboratory, Golden, CO, USA*

*<sup>2</sup> Center for Plant Genomics, Iowa State University, Ames, IA, USA*

*<sup>3</sup> National Bioenergy Center, National Renewable Energy Laboratory, Golden, CO, USA*

#### *Edited by:*

*Nigel Peter Minton, University of Nottingham, UK*

#### *Reviewed by:*

*Patricia Coutinho Dos Santos, Wake Forest University, USA Marianne Guiral, Centre National de la Recherche Scientifique, France*

#### *\*Correspondence:*

*Hui Wei and Shi-You Ding, National Renewable Energy Laboratory, Biosciences Center, 15013 Denver W Pkwy, Golden, CO 80401, USA e-mail: hui.wei@nrel.gov; shi.you.ding@nrel.gov*

*†Present address: Yan Fu, Monsanto Company, St. Louis, USA ‡These authors have contributed equally to this work.*

The anaerobic, thermophilic bacterium, *Clostridium thermocellum*, secretes multi-protein enzyme complexes, termed cellulosomes, which synergistically interact with the microbial cell surface and efficiently disassemble plant cell wall biomass. *C. thermocellum* has also been considered a potential consolidated bioprocessing (CBP) organism due to its ability to produce the biofuel products, hydrogen, and ethanol. We found that *C. thermocellum* fermentation of pretreated yellow poplar (PYP) produced 30 and 39% of ethanol and hydrogen product concentrations, respectively, compared to fermentation of cellobiose. RNA-seq was used to analyze the transcriptional profiles of these cells. The PYP-grown cells taken for analysis at the late stationary phase showed 1211 genes up-regulated and 314 down-regulated by more than two-fold compared to the cellobiose-grown cells. These affected genes cover a broad spectrum of specific functional categories. The transcriptional analysis was further validated by sub-proteomics data taken from the literature; as well as by quantitative reverse transcription-PCR (qRT-PCR) analyses of selected genes. Specifically, 47 cellulosomal protein-encoding genes, genes for 4 pairs of SigI-RsgI for polysaccharide sensing, 7 cellodextrin ABC transporter genes, and a set of NAD(P)H hydogenase and alcohol dehydrogenase genes were up-regulated for cells growing on PYP compared to cellobiose. These genes could be potential candidates for future studies aimed at gaining insight into the regulatory mechanism of this organism as well as for improvement of *C. thermocellum* in its role as a CBP organism.

**Keywords:** *Clostridium thermocellum***, transcriptomics, RNA-Seq, pretreated yellow poplar (PYP), cellobiose, cellulosome, ethanol, hydrogen**

#### **INTRODUCTION**

Microbial conversion of biomass to biofuels is an attractive route for biofuel development, but an essential challenge is to increase the microbial capacity both for overcoming the biomass recalcitrance and for converting the biomass-derived sugars to biofuels (Himmel et al., 2007; Alper and Stephanopoulos, 2009). *Clostridium thermocellum*, a gram-positive, thermophilic, anaerobic bacterium, is one of the model consolidated bioprocessing (CBP) systems used to study the enzymatic hydrolysis of cellulosic biomass to produce fuels (Islam et al., 2006; Brown et al., 2012; Yang et al., 2012). The following features of *C. thermocellum* contribute to its suitability as a model cellulolytic, biofuelproducing bacterium: (1) It produces cellulosomes, a type of highly organized multi-protein enzyme complexes, which have been shown to be highly efficient enzyme systems in deconstructing plant cell wall, especially in degrading the recalcitrant substrate crystalline cellulose (Bayer et al., 1998, 2004). (2) It carries out mixed-product fermentation, producing ethanol, H2 and numerous organic acids including acetate, formate, and lactate (Demain et al., 2005; Islam et al., 2006). (3) It is suitable for both submerged and solid-state fermentation (Bayer et al., 2007), the latter having similarity to compost in the set-up of feedstock (Wei et al., 2012). (4) Its genome sequence is available (http:// genome.jgi-psf.org/cloth/cloth.info.html) and many of the cellulolytic enzymes are identified and biochemically characterized. The knowledge gained from studies of this species will benefit work on other clostridial species of industrial interest, such as *C. acetobutylicum*, known to produce the potential fuels acetone, ethanol, and butanol (Cooksley et al., 2012).

Fulfilling this potential will require a more in-depth understanding of the metabolic and genetic mechanisms by which

**Abbreviations:** Abf, α-N-arabinofuranosidase; ADH, alcohol dehydrogenase; CBO, cellobiose-grown cells only; Cbp, carbohydrate-binding protein; CBP, consolidated bioprocessing; CDP, cellodextrin phosphorylase; CEP, cellobiose phosphorylases; COG, clusters of orthologous groups; *emPAI, exponentially modified protein abundance index; FC*, fold changes; Fdred, reduced ferredoxin; GH, glycoside hydrolases; NSAF, normalized spectral abundance factor; *PF, passing filter;* PTA, phosphotransacetylase; PYP, pretreated yellow poplar; PYPO, PYP-grown cells only; *qRT-PCR, quantitative reverse transcription-PCR;* RPKM, reads per kilobase of exon model per million mapped reads; RsgI, RsgI-like anti-sigma factors; SigI, sigma factor.

*C. thermocellum* utilizes recalcitrant biomass substrate. So far, a number of studies have been carried out on *C. thermocellum*, such as transcriptomic analysis of stress responses to ethanol, furfural, and heat during growth on pure sugars (i.e., cellobiose) (Yang et al., 2012), time course studies of cell growth on pure crystalline cellulose (Avicel) (Raman et al., 2011), and comparisons of growth on pure crystalline cellulose (Sigmacel 50) and cellobiose (Riederer et al., 2011). However, despite reports on the sub-proteomic analyses of cellulosomes from *C. thermocellum* grown on Avicel (Gold and Martin, 2007) and on combinations of Avicel with pectin and/or xylan, or on pretreated switchgrass (Raman et al., 2009), there were no genome-wide transcriptomic studies reported for growth *on* pretreated woody plant biomass until most recently a transcriptomic analysis comparing *C. thermocellum* cells grown on pretreated switchgrass and woody biomass, cottonwood (*Populus trichocarpa* × *Populus deltoids*; black cottonwood × eastern cottonwood in common names), was *published on* December 2, 2013 (Wilson et al., 2013a), after the submission of this manuscript.

Woody biomass has been found to be more recalcitrant to enzyme digestion than is herbaceous biomass. For example, whereas the Recalcitrance Index for switchgrass is ∼0.35, the index for hardwoods, such as yellow poplar, is 0.56, indicating that hardwoods are more difficult to be degraded (Wei et al., 2009). *Previous* studies have shown that cellulolytic bacteria grown on different lignocellulosic substrates have different levels of glycoside hydrolase (GH) activities (Irwin et al., 2003). As such C. *thermocellum* grown on pretreated woody plant biomass is likely to have distinctive responsive genes, as well as different composition of cellulolytic enzymes different from those grown on herbaceous biomass. Recently, we found that species of the genus *Clostridium* (including *C. thermocellum*) were among the dominant species, comprising 6.3% of the total, in an anaerobic community decomposing yellow poplar wood chips (van der Lelie et al., 2012). In parallel, we studied the yellow poplar compost system (Wei et al., 2012), in which *Clostridium* was also found to be one of the dominant bacteria (data not shown). These results prompted us to explore *C. thermocellum* grown on dilute acid-pretreated yellow poplar (PYP) as a single-species model for the plant cell wall-degradation. PYP is a widely-utilized feedstock in process-development for conversion of biomass to fuels and chemicals, and it is important to identify specific enzymes that this organism calls into action to attack this form of the feedstock/substrate.

PYP has been previously demonstrated that 60% conversion to simple sugars can be achieved with a loading of ∼8.4 mg per g biomass cellulose of a mixture of *Trichoderma reesei* CBHI and *Acidothermus cellulolyticus* E1 (95%: 5% on molar basis) (Vinzant et al., 1994; Baker et al., 1997). In this study, *C. thermocellum* was bench fermented using PYP and cellobiose as sole carbon sources, respectively, and transcriptional profiles were analyzed. RNA-seq is a recently developed technology for transcriptome profiling, which uses next-generation sequencing to reveal the presence and quantity of RNA in biological samples. The goals of this study are two-fold: first, we identified genes responsive for degradation of recalcitrant biomass substrates in PYP- vs. cellobiose-grown *C. thermocellum* cells. Secondly, we specifically focused on candidate genes related to cellulosome, cellodextrin transport, polysaccharide signal transduction and end-product synthesis related genes. These candidate genes are likely to be valuable for mechanism study; as well as for protein-engineering to further improve the abilities of this already potent organism to degrade the cell walls of recalcitrant biomass feedstocks.

# **MATERIALS AND METHODS**

**Figure 1A** shows the overall experimental approach we designed to investigate the *C. thermocellum* utilization of pretreated biomass substrates as reflected by changes in the cell's transcriptome. Details of the experimental approach are described in the following sections.

# **CARBON AND LIGNOCELLULOSIC SUBSTRATES**

Cellobiose and other chemical compounds were purchased from Sigma (St. Louis, MO). PYP was prepared as described previously (Tucker et al., 1998). Briefly, the milled yellow biomass (20% solids loading) was pretreated in 0.21% w/w H2SO4 at 200◦C for 4 min. The resultant PYP contained ∼65% cellulose, 4% xylan, and 31% lignin (dry weight basis). PYP was exhaustively washed with deionized water until the pH reached that of deionized water, prior to being used in medium preparation as described below.

# **MICROORGANISMS AND CULTURE CONDITIONS**

*C. thermocellum* ATCC 27405 was routinely cultured at 60◦C anaerobically in 30-mL serum bottles containing 10 mL of ATCC medium 1191 containing cellobiose, and was subcultured with 2% inoculum taken at exponential **g**rowth phase. For this study, similarly to practices in the literature (Gold and Martin, 2007; Rydzak et al., 2009), 100-mL batch culture in 250-mL serum bottles was used for the growth of the strain anaerobically at 58◦C in ATCC medium 1191, containing 0.30% (w/v) cellobiose as the control carbohydrate substrate, or 0.44% PYP as biomass substrate (sugar-content equivalent to 0.30% cellobiose) with an agitation of 130 rpm. Cell growth was monitored based on either the measurement of optical density (OD) by spectrophotometry at 600 nm (OD600) for cellobiose-grown culture, or the increase in pellet protein amount for PYP-grown culture using the method described in literature (Raman et al., 2009). Three independent sets of PYP- vs. cellobiose-grown cell cultures, harvested at the late stationary phase (20 h for cellobiose-grown cells and 36 h for PYP-grown cells), were carried through the downstream RNA extraction preparations. Two sets of RNA preparations were used for cDNA library construction and RNA-Seq, whereas all three sets of RNA preparations were used for qRT-PCR analysis in verifying RNA-Seq data of selected genes.

### **MEASUREMENT OF H<sup>2</sup> AND ETHANOL, AND ISOLATION OF BACTERIAL TOTAL RNA**

Cells were harvested for RNA isolation in late stationary phase after the hydrogen concentration at the headspace was analyzed using GC method, wherein an Agilent 7890A GC equipped with a Supelco 60/80 molsieve 5A column was used for the measurement of H2. The cell culture was centrifuged at high speed (8000 × g) for 5 min to pellet the bacterial cells. Cell-free culture supernatants were used to measure ethanol concentration by

**FIGURE 1 | Flowchart for experimental procedures, along with growth curves for cellobiose- and PYP-grown** *C. thermocellum* **cells. (A)** Flowchart outlining the experimental procedures; RPKM, reads per kilobase of exon model per million mapped reads. **(B)** OD600 for cellobiose-grown cells was measured by spectrophotometry at 600 nm

(OD600). **(C)** Cell pellet protein concentration for pretreated yellow poplar (PYP)-grown cell culture. The doubling time for cellobiose- and PYP-grown cells was 3.3 and 7.8 h, respectively. Data are presented as the mean (±SE) of 3 replicates. Arrows indicate the time points sampled for transcriptomic analyses.

using HPLC (Agilent) with a refractive index detector (Shimadzu, Kyoto, Japan). All samples were filtered through a 0.45-μm filter before HPLC analysis. The organic acids were separated in an Aminex HPX-87H column (Bio-Rad) running at a flow rate of 0.6 ml/min at 50◦C, with 4 mM H2SO4 as eluent.

The cell pellets collected from 10 mL culture were used for RNA extraction, using a combined protocol (Yang et al., 2012), in which the TRIzol (Invitrogen, Carlsbad, CA) extraction aqueous phase was mixed with an equal volume of 70% ethanol, and then applied to the column of Qiagen RNeasy Mini Kit (Qiagen, Valencia, CA) for purification according to the manufacturer's instructions to obtain total RNAs. 7.5μg total RNAs were mixed with 3μL DNase I (Invitrogen; 1 U/μL) in 3μL 10x DNase I reaction buffer, with adding RNase-free water added up to a total volume of 30μL, and incubated at room temperature for 15 min. The DNase I was inactivated by adding 3 μL of 25 mM EDTA and heating for 10 min at 65◦C. The DNase I treated total RNA was precipitated by the ethanol-glycogen method and redissolved in 15μL of 1 mM EDTA. Since the total RNAs contains 75% rRNA (Chen and Duan, 2011), mRNA was enriched by using the MICROBExpress Bacterial mRNA Enrichment Kit (cat # AM1905, Ambion, Life Technologies, Austin, TX) to remove 16 and 23 s rRNAs. The resultant RNAs were quantified and analyzed for integrity on the Agilent 2100 Bioanalyzer, and used for cDNA library construction as described below.

#### **cDNA LIBRARY CONSTRUCTION, SEQUENCING, AND READ MAPPING**

cDNA libraries were constructed with the RNA-Seq sample preparation kit (RS-100-0801, Illumina, San Diego, CA), using the procedure provided by the manufacturer. Briefly, the above enriched mRNA samples were fragmented, and then annealed to random hexamers and reverse transcribed. After second strand synthesis, end repair, and A-tailing, cDNA fragment ends were ligated to adapters that were complementary to sequencing primers. Resultant cDNA libraries were size separated on agarose gels with ∼200 bp fragments being excised, and amplified by 15 cycles of PCR. The prepared cDNA libraries were sequenced on an Illumina Genome Analyzer II by the Michigan State University DNA facility, using the standard protocols and running for 75 cycles of data acquisition. Solexa reads were aligned to the reference genome sequence of *C. thermocellum* strain (ATCC 27405) deposited at NCBI (Accession: NC\_009012.1) using Bowtie (Langmead et al., 2009). The reference genome is 3.8-Mb long with 3305 predicted genes, including 71 structural RNA genes, according to the NCBI website for the genome (http://www.ncbi.nlm.nih.gov/genome/? term=Clostridium+thermocellum) accessed on April 28, 2010. The best end-to-end alignment (ties broken by read quality) was used with no more than two mismatches. The reads that cannot be unambiguously mapped were not included for further analysis. The above alignment approach and criteria were similar to those used in literature (Jia et al., 2009; Lin et al., 2011).

#### **GENE EXPRESSION NORMALIZATION AND IDENTIFICATION OF DIFFERENTIALLY EXPRESSED GENES**

After read mapping, the midpoint of the read-reference alignment was used to determine which gene that read belongs to (or was derived from), and the RNA-seq read counts for each gene can be then calculated. To facilitate comparison of gene expression levels, both within and between two samples, we quantified and normalized gene expression levels by calculating the reads per kilobase of exon model per million mapped reads (RPKM), i.e., calculating the number of reads mapped per kilobase of exon model divided by the total number of mapped reads in the whole dataset (Mortazavi et al., 2008). As an indicator of gene expression level, the RPKM is widely recognized as a measure of Solexa read density that reflects the molar concentration of a transcript in the starting RNA sample, thus making the normalized gene expression levels comparable within and among samples.

The fold-change for the expression of an individual gene was calculated by taking the ratio of RPKM in PYP-grown cells to that in cellobiose-grown cells (RPKMPYP/RPKMcellobiose). The cutoff value for defining a gene as "differentially expressed" is either a two-fold increase (with the value of RPKMPYP/RPKMcellobiose larger than 2.0), or a two-fold decrease (with the value of RPKMPYP/RPKMcellobiose less than 0.5).

#### **STATISTICAL ANALYSIS**

The log2-transformed raw RPKM dataset was imported into the statistical analysis software JMP Genomics 6.0 software (SAS Institute, NC), and the data of PYP- and cellobiose-grown cell samples were normalized together using the LOWESS normalization algorithm within JMP Genomics. To determine if the differential expression levels between PYP- and cellobiosegrown cell samples were statistically significant, the normalized Log2(RPKM) data were subjected to One-Way analysis of variance (ANOVA) as described in literature (Yang et al., 2012, 2013; Wilson et al., 2013b). The False Discovery Rate (FDR) testing method was used with a significance threshold of *p* < 0.05 being considered statistically significant.

#### **BIOLOGICAL INTERPRETATION OF DIFFERENTIALLY EXPRESSED GENES**

The identified, differentially expressed genes were interpreted and discussed in the context of biological processes and functions, using clusters of orthologous groups (COG) and Carbohydrate-Active enZYmes (CAZy) analyses of proteins (www.cazy.org/).

#### **QUANTITATIVE REVERSE TRANSCRIPTION-PCR (qRT-PCR)**

Based on their potential functional importance, 12 genes were selected for validating the results of the RNA-Seq analysis. The primers for these genes were either based on literature or designed with Primer Express 3.0 software (Applied Biosystems), and are described along with the PCR results in the Results and Discussion section. Total RNA was extracted from three sets of independent cultures grown on PYP vs. cellobiose as described above, and then converted to cDNA by random priming, using the SuperScript II kit (Invitrogen, San Diego, CA). PCR reactions were run in triplicate using procedure as previously described (Wei et al., 2012). The transcription level of genes was determined according to the 2−--CT method, using RecA as a reference gene for the normalization of gene expression levels (Stevenson and Weimer, 2005).

#### **CORRELATION ANALYSIS FOR CELLULOSOME-RELATED GENE EXPRESSION AND PROTEIN ABUNDANCE**

Two sets of quantitative cellulosomal protein data of cellobiosegrown *C. thermocellum* cells at late stationery phase were retrieved from literature (Gold and Martin, 2007; Raman et al., 2009), and plotted against our sub-dataset of log2(RPKM) values of cellulosomal genes in cells grown on the same substrate at the same culture phase in this study. Pearson correlation coefficient values were calculated using Microsoft Excel (Microsoft Corporation, Redmond, WA, USA), and used as an indicator for the degree of correlation for the compared pairs.

# **COMPOSITIONAL ANALYSIS OF PRETREATED YELLOW POPLAR RESIDUES**

The PYP residues from PYP-grown *C. thermocellum* culture were collected by centrifugation at low speed (100 × g) for 2 min to precipitate the insoluble substrate not consumed by the bacterial cells in the culture. Such centrifugation speed has been used in the literature to remove any insoluble substrate in *C. thermocellum* culture (Dror et al., 2005). Compositional analysis of the collected PYP residue was performed by the National Bioenergy Center, National Renewable Energy Laboratory, using method described in literature (Templeton et al., 2010).

# **RESULTS AND DISCUSSION**

### *C. THERMOCELLUM* **GROWTH AND THE PRODUCTION OF HYDROGEN AND ETHANOL**

The first step of this study, as illustrated in **Figures 1A–C**, is the growth of *C. thermocellum* ATCC 27405 with two types of carbohydrate substrates, PYP being compared with cellobiose. GC analysis of gas composition in the headspace of batch cultures at the late stationary phase revealed a production yield of 1.22 vs. 0.92 mole H2/mole glucose consumed in cellobiose- and PYP-grown *C. thermocellum*, respectively (**Table 1**). Furthermore, HPLC analysis of metabolite production in supernatants of harvested cell culture revealed a production yield of 0.51 vs. 0.30 mole ethanol /mole glucose equivalent consumed in cellobioseand PYP-grown *C. thermocellum*, respectively. Overall, the production of hydrogen and ethanol by PYP-grown *C. thermocellum* are 75 and 58% of those by cellobiose-grown *C. thermocellum*, respectively (**Table 1**). Wet chemistry analysis of spent PYP solids revealed that 52% of the glucan in the starting PYP substrate was consumed during the fermentations, which can partially explain the lower absolute H2 and ethanol productions compared to the culture on cellobiose substrate (in which nearly all cellobiose was depleted at late stationary phase).

### **RNA-Seq RESULTS**

The quality of RNA-Seq cDNA libraries was assessed on a Bioanalyzer prior to GA II (**Figure 2**). The results showed the size distribution of cDNA library was between 200 and 300 bp, which met the requirement for Solexa sequencing.

For the two PYP-grown cell cDNA library samples, Solexa sequencing from the first cDNA library generated 20.8 million total raw reads, with 14.5 million passing filter reads (70% PF reads); from the second cDNA library generated 19.6 million total raw reads, with 14.1 million passing filter reads (72% PF reads). Out of these PYP-grown cell derived reads, 2.3 and 2.1 millions reads were unambiguously mapped to *C. thermocellum* genome with the criteria set in the Materials and Methods section, respectively. For the two cellobiose-grown cell cDNA library samples, Solexa sequencing from their first cDNA library generated 21.1 million total raw reads, with 15.7 million passing filter reads (74% PF reads); from the second cDNA library generated 19.7 million total raw reads, with 14.0 million passing filter reads (71% PF reads). Out of these cellobiose-grown cell-derived reads, 2.6 and 2.2 millions reads were unambiguously mapped to *C. thermocellum* genome, respectively. The RPKM value for each gene in each condition was calculated as RPKM = Reads number

#### **Table 1 | Hydrogen, ethanol and acetate production at the late stationary phase of** *C. thermocellum* **culture.**


*The time point for measurements was 20 and 36 h for cellobiose- and PYP-grown cells, respectively. Values for the yields of hydrogen, ethanol and acetate are means* ± *s.e.m. of three independent experiments. PYP, acid-pretreated yellow poplar.*

mapped to gene/Length of the gene (kb)/Total reads number (million reads), as described in the Materials and Methods section.

Overall, sequence analysis successfully aligned the Solexa reads to 3081 protein-coding genes and 61 structural RNA genes in cellobiose- and/or PYP-grown *C. thermocellum* mRNA samples, accounting for 95% [i.e., (3081 + 61)/3305] of all *C. thermocellum* genes in the genome, indicating that RNA-Seq analysis in this study achieved comprehensive coverage of the *C. thermocellum* transcriptome. In addition, the above result is consistent with the reports that in the genomes of other microorganisms such as *Saccharomyces cerevisiae* and *S. pombe*, more than 90% of the genes are transcriptionally active and expressed (Nagalakshmi et al., 2008; Wilhelm et al., 2008; Wang et al., 2009). The detailed lists of genes for the RNA-Seq sequences identified in the cellobiose- and PYP-grown cells are presented in **Supplementary Data Sheet 1**. Each of the RPKM values of cellobiose-grown cells and RPKM of PYP-grown cells was the average of two replicates.

Our data from Solexa-read transcriptome measurement of PYP- vs. cellobiose-grown *C. thermocellum* illustrate some key characteristics of the results. First, the obtained RPKM values for most of the transcripts of the active protein-coding genes in PYP- and cellobiose-grown *C. thermocellum*, are in the range of 1–50,000 (which can be revealed by re-sorting the **Supplementary Data Sheet 1**). Such range of RPKM values is comparable to the reported range of RPKM values in RNA-Seq whole-transcriptome analysis of other organism samples (Tang et al., 2009).

The gene expression levels can be classified into four levels: low, moderate, high, and very high, as illustrated in the X axis in **Figure 3**. While the RNA-Seq data sets of PYP- vs. cellobiosegrown cells had comparable numbers of genes that fall in the moderate expression levels (i.e., both 40% of the active proteinencoding genes), the PYP-grown cells had more highly expressed genes (41 vs. 33% in that of cellobiose-grown culture; **Figure 3**). In contrast, cellobiose-grown cells had more lowly expressed genes (26 vs. 18% in that PYP-grown culture).

**FIGURE 3 | RPKM frequency histogram of transcripts from RNA-Seq of pretreated yellow poplar (PYP)- vs. cellobiose-grown** *C. thermocellum* **cells.** The diagram shows the distribution of the number of genes expressed at different RPKM levels. The total number of protein coding genes aligned with the RNA-Seq was 3081; the percentage value above each bar indicates the genes at specific expression level accounting for the proportion of total number of genes. The ∗ mark indicates that significantly different frequencies (i.e., numbers of genes) were observed between the two RNA-Seq data sets from PYP- vs. cellobiose-grown *C. thermocellum* cells.

#### **CHANGES IN GLOBAL GENE EXPRESSION AND CLUSTERS OF ORTHOLOGOUS GROUPS (COG) ANALYSIS**

To compare the differential expression of genes between PYPand control cellobiose-grown cells, fold changes (FC) were computed as the ratio of the RPKM values obtained for individual genes in PYP- against cellobiose-grown cells. Analysis of changes in global gene expression had identified 1211 genes that show a two-fold or greater increase in expression (i.e., *FC* ≥ 2.0) or detected in PYP-grown cells only (referred as PYPO genes), and 314 genes with a two-fold or greater decrease (i.e., *FC* ≤ 0.5) in transcript abundance for PYP- against cellobiose-grown cells or detected in cellobiose-grown cells only (referred as CBO genes), as listed in **Supplementary Data Sheet 2** with related statistic analyses showing that they were statistically significant.

The COG distribution for these up- and down-regulated genes in the transcriptome was determined and the result is shown in **Figure 4**. The top two categories for up- and down-regulated genes belong to the categories "general function prediction, [R]" (equivalent to unclassified) and "inorganic ion transport and metabolism, [P]." A closer examination of the distribution chart reveals that for the up-regulated genes, the categories "signal transduction mechanisms, [T]," "amino acid transport and metabolism, [E]," and "energy production and conversion, [C]" (which is important for the energy-consuming process for biofuel end-product production), were also the most well-represented categories in the transcriptome, with the number of induced genes above 200 for each category (**Figure 4**).

An important category of COG for the degradation and utilization of lignocellulosic substrate is "carbohydrate transport and metabolism, [G]," which includes primarily cellulosome-related genes. In this category, the number of up- and down-regulated genes is 175 and 46, respectively (**Figure 4**). These differentially expressed genes are described in detail in later section.

# **VALIDATION OF RNA-Seq RESULTS WITH QUANTITATIVE REVERSE TRANSCRIPTION-PCR**

Quantitative reverse transcription-PCR (qRT-PCR) is a well accepted method for verifying microarray (Ferreira et al., 2010; Yang et al., 2012) and RNA-Seq data (Cusick et al., 2012; Huang et al., 2012; Ji et al., 2013). We used this method to validate the expression patterns of 12 selected genes in independent biological replicates. The selected genes mainly represented different functional categories involving in cellulosome, hydrogen and ethanol production and had a range of FC values based on RNA-Seq. In addition to these 12 genes, the gene RecA was chosen as a reference gene to normalize the real time RT-PCR data because it has been commonly used as reference gene for *C. thermocellum* (Stevenson and Weimer, 2005). As a further confirmation for the use of this gene as reference gene, RecA did not differ in expression in our cell samples grown on two substrates, the values being RPKM 1166 for PYP-grown cells vs. RPKM 1237 for cellobiose-grown cells. The primers for all 13 genes, along with one-by-one comparisons of the fold-changes in expression of each gene as measured by RNA-Seq and qRT-PCR, are listed in **Supplementary Data Sheet 3, section 1**. Most of the qRT-PCR data matched the RNA-Seq based FC values with a correlation coefficient of 0.95 for the set of 12 selected genes, which indicated that our RNA-Seq result is accurate and the conclusion from RNA-Seq should be reliable.

#### **TRANSCRIPTIONAL CHANGES OF CELLULOSOMAL COMPONENTS**

Cellulosome is an extracellular supramolecular machine that can efficiently degrade crystalline cellulosic substrates and associated plant cell wall materials. We are especially interested in the genes encoding cellulosomal component proteins in response to cellulosic substrates, which can lead to the identification of regulatory or rate-limiting components regarding cellulolysis.

The number of reported cellulosomal genes in the genome of ATCC 27405 strain has been increasing from the initial numbers of 71 (Zverlov et al., 2005) and 72 (Zverlov and Schwarz, 2006), to more recently, 81 genes (Raman et al., 2011). Note that the latter number is consistent with the updated genome annotation. Of these 81 genes, all were detected transcriptionally in this study. The list of 81 cellulosomal genes with their FC values is described in **Table 2**.

Analysis of this sub-transcriptome showed the following features: first, the overall cellulosome-associated genes were upregulated significantly in PYP- vs. cellobiose-grown cells, reflected by the facts that out of the 81 cellulosomal genes, 47 (i.e., 58%) were up-regulated with *FC* ≥ 2.0. In contrast, only 4 out of 81 (i.e., about 5%) cellulosomal genes, including a GH5 gene (Cthe\_2193), CelU (Cthe\_2360), XghA (Cthe\_1398), and Cthe\_0438, were down-regulated by two-fold or detected in celllobiose-grown cells only. The overall average FC value for all these 81 cellulosomal genes is 3.5 (see the bottom row in **Table 2**),

suggesting the whole cellulosome machinery was "geared up" at the late stationary phase on PYP substrate.

Secondly, the primary scaffoldin, CipA, and the main secondary scaffoldins, OlpB, Orf2P, and SdbA, have shown significant up-regulation and were found to have the highest abundance on the transcriptional level. This implies that the cellulosomal system is crucial for the efficient degradation of pretreated PYP. Our data further verifies the notion that the *C. thermocellum* cellulosome is the main contributor to the extremely high activity observed in cellulose degradation. Additionally, one putative scaffoldin gene, Cthe\_0736 (cellulosome anchoring protein) has been up-regulated as much as 2.8 times, i.e., FC 2.8 (**Table 2**, row 31). OlpC, which has been recently identified as an important outer layer protein - cellulosome anchoring protein cohesin subunit (Pinheiro et al., 2009), was also found to be up-regulated in this study (Cthe\_0452, FC 4.1; **Table 2**, row 19).

Thirdly, the major cellulosomal cellulases, such as CelS (Cthe\_2089, exo-, FC 4.5; RPKM 3848—the most abundant cellulosomal transcripts in PYP-grown cells), CelA (Cthe\_0269, endo-, FC 1.7, RPKM 1885—the third most abundant transcript in PYP-grown cells), CbhA (Cthe\_0413, exo/endo, FC 7.5, RPKM 367), and Cel124A (Cthe\_0435, endo-; FC 2.0; RPKM 401) were remarkably up-regulated, as shown in **Table 2**. Exo- and endorefer to the substrate site upon which the GHs act. While exocellulases remove one or more sugar units from the ends of polysaccharide chain, endoglucanases randomly hydrolyze the internal glycosidic bonds of polysaccharides. Among the above listed enzymes, CelS was reported to display high synergy with the endo-Cel124A (Brás et al., 2011). Our observation that the CelS had the highest RPKM value in PYP-grown cells is consistent with a report that a knockout mutant of this gene showed a ∼60% reduction in cell cellulolytic performance (Olson et al., 2010), and is also consistent with a most recent finding that CelS gene was found to be highly expressed in *C. thermocellum* cells grown on both pretreated switchgrass and cottonwood substrates (Wilson et al., 2013a). This study furthered such transcriptional analysis by showing that the transcript of CelS is not only highly abundant, but also significantly responsive to biomass substrate, as its RPKM level in cellobiose-grown cell was 4.5 times lower (**Table 2**).

In addition, there was one cellulosome dockerin type I gene (Cthe\_0438) for which the transcript was detected only in cellobiose-grown cells, with a low RPKM value of 4 (**Table 2**, row 81). This observation is consistent with a previous report regarding its absence in the sub-proteome of cellulosomes from cells grown on pretreated switchgrass (Raman et al., 2009). The domain architecture of Cthe\_0438 is "DUF843-type I dockerin" (DUF indicates domain of unknown function). It is difficult to classify the association of this gene with any of the specific catalytic enzyme types, cellulases or hemicellulases, and thus this issue may warrant further studies.



#### **Table 2 | Continued**


*The table is sorted in the order of FC values. The glycoside hydrolase (GH) and carbohydrate-binding module (CBM) families were determined according to the annotations in GenBank and the CAZY database (www.CAZY.org). Other abbreviations: Cbh, cellobiohydrolase; CBO, transcript that detected only in cellobiosegrown cells; CE, carbohydrate esterases; Cel, cellulase; Chi, chitinase; CipA, cellulose-integrating protein A; Lic, lichinase; Man, mannanase; n/a, not applicable; OlpB, outer layer protein; ORF2p, open reading frame 2p; PL, polysaccharide lyase; Prot, protein; SdbA, scaffoldin-dockerin binding component A; XghA, xyloglucanase Xgh74A; Xyn, xylanase.*

#### **EXPLORING THE CORRELATION BETWEEN RNA-Seq AND PUBLISHED PROTEOMIC DATA FOR CELLULOSOMAL GENES**

Literature reports have described the quantitative sub-proteomic data of cellulosomes extracted from cell-free culture filtrates of *C. thermocellum* grown to late stationary phase on cellobiose, Avicel, and other cellulosic substrates (Gold and Martin, 2007; Raman et al., 2009). In one study (Gold and Martin, 2007), an "emPAI," defined as the exponentially modified protein abundance index, showed a linear relationship with protein concentration and was normalized to the value obtained for CipA. Similarly, in another study (Raman et al., 2009), the normalized spectral abundance factor (NSAF) represented the number of spectral counts divided by the number of amino acid residues in the protein and was also normalized to the value obtained for CipA. To determine whether or not a correlative relationship exists between these published cellulosomal sub-proteomic data and the RNA-Seq RPKM data in the present study, we first retrieved the emPAI/CipA and *NSAF/CipA* data from the literature (**Supplementary Data Sheet 3, section 2**), and then plotted it against the Log2(RPKM) data for the genes encoding the same proteins in this study. The results showed that despite the fact that the plotted data were obtained from three different research groups, there remain strong correlations between the RNA-Seq RPKM and the protein abundance indicators of emPAI/CipA and

NSAF/CipA, with Pearson correlation coefficient values being 0.68 and 0.76, respectively (**Figures 5A,B**).

#### **NON-CELLULOSOMAL GLYCOSIDE HYDROLASE PROTEINS AND "FREE" CELLULASES**

In addition to the GHs existing in cellulosomes, the *C. thermocellum* genome includes non-cellulosomal genes coding for GHs, among which 12 were up-regulated by more than twofold (**Table 3**). Notably, both CelC (Cthe\_2807) and Lic16A (Cthe\_2809), the two members in the putative CelC- GlyR3- Lic16A operon (Newcomb et al., 2011), were up-regulated and predicted to be extracellular and/or bacterial cell wall associated (**Table 3**). This was consistent with a recent report that genes in this operon were mainly expressed in the stationary phase (very little during exponential phase) of pure cellulose (Avicel) fermentation (Raman et al., 2011). In addition, Lic16A contains CBM54 and a tandem of four CBM4s, in which both types of CBMs are able to bind xylan; as well as cellulose (Dvortsov et al., 2010). Remarkably, the grouping of four CBM4s in this protein has an ∼100-fold higher binding constant for xylan and cellulose than that of a protein with a single CBM4 module (Dvortsov et al., 2012).

It is surprising that Cel9I (Cthe\_0040), an important noncellulosomal processive endoglucanase that could digest crystalline cellulose with high efficiency, showed low abundance and FC value (RPKM 62 in PYP-grown cells, FC 1.3; see **Supplementary Data Sheet 1**, row no. 1875). Similarly, the only non-cellulosomal exo-cellulase CelY (Cthe\_0071, exo-) also showed both low abundance and FC value (RPKM 61, FC 1.4, see **Supplementary Data Sheet 1**, row no. 1741), which is consistent with a recent report that knocking out the CelY gene had no significant impact on the cellulolytic capacity of the strain (Olson et al., 2010). Based on the evidence above, it would seem that the free-enzyme system plays a less important role than does the cellulosome system in the degradation of PYP by *C. thermocellum*.

#### **RESPONSIVE SIGMA FACTORS COUPLED WITH MEMBRANE-ASSOCIATED ANTI-SIGMA FACTORS AS A MECHANISM FOR BIOSENSING BIOMASS SUBSTRATES**

Because COG analysis had revealed that the category of "signal transduction mechanisms, [T]" was among the most represented categories in the transcriptome of PYP-grown cells, we checked the genes related to the signal transmission attributed to sensing cellulose and other polysaccharide substrates. Recently, it was proposed the possible role of seven membrane-associated RsgI-like anti-sigma factors (referred to as RsgI) and their sigma factor partners (referred as SigI) in extracellular carbohydrate-sensing and glycosidase gene regulation in *C. thermocellum* (Kahel-Raifer et al., 2010; Nataf et al., 2010). In their proposed model, RsgI senses the presence of cellulose and other biomass components in the extracellular medium via its CBM domains, whereas SigI mediates the intracellular activation of different glycosidase genes (Kahel-Raifer et al., 2010; Nataf et al., 2010; Bahari et al., 2011). These predicted gene pairs include: SigI1- RsgI1 (Cthe\_0058- Cthe\_0059), σI2-RsgI2 (Cthe\_0268- Cthe\_0267), σI3-RsgI3 (Cthe\_0315- Cthe\_0316), σI4-RsgI4 (Cthe\_0403- Cthe\_0404), σI5-RsgI5 (Cthe\_1272- Cthe\_1273), σI6-RsgI6 (Cthe\_2120- Cthe\_2119), and σ24C-Rsi24C (Cthe\_1470- Cthe\_1471). Among them, three SigI genes (Cthe\_0058, Cthe\_0268, Cthe\_0403) were up-regulated during later stages of pure cellulose (Avicel) fermentation (Raman et al., 2011).

In this study, while the transcripts of all seven pairs of genes had been detected (see **Supplementary Data Sheet 1**), three pairs of genes and an extra SigI-RsgI pair (which was not included in the initial literature prediction shown above) were up-regulated with FC values above 3.0 in the PYP- against cellobiose-grown cells at the late stationary phase, as described below:




*aGH family characterization was based on CAZy database (http://www.cazy.org/).*

*bThe cellular location was based on three sources: literature, UniProt database (http://www.uniprot.org/), or prediction by Psortb program (http://www.psort.org/psortb/index.html). All listed genes were statistically significantly up-regulated (p* < *0.05). n/a, not available.*

2010). Similar to σI3-RsgI3, the differential expression of this pair was also not previously reported in cellulose-grown *C. thermocellum*.

(4) Equally interesting, we found that another SigI-RsgI pair (Cthe\_2521-Cthe\_2522) was detected on transcriptome, with FC values of 3.9 and 4.8, respectively (**Supplementary Data Sheet 1**). The protein of sigma factor Cthe\_2521 had been detected in the proteome of *C. thermocellum* cultured on cellobiose (Rydzak et al., 2012), but was not in the initial 7 pairs of SigI-RsgI proposed for polysaccharide signal transmission in the literature (Nataf et al., 2010). The possible role of this pair warrants further investigation.

#### **GENES RELATED TO CELLODEXTRIN TRANSPORT AND PHOSPHORYLATION**

*C. thermocellum* has been reported to use ABC-type transporters for uptake of oligosaccharides derived from cellulose hydrolysis (Strobel et al., 1995), which is an important energyconserving mechanism by which importing long cellodextrins can reduce the cost of transport as one-ATP molecule is consumed per transport event. So far, four cellodextrin ABC transporters (carbohydrate-binding protein CbpA, B, C, and D) have been characterized for their substrate binding features (Nataf et al., 2009). Whereas CbpA (Cthe\_0393) binds only to cellotriose (G3), CbpB (Cthe\_1020) binds to G2-G5 cellodextrins, and CbpC (Cthe\_2128) and D (Cthe\_2446) bind to G3-G5 cellodextrins (Nataf et al., 2009; Rydzak et al., 2012). Several transcripts of these annotated genes have been detected in the transcriptome of *C. thermocellum* (Raman et al., 2011; Riederer et al., 2011). Previously, six cellodextrin ABC transporter genes (Cthe\_1862, Cthe\_0391-0393, and 1019-1020, including CbpA and CbpB) were found to be expressed at high levels throughout the course of Avicel alone fermentation (Raman et al., 2011). Most recently, Cthe\_0391-0393 were found to be highly expressed on both pretreated switchgrass and cottonwood substrates (Wilson et al., 2013a). This study furthered such transcriptional analysis by comparing the gene expression between PYP- and cellobiose-grown cells at late stationary phase. A total of 12 transcripts of cellodextrin ABC transporters have been detected (**Supplementary Data Sheet 3, section 3**; also **Figure 6**), among which, seven were up-regulated in PYP- against cellobiose-grown cells, with FC values in the range of 3.1–10.3, suggesting that the PYP-grown cells had a mechanism for enhancing the uptake and utilization of polysaccharides derived from biomass substrates.

For the subsequent phosphorolytic cleavage of the imported oligosaccharides and cellobiose, this study identified the transcripts of one cellodextrin phosphorylase (CDP, Cthe\_2989) and two cellobiose phosphorylases (CEP, Cthe\_0275 and Cthe\_1221). Among these, the CEP Cthe\_1221 was up-regulated with a FC value of 3.0 (**Supplementary Data Sheet 3, section 3**; also **Figure 6**), suggesting this gene may warrant further studies.

### **GENES RELATED TO GLYCOLYSIS, PYRUVATE CATABOLISM, AND END-PRODUCT SYNTHESIS**

The deduced pathway for cellulolysis, glycolysis, ethanol, and H2 production in *C. thermocellum* ATCC 27405 is illustrated in **Figure 6**, which is in accordance to accumulated literature findings (Demain et al., 2005; Rydzak et al., 2009, 2011; Riederer et al., 2011; Carere, 2013). The set of gene IDs for the enzymes involved in above pathway were retrieved from the KEGG PATHWAY database (http://www. genome.jp/kegg) (Kanehisa et al., 2008), and were shown in **Supplementary Data Sheet 3, section 3** with their FC values in PYP- vs. cellobiose-grown cells. Out of the listed genes, 18 and 12 were significantly up- and down-regulated, as shown in red and green text in **Supplementary Data Sheet 3, section 3**, respectively.

For pyruvate catabolism and end-product synthesis, *C. thermocellum* may convert pyruvate into (1) formate via pyruvate formate lyase (PFL) and pyruvate formate lyase activating enzyme (PFL-AE), (2) lactate via lactate dehydrogenase (LDH), (3) CO2,

**FIGURE 6 | Diagram for the primary steps in the conversion of cellulose to fermentation products using** *C. thermocellum* **ATCC 27405.** SigI-RsgI (sigma factors coupled with membrane-associated RsgI-like anti-sigma factors) are also illustrated in this diagram as they have been proposed in literature for polysaccharide triggered signal transmission in regulation of GH family genes. Detailed nomenclature of enzymes and their associated gene IDs can be found in **Supplementary Data Sheet 3, section 3** and **Table 4**. For enzymes associated with a single gene ID, the gene ID is not included in the figure; instead, its fold change (FC) value is indicated directly beside the enzyme. For enzymes associated with multiple gene IDs, the gene IDs and

their FC values are illustrated in sorted lists in the diagram, in order of either their FC values or locus IDs. Text in red and green represents the up-regulated and down-regulated genes, respectively, in PYP- against cellobiose-grown cells; in contrast, text in black represents the genes with no significant transcriptional changes between the two types of cells, i.e., 2.0 > *FC* value > 0.5. Nomenclature of metabolites: CoA, coenzyme A; ECH, energy conserving hydrogenase; Fd, ferredoxin; Fru, fructose; Glu, glucose; ox, oxidized; P, phosphate; PEP, phosphoenolpyruvate; Pi, inorganic phosphate; PYPO, transcript that was detected in PYP-grown cells only; Pyr, pyruvate; red, reduced.

**Table 4 |** *C. thermocellum* **genes encoding putative hydrogenases, sensory hydrogenases, and NADH:Fd oxidoreductases using ferredoxin and NAD(P)H as electron carriers.**


#### **TYPE 1. [NiFe] H2ase Fd-dependent ECH: Cthe\_3013-Cthe\_3024**


**TYPE 2. [FeFe] H2ases**

**Type 2.1. Bifurcating: Cthe\_0338-Cthe\_0342; Cthe\_0428-Cthe\_0430**


#### **Type 2.2. Sensory: Cthe\_0425-Cthe\_0426**


#### **Type 2.3. NAD(P)H dependent: Cthe\_3003-3004**


*In addition, hydrogenase maturation proteins are also listed. The table is arranged according to the hydrogenase classification described by two literature sources, as cited in the related Results and Discussion section; within each sub-category, it was arranged in order of locus IDs. For each gene, fold change (FC) value was calculated by dividing the RPKM of PYP cells by RPKM of cellobiose cells, as listed in Supplementary Data Sheet 1. RPKM, reads per kilobase of exon model per million mapped reads. Text in black represents the genes with no significant transcriptional changes between the two types of cells, i.e., 2.0* > *FC value* > *0.5. Text in red and green represents the genes that found to be significantly up-regulated and down-regulated, respectively, in PYP- against cellobiose-grown cells, based on statistic analysis in Supplementary Data Sheet 2. ECH, energy conserving hydrogenase. PYPO, transcript that detected in PYP-grown cells only.*

*aCthe\_3003 ([FeFe] H*2*ase, putative) was characterized as Fd-linked by Schut and Adams (2009), but as NAD(P)H dependent recently by Carere et al. (2012).*

Fdred, and acetyl-CoA, in which acetyl-CoA eventually leads to the production of acetate and ethanol, the Fdred leads to the formation of hydrogen (**Figure 6**). The up-regulation of two out of four alcohol dehydrogenase (ADH) genes, namely Cthe\_0423 (FC value 7.0) and Cthe\_0394 (FC value 5.3), raise the average FC value for ADHs to 3.6. In contrast, the phosphotransacetylase (PTA; Cthe\_1029), whose activity had been verified in *C. thermocellum* (Lamed and Zeikus, 1980), was significantly down-regulated in PYP- against cellobiose-grown cells, with an FC value of 0.2. These data suggest that based on mRNA profiling, carbon flux of acetyl-CoA may preferentially be channeled to ethanol production in PYP-grown cells. The observed decrease of acetate/ethanol ratio, from 1.08 in cellobiose-grown cells to 0.92 in PYP-grown cells (**Table 1**), supports this carbon shift at the late stationary phase of PYP-grown cells.

Lactate and formate were not monitored in this study as previous study has shown that in the late growth phase of *C. thermocellum* cultures grown on cellobiose substrate, lactate, and formate together represent only a small fraction of the total end products produced (Islam et al., 2006). Future studies should also monitor the production of lactate, formate and CO2, which could provide a whole picture for carbon balance for the cells utilizing biomass substrates.

# **DYNAMICS FOR H<sup>2</sup> PRODUCTION WITH PYP AS SUBSTRATE**

*C. thermocellum* genes encoding putative hydrogenases and sensory hydrogenases using ferredoxin and NAD(P)H as electron carriers are listed **Table 4**. In addition, hydrogenase maturation proteins are also listed. The classification of the above hydrogenases are mainly based on literatures that systematically characterized the putative hydrogenases in *C. thermocellum* and other species (Schut and Adams, 2009; Carere et al., 2012). Briefly, there are two types of hydrogenases (H2ases) according to the metal content in the respective active sites:

The first type is [NiFe] H2ase, a putative Fd-dependent energy converting hydrogenase (ECH). Its hexameric structural subunits are encoded by Cthe\_3019-Cthe\_3024, with assembly of its active site assisted by a suite of maturation proteins encoded by Cthe\_3013-Cthe\_3018 (**Table 4**, **Figure 6**). This study showed that the FC values for 7 of the 12 detected Fd-H2ase and related genes (Cthe\_3013-Cthe\_3015, Cthe\_3017- Cthe\_3018, and Cthe\_3022-Cthe\_3023) were in the range of 0.6–1.6, suggesting an unchanged status at the transcriptional level for these genes. However, the remaining five genes (Cthe\_3019-Cthe\_3021, Cthe\_3023-Cthe\_3024) encoding the Fd-H2ase structural subunits were significantly down-regulated with FC values between 0.3 and 0.5, which might account for the observed lower H2 yield in PYP-grown cells, compared to the control cellobiose-grown cells. This finding implies that the [NiFe] ECH H2ase likely functions in the H2 production direction since its down-regulation led to less H2 accumulation in the culture headspace.

The second type is [FeFe] H2ases that include two bifurcating H2ases: both being trimeric, and encoded by Cthe\_0338- Cthe\_0342 and Cthe\_0428-Cthe\_0430, respectively. Only seven genes are listed in **Table 4** because Cthe\_0339 is annotated as histidine kinase in GenBank protein database. The second type also includes a sensory H2ase (Cthe\_0425-Cthe\_0426) and a NAD(P)H dependent H2ase (Cthe\_3003-Cthe\_3004), as listed in **Table 4**. The data showed that the FC values for the trimeric bifurcating H2ase genes (Cthe\_0338, Cthe\_0340-Cthe\_0342) and the NAD(P)H dependent H2ase genes (Cthe\_3003-3004) were in the range of 0.6–1.5, indicating an unchanged status for their transcription (**Figure 6**). A homolog of this trimeric H2ase has been uncovered in *Acetobacterium woodii* which functions in H2 consumption yielding reduced Fd and NAD(P)H (Schut and Adams, 2009; Hess et al., 2013a,b). In contrast, genes encoding the trimeric bifurcating H2ase (Cthe\_0428-Cthe\_0430) were significantly up-regulated in PYP-grown cells, with FC values of 6.4–7.9 (**Figure 6**). Based on literature (Schut and Adams, 2009; Calusinska et al., 2010), the trimeric [FeFe] H2ase identified in *C. thermocellum* is a putative bifurcating hydrogenase. This cytoplasmic enzyme was initially characterized in *Thermotoga maritima* and uses both reduced ferredoxin and NAD(P)H as substrates, and functions in H2 production (Schut and Adams, 2009; Hess et al., 2013a). Yet the up-regulation of Cthe\_0428- Cthe\_0430 (this work) led to decreased H2 production when cultured in PYP- vs. cellobiose-grown cells. As such, the exact role of the *C. thermocellum* trimeric H2ase is unknown, which makes it hard to link the above transcriptional data to the observed lower H2 yield in PYP-grown cells. Further studies are warranted to investigate the exact direction (production or consumption of H2) for each of the above [FeFe] H2ases.

In addition, the sensory H2ase encoded by Cthe\_0425- Cthe\_0426 had a FC value of 9.9 and 11.0, respectively (**Table 4**), and future studies are also needed to explore the implication of the significant up-regulation of these two genes in PYP-grown cells.

### **GENES FOR DEGRADING HEMICELLULOSE AND LIGNIN IN FEEDSTOCK BIOMASS**

We noticed that some important hemicellulase genes, such as XynY (Cthe\_0912, FC 13.8), XynD (Cthe\_2590, FC 2.2), XynZ (Cthe\_1963, FC 1.4), ManA (Cthe\_2811, FC 3.4) were upregulated in PYP- against cellobiose-grown cells (**Table 2**; cellulosomal proteins). In addition, some auxiliary enzymes such as α-N-arabinofuranosidase (Cthe\_2548, FC 3.3; **Table 3**, noncellulosomal protein) were also up-regulated. This is probably caused by the adaptation of the strain to digest the PYP, of which the chemical composition and structure are much different from those of cellobiose. In contrast, we did not identify the primary genes involved in lignin modification and degradation, such as laccases, lignin peroxidases, or manganese peroxidases.

### **GROWTH PHASE AND THE TRANSCRIPTOME OF CELLS**

The main focus of this study was to investigate the effects of different carbon sources (PYP vs. cellobiose) on the transcriptome of the cells. To attain this goal, the selection of sampling point was crucial. The experimental design of this study in choosing the late stationary phase is in accordance with literature (Gold and Martin, 2007; Raman et al., 2009). One merit for choosing this sampling point is that our transcriptomic data can be cross-analyzed with the literature's protein data of cellulosomes extracted from cellobiose-grown cells, which are used as one means of validating the transcriptomic data in this study.

Another merit in choosing this sampling point is that compared with other early growth stages (exponential or early stationary phases), the levels of simple, soluble sugars in PYP- and cellobiose-grown culture are both very low: due to either the limit of bacterial enzymes in releasing sugars from PYP or the depletion of cellobiose, respectively. In this sense, late stationary phase is relatively more suitable phase than other phases for comparing transcriptomes derived from different carbon substrate-grown cells in this study.

Nevertheless, it is noteworthy that different growth phases may cause some shifts to some metabolic pathways; thus future studies comparing cells grown on cellobiose vs. cells grown on woody biomass at the exponential phase would provide another angle to investigate the effects of biomass substrate on the transcriptomes of *C. thermocellum* cells, which may lead to the identification of both growth phase- and biomass substrate-responsive genes. In addition, the transcriptional profiles of these two cell samples could also be affected by the distinct concentrations of carbohydrate sources as well as the residual concentrations of metabolites/nutrients present at time of sampling, which will remain a challenge for future studies in managing these variables—some of which are nearly uncontrollable.

# **CONCLUSIONS**

We conducted a RNA-Seq analysis of the transcriptional profiles of *C. thermocellum* grown on PYP and cellobiose, which is different from previous transcriptional studies that focused on degrading cellulose alone substrates, or sub-proteomic studies that focused on cellulosomal protein components. We found that nearly 60% of the genes encoding the protein components of the cellulosomes—the core machinery for cellulose degradation were up-regulated, whereas only 5% were down-regulated. The top up-regulated cellulosomal genes, along with the responsive SigI-RsgI and cellodextrin transporter genes, present promising candidate genes for engineering *C. thermocellum* strains to improve their capacity in effectively converting lignocellulosic biomass substrates. Furthermore, the identified differentially expressed NAD(P)H H2ase, ADH, and PTA genes may provide insight into how the cells regulate the production of H2 and ethanol under the carbon-limited condition.

# **ACKNOWLEDGMENTS**

This work was funded by the Laboratory Directed Research and Development (LDRD) program at the National Renewable Energy Laboratory (NREL), and by the U.S. Department of Energy's Office of Science through the BioEnergy Science Center (BESC), Bioenergy Technology Office (DOE-BETO) and the Fuel Cell Technologies Office. This work was supported by DOE under Contract No. DE-AC36-08-GO28308 with NREL. Andrew Bowersox and Igor Bogorad were sponsored by DOE Academies Creating Teacher Scientists (ACTS) and Science Undergraduate Laboratory Internship (SULI) programs, respectively. We thank Dr. Katherine Chou of NREL for discussion.

# **SUPPLEMENTARY MATERIAL**

The Supplementary Material for this article can be found online at: http://www.frontiersin.org/journal/10.3389/fmicb.2014. 00142/abstract

**Supplementary Data Sheet 1 | List of 3081 genes with their RPKM and fold change values determined in PYP- vs. cellobiose-grown** *C. thermocellum* **cells.**

**Supplementary Data Sheet 2 | List of 1211 up-regulated and 314 down-regulated genes in PYP-grown** *C. thermocellum* **cells.**

**Supplementary Data Sheet 3 | Section 1.** Forward (F) and reverse (R) primer sequences for quantitative reverse transcription-PCR (qRT-PCR) analysis of selected genes in *C. thermocellum*. **Section 2.** RPKM values of cellulosomal genes based on this study and their protein abundance data retrieved from literature for cellobiose-grown *C. thermocellum* cells at the late stationary phase. **Section 3.** Fold change values for genes related to cellodextrin transport and phosphorylation, glycolysis, pyruvate catabolism and end-product synthesis in *C. thermocellum*.

### **REFERENCES**


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 21 November 2013; accepted: 19 March 2014; published online: 11 April 2014.*

*Citation: Wei H, Fu Y, Magnusson L, Baker JO, Maness P-C, Xu Q, Yang S, Bowersox A, Bogorad I, Wang W, Tucker MP, Himmel ME and Ding S-Y (2014) Comparison of transcriptional profiles of Clostridium thermocellum grown on cellobiose and pretreated yellow poplar using RNA-Seq. Front. Microbiol. 5:142. doi: 10.3389/fmicb. 2014.00142*

*This article was submitted to Microbial Physiology and Metabolism, a section of the journal Frontiers in Microbiology.*

*Copyright © 2014 Wei, Fu, Magnusson, Baker, Maness, Xu, Yang, Bowersox, Bogorad, Wang, Tucker, Himmel and Ding. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# A mathematical model of metabolism and regulation provides a systems-level view of how *Escherichia coli* responds to oxygen

*Michael Ederer <sup>1</sup> \*, Sonja Steinsiek2, Stefan Stagge2, Matthew D. Rolfe3, Alexander Ter Beek4, David Knies 1, M. Joost Teixeira de Mattos 4, Thomas Sauter 5, Jeffrey Green3, Robert K. Poole3, Katja Bettenbrock2 and Oliver Sawodny1*

*<sup>1</sup> Institute for System Dynamics, University of Stuttgart, Stuttgart, Germany*

*<sup>2</sup> Max Planck Institute for Dynamics of Complex Technical Systems, Magdeburg, Germany*

*<sup>3</sup> Department of Molecular Biology and Biotechnology, The University of Sheffield, Sheffield, UK*

*<sup>4</sup> Molecular Microbial Physiology, Swammerdam Institute for Life Sciences, University of Amsterdam, Amsterdam, Netherlands*

*<sup>5</sup> Life Sciences Research Unit, Université du Luxembourg, Luxembourg, Luxembourg*

#### *Edited by:*

*Katherine M. Pappas, University of Athens, Greece*

#### *Reviewed by:*

*Dong-Woo Lee, Kyungpook National University, Korea (South) Armen Trchounian, Yerevan State University, Armenia*

#### *\*Correspondence:*

*Michael Ederer, Institute for System Dynamics, University of Stuttgart, Pfaffenwaldring 9, Stuttgart D-70569, Germany e-mail: michael.ederer@ isys.uni-stuttgart.de*

The efficient redesign of bacteria for biotechnological purposes, such as biofuel production, waste disposal or specific biocatalytic functions, requires a quantitative systems-level understanding of energy supply, carbon, and redox metabolism. The measurement of transcript levels, metabolite concentrations and metabolic fluxes *per se* gives an incomplete picture. An appreciation of the interdependencies between the different measurement values is essential for systems-level understanding. Mathematical modeling has the potential to provide a coherent and quantitative description of the interplay between gene expression, metabolite concentrations, and metabolic fluxes. *Escherichia coli* undergoes major adaptations in central metabolism when the availability of oxygen changes. Thus, an integrated description of the oxygen response provides a benchmark of our understanding of carbon, energy, and redox metabolism. We present the first comprehensive model of the central metabolism of *E. coli* that describes steady-state metabolism at different levels of oxygen availability. Variables of the model are metabolite concentrations, gene expression levels, transcription factor activities, metabolic fluxes, and biomass concentration. We analyze the model with respect to the production capabilities of central metabolism of *E. coli*. In particular, we predict how precursor and biomass concentration are affected by product formation.

**Keywords:** *Escherichia coli***, mathematical modeling, metabolism, regulation, respiration, fermentation, thermokinetic modeling**

#### **1. INTRODUCTION**

*Escherichia coli* is able to utilize a variety of electron and carbon donors, such as glucose or glycerol, and electron acceptors, such as oxygen or nitrate. Energy currencies in form of the proton motive force (pmf) and the ATP/ADP ratio are supplied by either substrate level phosphorylation or by proton translocation against the pmf during membrane-associated electron transport (Lengeler et al., 1998). The membrane-associated electron transport chain transfers electrons from cytoplasmatic metabolites, mostly the electron carriers NADH and FADH2, to quinones and from the quinones to the external electron acceptor (Ingledew and Poole, 1984). The thermodynamic force of these redox reactions can be used to translocate protons and thus contribute to the maintenance of the pmf. Dependent on the extracellular medium, *E. coli* uses different strategies to match the balance of carbon, electrons, and energy. We focus on the use of glucose as electron and carbon donor and oxygen as the electron acceptor. If oxygen is available, the redox balance is maintained by transferring electrons to the external acceptor oxygen. Membrane-associated electron transfer is coupled to proton translocation. Glucose is oxidized to carbon dioxide or partially oxidized products such as acetate and succinate, often referred to as overflow products, or it is converted into precursors for biosynthesis. If no oxygen and no other external electron acceptors are available, ATP is gained mainly by substrate level phosphorylation in glycolysis. Redox balance is maintained by the excretion of different metabolites (e.g., acetate, formate, ethanol, and succinate). Even in the absence of extracellular electron acceptors, the electron transport chain can be active, since intracellular electron acceptors, such as fumarate, are available (Lengeler et al., 1998).

The metabolism of bacteria needs to match the requirements of growth and maintenance for carbon, electrons and energy with the supply from the medium. Complex regulatory networks control this process, such as those that operate following a change in oxygen availability. The redesign of bacteria for biotechnological purposes, such as biofuel production, puts additional loads on the metabolism. A high level expression of a production pathway is often not sufficient for satisfactory production capabilities. For optimal production, metabolic regulation needs to be adapted accordingly (Shimizu, 2009).

The adaption to a change of oxygen availability is controlled by transcriptional regulation centered around the transcription factors FNR and ArcA (Sawers, 1999). The activity of FNR depends directly on oxygen (Jordan et al., 1997) and the activity of ArcA depends on the quinols and quinones of the electron transport chain (Georgellis et al., 1997; Bekker et al., 2010; Alvarez et al., 2013; Sharma et al., 2013).

Alexeeva et al. (2000, 2002, 2003) introduced the aerobiosis scale that allows the reproducible adjustment of different microaerobic steady states in a continuous, glucose-limited chemostat culture. Anaerobic growth corresponds to an aerobiosis value of 0%. An aerobiosis value of 100% is defined as the steady state with the minimal oxygen input into the reactor where no fermentation products are excreted, i.e., where all carbon is either incorporated into biomass or respired to carbon dioxide. The aerobiosis scale is linear with the oxygen input into the reactor. The dependency of the biomass-specific oxygen uptake flux on the aerobiosis level is concave because at higher oxygen availability the steady state biomass concentration is higher. For the *E. coli* wild type, biomass-specific acetate production decreases linearly with aerobiosis until it vanishes at 100% aerobiosis units. The aerobiosis scale allows the reproducible analysis of microaerobic states on the physiological (Alexeeva et al., 2000, 2002, 2003; Bekker et al., 2010; Steinsiek et al., 2011) and transcriptional level (Partridge et al., 2006, 2007; Rolfe et al., 2011, 2012; Trotter et al., 2011). Using the aerobiosis scale, results of different laboratories using different reactors can be compared. The aerobiosis scale thus provides an ideal basis for mathematical modeling.

The analysis of measurement data of transcript levels, protein abundances, metabolite concentrations, and fluxes is a valuable tool to reveal bottlenecks of production pathways. In simple cases, such as a feedback inhibition in a linear path, the repressing conditions can be identified by the measurement of metabolite concentrations and reaction fluxes, and countermeasures can be implemented genetically. In more complex cases, the inherent correlation between different measured quantities may not always be so apparent. The metabolic pathways of the central metabolism form a strongly interconnected network with complex interdependencies. A thorough analysis of manipulations of this network requires integration of measurement data of different types, in particular transcript and protein levels, metabolite concentrations, fluxes, and transcription factor activities, by analyzing their dependencies within the network structure. Mathematical modeling of the central metabolism can provide tools for analyzing and predicting the effect of genetic intervention and thus provide guidance when redesigning organisms for biofuel production. The model-based integration of signal transduction, regulation, and metabolism is still not standard and most models are restricted to describing either metabolic processes, signal transduction or regulation (Gonçalves et al., 2013). The response of cellular metabolism to oxygen was studied previously using mathematical models. For example, Varma et al. (1993) used flux balance analysis to analyze the yield optimal behavior of *E. coli* for different oxygen availabilities. Peercy et al. (2006) presented a kinetic model of the respiratory chain of *E. coli* and its regulation via FNR and ArcA. They demonstrated that the model is able to show complex dynamic behavior such as oscillations and hysteresis. Beard (2005) described the electron transport chain of mitochondria by choosing a force for each reaction that is consistent with the requirements of thermodynamics. Similarly, Klamt et al. (2008) used linear relationships of affinity and flux to describe the kinetics of the electron transport chain of purple non-sulfur bacteria.

Here, we present a modeling approach that integrates several levels of information. The goal of the modeling approach was to provide a physically consistent systems-level view of the central carbon and energy metabolism of *E. coli* and its regulation. We show how the model is able to explain the steady state response of *Escherichia coli* to oxygen by comparing model simulation and measurement data for different values of aerobiosis. We demonstrate the utility of the model by making predictions on the effect of biofuel production pathways on bacterial metabolism.

# **2. MATERIALS AND METHODS**

#### **2.1. MODELING**

The complete model information is available in the supplementary files. In particular, **Supplementary Data Sheet 1** shows an overview of all model elements in alphabetical order and **Supplementary Data Sheet 2** shows the model definition file. The model definition file together with the Mathematica package TKMOD (Thermo-Kinetic Modeling) can be downloaded from https://seek.sysmo-db.org/models/23. Together they provide a runnable version of the model.

#### *2.1.1. Model of the metabolism*

We use the thermodynamic-kinetic modeling formalism (Ederer and Gilles, 2007; Ederer, 2010) to describe the metabolic reaction network. We assume an ideal, aqueous solution with the chemical potentials μ-- *<sup>i</sup>* = μ--◦ *<sup>i</sup>* (*T*, *p*, *a*H2O, pH, *I*) + *R*<sup>∗</sup> · *T* · log(*ci*/*c*◦) + *zi* · *F* · φ*i*. For μ--◦ *<sup>i</sup>* we use the transformed Gibbs formation energy of metabolite *i*. A Legendre transformation is conducted to adapt the Gibbs energy to constant pH and water activity *a*H2O (Alberty, 2003). The Debye-Hückel equation is used to correct for the effect of ionic strength *I* (Alberty, 2003). Temperature *T* and pressure *p* are constant. The symbols *R*∗ and *F* denote the ideal gas constant and the Faraday constant, respectively. The charge number *zi* of biochemical species *i* is approximated by the charge of the dominant species at pH 7 and taken from Reed et al. (2003). The electrical potential of the compartment of metabolite *i* is denoted by φ*i*. According to Ederer (2010), we get that the relationship between thermokinetic potential ξ*<sup>i</sup>* and concentration *ci* is *ci* = *Ci* · ξ*<sup>i</sup>* with *Ci* = *c*◦ · exp - − - μ--◦ *<sup>i</sup>* + *zi* · *F* · φ*<sup>i</sup>* /(*R*<sup>∗</sup> · *T*) where *Ci* is the thermokinetic capacity.

Fluxes *Jk* of metabolic reactions are modeled according to Ederer (2010) as (*Rj*(ξ )/*c*E,*j*) · *Jj* = *Fj*(ξ ) where *Fj*(ξ ) is the thermokinetic force, *Rj*(ξ ) is the enzyme-specific thermokinetic resistance and *c*E,*<sup>j</sup>* is the enzyme concentration. For example, the thermokinetic force of reaction *A* + *B* - *C* is *F* = ξ*<sup>A</sup>* · ξ*<sup>B</sup>* − ξ*C*. The above expression reflects three major effects that control metabolic fluxes: the thermokinetic force *F* describes the influence of reactants and products, the resistance *Rj*(ξ ) describes the specific enzyme activity that may depend on further activators and inhibitors, and the enzyme concentration *c*E,*<sup>j</sup>* describes the influence of the enzyme concentration on the metabolic reaction. For most reactions the thermokinetic resistance *Rj*(ξ ) is assumed to be independent of the metabolite potentials ξ . For some reactions, where enzymatic regulation proved to be important for describing the experimental data according non-constant terms were included in *Rj*(ξ ) (see **Supplementary Data Sheet 2**).

#### *2.1.2. De novo synthesis of conserved moieties*

The *de novo* synthesis of AMP, NAD, NADP, and CoA was modeled such that the concentrations of these conserved moieties are constant despite dilution due to growth. The *de novo* synthesis of quinones is modeled as a function of the aerobiosis value in order to reproduce the observed changes in the total quinone concentrations. In the model, the pool concentration of ubiquinone and ubiquinol increases linearly with aerobiosis and the pool concentration of menaquinone and menaquinol decreases linearly with aerobiosis. At oxygenation levels higher than 100% aerobiosis these concentrations are constant. In order to reproduce the observation that even in the complete anaerobic case a substantial part of the quinone pool is oxidized, we introduced a constant pool of oxidized quinones that does not participate in any reaction.

#### *2.1.3. Transcription factor activity*

Transcription factors control gene expression by activating or repressing the expression of many genes. The observed transcription factor activities are the result of a complex interplay of the amount of the transcription factor that in turn may be controlled by other transcription factors and its activation that is often controlled by a metabolic signal, for example oxygen in the case of FNR. As a simplification, we introduce the activity *a*TF,*<sup>i</sup>* of transcription factor *i*. If the activity of transcription factor *i* is minimal, we write that *a*TF,*<sup>i</sup>* = 0 and if it is maximal, we write that *a*TF,*<sup>i</sup>* = 1. We assume a phenomenological Hill type equation *a*TF,*<sup>i</sup>* = *x n*TF,*<sup>i</sup>* TF,*<sup>i</sup>* / *x n*TF,*<sup>i</sup>* TF,*<sup>i</sup>* + *k n*TF,*<sup>i</sup>* TF,*i* that describes *a*TF,*<sup>i</sup>* in dependence on the respective metabolic signal *x*TF,*i*. For *n*TF,*<sup>i</sup>* > 0 transcription factor *i* is activated by its metabolic signal *x*TF,*i*. For *n*TF,*<sup>i</sup>* < 0 it is inhibited. **Supplementary Data Sheet 3** lists the transcription factors with the respective metabolic signals.

#### *2.1.4. Gene expression*

Gene expression is described by the equation *c*˙E,*<sup>i</sup>* = *J*syn,*i*(*a*TF) − μ · *c*E,*<sup>i</sup>* where *c*E,*<sup>i</sup>* is the concentration of enzyme *i*, μ is the dilution rate due to growth and *J*syn,*<sup>i</sup>* is the synthesis rate of enzyme *i* that depends on the activities of the transcription factors *a*TF. The complex dependency of the synthesis rate on the transcription factors is approximated by a phenomenological relationship. For example, the expression of a gene that is activated by transcription factors 1 and 2 and inhibited by transcription factor 3 is modeled by *J*syn,*<sup>i</sup>* = *s*(*k*1, *a*TF,1) · *s*(*k*2, *a*TF,2) · *s*(*k*3, 1 − *a*TF,3) where *<sup>s</sup>*(*k*, *aTF*) <sup>=</sup> <sup>2</sup>−*<sup>k</sup>* <sup>+</sup> (<sup>1</sup> <sup>−</sup> <sup>2</sup>−*k*) · *aTF*. This means that each transcription factor is able to change the expression of a gene by a factor of 2*<sup>k</sup>* and the interaction of different transcription factors is multiplicative. As will be seen, this model allows the description of the measurement data and therefore the use of more complex models allowing for example for additive interactions is not necessary at this stage.

#### *2.1.5. Growth and maintenance*

Central metabolism provides biosynthesis with precursor molecules and with redox and energy equivalents. In order to describe the growth of the biomass we assume a stoichiometric relation for the reaction of precursors into biomass:

νg6p · G6P + νf6p · F6P + νdhap · DHAP + ν3pg · 3PG+ νpep · PEP + νpyr · PYR + νaccoa · ACCOA + νsuccoa · SUCCOA+ νakg · AKG + νoaa · OAA + νr5p · R5P + νe4p · E4P+ νatp · ATP + νnadph · NADPH + νnad · NAD → (νaccoa + νsuccoa) · COA + νatp · ADP + νnad · NADH+ νnadph · NADP + νg3p · G3P + νsucc · SUCC + νfum · FUM+

where the stoichiometric coefficients ν*<sup>i</sup>* define how much precursor is needed to produce 1 gram of dry cell mass (see Neidhardt et al., 1990). We assume that the rate of this reaction depends on the concentrations of the reactants with linlog kinetics. Growth of *E. coli* can only occur if the adenylate energy charge is above a threshold (Chapman et al., 1971). To model this fact, the linlog kinetics are extended by a factor realizing a ramp function with saturation dependent on the ATP/ADP ratio:

νco2 · CO2 + νac · AC + 1g DCW

$$\mu = \begin{cases} 0 & \text{if } k\_{lo} < c\_{\text{atp}}/c\_{\text{adp}}\\ k\_a \cdot (k\_b + \sum\_i \upsilon\_i \cdot \log(c\_i)) \cdot & \text{if } k\_{lo} \le c\_{\text{atp}}/c\_{\text{adp}} \le k\_{hi} \\\ (c\_{\text{atp}}/c\_{\text{adp}} - k\_{lo})/(k\_{hi} - k\_{lo}) \\\ k\_a \cdot (k\_b + \sum\_i \upsilon\_i \cdot \log(c\_i)) & \text{if } k\_{hi} < c\_{\text{atp}}/c\_{\text{adp}} \end{cases}$$

where *i* runs over the reactants. Cellular maintenance is modeled by assuming a hydrolysis rate of ATP to ADP that is not coupled to processes of the central metabolism or growth. It follows a ramp function with saturation dependent on the ATP/ADP ratio. The use of ramp functions assures that below a certain threshold no growth or maintenance occurs and that the rate of growth or maintenance saturates above a certain threshold.

#### *2.1.6. Chemostat environment*

With the specific growth rate we get according to the chemostat equations (Smith and Waltman, 2008) for the biomass concentration *c*˙<sup>X</sup> = μ · *c*<sup>X</sup> − *D* · *c*X. The extracellular metabolites are described by *c*˙*<sup>i</sup>* = *Ji* · *c*<sup>X</sup> + *D* · *c*in,*<sup>i</sup>* − *D* · *ci*, where *c*in,*<sup>i</sup>* is the concentration of *i* in the inflow, *Ji* is the specific production (positive) or consumption (negative) rate of metabolite *i* by the biomass and *D* is the dilution rate. For the gaseous compounds oxygen and carbon dioxide the equation is modified to *c*˙*<sup>i</sup>* = *Ji* · *c*<sup>X</sup> + *k*in,*<sup>i</sup>* − *k*out,*<sup>i</sup>* · *ci* because this compounds are mainly exchanged via the gas phase, but not the liquid phase. The parameter *k*in,*<sup>i</sup>* describes the supply of the medium by the aeration flow. For oxygen this parameter depends on the aerobiosis value. The parameter *k*out,*<sup>i</sup>* describes the outgasing.

#### *2.1.7. Model reduction*

The resulting model spans the time scales from fast metabolic reactions up to steady state growth. The stiffness of the model calls for model reduction. For several reactions of the central metabolism it is known that they proceed near thermodynamic equilibrium (Kümmel et al., 2006a). We assume quasiequilibrium for several reactions. For example the reaction G6P - F6P is usually rapid such that we can assume that the concentrations of G6P and F6P are in equilibrium with each other. This allows the reduction of the order and the stiffness of the model. In thermokinetic modeling this can be achieved by assuming a vanishing resistance (*Rj* = 0). The resulting differentialalgebraic equation system has index 2 but can be simplified to index 1 (Ederer, 2010).

#### *2.1.8. TKMOD*

Modeling is done using the tool TKMOD (Ederer, 2010). TKMOD reads a model description (see **Supplementary Data Sheet 2**) where stoichiometry, thermokinetic capacities and resistances can be defined. Then it derives the model equations by using the computer algebra system Mathematica (Wolfram Research, 2010). TKMOD automatically performs reduction steps for fast reactions with *Rj* = 0 and writes a FORTRAN code for the simulation equations. TKMOD uses DASKR that is a solver for differential-algebraic equation systems for simulation (Brown et al., 1994, 1998, 2007). A version of TKMOD including DASKR is packaged together with the model files and available from https://seek.sysmo-db.org/models/23.

### *2.1.9. Parameters*

The model described above has different types of parameters. The stoichiometric parameters of the reactions are taken mainly from Reed et al. (2003). The amount of translocated protons during electron transport follows Borisov et al. (2011). The thermokinetic capacities are computed from the standard Gibbs energies of formation. Standard Gibbs formation energies for most metabolites are taken from Alberty (2003). Thermodynamic data for the ubiquinone/quinol and menaquinone/quinol pairs are from Alvarez et al. (2013). Data for metabolites in the pentose phosphate pathway are taken from Kümmel et al. (2006b). The parameters of gene expression, gene regulation and the thermokinetic resistances are manually adjusted in order to fit the experimental data. The thermokinetic resistance *Rj* of reaction *j* is related to the thermokinetic capacities of the reactants *Ci* and the specific forward rate constant *<sup>k</sup>*+*<sup>j</sup>* by *Rj* <sup>=</sup> *<sup>k</sup>*−<sup>1</sup> <sup>+</sup>*<sup>j</sup>* · *i* ∈ *Ej <sup>C</sup>*−|ν*ij*<sup>|</sup> *<sup>i</sup>* . In order to restrict the search space, we assume that *k*+*<sup>j</sup>* can take only values of the type 10*<sup>x</sup>* with an integer *x*. Similarly, we assume that the parameters *ki* of the gene expression model can take only integer values. The quality of the fit we achieve with this high restriction suggests that the order of magnitude of the resistances determines most of the behavior of the model and that many features of the model are robust against uncertainties in the exact values.

#### *2.1.10. Comparison of simulation and experimental data*

Measurement data on biomass concentration, yield factors and fluxes can be compared directly to the simulation results. Valgepea et al. (2010) observe a high correlation between transcript levels and enzyme concentrations for similar chemostat conditions. Also for the gene expression data used in this study a high correlation was observed (Rolfe et al., 2011; Trotter et al., 2011). For this reason, we are able to compare measured transcript levels with the simulated enzyme concentrations. The direct use of measured metabolite concentrations in the mathematical model is subject to several uncertainties. The Gibbs formation energies used to parametrize the model are measured for dilute aqueous solution different from the crowded cytoplasm (Cossins et al., 2011). Systematic losses during quenching and probe preparation may prevent an absolute quantification. For these reasons, we treat the measured metabolite concentrations as relative values.

Relative measurement values *xm* (transcripts, metabolites) are scaled with a factor *f* before comparing them (in a plot) with the respective simulated variables *xs*. The factor *f* is calculated such that the quadratic difference 1/*<sup>f</sup>* <sup>∗</sup>*xs* <sup>−</sup> *xm*<sup>2</sup> <sup>2</sup> between measured an simulated variables is minimal and can be computed as *<sup>f</sup>* <sup>=</sup> *<sup>x</sup> <sup>T</sup> <sup>s</sup>* · *x <sup>s</sup> x T <sup>m</sup>* · *x <sup>s</sup>* . Here, *x <sup>s</sup>* and *x <sup>m</sup>* denote the vectors with corresponding pairs of simulated and measured data points. This means that *xs* is plotted together not with *xm* but with *f* · *xm*.

#### **2.2. EXPERIMENTAL DATA**

Experimental conditions and strain *E. coli* MG1655 were as described in (Rolfe et al., 2012). Biomass and extracellular metabolite concentrations were determined as in (Steinsiek et al., 2011) Transcript data were measured via DNA microarray technology and are taken from (Rolfe et al., 2011). Measurements were complemented with RT-PCR data as described in (Steinsiek et al., 2011).

For determination of intracellular metabolite concentrations cells were first quenched following the method of Link et al. (2008) and afterwards a modified extraction procedure published by Ritter et al. (2008) was used. Following the method of Link et al. (2008) 10 ml of cell containing medium from continuous cultivations were immediately quenched in 20 ml methanolglycerol solution (60/40% v/v) at −60◦C thereby holding the temperature below −20◦C. Samples were thoroughly mixed and immediately transferred to dry ice and cooled to −50◦C. After centrifugation for 30 min at 10,000g and −20◦C the cell pellet was washed with methanol-glycerol solution (60%/40% v/v) at −20◦C. After a second centrifugation step all the supernatant was removed and the pellet was kept at −80◦C until extraction. The cell pellet was extracted with 1 ml methanol and immediately after resuspension 500µl of trichloromethane were added and the solution was mixed vigorously. The sample was split into three aliquots and 450µl trichloromethane pre-chilled on ice were added to each aliquot. Samples were thoroughly vortexed. Afterwards 900µl of methanol/tricine buffer (9:10 parts; final concentration of tricine 1 mM, pH = 7.4) were added, the sample was vortexed again and centrifuged for 10 min at 16,000 g at 4◦C. 800µl of the upper (hydrophilic) phase were collected and stored. This step was repeated and the supernatant was collected, combined with the first sample and boiled for 4 min at 90◦C. The sample was again centrifuged at 16000g at 4◦C and the supernatant was evaporated to dryness under nitrogen stream. Samples were afterwards analyzed by anion exchange chromatography using a BioLC type DX320 (Dionex) as described by Ritter et al. (2006).

Quinone/ol concentrations were determined as described in (Bekker et al., 2007).

# **3. RESULTS AND DISCUSSION**

We present a mathematical model of the oxygen response of an *E. coli* population in a glucose-limited chemostat. Modeling is facilitated by the restriction to steady state conditions and the use of the aerobiosis scale. The use of the aerobiosis scale allows the integration of the experiments in different reactors with one unique parameter set. Due to the restriction to steady state conditions, differences in initial conditions (e.g., initial pH, cell density, gene expression levels) do not need to be considered. The overall model contains a thermokinetic model of the metabolic network. The model reproduces the effect of reactants, products, important activators and inhibitors, as well as the enzyme concentrations on the metabolic reaction rate by simplified kinetic laws. The metabolic model is complemented by a gene expression model. The synthesis rate of enzymes depends on the activities of transcription factors. The activities of transcription factors depends in turn on their respective metabolic signals. Information about the network structure is based on the EcoCyc database (Keseler et al., 2013). The metabolism and regulation model are embedded into a model describing the growth of the bacterial population and the chemostat environment. The final model is able to provide an integrated description of metabolic fluxes and concentrations, gene expression levels and genetic regulation. A further hallmark of the model is that the balances of the metabolites ATP, ADP, and AMP, as well as NADH, NAD, NADPH, and NADP are explicitly considered. In many models of smaller subnetworks the concentrations of these ubiquitous metabolites are assumed to be constant because only a small subset of all producing and consuming reactions is modeled. The present model seeks a complete description of the balance of these metabolites and thus reflects the constraints that arise from energy and redox requirements.

The parameters of the model fall into two classes: (1) The stoichiometric parameters and the Gibbs formation energies are largely organism-independent and can be taken from available databases. (2) The thermokinetic resistances and the parameters of gene expression and gene regulation are free and (within bounds) not subject to physical constraints. Their values depend on properties of enzymes, transcription factors, and consensus sequences that may vary between strains. By adjusting the different parameter values for the latter class of parameters, the model can describe different physically feasible behaviors of the cell population. In order to test the model, the free parameters are adapted to a data set describing a steady state chemostat at different levels of aerobiosis. The data set includes metabolite concentrations, gene expression data and uptake and excretion fluxes. The model reproduces the steady state values of most measured variables and predicts the values of many others at several values of aerobiosis. This means that the model is able to describe the steady state response of *E. coli* to oxygen.

Despite the complexity and size of the considered system, the use of simplifying assumptions keeps the model tractable. For most reactions a constant thermokinetic resistance is assumed. The resistance is allowed to vary only by integer factors of 10 to restrict search space. Resistances of rapid reactions are assumed to be zero allowing for a reduction of model size and stiffness. Gene expression and gene regulation are described with phenomenological equations. Parameters describing the influence of transcription factors on gene expression are allowed to take only integer values. Growth and maintenance are described by reactions with phenomenological kinetics. The rationale behind these simplifications is that they facilitate modeling and parameter adaptation while still preserving the basic physical and regulatory constraints on the cellular behavior.

The following section presents the comparison of model results and experimental data for several aerobiosis values. The subsequent section demonstrates the use of the model by providing predictions of the effects of production pathways on the central metabolism.

#### **3.1. BEHAVIOR ACROSS THE AEROBIOSIS SCALE**

The model is compared to measurement data of transcripts, metabolites and uptake/excretion fluxes for several values of aerobiosis. Under identical (Rolfe et al., 2011; Trotter et al., 2011) and similar experimental conditions (Valgepea et al., 2010), it was shown that transcript levels correlate well with protein levels. Thus, here we compare modeled enzyme levels with measured transcript levels. **Figures 1**–**4** show important parts of the model's results. **Supplementary Data Sheet 4** shows a more complete overview of the simulation results.

Most measurement data at the flux, metabolite, transcript, and gene regulatory level are reproduced in an integrated, thermodynamically consistent way by our model. The relative tendencies of metabolite concentrations are described well for glycolysis (**Supplementary Data Sheet 4**) and fermentation pathways (**Figure 1**). For low oxygen availability the biomass yield is low and glucose uptake is high to allow for a growth rate equal to the dilution rate. Consequently, the metabolite concentrations in the first steps of glycolysis (glucose 6-phosphate g6p, fructose 6-phosphate f6p, and fructose 1,6-bisphosphate fdp) before the enzymatic reaction catalyzed by glyceraldehyde-3-phosphate dehydrogenase GAPD decrease with oxygen availability. GAPD catalyzes a rapid reaction with NADH as a product. Because the concentration of NADH decreases strongly at high aerobiosis values, the thermodynamic pull of NADH reverses the pattern for 3-phospho-glycerophosphate 13dpg, glycerate 2-phosphate 2pg, 3-phospho-glycerate 3pg, and phosphoenolpyruvate pep. The pattern is again inverted after the essentially irreversible pyruvate kinase PYK such that pyruvate pyr and the metabolites of the fermentation pathways are high for low oxygen availability and low for high oxygen availability. This is consistent with the observed inverse correlation of fermentation product excretion and aerobiosis. As a test to check if the model could predict trends of metabolite concentrations, we computed the steady state solutions for different dilution rates that in a steady-state chemostat are equal to the growth rates. The results are qualitatively consistent with the experimental

observation from Schaub and Reuss (2008) in that the concentrations of fructose 1,6-bisphosphate, dihydroxyacetone phosphate, and glyceraldehyde 3-phosphate increase with dilution rate, whereas the concentrations of phosphoenolpyruvate, glycerate 2-phosphate and 3-phospho-glycerate decrease with dilution rate.

The concentrations of the different quinone species in the electron transport chains are well matched, in particular the nonmonotone behavior of the ubiquinone redox state (**Figure 3**). The NADH redox state follows the expected pattern from high reduction potential at low arerobiosis values to low reduction potential at high aerobiosis values. The same holds for the menaquinone redox state that however could not be experimentally measured. The level of oxidized ubiquinone q8 correlates to the concentration of its oxidant, oxygen, with low but almost constant levels for microaerobic levels and high level in the fully aerobic state. The strongly differentially regulated *de novo* synthesis of ubiquinones leads to strong increase of the total concentration (oxidized plus reduced) of ubiquinones. In the microaerobic range this leads to a seemingly paradoxical increase of the concentration of reduced ubiquinols such that maximal reduction is reached for intermediate aerobiosis levels and not for anaerobic growth.

Deviations occur for metabolites in the citric acid cycle (**Figure 2**) and the pentose phosphate pathway (**Supplementary Data Sheet 4**). Since the dynamic ranges of these metabolites are small in both, measurement and simulation, and since transcript levels and uptake and production fluxes are described well by the model, these deviations are considered to be minor. Driven by the redox state of NADH, the citric acid cycle shows the switch from its branched form at low aerobiosis values to the cyclic form at high aerobiosis values. The branched mode is characterized by reaction fluxes directed from oxaloacetate (oaa) and citrate (cit) to succinate (succ) and 2-oxoglutarate akg, whereas in the cyclic mode the 2-oxogluterate dehydrogenase AKGDH couples both branches and flux occurs in the direction from succinate succ to oxaloacetate (oaa). Succinate dehydrogenase SUCDH and fumarate reductase FRD participate in the cyclic and branched modes, respectively. This selectivity is caused by differential gene expression and the different use of quinone species (menaquinone for FRD and ubiquinone for SUCDH). Thus, the model reflects the expected behavior of the citric acid

cycle for different oxygen availabilities. The model comprises the anaplerotic reaction phosphoenolpyruvate carboxylase PPC and the reverse phosphoenolpyruvate carboxykinase PPCK as well as the anaplerotic glyoxylate shunt realized by isocitrate lyase ICL and malate synthase MALS. The model is able to fit the data for different distributions of anaplerosis over both possible pathways (PPC/PPCK and ICL/MALS), and the simulation results present only one possibility.

The activities of the two measured transcription factors in the simulation and measurement are in agreement (**Figure 4**). FNR and ArcA activities follow the expected pattern of high activity at low aerobiosis values and low activity at high aerobiosis values (Sawers, 1999). The model provides predictions for the behavior of several other transcription factors that could not be measured. An indirect validation of these predictions is provided by the good fits of the transcript data. Some glycolytic enzymes are repressed by FruR and the slight increase of FruR activity across the aerobiosis scale is likely. The increase of CRP can be seen for example in the increased expression of the *mgl* operon that is activated by CRP. The *mgl* operon encodes a methyl-β-D-galactoside and galactose ABC transporter. Since this transporter is also able to transport glucose and is active under glucose-limited conditions (Death and Ferenci, 1993; Ferenci, 1996; Hua et al., 2004; Steinsiek and Bettenbrock, 2012) it is

menaquinones and NADH and the ATP energy charge, respectively. The second row shows the total concentration (oxidized plus reduced form) of ubiquinone and menaquinone and the substrate-biomass yield *Y*. The labels dehydrogenase (ubiquinone-8) *nuo*, NADHImq NADH dehydrogenase (menaquinone-8) *nuo*, NADHII NADH dehydrogenase (ubiquinone-8 ) *ndh*, ATPS ATP synthase.

**FIGURE 4 | Steady-state simulation results of some transcription factor activities in comparison to measurement data.** The abscissa is aerobiosis in percent. The ordinate is in arbitrary units. FNR data from Rolfe et al. (2011). ArcA data was determined as described in Rolfe et al. (2012).

modeled as a glucose transporter (GLCabc). The simulated CRP activity is also consistent with the inferred CRP activity from Rolfe et al. (2012). Since FruR activity is controlled by the concentration of fructose 1,6-bisphosphate (fdp) and CRP activity is controlled by the ratio of phosphoenolpyruvate (pep) to pyruvate (pyr), these predictions follow directly from the distribution of concentrations in glycolysis. The observation that PdhR is more active under aerobic conditions than under anaerobic conditions (**Supplementary Data Sheet 4**) is consistent with the experimental observation of Ogasawara et al. (2007). AppY is known to be a regulator active under anaerobic conditions (Brøndsted and Atlung, 1996; Atlung et al., 1997) as it is in the model (**Supplementary Data Sheet 4**). The transcript levels of both target genes of IclR (MALS and ICL in **Figure 2**) match the predicted course of IclR activity (**Supplementary Data Sheet 4**).

#### **3.2. ASSESSMENT OF PRODUCTION CAPABILITIES**

We use the above described model to assess the production capabilities of *E. coli* in a chemostat. For this purpose, we introduce a reaction into the model that represents the effect of a production pathway. For example, if a pathway uses the precursors A and B and 2 ATP to produce a product (possibly via some intermediates), we introduce the reaction A + B + 2·ATP → 2·ADP. For every production pathway, we show steady state values of several quantities depending on the enforced biomass-specific production flux *J*prod given in mmol/g/h. These quantities are the biomass concentration, concentrations of precursors of the production pathway and the overall productivity per reactor volume *q*prod in mmol/l/h. The productivity is given by *q*prod = *J*prod · *c*<sup>X</sup> where *c*<sup>X</sup> is the biomass concentration in g/l. In this way, we can assess the production capabilities independently of the kinetics of the production pathway.

**Figure 5** shows results for hypothetical production pathways where each pathway requires only a single precursor molecule. These results demonstrate the abilities and limitations of the central metabolism to provide biosynthesis with precursors. The aerobic and the anaerobic case are shown. For the aerobic case aeration is fixed to a level that results in 160% aerobiosis of the undisturbed case (*J*prod = 0). Plots of all twelve precursor molecules are shown in the **Supplementary Data Sheet 5**.

The concentration of glucose 6-phosphate (g6p) increases with production flux, i.e., the consumption rate of glucose 6-phosphate by the production pathway, under aerobic and anaerobic conditions (**Figure 5**). This seemingly paradoxical behavior is explained by the autocatalytic structure of glycolysis. Glucose is phosphorylated to glucose 6-phosphate using ATP (or phosphoenolpyruvate). ATP is regained by metabolism of glucose 6-phosphate. An additional consumption of glucose 6-phosphate leads to an increase of the glucose uptake flux. An increased glucose uptake requires more ATP for phosphorylation of glucose. Since a fixed flux of ATP is required to form biomass in the steady state chemostat, the flux away from glucose 6-phosphate increases and provides a higher ATP production flux. Because the ATP concentration is kept nearly constant and since glycolytic enzymes are nearly constitutively expressed, an increase of the glucose 6-phosphate concentration is needed to drive this flux in order to achieve a steady state. Depending on the kinetics of the production pathway, this behavior may cause dynamic instabilities such as oscillations.

Aerobically, the concentration of acetyl-CoA (accoa) and the biomass concentration drop with an increasing consumption of acetyl-CoA (**Figure 5** upper row). Anaerobically, the concentration of acetyl-CoA increases initially and then decreases slightly (**Figure 5** lower row). However, the biomass concentration decreases rapidly. In the model, up to 6 mmol/l/h acetyl-CoA can be produced aerobically and up to 5 mmol/l/h anaerobically. Due to the much smaller biomass concentration in the anaerobic case, this corresponds to a much higher biomass-specific flux in the anaerobic case.

Aerobically and anaerobically, the concentration of ATP (atp) decreases only slightly with a consumption of ATP (**Figure 5**). This is in line with the observation that the energy charge is kept constant over a wide range of conditions (Chapman et al., 1971). In the model this behavior emerges from the activation of the phosphofructokinase by ADP and by the high sensitivity of growth and maintainenance with respect to the ATP/ADP ratio. Aerobically, we obtain the maximal volume-specific ATP production flux for intermediate values of the biomass-specific ATP production flux, whereas for the anaerobic case volume-specific and biomass-specific flux are related monotonously.

Aerobically, the oxidation of NADH (nadh) from the cell leads to a decrease of the (already low) NADH concentration and the biomass (**Figure 5** upper row). But anaerobically, the oxidation of NADH has initially a positive effect on the population size (**Figure 5** lower row), because it removes the requirement for the excretion of carbon in the form of ethanol to maintain the electron balance.

The above examples show that the model provides detailed predictions about the complex effects of production pathways on central metabolism. The synthesis of any biofuel requires a combination of precursor molecules. **Figure 6** shows four anaerobic example cases. (1) *E. coli* naturally produces ethanol via

**FIGURE 5 | Steady-state response of the model to enforced precursor consumption rates.** The upper and lower rows show the results for the aerobic and anaerobic case, respectively. The abscissas shows the biomass-specific production flux in mmol/g/h. The black solid lines show the biomass. The dashed lines show the intracellular

concentration of the respective metabolite. The curves are scaled to the undisturbed case and start at unity. The blue lines show the productivity in mmol/l/h. Labels mean accoa acetyl-CoA, g6p glucose 6-phosphate, atp → adp + pi dephosphorylation of ATP, nadh → nad oxidation of NADH.

the aldehyde-alcohol dehydrogenase AdhE with stoichiometry AcCoA + 2 NADH → CoA + 2 NAD + ethanol. If the flux is increased (for example by overexpression of AdhE), the AcCoA and NADH concentrations drop rapidly. (2) Ohta et al. (1991) expressed the ethanol production pathway from *Zymomonas mobilis* in *E. coli*. The pathway uses the pyruvate decarboxylase Pdc and the alcohol dehydrogenase II AdhB with the stoichiometry pyruvate + ubiquinol → ubiquinone + ethanol. When enforcing an increasing flux via this pathway, the concentrations of the precursors ubiquinol and pyruvate drop initially only slightly. Only immediately before reaching the maximum productivity do the concentrations decrease. This demonstrates the superior properties of this pathway over the native pathway. **Figure 6** shows that pyruvate supply becomes limiting. (3) Inui et al. (2008) uses a pathway from *Clostridium acetobutylicum* for the production of butanol in *E. coli*. Its total stoichiometry is 2 AcCoA + 4 NADH → 2 CoA + 4 NAD + butanol. The concentrations of the precursors AcCoA and NADH decrease strongly with the production flux. The biomass concentration initially rises because the removal of reducing power is advantageous under anaerobic conditions. (4) The last example is the production of isoprene via the methylerythritol phosphate (MEP) pathway (Zhao et al., 2011). The overall stoichiometry is glyceraldehyde 3-phosphate + pyruvate + 2 NADPH + NADH + 1 ATP → isoprene + 2 NADP + NAD + AMP + ADP + diphosphate + CO2. When enforcing an increasing flux over this pathway, the biomass concentration decreases strongly until it approaches zero. The ATP concentration stays almost constant, the concentrations of NADH and NADPH (not shown) drop slightly and the concentrations of pyruvate and glyceraldehyde 3-phosphate (not shown) increase. Since biomass is clearly the limiting resource in this example, supporting biomass production by providing a complex medium may increase productivity.

The above examples demonstrate the use of the model for assessing the production capabilities of *E. coli*. We enforce the consumption rate of precursor molecules in stoichiometric relations corresponding to the requirements for the synthesis of biofuel. The analysis of the steady state response of the precursor concentrations, the biomass concentration and the volumespecific productivity gives a detailed picture of the capacity of the metabolism in chemostat conditions. In an ideal situation a high production flux is possible and biomass and precursor concentrations are stable over a wide range of production rates. In real examples, either the decrease in biomass concentration or the decrease in concentration of a precursor limits the productivity of the pathway. The model gives detailed predictions about the responses of metabolism to specific production pathways. For example, the model makes predictions for which biomass-specific production flux the highest volumespecific productivity occurs. The model predicts how precursor concentrations change with varying production rate. For pathways with several precursors, the model shows which precursor concentrations decrease strongly and become limiting. In addition, complex phenomena can be observed. For example, it is possible that, due to autocatalytic effects, the concentration of a precursor increases despite an enforced consumption rate of the precursor. Depending on the kinetics of the production pathway, this may lead to dynamic instabilities that could be manifest as oscillations.

These results are predictions of the model. The approach of dissecting the parameters into an experiment- and organismindependent class and a class that is variable within bounds, assures that the resulting model is generic but can be adapted to specific experimental situations. The validity of model predictions decreases when one departs from the experimental conditions that were used to determine the values of the free parameters. Under changed experimental conditions, other, currently unmodeled, regulatory systems may play a relevant role. Changed experimental conditions may be the introduction of production pathways, as discussed here, the use of different substrates, dilution rates or strains, However, since the model by construction cannot violate the basic mass balance and thermodynamic constraints, the predictions will share the features enforced by these constraints.

#### **4. CONCLUSION**

To our knowledge, this model presents the first comprehensive, integrated model of steady state metabolism and regulation of the oxygen response of *E. coli*. A model of metabolism describing metabolite concentrations and reaction fluxes is complemented with a model of genetic regulation describing gene expression and transcription factor activity. This model of cellular metabolism is embedded into a model of growth describing the requirement of precursor for biomass formation, a model of maintenance describing the non-growth-associated needs for ATP, and a model of the chemostat describing the concentrations of extracellular compounds and biomass. The model provides a thermodynamically consistent view that integrates experimental data at the metabolite, flux, transcript, and regulator levels.

The combined model reflects the major constraints that restrict the behavior of *E. coli*. The use of the thermokinetic modeling formalism and the availability of the Gibbs formation energies of the metabolites assures that the metabolic model is thermodynamically consistent. This means that the energetic constraints of the cellular metabolism are properly addressed. The mass balances of all relevant metabolites of central metabolism are considered. In particular the model includes also the balances of the biosynthetic precursors and of important energy and redox carriers, as ATP/ADP/AMP, NADH/NAD, and NADPH/NADP. Thus, the model integrates major mass balance constraints with the respective energetic constraints and the respective cellular regulation. The model is able to describe the competition of different pathways for energy and carbon and provides predictions of the effect of the introduction of production pathways on central metabolism.

# **FUNDING**

We thank the ERASysbio SysMO (Systems Biology of Microorganisms) initiative for funding the SUMO and SUMO2 consortia. The research was funded by the Biotechnology and Biological Sciences Research Council (BBSRC), the Bundesministerium für Bildung und Forschung (BMBF) and the Nederlandse Organisatie voor Wetenschappelijk Onderzoek (NWO). ME further acknowledges support by the Ministerium für Wissenschaft, Forschung und Kunst Baden-Württemberg within the Ideenwettbewerb Biotechnologie und Medizintechnik Baden-Württemberg.

# **ACKNOWLEDGMENTS**

We would like to thank all members of the SysMO SUMO consortium for fruitful discussions and collaborations especially M. Bekker, F. Bruggemann, B. Cseke, K. J. Hellingwerf, M. Holcombe, E. D. Gilles, A. Graham, S. Henkel, W. Jia, A. Maleki-Dizaji, T. Nye, G. Sanguinetti.

#### **SUPPLEMENTARY MATERIAL**

The Supplementary Material for this article can be found online at: http://www.frontiersin.org/journal/10.3389/fmicb.2014. 00124/abstract

**Supplementary Data Sheet 1 | Index of Model Elements.**

**Supplementary Data Sheet 2 | Model Definition File.**

**Supplementary Data Sheet 3 | Transcription Factors and Their Metabolic Signals.**

**Supplementary Data Sheet 4 | Map of Measurement Data and Model Results.**

**Supplementary Data Sheet 5 | Steady State Answer of the Model to enforced precursor effluxes.**

#### **REFERENCES**


not anaerobic or aerobic conditions. *J. Bacteriol.* 185, 204–209. doi: 10.1128/JB.185.1.204-209.2003


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 15 January 2014; accepted: 11 March 2014; published online: 27 March 2014. Citation: Ederer M, Steinsiek S, Stagge S, Rolfe MD, Ter Beek A, Knies D, Teixeira de Mattos MJ, Sauter T, Green J, Poole RK, Bettenbrock K and Sawodny O (2014) A mathematical model of metabolism and regulation provides a systems-level view of how Escherichia coli responds to oxygen. Front. Microbiol. 5:124. doi: 10.3389/fmicb. 2014.00124*

*This article was submitted to Microbial Physiology and Metabolism, a section of the journal Frontiers in Microbiology.*

*Copyright © 2014 Ederer, Steinsiek, Stagge, Rolfe, Ter Beek, Knies, Teixeira de Mattos, Sauter, Green, Poole, Bettenbrock and Sawodny. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Death by a thousand cuts: the challenges and diverse landscape of lignocellulosic hydrolysate inhibitors

# *Jeff S. Piotrowski\*, Yaoping Zhang , Donna M. Bates , David H. Keating , Trey K. Sato , Irene M. Ong and Robert Landick*

*DOE Great Lakes Bioenergy Research Center, University of Wisconsin-Madison, Madison, WI, USA*

#### *Edited by:*

*Shihui (Shane) Yang, National Renewable Energy Laboratory, USA*

#### *Reviewed by:*

*Shiyong Peng, National Institute of Health, USA Qiang J. Fei, National Renewable Energy Lab, USA*

#### *\*Correspondence:*

*Jeff S. Piotrowski, DOE Great Lakes Bioenergy Research Center, University of Wisconsin-Madison, 1552 University Ave., WEI4152, Madison, WI 53726, USA e-mail: jpiotrowski@wisc.edu*

Lignocellulosic hydrolysate (LCH) inhibitors are a large class of bioactive molecules that arise from pretreatment, hydrolysis, and fermentation of plant biomass. These diverse compounds reduce lignocellulosic biofuel yields by inhibiting cellular processes and diverting energy into cellular responses. LCH inhibitors present one of the most significant challenges to efficient biofuel production by microbes. Development of new strains that lessen the effects of LCH inhibitors is an economically favorable strategy relative to expensive detoxification methods that also can reduce sugar content in deconstructed biomass. Systems biology analyses and metabolic modeling combined with directed evolution and synthetic biology are successful strategies for biocatalyst development, and methods that leverage state-of-the-art tools are needed to overcome inhibitors more completely. This perspective considers the energetic costs of LCH inhibitors and technologies that can be used to overcome their drain on conversion efficiency. We suggest academic and commercial research groups could benefit by sharing data on LCH inhibitors and implementing "translational biofuel research."

**Keywords: cellulosic biofuels, lignocellulosic hydrolysate inhibitors, systems biology, chemical genomics, metabolic modeling, ethanologens**

# **INTRODUCTION**

Lignocellulosic biofuels offer the promise of sustainable, domestically produced fuels with favorable carbon balances. Fastgrowing grasses like *Miscanthus* and agricultural residues provide fermentable sugars at lower energy and fertilizer costs than grains (Schmer et al., 2008), making them preferable feedstocks for advanced biofuels. Cellulosic ethanol is an obvious nextgeneration biofuel to implement given its production and delivery infrastructures are compatible with existing fuels.

Central to the success of cellulosic ethanol is efficient conversion of biomass-derived sugars to ethanol by microbes such as *Saccharomyces cerevisiae*, *Escherichia coli*, and *Zymomonas mobilis* (Alper and Stephanopoulos, 2009; Lau et al., 2010; Yang et al., 2010a). Under optimal conditions, these microbes are powerful ethanologens; however, lignocellulosic hydrolysates (LCH) and industrial scale fermentation tanks are not optimal conditions. Thermal, osmotic, and ethanol stresses are just some of the environmental factors that inhibit fermentation and reduce yield (Attfield, 1997; Gibson et al., 2007; Jin et al., 2013). Industrial microbes are pushed to the limits of stress tolerance to make biofuel production energetically favorable.

Although environmental stressors limit yields in present day ethanol facilities, cellulosic biomass conversion comes with new challenges. Specifically, LCH inhibitors, a group of small, bioactive molecules can significantly reduce conversion efficiency. LCH inhibitors such as aliphatic acids, furans, and phenolics are released or condensed from cellulose and hemicellulose during pretreatment and hydrolysis (Larsson et al., 1999, 2000; Yang et al., 2010a); however, chemical residues from newer hydrolysis strategies and synergies with biofuel end products (ethanol, isobutanol) are less well studied. Removal of these inhibitors can be expensive and may reduce titers of fermentable sugars; some estimates suggest that detoxification can remove up to 26% of total fermentable sugars (Larsson et al., 1999). Thus, a preferred strategy is to develop microbial strains with properties that minimize the effects of LCH inhibitors on biofuel yields.

With commercially available industrial stains that are robust to thermal and ethanol stress (e.g., Ethanol Red, Fermentis, Milwaukee, WI, USA), recent attention has been directed to overcoming the challenge of LCH inhibitors. These compounds are ubiquitous in hydrolysates, and their abundance and composition depends on pretreatment (Chundawat et al., 2010), feedstock (Klinke et al., 2004; Almeida et al., 2007), and seasonality (Bunnell et al., 2013; Greenhalf et al., 2013). Given their chemical diversity, these compounds can target many cellular processes. LCH inhibitors can also generate a substantial cellular energy drain. Cells have evolved to detoxify, excrete inhibitors, or repair the resultant cellular damage fast enough to reproduce. However, evolved coping mechanisms may also negatively affect the efficiency of conversion by competing for cellular resources (Bellissimi et al., 2009; Miller et al., 2009). Although it is in the microbe's best interest to use its resources to limit the effects of LCH inhibitors and maintain cellular viability, this may be reducing biofuel production. In this perspective, we consider the diversity and cellular costs of LCH inhibitors from traditional and novel pretreatment and hydrolysis strategies, describe new technologies and their application to strain development, and finally identify key needs of the cellulosic biofuel community that will empower "translational biofuel research" to take discoveries quickly to industrial scale.

# **DIVERSITY OF FERMENTATION INHIBITORS**

Prior to microbial conversion of lignocellulosic sugars into biofuel, biomass must be deconstructed into monomeric sugars by enzymatic or chemical hydrolysis. This hydrolysis step is often preceded by a pretreatment step that expands the plant fibers and allows cellulolytic enzymes access to the polysaccharide matrices. The resulting hydrolysates are complex, ill-defined mixtures that include sugars and a diversity of bioactive molecules (**Table 1**). Small acids and phenolic compounds are released from cellulose and hemicelluloses during hydrolysis and furans arise from the dehydration of pentose and hexose monomers (Klinke et al., 2004). Pretreatments such as acid hydrolysis, steam explosion or NH3 expansion each impart their own "profile" of LCH inhibitors. For example, AFEX (Ammonia Fiber EXpansion) uses high-pressure/temperature ammonia to alter the cellulose matrix to allow hydrolysis by cellulases (Lau and Dale, 2009), and

#### **Table 1 | Classes of lignocellulosic hydrolysate (LCH) inhibitors and their described modes of toxicity.**


this produces amide versions of inhibitors (e.g., feruloyl amide from ferulic acid) with potentially new biological properties (Chundawat et al., 2010).

Besides these common inhibitors, residual pretreatment chemicals may complicate fermentation. Ionic liquids are pretreatment and hydrolysis solvents, but are toxic to many microorganisms (Docherty and Kulpa, 2005; Ouellet et al., 2011). Alkaline hydrogen peroxide (AHP) pretreatment limits the production of furans; however this method introduces significant amounts of Na+ from NaOH, which can cause osmotic stress (Sato et al., 2014). Copper(II) 2,2- -bipyridine is a catalyst that enhances AHP pretreatment by reducing the H2O2 requirement (Li et al., 2013), but copper is toxic to most microbes. Next-generation pre-treatments and hydrolysis methods like γ-valerolactone (Luterbacher et al., 2014), surfactants (Sindhu et al., 2013), zirconium phosphate catalysts (Gliozzi et al., 2014), and other incipient hydrolysis technologies may imbue hydrolysates with novel toxicities and synergisms with common inhibitors.

Biofuel end-products themselves are inhibitory. Ethanol can directly damage cellular membranes, DNA, as well as inhibit enzymes (Nagodawithana and Steinkraus, 1976; Dombek and Ingram, 1984; Alexandre et al., 1994; Ibeas and Jimenez, 1997; Huffer et al., 2011). The ethanologens *S. cerevisiae* and *Z. mobilis* are not immune to ethanol toxicity at high concentrations (Carmona-Gutierrez et al., 2012; Yang et al., 2013). Advanced biofuels like isobutanol are toxic at significantly lower concentrations than ethanol (Brynildsen and Liao, 2009; Atsumi et al., 2010; Huffer et al., 2011; Minty et al., 2011). Inhibition by end products has been an area of research interest (Baez et al., 2011; McEwen and Atsumi, 2012; Zingaro et al., 2013), and ethanol tolerance is a pre-requisite for all industrial yeast.

Effects of inhibitors at a minimum can be additive, but an even greater concern is the potential for synergy between LCH inhibitors and fermentation condition, including high osmolarity and absence of O2. Some studies have described synergies between acetic acid, furfural, and phenolics in yeast (Oliva et al., 2006; Ding et al., 2011), but a comprehensive evaluation of synergisms between compounds and conditions on both growth rate and fermentation will be essential. Such assessment will be a massive undertaking that will also require defined synthetic hydrolysate media to permit meaningful definition of minimum inhibitory concentrations (MICs) of the individual inhibitors, and how this value will change in combination with other LCH inhibitors (synergy/antagonism) and fermentation conditions. Nevertheless, documenting interactions between inhibitors on sugar conversion is crucial to prioritizing future research for improved biofuel microbes.

### **SMALL MOLECULE INHIBITORS DEPLETE CELLULAR RESOURCES**

LCH inhibitors directly affect biofuel yield as well as the production rate, which can extend fermentation time and result in higher operating costs. In simplest terms, these inhibitors affect conversion efficiency by depleting cellular energy resources (e.g., ATP, NADH, NADPH; **Figure 1**). Some inhibitors can act broadly and damage key enzymes of fermentation pathways (Modig et al., 2002). The coping mechanisms available to the biofuel microbes fall into 4 main categories: (i) detoxification, (ii) efflux, (iii) repair, or (iv) tolerance. The first three are of most concern given their effects on cellular energy and resources.

Detoxification and efflux are the most well characterized mechanisms of inhibitor tolerance in microbes, with deep literature not only from biofuel research but also in the medical literature from a wealth of antibiotic and pharmaceutical research. Detoxification is the major route of tolerance for aldehydes in both bacteria and yeast. Reduction of compounds like furfural and vanillin to less toxic alcohols by NADH/NADPH dependent reductases occurs in ethanol fermentation using *S. cerevisiae* and *E. coli* (Jarboe, 2011). Oxidoreductase expression significantly increases in yeast and *E. coli* in the presence of aldehydes (Liu et al., 2008; Wang et al., 2011), depleting cellular NADH/NADPH. This results in inhibition of NADPH-dependent processes (e.g., assimilation of sulfur) leading ultimately to slower conversion of sugars (Miller et al., 2009). Interestingly, yields can be increased by disabling the detoxification pathway, suggesting that tolerance may be more energetically efficient than detoxification (Wang et al., 2013). Alternatively, changing the source of reducing equivalent for aldehyde detox from NADPH to NADH also can improve biofuel yield (Wang et al., 2013).

Efflux is mediated by ATP-dependent trans-membrane pumps that selectively or non-selectively pump out toxic compounds usually at the cost of 1 ATP per molecule (Shapiro and Ling, 1998; Schmitt and Tampé, 2002). The yeast *S. cerevisiae* has 29 different ATP-binding cassette (ABC) efflux transporters (Decottignies and Goffeau, 1997) and 5% of the *E. coli* genome is composed of genes with ABC-transporter domains, many involved in efflux (Linton and Higgins, 1998). In both yeast and *E. coli*, expression of transporters increases in response to LCH inhibitors (Schüller et al., 2004; Lee et al., 2012; Schwalbach et al., 2012). The yeast weak acid response is mediated by increased expression of ABC-transporters (e.g., Pdr1p and Pdr5p) (Schüller et al., 2004; Pereira Rangel et al., 2010). As long as cells are exposed to LCH inhibitors, a significant fraction of cellular ATP will be diverted to efflux pumps. Of particular importance for overall efficiency of LCH conversion, ATPdepletion may have a disproportionate effect on xylose conversion compared to glucose conversion because xylose produces less cellular energy per molecule transported (Matsushika et al., 2013). Bellissimi et al. (2009) found that acetic acid could specifically inhibit xylose fermentation, but that this effect could be reversed with glucose addition. The authors posit that ATP generated from xylose fermentation cannot match ATP depletion from the weak acid response and efflux/proton pumps used to maintain cellular pH. Inhibitors that require ATP-dependent coping mechanisms can directly reduce xylose conversion. Some inhibitors can be particularly draining, affecting both NADPH and ATP pools. Ask et al. (2013) found that furfural and HMF not only reduced cellular NAPDH in *S. cerevisiae*, but also elicited increased expression of ATP-dependent efflux pumps Pdr5p and Yor1p. This suggests that coping with furans requires both NADPH dependent detoxification and ATP-dependent efflux.

An unanswered question is whether the ATP-dependent action of efflux pumps has a net positive (by inhibitor removal) or net negative (by energy consumption) effect on biofuel production. Although the answer may vary by inhibitor, because some inhibitors like aldehydes are more damaging to cells than others, a general test of the positive or negative consequences of efflux pumps for biofuel yield will help advance strategies for biofuel microbe design.

The cellular energy costs of maintenance and repair are more difficult to quantify but could account for significant energy loss. If cells can repair the damage caused by fermentation inhibitors quickly, then fermentation may proceed. Inhibitors can acidify cells (Verduyn et al., 1992), damage cellular membranes (Russell, 1992), DNA (Allen et al., 2010), and individual proteins (Modig et al., 2002). Repairing structures requires biogenesis; this comes at the expense of ATP, NADPH, carbon, and nitrogen. Maintaining pH is mediated by ATP-dependent proton pump Pma1p; and ATP cost under acidic conditions is the primary cause of reduced cellular growth (Verduyn et al., 1992; Ullah et al., 2012). Biogenesis requires NADPH-dependent assimilation of nutrients like sulfur, which is drained by repair enzymes. Phenolics and furans can damage membranes, requiring more energy to maintain the proton gradient required for basic metabolism (Ding et al., 2012; Schwalbach et al., 2012; Stratford et al., 2013). Growth and sugar conversion will be slowed as resources are diverted to maintenance and repair. Given that hydrolysates contain a mixture of inhibitors with diverse modes of action requiring all of these coping mechanisms simultaneously, the energy drain from fermentation inhibitors is truly death by a thousand cuts.

#### **ADAPTING TO THE CHANGING LANDSCAPE OF LCH INHIBITORS**

Strain development is an economical route to deal with LCH inhibitors. Resistance to the suite of inhibitors requires complex response of many cellular systems, and as such is not easily conferred by engineering of individual genes. Moreover, inhibitor pools can vary between hydrolysate preparations, thus even the most robust strains in ammoniapretreated hydrolysate may wither in dilute acid pretreated hydrolysates. It is unlikely that one strain will be optimal for all conditions. The reality is that microbial strains will need to be tailored to specific hydrolysates through engineering and directed evolution. Accelerating this process is crucial to making new cellulosic technologies industrially viable.

The tools of systems biology can give a detailed view of the microbial stress response (Jozefczuk et al., 2010; Lee et al., 2011). Transcriptomic, proteomic, and metabolomic responses to inhibitors can be tracked in detail, and this "multiomic" approach can give high-resolution insight to the global cellular consequences (Miller et al., 2009; Yang et al., 2010b; Schwalbach et al., 2012; Skerker et al., 2013; Yang et al., 2013). Advanced techniques such as ribosome profiling can give a view into relationships between transcription and protein abundance in the presence of LCH inhibitors (Ingolia et al., 2009; Brar et al., 2012). Metabolic and flux-balance models are valuable in determining energy balances within cells (Varma and Palsson, 1994; Fiaux et al., 2003; Jin and Jeffries, 2004; Herrgård et al., 2008; Taymaz-Nikerel et al., 2010). These models combined with a systems biology view of protein and gene expression can be used to identify key energetic bottlenecks as targets for engineering. Recently, Wei et al. (2013) demonstrated an elegant way to overcome the redox cofactor imbalance in yeast designed to ferment xylose by engineering acetate metabolism from *E. coli* into *S. cerevisiae*. The authors combined an acetate utilization pathway that consumes NADH with a xylose utilization pathway that produces NADH to overcome the redox imbalance of engineered xylose fermentation. The resultant strain has both better xylose conversion and the ability to detoxify acetate (Wei et al., 2013). Detailed accounting of ATP and NAD(P)H in the presence of LCH inhibitors and industrial conditions will be necessary to disentangle and understand points for rational engineering of microbial catalysts.

Model biofuel microbes like *S. cerevisiae* and *E. coli* benefit from well-developed suites of functional genomics resources, such as deletion mutant or overexpression collections (Giaever et al., 2002; Baba et al., 2006; Kitagawa et al., 2006). These tools have revealed effects of some inhibitors such as furfural (Gorsich et al., 2006), vanillin (Endo et al., 2008; Iwaki et al., 2013), and acetic acid (Mira et al., 2010). Genome-wide mutant collections also allow powerful studies of inhibitors via "chemical genomics" (Giaever et al., 2004; Parsons et al., 2006; Ho et al., 2011). This new tool in the multiomic arsenal, when combined with the information in genetic interaction networks (Butland et al., 2008; Costanzo et al., 2010), can allow precise predictions of the cellular targets of fermentation inhibitors. Recently, Skerker et al. (2013) used a chemical genomics approach to discover a previously undescribed inhibitor in acid pretreated hydrolysate, methyl glyoxal (MG), and identified mutations that confer MG resistance. Chemical genomics can be used to identify the chemical biological signatures within hydrolysates and mutations conferring resistance, but more broadly can serve as a "biological fingerprinting" technique for hydrolysate to identify variation in production, and as a method to benchmark the biological properties of novel hydrolysates. Resources such as the MoBY-ORF collections (Ho et al., 2009; Magtanong et al., 2011), which are barcoded plasmids collections carrying nearly all *S. cerevisiae* used to assess the effects of increased gene dose and gene complementation, could be used with industrial, wild, and engineered yeast to identify genetic interactions within diverse yeast strains. Combined with traditional selection for resistance and directed evolution, systems biology tools offer great potential for strain development.

Further, new tools are now available that can accelerate strain development. Genome editing and optimization techniques such as CRISPR/Cas9 and MAGE allows rapid, detailed genome editing in both bacteria and eukaryotes (Wang et al., 2012; Cong et al., 2013; DiCarlo et al., 2013; Gilbert et al., 2013). Next steps in strain development will be tuning genomes to inhibitor tolerance by parallel disruption or activation of inhibitor responsive genes using CRISPR/Cas9-based systems. Additionally, large-scale gene synthesis can be used to identify genes that not only aid xylose utilization, but also confer inhibitor tolerance. Systems biology tools for biocatalyst development are coalescing to a pipeline that can keep pace with the changing landscape of fermentation inhibitors, allowing for the rapid tailoring of ethanologens with robust and efficient sugar to biofuel conversion from any new hydrolysate.

# **NECESSARY TOOLS TO MEET A COMMON GOAL**

Much like the interface between commercial and academic drug discovery communities, applying next-generation biofuel technologies will require cooperative translational research. Academic biofuel communities are developing advanced system biology techniques whereas the commercial community excels at scale-up commercialization. However, similar to drug discovery, these two groups are often isolated in their research. In both cases, publicprivate partnerships such as the NIH translational medicine initiatives (Zerhouni, 2003) and NCERC industrial partnerships (http://www*.*siue*.*edu/ethanolresearch/), as well as shared computational resources like Kbase (http://www*.*kbase*.*us/) can help bridge the divide.

How can researchers and funding agencies best enhance collaboration? A key advance would be greater data sharing about LCH inhibitors. Diverse hydrolysates and their respective inhibitors are major variables in the field of cellulosic biofuel production, but detailed information on specific hydrolysate compositions is not broadly available. Comprehensive efforts to identify all major LCH components and inhibitors across feedstocks and hydrolysis treatments are needed and will require a central repository of LCH data that includes data on feedstocks, pretreatment, hydrolysis, nutrients, and inhibitor. Chemical genomic profiling could be used to generate "biological fingerprints" of hydrolysates to allow comparisons among hydrolysates and identify the biological effects of LCH components by standard analytical methods. A central resource would allow researchers to compare the composition and biological fingerprints of new hydrolysates with existing knowledge about tolerant microbes for further strain development. The DOE's Systems Biology Portal, KBase, offers the best outlet for community resources, and could serve as the authority of the response of biocatalysts to LCH inhibitors with open-source, community-developed analytical tools for chemical genomics datasets.

# **CONCLUSIONS**

LCH inhibitors are major barriers for cellulosic biofuels. The cellular energy costs of coping with these compounds are a significant drain on the already thin margins of biofuel production. However, the increasingly powerful tools of systems biology can be used to gain a detailed understanding of the cellular consequences of individual and mixtures of fermentation inhibitors, which will serve as a basis for rational engineering of customizable microbes. The biofuel research community would benefit from shared computational and database resources that can improve communication between the academic and commercial sides of biofuels.

#### **ACKNOWLEDGMENTS**

All authors are funded by the DOE Great Lakes Bioenergy Research Center (DOE BER Office of Science DE-FC02- 07ER64494).

#### **REFERENCES**


phosphate pathway genes ZWF1, GND1, RPE1, and TKL1 in *Saccharomyces cerevisiae*. *Appl. Microbiol. Biotechnol.* 71, 339–349. doi: 10.1007/s00253-005- 0142-3


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 16 November 2013; accepted: 18 February 2014; published online: 14 March 2014.*

*Citation: Piotrowski JS, Zhang Y, Bates DM, Keating DH, Sato TK, Ong IM and Landick R (2014) Death by a thousand cuts: the challenges and diverse landscape of lignocellulosic hydrolysate inhibitors. Front. Microbiol. 5:90. doi: 10.3389/fmicb. 2014.00090*

*This article was submitted to Microbial Physiology and Metabolism, a section of the journal Frontiers in Microbiology.*

*Copyright © 2014 Piotrowski, Zhang, Bates, Keating, Sato, Ong and Landick. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Modeling of Zymomonas mobilis central metabolism for novel metabolic engineering strategies

# *Uldis Kalnenieks1\*, Agris Pentjuss <sup>2</sup> , Reinis Rutkis1, Egils Stalidzans1,2,3 and David A. Fell <sup>4</sup>*

<sup>1</sup> Institute of Microbiology and Biotechnology, University of Latvia, Riga, Latvia

<sup>2</sup> Department of Computer Systems, Latvia University of Agriculture, Jelgava, Latvia

<sup>3</sup> SIA TIBIT, Jelgava, Latvia

<sup>4</sup> Department of Biological and Medical Sciences, Oxford Brookes University, Oxford, UK

#### *Edited by:*

Katherine M. Pappas, University of Athens, Greece

#### *Reviewed by:*

Jason Warren Cooley, University of Missouri, USA Patrick Hallenbeck, University of Montreal, Canada

#### *\*Correspondence:*

Uldis Kalnenieks, Institute of Microbiology and Biotechnology, University of Latvia, Kronvalda boulevard 4, Riga, LV-1586, Latvia e-mail: kalnen@lanet.lv

**INTRODUCTION**

*Zymomonas mobilis*, a member of the family of *Sphingomonadaceae*, is an unusual facultatively anaerobic Gram-negative bacterium, which has a very efficient homoethanol fermentation pathway. High ethanol yields, outstanding ethanol productivity (exceeding by 3–5 fold that of yeast; see Rogers et al., 1982), and tolerance to high ethanol and sugar concentrations, keep *Z. mobilis* in the focus of biotechnological research over four decades. The complete genome sequence of *Z. mobilis* ZM4, consisting of a single circular chromosome of 2,056,416 bp, was reported by Seo et al. (2005), followed by the genomes of several other strains (Kouvelis et al., 2009; Pappas et al., 2011; Desiniotis et al., 2012). Its small genome size, together with high specific rate of sugar catabolism via the Entner–Doudoroff (ED) pathway, and a relatively simple central metabolic network, make *Z. mobilis* a promising candidate for metabolic engineering (Sprenger, 1996; Rogers et al., 2007). Currently, recombinant *Z. mobilis* capable of fermenting pentose sugars is regarded as a potential alternative to yeast and recombinant *Escherichia coli* for ethanol biofuel synthesis from agricultural and forestry waste (Dien et al., 2003; Panesar et al., 2006; Rogers et al., 2007; Lau et al., 2010).

In spite of the seeming simplicity of its metabolism, *Z. mobilis* is a bacterium with an interesting physiology (Kalnenieks, 2006), posing researchers some long-standing challenges. Its extremely rapid glucose catabolism, far exceeding the biosynthetic demands of the cell, and the presence of an active respiratory chain with a low apparent P/O ratio (Bringer et al., 1984; Strohdeicher et al., 1990; Kalnenieks et al., 1993) are major manifestations of its so-called uncoupled growth. There are serious gaps in our understanding of the mechanistic basis of uncoupled growth, and in particular, the reason for the low degree of coupling in the respiratory chain of *Z. mobilis*.

Mathematical modeling of metabolism is essential for rational metabolic engineering. The present work focuses on several types of modeling approach to quantitative understanding of central metabolic network and energetics in the bioethanol-producing bacterium Zymomonas mobilis. Combined use of Flux Balance, Elementary Flux Mode, and thermodynamic analysis of its central metabolism, together with dynamic modeling of the core catabolic pathways, can help to design novel substrate and product pathways by systematically analyzing the solution space for metabolic engineering, and yields insights into the function of metabolic network, hardly achievable without applying modeling tools.

**Keywords: stoichiometric modeling, elementary flux modes, kinetic modeling, systems biology, metabolic engineering, Entner–Doudoroff pathway, central metabolism,** *Zymomonas mobilis*

> Mathematical modeling and *in silico* simulations are the most powerful tools of systems biology for understanding of complex metabolic phenomena, and often lead to novel, counterintuitive conclusions. A quantitative picture of physiology and metabolism is a key for rational, model-driven metabolic engineering. Some of the different metabolic modeling approaches that can support the design of novel metabolic engineering strategies are summarized in **Figure 1** (for reviews see: Schuster et al., 2000; Liu et al., 2010; Santos et al., 2011; Schellenberger et al., 2011; Rohwer, 2012). Compared with qualitative, pathway-oriented approaches, computational network analyses can enforce strict mass, energy and redox balancing and give an overall stoichiometric equation for predicted conversions (c.f. de Figueiredo et al., 2009). Here we outline recent advances and perspectives from applying such systems biology approaches to the physiology of *Z. mobilis*. We discuss some recent results gained by stoichiometric and kinetic modeling of its central metabolism, and their potential application to the design of novel substrate pathways, synthesis of novel products, and to the study of the uncoupled growth phenomenon *per se*.

### **RECONSTRUCTION OF** *Z. mobilis* **CENTRAL METABOLIC NETWORK**

Two medium-scale (Tsantili et al., 2007; Pentjuss et al., 2013) and two genome-scale stoichiometric reconstructions of *Z. mobilis* (Lee et al., 2010; Widiastuti et al., 2010) have been reported so far, representing instances of the left and center panels of **Figure 1** respectively. These reconstructions were based on the available genome annotation (Seo et al., 2005; Kouvelis et al., 2009) and provided an overall picture of *Z. mobilis* metabolism. The recent reconstruction made by Pentjuss et al. (2013) was focussed solely on the reactions of central metabolism and for the first time for *Z. mobilis* provided simulation-ready model files. That

decreased the scale, yet allowed an improvement in the accuracy of reconstruction, by combining the genome-derived information with the preexisting biochemical evidence on *Z. mobilis*, available mostly for the reactions of catabolism and central metabolism.

Notably, several key reactions of central metabolism, common for the majority of the chemoheterotrophic, facultatively anaerobic bacteria, are absent in *Z. mobilis*. The Embden– Meyerhof–Parnas (EMP) glycolytic pathway is not operating in this bacterium. Absence of the EMP pathway has been confirmed by [1-13C]glucose experiments (Fuhrer et al., 2005), and furthermore, the gene for phosphofructokinase is lacking in the genome (Seo et al., 2005). *Z. mobilis* is the only known microorganism that uses the ED pathway anaerobically in place of the EMP glycolysis. Since the EMP pathway produces two ATP per glucose while the ED produces only one, it might seem that *Z. mobilis* suffers from ATP deficiency. However, it has been recently shown by means of thermodynamic analysis that, for a given glycolytic flux, the ED pathway requires significantly less enzymatic protein than the EMP pathway (Flamholz et al., 2013). On the other hand, the amount of the ED pathway enzymes in *Z. mobilis* cell is reported to be very high, reaching 50% of the cell's soluble protein (Algar and Scopes, 1985; An et al., 1991). The high level of expression of the pathway together with its inherent speed, therefore, makes ATP production by the *Z. mobilis* ED pathway very rapid and, in fact, excessive for the needs of cell. Energy dissipation in order to regenerate ADP is thus essential for its balanced operation (Kalnenieks, 2006).

The TCA cycle is truncated, and consists of two branches, leading to 2-oxoglutarate and fumarate as the end products (Bringer-Meyer and Sahm, 1989). The genes for the 2-oxoglutarate dehydrogenase complex and malate dehydrogenase are absent (Seo et al., 2005), and accordingly, 13C-labeling patterns of 2-oxoglutarate and oxaloacetate do not support cyclic function of this pathway in *Z. mobilis* (de Graaf et al., 1999). Also, the pentose phosphate pathway is incomplete: transaldolase activity is lacking (Feldmann et al., 1992; de Graaf et al., 1999). The activity of 6-phosphogluconate dehydrogenase, the first reaction of the oxidative part of the pentose phosphate pathway, was reported to be very low (Feldmann et al., 1992). Subsequently, the corresponding gene (*gnd*) could not be identified in the sequenced genomes.

The aerobic redox cofactor balance and the function of electron transport chain represent yet another part of *Z. mobilis* metabolism that differs from that typically found in other bacteria. *Z. mobilis* is one of the few known bacteria in which both NADH and NADPH can serve as electron donors for the respiratory type II NADH dehydrogenase (ZMO1113; Bringer et al., 1984; Strohdeicher et al., 1990; Kalnenieks et al., 2008). Because of the truncated Krebs cycle, the ED pathway is the only source of reducing equivalents in catabolism, and therefore the electron transport chain competes for the limited NADH with the highly active alcohol dehydrogenases (Kalnenieks et al., 2006). Withdrawal of NADH from the alcohol dehydrogenase reaction would cause accumulation of acetaldehyde, which inhibits growth of aerobic *Z. mobilis* culture (Wecker and Zall, 1987). Nevertheless, this bacterium possesses a respiratory chain with

high rates of oxygen consumption. The apparent P/O ratio of its respiratory chain is low (Bringer et al., 1984; Kalnenieks et al., 1993) though the mechanistic basis for that is not clear. However, for metabolic engineering purposes, an active, yet energetically inefficient electron transport has advantages for the needs of redox balancing during synthesis of novel products via metabolic pathways for which regeneration of NAD(P)+ is essential, whereas the aerobic increase of biomass yield is unwanted.

# **QUEST FOR NOVEL SUBSTRATES AND PRODUCTS: STOICHIOMETRIC AND THERMODYNAMIC ANALYSIS**

Much of the metabolic engineering in *Z. mobilis* has been devoted to broadening of its substrate spectrum and expanding its product range beyond bioethanol with a particular focus on the pathway of pentose sugar utilization for synthesis of bioethanol (Sprenger, 1996; Rogers et al., 2007). Advanced pentose-assimilating strains of *Z. mobilis* have been developed during the last couple of decades that can, in several respects, compete with the analogous recombinant strains of *E. coli* and *S. cerevisiae* (Lau et al., 2010). We were interested to explore the biotechnological potential of the low-efficiency respiratory chain of this bacterium for expanding its substrate and product spectrum.

Based on a medium-scale reconstruction of central metabolism (Pentjuss et al., 2013), stoichiometric modeling was used to search the whole solution space of the model, finding maximum product yields and the byproduct spectra with glucose, xylose, or glycerol as the carbon substrates for respiring cultures (**Figure 1**, left hand side). This was done by Flux Balance Analysis approach, using the COBRA Toolbox (Schellenberger et al., 2011). The stoichiometric analysis suggested several metabolic engineering strategies for obtaining products, such as glycerate, succinate, and glutamate that would use the electron transport chain to oxidize the excess NAD(P)H, generated during synthesis of these metabolites. Oxidation of the excess NAD(P)H would also be needed for synthesis of ethanol from glycerol.

It is essential, however, to complement the stoichiometric analysis with estimation of the thermodynamic feasibility of the underlying reactions. Glycerol utilization can serve as an example. Being a cheap, renewable carbon source, a byproduct of biodiesel technology, glycerol represents an attractive alternative substrate for *Z. mobilis* metabolic engineering. It is not expected to have serious growth-inhibitory effects, and also, little genetic engineering seems to be needed to make it consumable by *Z. mobilis*, and to channel it into the rapid ED pathway. Conversion of glycerol to ethanol by *Z. mobilis* would require expression of a heterologous transmembrane glycerol transporter and a glycerol kinase. Its genome contains genes for the two subsequent conversion steps, glycerolphosphate dehydrogenase and triose phosphate isomerase, leading to the ED intermediate glyceraldehyde-3-phosphate although, their overexpression might be needed. The further reactions from the glyceraldehyde-3-phosphate to ethanol represent a part of *Z. mobilis* natural ethanologenic pathway, and should be both rapid and redox-balanced. The extra NAD(P)H, generated by the glycerolphosphate dehydrogenase reaction could be oxidized by the respiratory chain. If succinate is the desired product (Pentjuss et al., 2013), the extra reducing equivalents could be used for reduction of fumarate by the respiratory fumarate reductase.

The pathway from glycerol to glyceraldehyde-3-phosphate via phosphorylation and following oxidation and isomerization steps is presented in biochemistry textbooks as the pathway of glycerol catabolism after breakdown of triacylglycerols in higher animals and humans (see e.g., Lehninger Principles of Biochemistry, 6th edition, Fig. 17-4). Though feasible from the stoichiometric point of view, it reveals problems when subjected to thermodynamic analysis. The equilibrium of the glycerolphosphate dehydrogenase reaction appears to be shifted very much toward formation of glycerol phosphate. *In silico* kinetic simulations of glycerol uptake for a putative engineered *Z. mobilis* demonstrate a dramatic accumulation of glycerol-3-phospate, reaching concentrations of several molar even at a high rate of NAD(P)H withdrawal by the respiratory chain (Rutkis et al., unpublished). Apparently, while the estimated overall stoichiometry of aerobic glycerol conversions is correct, thermodynamic analysis suggests the need to search for alternative reaction sequences to avoid excessive intracellular accumulation of metabolites.

#### **AEROBIC ELEMENTARY FLUX MODES OF THE PENTOSE PHOSPHATE PATHWAY**

A metabolic network can function according to many different pathway options. Elementary flux mode (EFM) analysis has emerged as a systems biological tool that dissects a metabolic network into its basic building blocks, the EFMs (Schuster et al., 2000, 2002). All metabolic capabilities in steady states represent a weighted average of the EFMs, which are the minimal sets of enzymes that can each generate a valid steady state. The EFM approach has proved to be efficient for designing sets of knock-out mutations in order to minimize unwanted metabolic functionality in the producer strains. For example, in engineered *E. coli*, EMFbased mutation analysis helped to eliminate catabolite repression and to increase carbon flux toward the target product ethanol (Trinh et al., 2008).

By decomposing a network of highly interconnected reactions, the EFM analysis may reveal unexpected flux options. Recently we applied EFM analysis to the interaction between the ED, pentose phosphate pathway and respiratory chain in an engineered *Z. mobilis*, which expresses heterologous *gnd* and enzymes for pentose conversion, using the metabolic modeling package ScrumPy (Poolman, 2006). We were interested in the EFMs that such non-growing engineered *Z. mobilis* might employ for aerobic catabolism of glucose and xylose. Analysis revealed several EFMs in respiring cells (**Figure 2**) that have considerable interest for study of aerobic energy-coupling in this bacterium. With both monosaccharides, knocking out *edd* (encoding 6-phosphogluconate dehydratase), and overexpressing heterologous *gnd* (encoding 6-phosphogluconate dehydrogenase), would lead to generation of additional NAD(P)H and CO2 in the pentose phosphate pathway, while lowering the ethanol yield. Yet, most importantly, decrease of the ethanol yield would not be accompanied by accumulation of acetaldehyde and acetoin.

Thus, a simple EFM analysis suggests how to modify *Z. mobilis* aerobic metabolism so that its electron transport chain would receive more reducing equivalents without accumulation of inhibitory byproducts. Strains with such metabolic modifications might be very useful for study of the mechanisms underlying the uncoupled mode of oxidative phosphorylation in this bacterium.

#### **KINETIC MODELING OF THE ENTNER–DOUDOROFF PATHWAY**

Despite the diverse studies of *Z. mobilis* physiology and genetics, little has been done so far to combine the accumulated knowledge in a form of kinetic model of central metabolism that would be comparable to the existing models for *E. coli* and yeast, and could be used to develop efficient metabolic engineering strategies (c.f. **Figure 1**, right-hand panel). A kinetic model reported by Altintas et al. (2006) focussed mainly on the interaction between the heterologous enzymes of pentose phosphate pathway and the native *Z. mobilis* ED glycolysis. Providing predictions for optimization of expression levels of the heterologous genes*,* this study contributed to strategies for maximizing xylose conversion to ethanol. However, the authors assumed constant intracellular concentrations of all adenylate cofactors. Since the ED pathway itself is a major player in ANP and NAD(P)(H) turnover, this might lead to erroneous conclusions on the pathway kinetics and restrict the range of model application. The recent kinetic model by Rutkis et al. (2013): (i) treated the cofactor levels as variables, making the interplay between adenylate cofactor levels and the pathway kinetics explicit, and (ii) introduced equilibrium constants in the kinetic equations to account for the reversibility of reactions more correctly. Metabolic control analysis (MCA) carried out with the model pointed to the ATP turnover as a major bottleneck, showing that the ATP consumption (dissipation) exerts a high level of control over glycolytic flux under various conditions (Rutkis et al., 2013).

Indeed, experimental studies of the ED pathway flux have shown that moderate overexpression of the ED pathway and alcohol dehydrogenase genes do not affect the glycolytic flux (Arfman et al., 1992; Snoep et al., 1995). Larger increases of the expression levels even caused a decrease in flux, exerting also a negative impact on *Z. mobilis* growth rate (Snoep et al., 1995). This clearly indicated that glycolytic flux in *Z. mobilis* must be controlled at some point(s) outside the ED pathway itself. The negative effects of overexpression apparently did not result from intrinsically negative flux control coefficients of the ED enzymes, but were attributable to the protein burden effect (Snoep et al., 1995), whereby overexpression of an enzyme with a small flux control coefficient caused reductions in the expression of other enzymes that have a greater influence on the flux. These results together with MCA studies on the kinetic model suggested that, due to the negligible flux control coefficients for the majority of reactions, single enzymes of the ED pathway should not be considered as prime targets for overexpression to increase the glycolytic flux in *Z. mobilis* (Rutkis et al., 2013). The calculated effects of several glycolytic enzyme (*gap, pgk, pgm*) and both alcohol dehydrogenase isoenzyme (*adhA*and *adhB*) overexpression, in accordance with previous experimental observations, predicted little or no increase of glycolytic flux (Arfman et al., 1992; Snoep et al., 1995). The somewhat higher flux control coefficient for the pyruvate decarboxylase (*pdc*) reaction suggested

**FIGURE 2 | Elementary flux modes of aerobic glucose and xylose catabolism for a strain with engineered pentose phosphate pathway enzymes.** Elementary flux mode for catabolism of glucose in cells with knocked-out edd and overexpressed heterologous gnd via the Entner–Doudoroff and pentose phosphate pathway, involving both NADH- and NADPH-oxidizing activity of the respiratory chain, is shown. ScrumPy

modeling software EFM drawing algorithm (Pentjuss et al., unpublished) was used for visualization. Inset: complete list of elementary flux modes of glucose and xylose catabolism in Z. mobilis, involving the Entner–Doudoroff pathway, pentose phosphate pathway and the respiratory chain, with ethanol and carbon dioxide as the sole products. The explicitly shown elementary flux mode is shaded in gray.

that overexpression of this enzyme by more than 3-fold, might lead to an increase of glycolytic flux of almost 23% (Rutkis et al., 2013). However, quite the opposite was observed experimentally: approximately 10-fold increase of *pdc* was shown to slow down glycolysis by up to 25%, thereby implying that the protein burden might be a serious side effect of catabolic enzyme overexpression in *Z. mobilis*. Usually effects of protein burden are of minor importance in optimization of catabolic fluxes, due to relatively low concentrations of the enzymes in catabolic routes. This is not the case for *Z. mobilis* catabolism, however, since over 50% of the cell protein already is engaged in the function of the ED pathway (Algar and Scopes, 1985). Fortunately, flux control coefficient estimations still indicate a certain solution space for flux improvement: simultanous overexpression of *pdc, eno, pgm* within the 3-fold range of initial

enzyme activities (wich most probably would be below the putative protein burden threshold), has the potential to increase the glycolytic flux by up to 25% (to reach 6.6 *g* glucose, *g* dry wt−1h−1; Rutkis et al. (2013).

Obviously, another option would be to raise ATP dissipation. That could be done by overexpression of the H+-dependent F0F1- ATPase, a major ATP-dissipating activity. Reyes and Scopes (1991) have estimated the F0F1-ATPase contribution being over 20% of the total intracellular ATP turnover. It should be noted, however, that overexpression of ATP-dissipating reaction(s) might disturb the intracellular ATP homeostasis, with successive suspension of glycolysis (by slowing down the first reaction of the ED pathway, phosphorylation of glucose). Co-response analysis indicates (Rutkis et al., 2013) that, at the highest glycolytic flux considered

(4.6 g/g/h), the cellular capacity to maintain the ATP homeostasis is close to its limit, since even 1% further increase of glycolytic flux due to rise of ATP dissipation would be associated with a 4% decrease in ATP concentration.

#### **CONCLUSION**

Although *Z. mobilis* metabolism has been subject to extensive research, and genome sequence data for several strains are now also available, it is only quite recently that modeling of its central metabolic network has started to gain momentum. These latest results of modeling *Z. mobilis* illustrate the relevance of combined stoichiometric, thermodynamic and kinetic analysis of central metabolism at different scales for microorganisms producing biorenewables. Concerted application of structural and dynamic modeling will help to identify targets for future metabolic engineering in a systematic manner, and provide novel insights into the biotechnological potential of this bacterium.

#### **AUTHOR CONTRIBUTIONS**

All authors have equally contributed to the manuscript and have accepted the final version to be published.

#### **ACKNOWLEDGMENTS**

This work was supported by the Latvian ESF projects 2009/027/1DP/1.1.1.2.0/ 09/APIA/VIAA/128 and 2009/0138/1DP/ 1.1.2.1.2/09/IPIA/VIAA/004, and by the Latvian Council of Science project 536/2012.

#### **REFERENCES**


Reyes, L., and Scopes, R. K. (1991). Membrane-associated ATPase from *Zymomonas mobilis*; purification and characterization. *Biochim. Biophys. Acta* 1068, 174–178.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 16 November 2013; paper pending published: 29 December 2013; accepted: 21 January 2014; published online: 05 February 2014.*

*Citation: Kalnenieks U, Pentjuss A, Rutkis R, Stalidzans E and Fell DA (2014) Modeling of Zymomonas mobilis central metabolism for novel metabolic engineering strategies. Front. Microbiol. 5:42. doi: 10.3389/fmicb.2014.00042*

*This article was submitted to Microbial Physiology and Metabolism, a section of the journal Frontiers in Microbiology.*

*Copyright © 2014 Kalnenieks, Pentjuss, Rutkis, Stalidzans and Fell. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Comparative genomics and functional analysis of rhamnose catabolic pathways and regulons in bacteria

*Irina A. Rodionova1, Xiaoqing Li 1, Vera Thiel 2, Sergey Stolyar 3†, Krista Stanton4, James K. Fredrickson3, Donald A. Bryant 2,5, Andrei L. Osterman1, Aaron A. Best <sup>4</sup> \* and Dmitry A. Rodionov1,6\**

*<sup>1</sup> Sanford-Burnham Medical Research Institute, La Jolla, CA, USA*

*<sup>2</sup> Department of Biochemistry and Molecular Biology, Pennsylvania State University, University Park, PA, USA*

*<sup>4</sup> Department of Biology, Hope College, Holland, MI, USA*

*<sup>5</sup> Department of Chemistry and Biochemistry, Montana State University, Bozeman, MT, USA*

*<sup>6</sup> A.A. Kharkevich Institute for Information Transmission Problems, Russian Academy of Sciences, Moscow, Russia*

#### *Edited by:*

*Katherine M. Pappas, University of Athens, Greece*

#### *Reviewed by:*

*Akos T. Kovacs, Friedrich Schiller University of Jena, Germany Sacha A.F. T. Van Hijum, UMC St. Radboud, Netherlands*

#### *\*Correspondence:*

*Aaron A. Best, Department of Biology, Hope College, 35 E 12th Street, Holland, MI 49423, USA e-mail: best@hope.edu; Dmitry A. Rodionov, Sanford-Burnham Medical Research Institute, 10901 North Torrey Pines Road, La Jolla, CA 92037, USA e-mail: rodionov@burnham.org*

#### *†Present address:*

*Sergey Stolyar, Institute for Systems Biology, Seattle, USA*

L-rhamnose (L-Rha) is a deoxy-hexose sugar commonly found in nature. L-Rha catabolic pathways were previously characterized in various bacteria including *Escherichia coli*. Nevertheless, homology searches failed to recognize all the genes for the complete L-Rha utilization pathways in diverse microbial species involved in biomass decomposition. Moreover, the regulatory mechanisms of L-Rha catabolism have remained unclear in most species. A comparative genomics approach was used to reconstruct the L-Rha catabolic pathways and transcriptional regulons in the phyla Actinobacteria, Bacteroidetes, Chloroflexi, Firmicutes, Proteobacteria, and Thermotogae. The reconstructed pathways include multiple novel enzymes and transporters involved in the utilization of L-Rha and L-Rha-containing polymers. Large-scale regulon inference using bioinformatics revealed remarkable variations in transcriptional regulators for L-Rha utilization genes among bacteria. A novel bifunctional enzyme, L-rhamnulose-phosphate aldolase (RhaE) fused to L-lactaldehyde dehydrogenase (RhaW), which is not homologous to previously characterized L-Rha catabolic enzymes, was identified in diverse bacteria including Chloroflexi, Bacilli, and Alphaproteobacteria. By using *in vitro* biochemical assays we validated both enzymatic activities of the purified recombinant RhaEW proteins from *Chloroflexus aurantiacus* and *Bacillus subtilis.* Another novel enzyme of the L-Rha catabolism, L-lactaldehyde reductase (RhaZ), was identified in Gammaproteobacteria and experimentally validated by *in vitro* enzymatic assays using the recombinant protein from *Salmonella typhimurium*. *C. aurantiacus* induced transcription of the predicted L-Rha utilization genes when L-Rha was present in the growth medium and consumed L-Rha from the medium. This study provided comprehensive insights to L-Rha catabolism and its regulation in diverse Bacteria.

#### **Keywords: L-rhamnose catabolism, metabolic reconstruction, regulon, comparative genomics,** *Chloroflexus*

# **INTRODUCTION**

L-rhamnose (L-Rha) is a deoxy-hexose sugar commonly found in plants as a part of complex pectin polysaccharides and in many bacteria as a common component of the cell wall (Buttke and Ingram, 1975; Giraud and Naismith, 2000). Many microorganisms including the *Enterobacteriaceae* and *Rhizobiaceae* are capable of utilizing L-Rha as a carbon source (Eagon, 1961). Plant-pathogenic species (such as *Erwinia* spp.) and saprophytic species (e.g., *Bacillus subtilis*) are able to degrade rhamnogalacturonans and other L-Rha-containing polysaccharides by a set of extracellular enzymes including rhamnogalacturonate lyases (termed RhiE in *Erwinia* spp.) and α-L-rhamnosidases (RhmA, RamA) (Laatu and Condemine, 2003; Ochiai et al., 2007; Avila et al., 2009). The resulting L-Rha and unsaturated rhamnogalacturonides can enter the cells by specific transport systems, the L-rhamnose permease RhaT in Enterobacteriaceae (Muiry et al., 1993), and the RhiT transporter in *Erwinia chrysanthemi* (Hugouvieux-Cotte-Pattat, 2004). In the latter species, the unsaturated galacturonyl hydrolase RhiN is used to release L-Rha and unsaturated galacturonate residues to promote their further catabolism in the cytoplasm (Hugouvieux-Cotte-Pattat, 2004; Rodionov et al., 2004).

The canonical phosphorylated catabolic pathway for L-Rha described in enterobacteria is comprised of three enzymes, L-Rha isomerase (RhaA), L-rhamnulose kinase (RhaB), and Lrhamnulose-1-phosphate aldolase (RhaD) (Schwartz et al., 1974), which convert L-Rha to dihydroxyacetone phosphate (DHAP) and L-lactaldehyde (Akhy et al., 1984; Badia et al., 1989) (**Figure 1**). In addition, L-Rha mutarotase (RhaM) facilitates the interconversion of α and β anomers of L-Rha, providing the stereochemically less-favored anomer for the subsequent catabolic reactions (Richardson et al., 2008). The structures

*<sup>3</sup> Pacific Northwest National Laboratory, Biological Sciences Division, Richland, WA, USA*

and reaction mechanisms each of these four enzymes from *Escherichia coli* have been determined (Korndorfer et al., 2000; Kroemer et al., 2003; Ryu et al., 2005; Grueninger and Schulz, 2006). Another L-Rha isomerase with broad substrate specificity (RhaI, 17% sequence identity to RhaA from *E. coli*) has been characterized in *Pseudomonas stutzeri* (Leang et al., 2004; Yoshida et al., 2007). L-lactaldehyde is a common product of both the L-rhamnose and L-fucose catabolic pathways and is further metabolized to L-lactate by the aldehyde dehydrogenase AldA or to 1,2-propanediol by the lactaldehyde reductase RhaO/FucO under certain conditions (Baldoma and Aguilar, 1988; Zhu and Lin, 1989; Patel et al., 2008). An alternative nonphosphorylated catabolic pathway for L-Rha comprising four metabolic enzymes L-rhamnose-1-dehydrogenase, L-rhamnono-γ-lactonase, L-rhamnonate dehydratase and L-2 keto-3-deoxyrhamnonate aldolase, by which L-Rha is converted to pyruvate and L-lactaldehyde, have been identified in fungi and two bacterial species, *Azotobacter vinelandii* and *Sphingomonas* sp. (Watanabe et al., 2008; Watanabe and Makino, 2009).

Induction of the L-Rha utilization genes in *E. coli* is mediated by two rhamnose-responsive positive transcription factors (TFs) from the AraC family, RhaS, and RhaR (Tobin and Schleif, 1990; Egan and Schleif, 1993; Via et al., 1996). RhaR activates the *rhaSR* genes via binding to the inverted repeat of two 17 bp half sites separated by a 17 bp spacer. RhaS activates the *rhaBAD* and *rhaT* genes via binding to another inverted repeat of two sites whose sequence differs from the RhaR consensus binding site. In another bacterium, the plant pathogen *Erwinia chrysanthemi* from the order *Enterobacteriales*, the expanded RhaS regulon includes a similar set of genes involved in L-Rha utilization, as well as the rhamnogalacturonides utilization genes *rhiTN* (Hugouvieux-Cotte-Pattat, 2004). The L-Rha catabolic gene cluster in *Bacteroides thetaiotaomicron* is positively controlled by another AraC-family TF, which is non-orthologous to *E. coli* RhaR (16% identity) (Patel et al., 2008). In *Rhizobium leguminosarum* bv. trifolii, a novel negative TF of the DeoR family has been implicated in control of the L-Rha utilization regulon, which contains two divergently transcribed operons, *rhaRST-PQUK* and *rhaDI*, encoding an ABC transporter for L-Rha uptake (RhaSTPQ), an alternative kinase (RhaK, 19% identity to RhaB from *E. coli*), an isomerase (RhaI), and a mutarotase (RhaU, 41% identity to RhaM from *E. coli*) (Richardson et al., 2004, 2008; Richardson and Oresnik, 2007).

Our initial genome analysis suggested the presence of a novel variant of the L-Rha utilization pathway in anoxygenic phototrophic bacteria from the *Chloroflexi* phylum. Indeed, the existence of such pathway was implicated by the presence of *rhaA* and *rhaB* gene orthologs and the absence of *rhaD* and *rhaO* genes in *Chloroflexus aurantiacus.* Moreover, the L-Rha catabolic pathway is not completely understood in many more bacterial species including *Bacillus subtilis*, and *Streptomyces coelicolor*. Mechanisms of transcriptional regulation of L-Rha utilization genes are also poorly understood in many species beyond the models. With the availability of hundreds of sequenced bacterial genomes, it is possible to use comparative genomics to reconstruct metabolic pathways and regulatory networks in individual taxonomic groups of Bacteria (Rodionov et al., 2010, 2011; Ravcheev et al., 2011, 2013; Leyn et al., 2013). Genome contextbased techniques, including the analysis of chromosomal gene clustering, protein fusion events, phylogenetic co-occurrence profiles, and the genomic inference of metabolic regulons, are highly efficient methods for elucidation of novel sugar catabolic pathways. In our previous studies, we combined the genomic reconstruction of metabolic and regulatory networks with experimental testing of selected bioinformatic predictions to map sugar catabolic pathways systematically in two diverse taxonomic groups of bacteria, *Shewanella*, and *Thermotoga* (Rodionov et al., 2010, 2013). Furthermore, we have applied the integrated bioinformatic and experimental approaches to predict and validate novel metabolic pathways and transcriptional regulons involved in utilization of arabinose (Zhang et al., 2012), xylose (Gu et al., 2010), N-acetylglucosamine (Yang et al., 2006), Nacetylgalactosamine (Leyn et al., 2012), galacturonate (Rodionova et al., 2012a), and inositol (Rodionova et al., 2013) in diverse bacterial lineages.

In this work, we combined genomics-based reconstruction of L-Rha utilization pathways and RhaR transcriptional regulons in bacteria from diverse taxonomic lineages with the experimental validation of the L-Rha utilization system in *C. aurantiacus* and two other microorganisms. A novel bifunctional enzyme (named RhaEW) catalyzing two consecutive steps in L-Rha catabolism, L-rhamnulose-phosphate aldolase and Llactaldehyde dehydrogenase, was identified in diverse bacterial lineages including Actinobacteria, α-proteobacteria, Bacilli, Bacteroidetes, and Chloroflexi. The predicted dual function of RhaEW was validated by *in vitro* enzymatic assays with recombinant proteins from *C. aurantiacus* and *B. subtilis*. Another enzyme involved in L-lactaldehyde utilization in γ-proteobacteria, Llactaldehyde reductase RhaZ, was identified and experimentally confirmed in *Salmonella* spp. Comparative analyses of upstream regions of the L-Rha utilization genes allowed identification of candidate DNA motifs for various groups of regulators from different TF families and reconstruction of putative rhamnose regulons. L-Rha-specific transcriptional induction and the predicted DNA binding motif of a novel DeoR-family regulator for of the *rha* genes were experimentally confirmed in *C. aurantiacus*.

#### **MATERIALS AND METHODS**

### **GENOMIC RECONSTRUCTION OF RHAMNOSE UTILIZATION PATHWAYS AND REGULONS**

The comparative genomic analysis of L-Rha utilization subsystem was performed using the SEED genomic platform (Overbeek et al., 2005), which allowed annotation and capture of gene functional roles, their assignment to metabolic subsystems, identification of non-orthologous gene displacements, and projection of the functional annotations across microbial genomes, as it was previously described for other sugar catabolic subsystems (Rodionov et al., 2010, 2013; Leyn et al., 2012; Rodionova et al., 2012a, 2013). The obtained functional gene annotations were captured in the SEED subsystem available online at http://pubseed.theseed.org/SubsysEditor.cgi?page= ShowSubsystem&subsystem=L-rhamnose\_utilization and are summarized in Table S1 in the Supplementary Material.

For reconstruction of RhaR regulons we used an established comparative genomics approach based on identification of candidate regulator-binding sites in closely related bacterial genomes implemented in the RegPredict Web server tool (regpredict.lbl.gov) (Novichkov et al., 2010). First, we identified potential *rhaR* transcription factor genes that are located within the conserved neighborhoods of the L-Rha catabolic genes in bacterial genomes from each studied taxonomic lineage. Identification of orthologs in closely related genomes and gene neighborhood analysis were performed in MicrobesOnline (http://microbesonline*.*org/) (Dehal et al., 2010). To find the conserved DNA-binding motifs for each group of orthologous RhaR regulators, we used initial training sets of genes that are co-localized with *rhaR* orthologs (putative operons containing at least one candidate L-Rha utilization gene and that are located in the vicinity of a maximum ten genes from a *rhaR* gene), and then we updated each set by the most likely RhaRregulated genes confirmed by the comparative genomics tests as well as functional considerations (i.e., involvement of candidate target genes in the L-Rha utilization pathway). Using the Discover Profile procedure in RegPredict, common DNA motifs with palindromic or direct repeat symmetry were identified and their corresponding position weight matrices (PWMs) were constructed. The initial PWMs were used to scan the reference genomes and identify additional RhaR-regulated genes that share similar binding sites in their upstream regions. The conserved regulatory interactions were included in the reconstructed RhaR regulons using the clusters of co-regulated orthologous operons in RegPredict. Candidate sites associated with new members of the regulon were added to the training set, and the respective lineage-specific PWM was rebuilt to improve search accuracy. Sequence logos for the derived DNA-binding motifs were built using the Weblogo package (Crooks et al., 2004). The details of all reconstructed regulons are displayed in the RegPrecise database of regulons (Novichkov et al., 2013) available online at http://regpre cise.lbl.gov/RegPrecise/collection\_pathway.jsp?pathway\_id=34.

#### **GENE CLONING AND PROTEIN PURIFICATION**

The *rhaEW* (*Caur\_2283*) and *rhaR* (*Caur\_2290*) genes from *C. aurantiacus* J-10-fl, the *rhaEW* (*yuxG*) gene from *B. subtilis*, and the *rhaZ* (*STM4044*) and *rhaD* (*STM4045*) genes from *Salmonella enterica* serovar Typhimurium LT2 were amplified by PCR from genomic DNA using specific primer pairs (see Table S2 in Supplementary Material). A pET-derived vector, pODC29 Gerdes et al. (2006), containing a T7 promoter and an N-terminal His6 tag, or a similar vector, pProEX HTb (Invitrogen), with a *trc* promoter was used for cloning and protein expression. The *rhaR* gene was cloned into the pSMT3 expression vector (Mossessova and Lima, 2000) (a kind gift of Dr. Lima from Cornell University). The obtained plasmid encodes a fusion between the RhaR protein and an N-terminal Hexa-histidine Smt3 polypeptide (a yeast SUMO ortholog), which enhances protein solubility. The resulting plasmids were transformed into *E. coli* BL21/DE3 or BL21 (Gibco-BRL, Rockville, MD). Recombinant proteins were overexpressed as fusions with an N-terminal His6tag and purified to homogeneity using Ni2+-chelation chromatography. Cells were grown in LB medium (50 ml), induced by addition of 0.2 mM isopropyl-β-D-thiogalactopyranoside, and harvested after 4 h of additional shaking at 37◦C (for Caur\_2283, and Caur\_2290) or 16 h of shaking at 25◦C (for YuxG, STM4044, and STM4045). Harvested cells were resuspended in 20 mM HEPES buffer (pH 7) containing 100 mM NaCl, 0.03% Brij-35, 2 mM β-mercaptoethanol, and 2 mM phenylmethylsulfonyl fluoride (Sigma-Aldrich). Cells were lysed by incubation with lysozyme (1 mg/ml) for 30 min, followed by a freeze-thaw cycle and sonication. After centrifugation, Tris-HCl buffer (pH 8) was added to the supernatant (50 mM, final concentration), which was loaded onto Ni-nitrilotriacetic acid (NTA) agarose minicolumn (0.3 ml) from Qiagen Inc. (Valencia, CA). After washing with starting buffer containing 1 M NaCl and 0.3% Brij-35 bound proteins were eluted with 0.3 ml of the same buffer supplemented with 250 mM imidazole. The purified proteins were electrophoresed on a 12% (w/v) sodium dodecyl sulfate-polyacrylamide gel to monitor size and purity (*>*90%). Protein concentration was determined by the Quick Start Bradford Protein Assay kit from Bio-Rad.

#### **ENZYME ASSAYS**

Aldolase/dehydrogenase activities of the purified recombinant RhaEW proteins from *C. aurantiacus* (*Ca*\_RhaEW) and *B. subtilis* (*Bs*\_RhaEW), and the *St*\_RhaD and *St*\_RhaZ proteins from *Salmonella typhimurium* were tested by a direct NADH detection assay. Because L-rhamnulose-1-P is not commercially available, we used an enzymatic coupling assay with two upstream catabolic enzymes for the conversion of L-Rha to L-rhamnulose-1-P. The L-Rha isomerase RhaA from *E. coli*(*Ec*\_RhaA) and the Lrhamnulose kinase RhaB from *Thermotoga maritima* (*Tm*\_RhaB) were expressed in *E. coli* and purified as described previously (Rodionova et al., 2012b). For *Ca*\_RhaEW assays, the purified recombinant enzymes, *Ec*\_RhaA (2μg) and *Tm*\_RhaB (2μg), were pre-incubated during 20 min at 37◦C in 100μl of reaction mixture containing 150 mM Tris-HCl (pH 8), 20 mM MgCl2, 10 mM ATP, 1.4 mM NAD+, 10μM ZnSO4, and 8 mM L-Rha. Then *Ca*\_RhaEW (0.5μg) was added to the assay mixture and the reduction of NAD+ was followed by increase in absorbance at 340 nm at different temperatures (30–70◦C) in the spectrophotometer. For *Bs*\_RhaEW, *St\_*RhaD and *St*\_RhaZ assays, *Ec*\_RhaA and *Tm*\_RhaB were pre-incubated in a ratio of 40:1 (RhaA:RhaB) at 25◦C for 40 min in a reaction mixture containing 50 mM Tris-HCl (pH 7.5), 20 mM MgCl2, 1 mM ATP, 50 mM KCl, 2.5 mM NAD<sup>+</sup> or 0.25 mM NADH, 5μM ZnCl2, and 2 mM L-Rha. Subsequently, either *Bs\_*RhaEW (2.8μg) or *St\_*RhaD (10μg) and *St*\_RhaZ (2.8μg) enzymes were added and the reduction of NAD+ or oxidation of NADH was monitored by increase or decrease in absorbance at 340 nm, respectively, at 25◦C in a final reaction volume of 200μl.

#### **GC-MS ANALYSIS**

Four-step biochemical conversions of L-Rha to L-lactate and DHAP by mixtures of the three L-Rha catabolic enzymes were monitored by GC-MS. Samples from enzymatic assay mixtures (10μl) were dried in a vacuum centrifuge at room temperature, and derivatized at 80◦C for 20 min with 75μl of pyridine containing 50 mg ml−<sup>1</sup> methoxylamine or ethylhydroxylamine (for lactate detection). The solution was incubated at 80◦C for 60 min with 75μl of N,O-*bis-*(trimethylsilyl)trifluoroacetamide or N*tert*-butyldimethylsilyl-N-methyltrifluoroacetamide (for lactate detection). After derivatization, the samples were centrifuged for 1 min at 14,000 r.p.m. and the supernatant (1μl) was transferred to vials for GC-MS analysis. A QP2010 Plus GC-MS instrument was from Shimadzu (Columbia, MD). GC-MS analyses were performed as previously described in Rodionova et al. (2012a, 2013).

#### **BACTERIAL STRAINS AND GROWTH CONDITIONS**

The *yuxG*(*rhaEW*) and *yceI*(*niaP*) disruption strains of *B. subtilis* were obtained from the joint Japanese and European *B. subtilis* consortium (Kobayashi et al., 2003). The latter strain with an insertion in the niacin transporter *niaP* was used as an isogenic negative control. Both strains were grown overnight at 37◦C in chemically defined medium containing D-glucose (4 g/l), L-tryptophan (50 mg/l), L-glutamine (2 g/l), K2HPO4 (10 g/l), KH2PO4 (6 g/l), sodium citrate (1 g/l), MgSO4 (0.2 g/l), K2SO4 (2 g/l), FeCl3 (4 mg/l), and MnSO4 (0.2 mg/l) in the presence of erythromycin (0.5 mg/l) (pMUTIN2 marker). Overnight cultures were diluted ∼10-fold to yield the same cell density (optical density at 600 nm of 0.05) in the defined medium lacking glucose and washed three times to remove residual glucose. Cells were grown in triplicate in one of two versions of the defined medium containing L-Rha (4 g/l), or no additional carbon source. *C. aurantiacus* J-10-fl was grown at 52◦C in 25 ml screw capped glas tubes completely filled with BG-11 medium (Stanier et al., 1971) supplemented with 0.02% (w/v) of NH4Cl and 2 mM of NaHCO3. 0.2% of yeast extract (YE) or 35 mM of pyruvate, both with and without additional 20 mM L-Rha, were used as main carbon source and cultures grown under microaerobic starting conditions in the light. Cultures were constantly mixed on a rotation wheel during incubation. Growth of cultures was monitored at 600 nm using a ELX-808IU microplate reader from BioTek Instruments Inc. (Winooski, VT). The concentration of L-Rha in culture fluids was determined on an HPLC equipped with an HPX 78 (Bio-Rad) column.

#### **RT-PCR**

Individual transcript levels were measured for seven genes from *C. aurantiacus*: *rhaB* (*Caur\_2282*), *rhaF* (*Caur\_2286*), *rhaR* (*Caur\_2290*), *rhmA* (*Caur\_0361*), and *Caur\_0839* (NADH-flavin oxidoreductase/NADH oxidase). The latter housekeeping gene was used as a positive control since it was found to be highly expressed under both photoheterotrophic as well as chemoheterotrophic conditions in a previous proteome study (Cao et al., 2012). Total RNA was isolated from cells grown on BG-11 medium supplied with YE, YE plus L-Rha, pyruvate, and pyruvate plus L-Rha under suboxic conditions in the light, and collected after 3 days at optical densities at 650 nm of 1.3, 0.9, 0.4, and 0.6, respectively. RNA was isolated using a phenol-chloroform extraction method adapted from (Aiba et al., 1981; Steunou et al., 2006). Cell pellets were resuspended in 250μl 10 mM sodium acetate (pH 4.5) and 37.5μl 500 mM Na2EDTA (pH 8.0), then mixed with 375μl Lysis buffer (10 mM sodium acetate, 2% SDS, pH to 4.5). Hot (65◦C) acidic (pH 4.5) phenol (700μl) was added, the sample was vortexed and incubated at 65◦C for 3 min. After centrifugation (17,000 × g, 2 min), the RNA was further purified by one phenol-chloroform-isoamyl alcohol (25:24:1) and one chloroform extraction. RNA was precipitated using 0.1 volume of 10 M LiCl and 2.5 volume 100% EtOH and precipitated at -80◦C for at least 30 min, washed with 80% EtOH and resuspended in DEPC treated H2O. The RNA solution was treated with DNase I (New England Biolab Inc.) and re-precipitated after an additional chloroform:isoamyl alcohol (24:1) extraction. The purified RNA was dissolved in DEPC-treated water. Semi-quantitative RT-PCR was conducted using a Bioline Tetro one-step RT-PCR kit following the manufacturer's protocol. The gene-specific primers for each gene tested are shown in Table S2 in Supplementary Material. For each reaction one control for DNA contamination was included (same template as for RT-PCR, started with inactivation of RT-Polymerase step) and a PCR positive control (using 10 ng whole genome DNA from *C. aurantiacus* as template) was used. PCR conditions were the same for each primer pair used. All started with a 30 min RT-step at 42◦C followed by an RT-inactivation step at 95◦C. Then a single step PCR for amplification of the genes from cDNA was conducted using 30 cycles of 30 s denaturation at 95◦C, 30 s annealing at 60◦C, and 90 s elongation step at 72◦C before cooling down to 10◦C.

#### **DNA BINDING ASSAYS**

The interaction of the purified recombinant *C. aurantiacus* RhaR protein with its cognate DNA binding site in *C. aurantiacus* was assessed using an electrophoretic mobility-shift assay (EMSA). The His6-Smt3-tag was cleaved from the purified RhaR protein by digestion with Ulp1 protease. Complementary DNA fragments, containing the predicted 38-bp RhaR binding site from the *Caur\_2290* promoter region and flanked on each side by five guanosine residues (Table S2 in Supplementary Materials) were synthesized by Integrated DNA Technologies. One strand of oligo was 3'-labeled by a biotin label, whereas the complementary oligo was unlabeled. Double-stranded labeled DNA fragments were obtained by annealing the labeled oligonucleotides with unlabeled complementary oligonucleotides at a 1:10 ratio. The biotin-labeled 48-bp DNA fragment (0.2 nM) was incubated with increasing concentrations of the purified RhaR protein (10–1000 nM) in a total volume of 20μl of the binding buffer containing 50 mM Tris-HCl (pH 8.0), 150 mM NaCl, 5 mM MgCl2, 1 mM DDT, 0.05% NP-40, and 2.5% glycerol. Poly(dIdC) (1μg) was added as a nonspecific competitor DNA to reduce non-specific binding. After 25 min of incubation at 50◦C, the reaction mixtures were separated by electrophoresis on a 1.5% (w/v) agarose gel at room temperature. The DNA was transferred by electrophoresis onto a Hybond-N+ membrane and fixed by UV-cross-linking. The biotin-labeled DNA was detected with the LightShift chemiluminiscent EMSA kit (Thermo Fisher Scientific Inc, Rockford, IL, USA). Additional DNA fragment of the *Caur\_0003* gene upstream region (Table S2 in Supplementary Materials) was used as a negative control. The effect of D-glucose, L-Rha, and L-rhamnulose (obtained by enzymatic conversion of L-Rha by *Ec*\_RhaA) was tested by their addition to the incubation mixture.

# **RESULTS**

#### **COMPARATIVE GENOMICS OF L-RHAMNOSE UTILIZATION IN BACTERIA**

To reconstruct catabolic pathways and transcriptional regulons involved in L-Rha utilization in bacteria we utilized the subsystem-based comparative genomics approach implemented in the RegPredict and the SEED Web resources (Overbeek et al., 2005; Novichkov et al., 2010). As a result, the L-Rha metabolic pathway genes and transcriptional regulons were identified in complete genomes of 55 representatives of diverse taxonomic groups of bacteria including the *Actinomycetales*, *Bacteroiodales*, *Chloroflexales*, *Bacillales*, *Rhizobiales*, *Enterobacteriales*, and *Thermotogales*. The distribution of genes encoding the L-Rha catabolic enzymes and associated transporters and regulators across the studied species is summarized in Table S1 in Supplementary Material. The studied bacterial species possess many variations in key enzymes from the L-Rha catabolic pathway, as well as in mechanisms of sugar uptake and transcriptional regulation. Some of these variations are briefly described below when we describe novel functional variants of the L-Rha catabolic pathway and novel transcriptional regulons for these pathways.

#### *L-rhamnose catabolic regulons*

The transcriptional regulator RhaS in *E. coli* belongs to the AraC protein family and controls the L-Rha transporter *rhaT* and the catabolic operon *rhaBADU* (Egan and Schleif, 1993; Via et al., 1996). Orthologs of *rhaS* and these catabolic genes for L-Rha utilization are present in other *Enterobacteriales*, as well as in *Tolumonas* and *Mannheimia* spp. RhaS in *E. chrysanthemi* was additionally shown to regulate the *rhiTN* operon involved in the uptake and catabolism of rhamnogalacturonides, L-rhamnose containing oligosaccharides (Hugouvieux-Cotte-Pattat, 2004). The analysis of upstream regions of RhaS-controlled genes and their orthologs in γ-proteobacteria resulted in identification of the putative RhaS-binding motif, which was used for identification of additional RhaS targets in the analyzed genomes (**Figures 2B**, **3B**).

Analysis of other taxonomic groups outside the γproteobacteria identified previously uncharacterized members of the LacI, DeoR, and AraC families as alternative transcriptional regulators of the L-rhamnose catabolic pathways (**Figure 2**). To infer novel L-Rha regulons in each taxonomic group, we applied the comparative genomics approach that combines identification of candidate regulator-binding sites with cross-genomic comparison of regulons. The upstream regions of L-Rha utilization genes in each group of genomes containing an orthologous TF was analyzed using a motif-recognition program to identify conserved TF-binding DNA motifs (**Figure 3**). The deduced palindromic DNA motifs of novel LacI-family regulators are

characteristic of DNA-binding sites of LacI family regulators. The predicted binding motifs of DeoR-family RhaR regulators in four distinct taxonomic groups are characterized by unique sequences; however, each of them has a similar structure that includes two imperfect direct repeats with a periodicity of 10–11 bp. Novel AraC-family regulators of L-Rha metabolism in the *Bacillales*, *Bacteroides,* and *Enterococcus* groups also are characterized by unique DNA motifs with a common structure of a direct repeat with 21-bp periodicity. Among this large set of predicted L-Rha catabolic regulators, only two transcriptional factors, an AraC-type activator in *B. thetaiotaomicron* and a DeoR-type repressor in *R. leguminosarum*, have been shown experimentally to mediate the transcriptional control of L-Rha utilization genes in the previous studies (Richardson et al., 2004; Patel et al., 2008), although specific DNA operator motifs of these two regulators were not reported before.

A detailed description of the reconstructed L-Rha catabolic regulons is available in the RegPrecise database within the collection of regulons involved in L-Rha utilization (Novichkov et al., 2013). Overall, most of these TF regulons are local and control from one to several target operons per genome (**Figure 2**). In the *Bacillales,* RhaR and RhgR control genes involved in the utilization of L-rhamnose and rhamnogalacturonan, respectively (Leyn et al., 2013). In the *Thermotogales*, the DeoR-family RhaR regulator co-regulates genes involved in the utilization of L-Rha mono- and oligosaccharides (Rodionov et al., 2013). In the *Rhizobiales*, RhaR from the DeoR family negatively controls the L-Rha catabolic operon (Richardson et al., 2004), whereas RhiR from the LacI family is predicted to regulate the rhamnogalacturonide utilization gene cluster (named *rhi*). An orthologous LacI-family regulator controls the similar *rhi* gene locus in *Erwinia* spp. In the *Actinomycetales*, a novel LacI-type regulator (termed RhaR) co-regulates genes involved in the uptake and catabolism of L-Rha and L-Rha-containing oligosaccharides. In the *Chloroflexales*, two unique TFs control L-Rha metabolism—the DeoR-family regulator RhaR controls the

L-Rha utilization operons in both *Chloroflexus* and *Roseiflexus* spp., while the LacI-family regulator RhmR controls the *rhm* operon involved in the L-Rha oligosaccharide utilization in *C. aurantiacus*.

In summary, at least seven non-orthologous types of TFs appear to regulate the L-rhamnose utilization (*rha*) genes in diverse bacterial lineages. Uptake and catabolism of L-Rhacontaining oligosaccharides is either co-regulated with *rha* genes by the same TFs (e.g., RhaRs in *Actinomycetales* and *Thermotogales*; RhaS in *Enterobacteriales*), or is under control of other specialized TFs (RhgR in *Bacilales*, RhiR in *Rhizobiales,* and *Erwinia*, RhmR in *Chloroflexus*). In the third part of this study, we experimentally validated the predicted DNA binding sites of RhaR regulator in *C. aurantiacus*.

#### *L-rhamnose catabolic pathways*

Analysis of L-Rha regulons revealed various sets of genes that are presumably involved in the L-rhamnose utilization subsystem (Table S1 in Supplementary Material). By analyzing protein similarities and genomic contexts for these genes, we inferred their potential functional roles and reconstructed the pathways (**Figure 1**). All four enzymatic steps of the reconstructed catabolic pathways occur in many alternative forms. The most conserved enzyme in the L-Rha subsystem is the L-rhamnulose kinase RhaB, which is substituted by a non-orthologous kinase from the same protein family in γ-proteobacteria (Rodionova et al., 2012b). Two alternative types of L-rhamnulose isomerase (RhaA and RhaI) are almost equally distributed among the studied genomes. All analyzed lineages except the *Bacilalles* possess L-rhamnulose isomerases of a single type. Among the *Bacillales*, all studied genomes have the RhaA isomerase, whereas only *B. licheniformis* has the non-orthologous isozyme RhaI.

The canonical form of L-rhamnulose-1-P aldolase (RhaD) was found in γ-proteobacteria, *Bacteroidales*, *Thermotogales*, and *Lactobacillales*. Instead of RhaD, the L-Rha catabolic gene clusters in *Actinomycetales*, α-proteobacteria, *Bacillales*, and *Chloroflexales* contain a chimeric gene encoding a two-domain protein (e.g., *yuxG* in *Bacillus subtilis*). The uncharacterized protein YuxG and its orthologs have an N-terminal class II aldolase domain (PF00596 protein family in PFAM) fused to a C-terminal short-chain dehydrogenase domain (PF00106). We used DELTA\_BLAST to search for distant homologues of YuxG among proteins with experimentally determined functions. The N-terminal domain of YuxG (named RhaE) is distantly homologous to three *E. coli* enzymes, L-ribulose-5 phosphate epimerase (15% identity, *E*-value 1e−39), L-fuculose-1-phosphate aldolase (11% identity, *E*-value 4e−36), and the canonical RhaD enzyme (14% identity, *E*-value 1e−25). These relationships suggest that it represents a non-orthologous substitution of aldolase RhaD. The C-terminal domain of YuxG (named RhaW) is homologous to various NADH-and NADPHdependent sugar dehydrogenases including sorbose dehydrogenase from fungi (29% identity, *E*-value 3e−14), 2,3-butanediol dehydrogenase from *Corynebacterium glutamicum* (25% identity, *E*-value 4e−10), and sorbitol-6-phosphate dehydrogenase from *E. coli* (22% identity, *E*-value 9e−09). The phylogenetic occurrence profile suggests that RhaW may encode the missing L-lactaldehyde dehydrogenease/reductase. Thus, the bifunctional protein RhaEW is tentatively predicted to catalyze the two final reactions in the L-Rha catabolic pathway (**Figure 1**).

Downstream enzymes for utilization of L-lactaldehyde varied the most among the analyzed species. Reconstruction of the RhaS regulon in γ-proteobacteria identified various genes that are likely involved in utilization of L-lactaldehyde. The rhamnose operons in *S. typhimurium* and five other species include an additional gene (named *rhaZ*) that encodes a hypothetical iron-containing alcohol dehydrogenase (PF00465). *E. carotovora* has a single RhaS-regulated gene *aldA* encoding alcohol dehydrogenase from another protein family (PF00171). In contrast, the RhaS regulons in *E. chrysanthemi* and *Mannheimia* spp. include the L-lactaldehyde reductase *rhaO*, whereas *aldA* and *rhaZ* are absent from their genomes. These observations suggest that γ-proteobacteria use three different enzymes and two different pathways for the final stage of the L-rhamnose pathway (**Figure 1**).

In summary, the subsystem reconstruction and genome context analyses allowed us to predict the following novel candidate genes: L-rhamnulose-1-P aldolase (RhaE) and two variants of L-lactaldehyde utilizing enzymes (RhaW and RhaZ) in diverse bacterial genomes. In the second part of this study, we experimentally validated the predicted functions of RhaEW from *B. subtilis* and*C. aurantiacus* and RhaZ from *S. typhimurium.*

#### *L-rhamnose transporters and upstream hydrolytic pathways*

Uptake of L-Rha in *E. coli* is mediated by the L-Rha–proton symport protein, RhaT (Baldoma et al., 1990) that belongs to the Drug/Metabolite Transporter (DMT) superfamily. An orthologous L-Rha transporter was found in the genome context of L-Rha utilization genes/regulons in other γ-proteobacteria and in the *Bacteroidales* (Table S1 in Supplementary Material). Another L-Rha transporter belonging to the ABC superfamily, RhaSTPQ (designated RhaFGHJ here) was described in *R. leguminosarum* (Richardson et al., 2004). In this study, we identified orthologs within the L-Rha operons/regulons in all other α-proteobacteria, as well as in several genomes from the *Chloroflexales*, *Actinomycetales*, and *Enterobacteriales* orders. A different putative L-Rha transporter (termed RhaY), which belongs to the Sugar Porter (SP) family of the Major Facilitator Superfamily (MFS), was identified in certain *Bacillales* and *Actinomycetales* genomes. This functional assignment is supported by the conserved co-localization on the chromosome (in *Mycobacterium*/*Nocardia* spp.) and by predicted co-regulation (via upstream RhaR-binding site in *Saccharopolyspora erythraea*) with other *rha* genes.

The predicted L-Rha regulons in many bacteria include several glycoside hydrolases and transport systems involved in the uptake of L-Rha-containing oligosaccharides in the cytoplasm and their consequent degradation to form L-Rha monosaccharides. The RhaS-activated operon *rhiTN* is involved in the uptake and hydrolysis of oligosaccharides produced during rhamnogalacturonan catabolism in the plant-pathogenic species from the order *Enterobacteriales* (Hugouvieux-Cotte-Pattat, 2004). Another enterobacterium, *S. typhimurium,* possesses a different RhaS-regulated transport system (named *rhiABC*), which is similar to the C4-dicarboxylate transport system Dcu (**Figure 2**). Based on the gene occurrence pattern and candidate coregulation, *rhiABC* is tentatively predicted to encode an alternative transporter for rhamnogalacturonides, which replaces RhiT in *S. typhimurium*. A different transport system from the ABC family (named *rhiLFG*) and putative α-L-rhamnosidases (*ramA*, *rhmA*) were detected within the RhaR regulons in several *Actinomycetales*. In the *Bacillales* and *Rhizobiales* groups, as well as in the *Ewrinia* and *Chloroflexus* spp., homologous ABC transporters and rhamnohydrolases are co-regulated with several novel lineage-specific transcriptional regulons, RhgR, RhiR, and RhmR, respectively.

In summary, the comparative genomics analysis of L-Rha catabolic subsystem in bacteria revealed extensive variation for the components of transport machinery. L-Rha transport systems belong to at least three protein families. In addition to L-Rha transporters, many L-Rha-utilizing bacteria possess systems for active uptake of L-Rha containing oligosaccharides.

#### **EXPERIMENTAL VALIDATION OF NOVEL RHAMNOSE CATABOLIC ENZYMES**

#### *Novel aldolase/dehydrogenase RhaEW*

To provide biochemical evidence for the novel bifunctional aldolase/dehydrogenease enzyme involved in L-Rha catabolism, the recombinant protein RhaEW from *C. aurantiacus* (termed *Ca*\_RhaEW) was overexpressed in *E. coli* with the N-terminal His6 tag, purified using Ni-NTA affinity chromatography, and characterized *in vitro* by a coupled enzymatic assay using spectrophotometry and GC-MS.

Bioinformatics analysis suggested that RhaEW is a bifunctional enzyme catalyzing two sequential activities, L-rhamnulose-1-P aldolase and L-lactaldehyde dehydrogenase (**Figure 1**). We assayed the biochemical activity of the recombinant *Ca*\_RhaEW protein by monitoring the conversion of NAD+ to NADH at 340 nm as a result of predicted L-lactaldehyde dehydrogenase reaction. The peak of *Ca*\_RhaEW activity (*Vmax* 2.9 U mg Major Facilitator Superfamily−1) was observed at 60–70◦C (**Figure 4A**), which is in agreement with the optimal temperature range for the growth for *C. aurantiacus* (Hanada and Pierson, 2006). Additionally, we tested the possibility that *Ca*\_RhaEW acts as an aldolase/reductase by supplying NADH rather than NAD+ in the reaction; no activity was seen under these conditions (data not shown). Thus, *Ca*\_RhaEW acts *in vitro* to convert Lrhamnulose-1-P to L-lactate and DHAP, which is consistent with the prediction made through comparative genomics analyses.

The formation of *Ca*\_RhaEW reaction products was directly confirmed by GC-MS profiling of reaction mixtures obtained by overnight incubation of L-Rha with the *Ca*\_RhaEW protein taken alone or in combination with the upstream catabolic enzymes. While incubation of L-Rha with *Ca*\_RhaEW alone did not produce any new peaks on the chromatogram, the addition to the mixture of the *Ec*\_RhaA and *Tm*\_RhaB proteins led to a decrease of two peaks corresponding to L-Rha (retention times 9.28 and 9.39 min) and the appearance of a series of novel peaks (Figure S1 in Supplementary Material). By comparison with standards and the analysis of electron ionization mass spectra (*m/z* 299), the first two peaks with retention times 9.27 and 9.37 min were

dehydrogenase activity. **(B)** Growth studies of *B. subtilis* knockout mutants for *yuxG* (*rhaEW*) and *yceI* (*niaP* gene used as a control) grown in defined medium in the presence of L-rhamnose, D-glucose, and no additional carbon source (N.C.). Growth studies were conducted in triplicate.

attributed to DHAP, whereas the peak at retention time 7.75 min was assigned as lactate. Additional peaks appearing in the coupled enzymatic assay were attributed to the upstream intermediates of the L-Rha catabolic pathway, L-rhamnulose (retention times 8.85 and 8.92 min) and L-rhamnulose-1-P (13.04 min). The moderate consumption of L-Rha observed when only *Ec*\_RhaA and *Tm*\_RhaB enzymes were added increased substantially after addition of *Ca*\_RhaEW to the reaction mixture. Finally, neither DHAP nor lactate was detected in the reaction mixture after exclusion of NAD+ which is an essential cofactor of L-lactaldehyde dehydrogenase. These results suggest that the activity of the L-lactaldehyde dehydrogenase domain RhaW is essential for the L-rhamnulose-1-P aldolase activity of the second domain in this bifunctional enzyme.

In order to test the hypothesis that RhaEW from *B. subtilis* functions in the catabolism of L-Rha *in vivo*, we performed growth experiments in defined medium for two mutant *B. subtilis* strains. One strain carried a knockout mutation in the gene *yuxG* (*rhaEW*), whereas the second strain carried an intact version of *yuxG* but had a knockout mutation in an unrelated gene, *yceI* (encoding a niacin transporter), to serve as an isogenic control. We expected that the growth of the *B. subtilis yuxG* mutant strain would not be stimulated by the addition of L-Rha as a carbon source when compared to the *yceI* mutant strain. The results clearly demonstrate that the *B. subtilis yuxG* knockout mutant is non-responsive to added L-Rha when compared to the *yceI* knockout strain and to both strains grown in the absence of an additional carbon source (**Figure 4B**). These data confirm that RhaEW is required for L-Rha utilization in *B. subtilis*. The *B. subtilis* RhaEW protein (*Bs*\_RhaEW) was cloned, purified, and tested by the same coupled enzymatic assay as described above for *Ca*\_RhaEW. The *Bs*\_RhaEW protein showed weak, but reproducible activity, measured at 0.0127 <sup>±</sup> 0.001μmol mg protein−<sup>1</sup> min−<sup>1</sup> at 25◦C. Controls removing starting substrate (L-Rha), *Bs*\_RhaEW, or *Ec\_*RhaB (effectively removing rhamnulose-1-P) from the reaction yielded no measurable activity (Figure S2A in Supplementary Materials).

# *RhaZ functions as a L-lactaldehyde reductase in vitro*

We used the reconstituted L-Rha catabolic pathway to test the prediction that *Salmonella* spp. harbor a novel L-lactaldehyde dehydrogenase, distinct from that of *E. coli* and shared among a subgroup of the γ-proteobacteria. We cloned, overexpressed and purified the recombinant proteins *St\_*RhaD (predicted aldolase) and *St*\_RhaZ (predicted novel dehydrogenase) from *S. typhimurium* to complete the *in vitro* pathway (**Figure 1**). *St*\_RhaD is 99% identical at the amino acid level to *E. coli* RhaD, for which an aldolase function has been demonstrated (Schwartz et al., 1974). To ensure that *St*\_RhaD acts as an aldolase in the L-Rha catabolism, we performed two control assays to confirm the production of DHAP and L-lactaldehyde. To test for the production of DHAP, we used purified glycerol-3-P dehydrogenase (GPDH) (Sigma) in an assay containing *Ec*\_RhaA, *Tm*\_RhaB, and *St*\_RhaD. If *St*\_RhaD acts as a L-rhamnulose-1-P aldolase, then the DHAP produced would be converted to glycerol-3-P by GPDH with the oxidation of NADH to NAD+ monitored as a decrease in absorbance at 340 nm. Likewise, it was expected that if *St*\_RhaD produced L-lactaldehyde, then the known *E. coli* L-lactaldehyde dehydrogenase, AldA, should be active in a reaction containing all three L-Rha catabolic enzymes, producing L-lactate, and converting NAD+ to NADH. The results of both controls confirmed the activity of *St*\_RhaD as a L-rhamnulose-1-P aldolase (data not shown), making possible to test the prediction for *St*\_RhaZ. The purified *St*\_RhaZ protein was included in an assay containing *Ec*\_RhaA, *Tm*\_RhaB, and *St*\_RhaD, using NAD+ as a cofactor. This reaction mixture should lead to the conversion of L-lactaldehyde to L-lactate (as with the *E. coli* AldA enzyme). Under these conditions, *St*\_RhaZ did not show activity as a L-lactaldehyde dehydrogenase. In order to assess the alternative fate for L-lactaldehyde, which is conversion to L-1,2 propanediol, we repeated the assay under identical conditions with the exception of supplying NADH as the cofactor. *St*\_RhaZ was active under these conditions (Figure 2B in Supplementary Materials), converting L-lactaldehyde to L-1,2-propanediol with a specific activity of 0.13 <sup>±</sup> 0.02μmol mg protein−<sup>1</sup> min−1. This indicates that the function of RhaZ is a L-lactaldehyde reductase, rather than a L-lactaldehyde dehydrogenase.

### **EXPERIMENTAL VALIDATION OF RHAMNOSE UTILIZATION AND REGULON IN CHLOROFLEXUS AURANTIACUS**

The anoxygenic phototroph *C. aurantiacus* can grow heterotrophically using various organic compounds under either oxic conditions or anoxic conditions in light (Hanada and Pierson, 2006). However, the ability of *C. aurantiacus* and other species from the order *Chloroflexales* to utilize L-Rha has not been previously investigated. In *C. aurantiacus*, the L-Rha utilization genes are organized into a nine-gene *rha* operon, which is predicted to be transcriptionally controlled by a novel DeoR-family regulator RhaR (**Figure 2**). An additional gene, termed *rhmA*, encoding a potential α-L-rhamnosidase (*Caur\_0361*) is potentially involved in the utilization of L-Rha oligosaccharides by *C. aurantiacus*. A novel LacI-family transcription factor, termed RhmR, potentially regulates the RhmA-encoding operon, which also encodes a potential transport system for uptake of L-Rha-containing oligosaccharides, termed RhmEFG (**Figure 2**). In contrast to the L-Rha utilization operon, which has orthologs in all sequenced genomes of *Chloroflexus* and *Roseiflexus* spp., the *rhmR/A* gene locus is only conserved in the closely-related *Chloroflexus* spp. strain, Y-400-fl, but is absent in the other *Chloroflexales*. We assessed the L-Rha utilization and regulon in *C. aurantiacus* by a combination of *in vivo* and *in vitro* experimental approaches.

To validate L-Rha-specific induction of the predicted L-Rha utilization genes *in vivo*, we performed RT-PCR with specific primers designed for three *rha* operon genes, *rhaR*, *rhaF*, and *rhaB*. Total RNA was isolated from *C. aurantiacus* grown in media containing YE or pyruvate, with and without addition of L-Rha. All three genes demonstrated elevated transcript levels in the cells grown on either YE or pyruvate media supplied with L-Rha compared to that of the cells grown in the absence of L-Rha (Figure S3 in Supplementary Materials). In addition to the *rha* operon genes, *rhmA* transcription was also highly elevated in pyruvate-grown cells supplied with L-Rha. These results confirm that the *rha* and *rhm* operons, that are predicted to be controlled by RhaR and RhmR transcription factors, respectively, are transcriptionally induced by L-Rha. Additionally, the L-Rha grown culture samples of *C. aurantiacus* were analyzed by HPLC to monitor the L-Rha consumption from the culture fluids. The results confirm a high rate of L-Rha consumption in the samples (Figure S3 in Supplementary Materials), thus confirming that the L-Rha uptake and utilization system is functional *in vivo*.

The interaction of the predicted RhaR regulator with the *Caur\_2209* (*rhaR*) upstream DNA fragment containing candidate RhaR-binding sites in *C. aurantiacus*, and the influence of potential sugar effectors on protein-DNA interaction were assessed *in vitro* by EMSA (**Figure 5**). The synthetic 38-bp DNA region containing a tandem repeat of four individual RhaR sites (a consensus sequence TCGAAA) was incubated with increasing concentrations of the purified recombinant RhaR protein. The incubation was performed at 50◦C, which is close to the optimal growth temperature of 55◦C for *C. aurantiacus*. The EMSA results (Figure S4 in Supplementary Material) are consistent with the *in silico* predicted DNA operator region of RhaR. The addition of D-glucose and L-Rha had no effect on RhaR-DNA interaction, whereas L-rhamnulose abolished the specific DNA-binding ability of RhaR. The obtained results suggest that the RhaR repressor binds to the operator region at the *rha* operon in the absence of a sugar inducer, and that L-rhamnulose serves as a negative regulator for RhaR in *C. aurantiacus*.

# **DISCUSSION**

L-Rha is the most common deoxy-hexose sugar in nature. In plants, it is a component of many glycosides and polysaccharides such as pectins and hemicelluloses (Peng et al., 2012). Among bacteria, L-Rha is found in the cell wall and as a part of the glycosylated carotenoids (Takaichi and Mochimaru, 2007; Takaichi et al., 2010). Utilization of L-Rha and rhamnosecontaining polysaccharides has previously been studied in several free-living and plant pathogenic microbial species from the phylum *Proteobacteria*, including members of the genera *Escherichia*, *Erwinia*, *Rhizobium*, *Azotobacter*, and *Sphingomonas*. Due to significant variations in sugar catabolic pathways in bacteria, the projection of this knowledge to the genomes of more distant species, including many species important for prospective bioenergy applications, is a challenging problem (Rodionov et al., 2010, 2013). In this study, we used comparative genomics to reconstruct novel variants of catabolic pathways and novel transcriptional regulons for L-Rha utilization in the genomes of bacteria from ten taxonomic groups.

Using bioinformatics analyses of L-Rha utilization genes, we identified twelve groups of rhamnose-related transcriptional regulators from different protein families, AraC, DeoR, and LacI, and proposed binding site motifs for these regulators within tentatively reconstructed regulons (Figure S5 in Supplementary Material). Prior to this study, only four types of bacterial transcriptional regulators related to L-Rha metabolism had been identified. The AraC family includes at least five groups of non-orthologous regulators of L-Rha metabolism. These regulators have unique DNA motifs with a tandem repeat symmetry. Activators from three AraC groups have been characterized previously: RhaR and RhaS from *E. coli* and *Erwinia* spp., with previously known DNA motifs, and RhaR from *Bacteroides*, with

**(A)** Conservation of predicted RhaR binding sites (boxed) identified in the promoter regions of *rha* operons in the *C. aurantiacus* J-10-fl (Caur), *C.* sp. Y-400-fl (Chy400), *C. aggregans* DSM 9485 (Cagg), *Roseiflexus* sp. RS-1 (RS-1), and *R. castenholzii* DSM 13941 (Rcas). Distance to a start codon of *rhaR* is indicated. A 38-bp fragment from *C. aurantiacus* used for DNA binding assays is underlined. **(B)** Summary of the EMSA experiments assessing the potential interaction between the recombinant RhaR protein and its predicted DNA motif at the *Caur\_2209* (*rhaR*) gene. The disappearance of unbound

DNA band (shown by "+") was observed upon the addition of increasing concentrations of RhaR protein (0.25–1 μM). Addition of 2 mM of L-rhamnose or D-glucose to the reaction mixture containing 1μM of RhaR did not change this pattern, whereas addition of 2 mM of L-rhamnulose led to re-appearance of the unbound DNA band (shown by "–"). As a negative control, incubation of RhaR protein (0.5μM) with upstream DNA fragment of *Caur\_0003* did not reveal the disappearance of unbound DNA band (shown by "–"). The EMSA gel pictures are presented in Figure S4 in Supplementary Material. Asterisks indicate the conserved nucleotides in the multiple alignment.

previously unknown DNA motif. The DeoR family includes at least four non-orthologous groups of RhaR regulators that are characterized by distinct DNA motifs with a tandem repeat symmetry. Among them, only RhaR in *Rhizobium* spp. was described previously (Richardson et al., 2004); however, its DNA binding motif was not known before this study. All LacI-family regulons of L-Rha utilization genes were analyzed for the first time in this study. They are characterized by 20-bp palindromic DNA motifs of four different consensus sequences. In summary, the results of this comparative genomics study demonstrate significant variability in the design and composition of transcriptional regulons for L-Rha metabolism in bacteria. This study has very significantly increased our knowledge about types and operator sequences for transcriptional regulators for L-Rha utilization.

Based on genomic context analyses of the reconstructed regulons, we have identified several novel enzymes and transporters involved in L-Rha utilization (**Figure 1**). A novel enzyme with two domains, termed RhaEW, encoded by the *yuxG* gene in *B. subtilis* and its orthologs in other bacterial lineages, was found to catalyze the last two steps in the catabolism of L-Rha, namely cleavage of L-rhamnulose-1-P to produce DHAP and L-lactaldehyde and oxidation of L-lactaldehyde to L-lactate. Thus, the RhaE domain functions as a non-orthologous substitute for the classical RhaD aldolase, whereas the function of the RhaW domain is analogous to the aldehyde dehydrogenase AldA from *E. coli*. A novel L-lactaldehyde reductase involved in L-Rha catabolism, termed RhaZ, that is not homologous to previously characterized RhaO/FucO, was identified in many γ-proteobacteria. Both functional predictions were experimentally validated *in vitro* by enzymatic assays with the purified recombinant proteins from *C. aurantiacus* and *B. subtilis* (for RhaEW), and *S. typhimurium* (for RhaZ). The function of RhaEW in L-Rha utilization *in vivo* was also confirmed by genetic techniques in *B. subtilis*. Interestingly, genes encoding L-lactate dehydrogenases (*lldD*, *lldEFG*) belong to the reconstructed RhaR regulons in certain genomes of the *Actinomycetales* and *Rhodobacterales* that encode RhaEW. Thus, the L-Rha utilization pathways in these species are probably extended to produce pyruvate as one of the final products.

Orthologs of the novel aldolase/dehydrogenase RhaEW are broadly distributed among diverse bacterial phyla including Proteobacteria (α-subdivision), Actinobacteria, Chloroflexi, Bacteroidetes, and Firmicutes (*Bacillales*), in which they are always encoded within the *rha* gene loci (Figure S6 in Supplementary Material). The L-rhamnulose-1-P aldolase domain in RhaE is distantly homologous to class II aldolases including the analogs enzyme, RhaD, and the L-fuculose-1-P aldolase, FucA, from *E. coli*. The tertiary structures and catalytic mechanisms for these enzymes have been determined (Dreyer and Schulz, 1996; Grueninger and Schulz, 2008). We aligned the amino acid sequences of all three enzymes using the multiple protein sequence and structure alignment server PROMALS3D (Pei et al., 2008) (Figure S7 in Supplementary Material). Class II aldolases are zinc-dependent enzymes, in which the metal ion is used for enolate stabilization during catalysis. In RhaD, the Zn2+ion is chelated by three histidines, His141, His143, and His212, which are conserved in all RhaE proteins. An Asp residue in RhaE replaces the catalytically important Glu<sup>117</sup> in RhaD, which performs the nucleophilic attack of the C3 atom of DHAP. This conservative substitution suggests that this Asp may play the similar role in RhaE. The Gly28, Asn29, and Gly44 residues that are involved in phosphate binding in FucA (Dreyer and Schulz, 1996) are conserved in both RhaD and RhaE enzymes. Conservation of the catalytically important amino acids in both types of L-rhamnulose-1-P aldolases suggests similar position of the active site and catalytic mechanism.

In summary, the phosphorylated catabolic pathway for L-Rha contains a large number of alternative enzymes including RhaI/RhaA, RhaB/RhaK, RhaD/RhaE, RhaO/RhaZ, and RhaW/AldA (**Figure 1**) and is widely-distributed among diverse bacterial phyla. An alternative pathway for the nonphosphorylated L-Rha catabolism that utilizes a unique subset of catabolic enzymes was found only in a small number of proteobacteria (Table S1 in Supplementary Material). In addition to numerous variations among enzymes and transcriptional regulators associated with the L-Rha catabolic pathway, a similarly high level of variations and non-orthologous displacements is observed for the components of transport machinery. The L-Rha permease, RhaT, which is characteristic of members of the *Enterobacteriales* and *Bacteroidales*, appears to be functionally replaced by either a permease from a different family in some *Actinomycetales* and *Bacillales* or an ABC cassette in α-proteobacteria and *Chloroflexales*. In other genomes, no candidate transporter specific for L-Rha was detected; however, the reconstructed L-Rha pathways and regulons in these species include transport systems and hydrolytic enzymes for L-Rha oligosaccharides (e.g., rhamnogalacturonides). Some of the latter species are known to grow on L-Rha, such as *B. subtilis* (this study) and *T. maritima* (Rodionov et al., 2013), thus we propose that the predicted L-Rha oligosaccharide transporters in these species are also capable of L-Rha uptake.

Previous studies of L-Rha catabolism in *E. coli* and *Salmonella*, revealed a differential fate for L-Rha under aerobic and anaerobic conditions in *E. coli*, but not in *Salmonella* (Baldoma et al., 1988; Obradors et al., 1988). *E. coli* oxidizes L-lactaldehyde to L-lactate via the activity of AldA under aerobic conditions and reduces L-lactaldehyde to L-1,2-propanediol via the activity of FucO under anaerobic conditions (**Figure 1**). In contrast, *Salmonella* produces L-1,2-propanediol under both aerobic and anaerobic conditions when metabolizing L-Rha. The identification of *Salmonella* RhaZ as an L-lactaldehyde reductase is consistent with these observations. *Salmonella* produce 1:1 molar equivalents of L-1,2-propanediol from the catabolism of L-Rha under both aerobic and anaerobic conditions, with growth yields higher than *E. coli* under anaerobic conditions (Baldoma et al., 1988). The production of L-1,2-propanediol through renewable, biological methods is of high importance given the current chemical based processes of production and the high use of L-1,2-propanediol in many commercial products (Cameron et al., 1998). There are several examples of recent bioengineering strategies to improve L-1,2-propanediol production in *E. coli* (Clomburg and Gonzalez, 2011), cyanobacteria (Li and Liao, 2013), and *Saccharomyces* (Jung et al., 2011) in which each strategy uses glycerol as a starting substrate. The observation of differential fates for L-Rha in *E. coli* and *Salmonella*, the identification of the activity of RhaZ, putative transport systems for rhamnogalacturonides, and predicted regulatory mechanisms in *Salmonella* raise possibilities for exploring alternative biological production strategies of the commercially important L-1,2-propanediol from L-Rha containing substrates, though L-Rha, itself, remains an expensive substrate (Cameron et al., 1998).

*C. aurantiacus* and other filamentous anoxygenic phototrophic bacteria from the *Chloroflexaceae* family were commonly found in the upper layers of microbial mats in hot springs (50–62◦C), with cyanobacteria growing together with chloroflexi. Although *Chloroflexus* spp. can grow heterotrophically on various organic carbon sources, their sugar utilization pathways have remained largely unknown before this work. Here, we identified and characterized a novel variant of the L-Rha catabolic pathway in *C. aurantiacus*, which includes the L-Rha isomerase RhaA, kinase RhaB, and a novel bifunctional enzyme, RhaEW, that catalyzes the last two steps of the pathway. *C. aurantiacus* transcribed genes for L-Rha utilization when L-Rha was present in the growth medium and consumed L-Rha from the medium. The ecophysiological importance of the L-Rha utilization pathway in members of the *Chloroflexales* is yet to be elucidated. One possibility is that cyanobacteria commonly co-occurring with chloroflexi in hot springs microbial mats may provide them L-Rha. In such microbial mats, cyanobacteria are primary producers that are thought to cross-feed low-molecular-weight organic compounds (e.g., lactate, acetate, glycolate) to members of the *Chloroflexales* (van der Meer et al., 2003, 2007). There are several potential sources of L-Rha in cyanobacteria including lipopolysaccharides in the outer membrane (Buttke and Ingram, 1975) and glycosylated carotenoids in the cytoplasmic and outer membrane that protect the cell against photooxidative damage (Takaichi and Mochimaru, 2007; Graham and Bryant, 2009). The exact source of L-Rha from a primary producer and its significance for possible metabolite exchange in the mat community requires further investigation.

# **ACKNOWLEDGMENTS**

We would like to thank P. Novichkov (LBNL, Berkeley, CA) for help with visualization of regulons in the RegPrecise database, Dave Kennedy (PNNL, Richland, WA) for help with HPLC analysis, David Scott (SBMRI, La Jola, CA) for help with GC-MS analysis, and E. Dervyn (INRA, France) for *B. subtilis* knockout strains. This research was supported by the Genomic Science Program (GSP), Office of Biological and Environmental Research (OBER), U.S. Department of Energy (DOE), and is a contribution of the Pacific Northwest National Laboratory (PNNL) Foundational Scientific Focus Area. Additional funding was provided by the Russian Foundation for Basic Research (12-04-33003) and the Towsley Foundation (Midland, MI) through the Towsley Research Scholar program at Hope College.

# **SUPPLEMENTARY MATERIAL**

The Supplementary Material for this article can be found online at: http://www.frontiersin.org/journal/10.3389/fmicb. 2013.00407/abstract.

#### **REFERENCES**


from Pseudomonas stutzeri in Escherichia coli. *Appl. Environ. Microbiol.* 70, 3298–3304. doi: 10.1128/AEM.70.6.3298-3304.2004


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 01 November 2013; paper pending published: 25 November 2013; accepted: 09 December 2013; published online: 23 December 2013.*

*Citation: Rodionova IA, Li X, Thiel V, Stolyar S, Stanton K, Fredrickson JK, Bryant DA, Osterman AL, Best AA and Rodionov DA (2013) Comparative genomics and functional analysis of rhamnose catabolic pathways and regulons in bacteria. Front. Microbiol. 4:407. doi: 10.3389/fmicb.2013.00407*

*This article was submitted to Microbial Physiology and Metabolism, a section of the journal Frontiers in Microbiology.*

*Copyright © 2013 Rodionova, Li, Thiel, Stolyar, Stanton, Fredrickson, Bryant, Osterman, Best and Rodionov. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*