# CURRENT CHALLENGES IN MODELING CELLULAR METABOLISM

EDITED BY: Daniel Machado, Kai H. Zhuang, Nikolaus Sonnenschein and Markus J. Herrgård PUBLISHED IN: Frontiers in Bioengineering and Biotechnology and Frontiers in Cell and Developmental Biology

#### *Frontiers Copyright Statement*

*© Copyright 2007-2016 Frontiers Media SA. All rights reserved. All content included on this site, such as text, graphics, logos, button icons, images, video/audio clips, downloads, data compilations and software, is the property of or is licensed to Frontiers Media SA ("Frontiers") or its licensees and/or subcontractors. The copyright in the text of individual articles is the property of their respective authors, subject to a license granted to Frontiers.*

*The compilation of articles constituting this e-book, wherever published, as well as the compilation of all other content on this site, is the exclusive property of Frontiers. For the conditions for downloading and copying of e-books from Frontiers' website, please see the Terms for Website Use. If purchasing Frontiers e-books from other websites or sources, the conditions of the website concerned apply.*

*Images and graphics not forming part of user-contributed materials may not be downloaded or copied without permission.*

*Individual articles may be downloaded and reproduced in accordance with the principles of the CC-BY licence subject to any copyright or other notices. They may not be re-sold as an e-book.*

*As author or other contributor you grant a CC-BY licence to others to reproduce your articles, including any graphics and third-party materials supplied by you, in accordance with the Conditions for Website Use and subject to any copyright notices which you include in connection with your articles and materials.*

> *All copyright, and all rights therein, are protected by national and international copyright laws.*

*The above represents a summary only. For the full conditions see the Conditions for Authors and the Conditions for Website Use.*

ISSN 1664-8714 ISBN 978-2-88919-754-5 DOI 10.3389/978-2-88919-754-5

## About Frontiers

Frontiers is more than just an open-access publisher of scholarly articles: it is a pioneering approach to the world of academia, radically improving the way scholarly research is managed. The grand vision of Frontiers is a world where all people have an equal opportunity to seek, share and generate knowledge. Frontiers provides immediate and permanent online open access to all its publications, but this alone is not enough to realize our grand goals.

## Frontiers Journal Series

The Frontiers Journal Series is a multi-tier and interdisciplinary set of open-access, online journals, promising a paradigm shift from the current review, selection and dissemination processes in academic publishing. All Frontiers journals are driven by researchers for researchers; therefore, they constitute a service to the scholarly community. At the same time, the Frontiers Journal Series operates on a revolutionary invention, the tiered publishing system, initially addressing specific communities of scholars, and gradually climbing up to broader public understanding, thus serving the interests of the lay society, too.

## Dedication to quality

Each Frontiers article is a landmark of the highest quality, thanks to genuinely collaborative interactions between authors and review editors, who include some of the world's best academicians. Research must be certified by peers before entering a stream of knowledge that may eventually reach the public - and shape society; therefore, Frontiers only applies the most rigorous and unbiased reviews.

Frontiers revolutionizes research publishing by freely delivering the most outstanding research, evaluated with no bias from both the academic and social point of view. By applying the most advanced information technologies, Frontiers is catapulting scholarly publishing into a new generation.

## What are Frontiers Research Topics?

Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are collections of at least ten articles, all centered on a particular subject. With their unique mix of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author by contacting the Frontiers Editorial Office: researchtopics@frontiersin.org

## **CURRENT CHALLENGES IN MODELING CELLULAR METABOLISM**

Topic Editors:

**Daniel Machado,** University of Minho, Portugal **Kai H. Zhuang,** Technical University of Denmark, Denmark **Nikolaus Sonnenschein,** Technical University of Denmark, Denmark **Markus J. Herrgård,** Technical University of Denmark, Denmark

Mathematical modeling and computational methods provide an essential framework to unravel the complexity of cellular metabolism.

Image adapted from: https:// pixabay.com/en/koli-bacteriaescherichia-coli-123081/ and https://pixabay.com/en/monitorisolated-display-white-313011/

Mathematical and computational models play an essential role in understanding the cellular metabolism. They are used as platforms to integrate current knowledge on a biological system and to systematically test and predict the effect of manipulations to such systems. The recent advances in genome sequencing techniques have facilitated the reconstruction of genome-scale metabolic networks for a wide variety of organisms from microbes to human cells. These models have been successfully used in multiple biotechnological applications.

Despite these advancements, modeling cellular metabolism still presents many challenges. The aim of this Research Topic is not only to expose and consolidate the state-of-theart in metabolic modeling approaches, but also to push this frontier beyond the current edge through the introduction of innovative solutions.

The articles presented in this e-book address some of the main challenges in the field, including the integration of different modeling formalisms, the integration of heterogeneous data sources into metabolic models, explicit representation of other biological processes during phenotype simulation, and standardization efforts in the representation of metabolic models and simulation results.

**Citation:** Machado, D., Zhuang, K. H., Sonnenschein, N., Herrgård, M. J., eds. (2016). Current Challenges in Modeling Cellular Metabolism. Lausanne: Frontiers Media. doi: 10.3389/978-2-88919-754-5

# Table of Contents


William R. Cannon

*17 Metabolic network discovery by top-down and bottom-up approaches and paths for reconciliation*

Tunahan Çakır and Mohammad Jafar Khatibipour

*28 Microalgal metabolic network model refinement through high-throughput functional metabolic profiling*

Amphun Chaiboonchoe, Bushra Saeed Dohai, Hong Cai, David R. Nelson, Kenan Jijakli and Kourosh Salehi-Ashtiani


Daniel Machado, Markus J. Herrgård and Isabel Rocha

*61 Analysis of genetic variation and potential applications in genome-scale metabolic modeling*

João G. R. Cardoso, Mikael Rørdam Andersen, Markus J. Herrgård and Nikolaus Sonnenschein


Ali Khodayari, Anupam Chowdhury and Costas D. Maranas

*96 Improving collaboration by standardization efforts in systems biology* Andreas Dräger and Bernhard Ø. Palsson

## **Editorial: Current Challenges in Modeling Cellular Metabolism**

*Daniel Machado<sup>1</sup> \*, Kai H. Zhuang<sup>2</sup> , Nikolaus Sonnenschein<sup>2</sup> and Markus J. Herrgård<sup>2</sup>*

*<sup>1</sup> Centre of Biological Engineering, University of Minho, Braga, Portugal, <sup>2</sup> The Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Hørsholm, Denmark*

#### **Keywords: metabolism, modeling formalisms, metabolic networks, genome-scale modeling, kinetic modeling**

Metabolism is a core process of every cell providing the energy and building blocks for all other biological processes. Mathematical models and computational tools have become essential for unraveling the complexity of cellular metabolism (Heinemann and Sauer, 2010). Models integrate current knowledge on a biological system in an unambiguous manner and allow simulating cellular responses to genetic and environmental perturbations. Advances in genome sequencing and annotation have facilitated the reconstruction of genome-scale metabolic models for hundreds of organisms, which are currently used in various applications ranging from human health to industrial biotechnology (Bordbar et al., 2014).

Despite these advancements, there are still major challenges in modeling cellular metabolism at the genome scale. These include the reconciliation of different modeling approaches, the integration of metabolic models with models of other biological processes, the interpretation of heterogeneous data sources using models, and the adoption of suitable standards for model sharing. The aim of this Research Topic is to present state-of-the-art methods that aim to overcome these challenges and push this frontier to a new edge.

Starting from the most fundamental aspect of biochemical reactions, Cannon (2014) reviews the historical perspective of thermodynamics as a major driving force in the evolution of life and presents a primer on statistical thermodynamics. The author then provides examples of thermodynamic analysis of small metabolic pathways, highlighting future directions for integration of thermodynamics and large-scale modeling.

The most common approach to build a metabolic model is bottom-up reconstruction, where individual reactions for a given organism are identified (through genome annotation and literature data) and retrieved from biochemical databases. This approach is mostly limited by the current knowledge on enzymes with annotated functions. The alternative (termed top-down) approach is to infer the underlying network structure by reverse engineering of metabolome data. Çakir and Khatibipour (2014) compare these two approaches, reviewing available methods for both cases and providing pointers toward the reconciliation of these strategies.

Once a model is built, it can be used to simulate the metabolic phenotype under different conditions and subsequently compared with *in vivo* results for validation and refinement. Phenotype microarrays currently allow high-throughput assessment of metabolic responses to multiple experimental conditions. Chaiboonchoe et al. (2014) present an optimization of the Biolog phenotyping protocol for metabolic profiling of microalgae. The experimental results are used to expand and refine a genome-scale model of the alga *Chlamydomonas reinhardtii* to include the utilization of carbon and nitrogen sources not present in the original model.

Choosing a modeling formalism requires a compromise between model size and detail (Machado et al., 2011). Constraint-based models have gained popularity for their scalability to the genome scale. However, when insight of intracellular dynamics is required, kinetic models become the obvious choice. Petri nets, with their varied extensions, offer an intermediate level of compromise, allowing structural network analysis and, to some extent, dynamic analysis. Hartmann and Schreiber (2015) present a unified graph formalism and implement transformation operations to convert from the unified model to any specific formalism. The authors provide an example of integrated analysis using different formalisms in a unified model of sucrose breakdown in the potato tuber.

#### *Edited and reviewed by:*

*Raina Robeva, Randolph-Macon College, USA; Sweet Briar College, USA*

> *\*Correspondence: Daniel Machado dmachado@deb.uminho.pt*

#### *Specialty section:*

*This article was submitted to Systems Biology, a section of the journal Frontiers in Bioengineering and Biotechnology*

> *Received: 29 October 2015 Accepted: 09 November 2015 Published: 26 November 2015*

#### *Citation:*

*Machado D, Zhuang KH, Sonnenschein N and Herrgård MJ (2015) Editorial: Current Challenges in Modeling Cellular Metabolism. Front. Bioeng. Biotechnol. 3:193. doi: 10.3389/fbioe.2015.00193*

Current *omics*technologies allow unprecedented quantification of different types of cellular components including RNA transcript, protein, and metabolite levels. Machado et al. (2015) use a multi-*omics* dataset of *Escherichia coli* to analyze the contribution of allosteric regulation in controlling central carbon metabolism. Given the role of this type of control in response to different perturbations, the authors present a new simulation method to account for allosteric interactions in the determination of steadystate flux distributions. This is the first constraint-based method to account for allosteric regulation.

Next-generation sequencing is another example of technology pushing the limits of biological discovery. Understanding how genetic variants affect metabolic phenotype is fundamental in diverse areas, such as the study of disease mechanisms and the engineering of microbial cell factories. Cardoso et al. (2015) review available methods to predict the effect of genetic variations in protein function and expression. Integrating these methods with genome-scale metabolic modeling creates the potential for mechanistically predicting the consequences of genetic variation in the cellular phenotype, which is currently not possible with the statistical approaches used in genome-wide association studies.

Microbial strain design is a common application of genomescale models as the combinatorial explosion of possible genetic manipulations demands efficient optimization methods. Stanford et al. (2015) address the problem of butanol production in *E. coli* using a new strain design method, RobOKoD, that combines gene over/underexpression with gene knockouts, showing good agreement with experimental data.Khodayari et al. (2015) analyze the case of succinate overproduction in *E. coli* using k-OptForce, the first strain design method that accounts for integrated simulation of kinetic and constraint-based models. This enables strain design at the genome scale while accounting for regulation mechanisms in central carbon pathways, such as feedback inhibition.

## **REFERENCES**


The authors observe decreased prediction accuracy when the kinetic model is applied in experimental conditions that differ from those used for parameter estimation, highlighting the importance of reparameterization of kinetic models for the conditions used in the production setting.

Last but not least, modeling the complexity of cellular metabolism is an iterative refinement process that cannot be accomplished without a community effort. The ability to share models using suitable standards is of paramount importance (Ebrahim et al., 2015). Dräger and Palsson (2014) present a comprehensive review of standardization efforts in Systems Biology, including standards for model representation, model visualization, minimum information requirements, and suitable ontologies. This review also covers public model databases, conversion tools, simulation software, and standards for publication of simulation results. Adoption of these standards is essential to ensure reusability of models and reproducibility of results.

The work presented in this Research Topic addresses many of the current gaps in the field with innovative solutions. Closing these gaps provides a stepping stone for the challenges to come. The future of metabolic modeling already holds exciting opportunities with a new generation of models that include protein structures, gene expression pathways, and even whole-cell models (King et al., 2015).

## **AUTHOR CONTRIBUTIONS**

All authors have read and revised the manuscript.

## **ACKNOWLEDGMENTS**

The authors would like to thank the financial support from the Novo Nordisk Foundation.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2015 Machado, Zhuang, Sonnenschein and Herrgård. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

## Concepts, challenges, and successes in modeling thermodynamics of metabolism

## **William R. Cannon\***

Computational Biology and Bioinformatics Group, Biological Sciences Division, Pacific Northwest National Laboratory, Richland, WA, USA

#### **Edited by:**

Daniel Machado, University of Minho, Portugal

#### **Reviewed by:**

Keng Cher Soh, Ciris Energy, USA Daniel Craig Zielinski, University of California San Diego, USA

#### **\*Correspondence:**

William R. Cannon, Computational Biology and Bioinformatics Group, Biological Sciences Division, Pacific Northwest National Laboratory, 902 Battelle Blvd, Richland, WA 99352, USA

e-mail: william.cannon@pnnl.gov

The modeling of the chemical reactions involved in metabolism is a daunting task. Ideally, the modeling of metabolism would use kinetic simulations, but these simulations require knowledge of the thousands of rate constants involved in the reactions. The measurement of rate constants is very labor intensive, and hence rate constants for most enzymatic reactions are not available. Consequently, constraint-based flux modeling has been the method of choice because it does not require the use of the rate constants of the law of mass action. However, this convenience also limits the predictive power of constraintbased approaches in that the law of mass action is used only as a constraint, making it difficult to predict metabolite levels or energy requirements of pathways. An alternative to both of these approaches is to model metabolism using simulations of states rather than simulations of reactions, in which the state is defined as the set of all metabolite counts or concentrations. While kinetic simulations model reactions based on the likelihood of the reaction derived from the law of mass action, states are modeled based on likelihood ratios of mass action. Both approaches provide information on the energy requirements of metabolic reactions and pathways. However, modeling states rather than reactions has the advantage that the parameters needed to model states (chemical potentials) are much easier to determine than the parameters needed to model reactions (rate constants). Herein, we discuss recent results, assumptions, and issues in using simulations of state to model metabolism.

**Keywords: statistical thermodynamics, metabolism, simulations, fluctuation theory, molecular motors, tricarboxylic acid cycle, adaptation, biological**

#### **INTRODUCTION**

Since the time of Boltzmann, it was recognized that living organisms are thermodynamic entities. Lotka (1922a) paraphrased Boltzmann's thinking, "that the fundamental object of contention in the life-struggle, in the evolution of the organic world, is available energy." Lotka went on, "in accord with this observation is the principle that, in the struggle for existence, the advantage must go to those organisms whose energy-capturing devices are most efficient in directing available energy into channels favorable to the preservation of the species." Lotka (1922b) proposed that natural selection is at its most fundamental level a physical principle. Schrödinger (1945) famously expanded on this concept with *What is Life?,* and used the concept of entropy to describe how order, in the form of high energy compounds in the environment, drives organization within organisms. Organisms dissipate that energy into lower forms*.* The concept of life as a non-equilibrium process has resonated with others as well, including Prigogine who described living organisms as dissipative structures that selforganize in response to large non-equilibrium driving forces (Prigogine, 1978). Abiotic examples of dissipative structures include tornadoes, hurricanes, and convection cells. The non-equilibrium driving forces "pay"for the self-organization that allows the resulting structures to dissipate energy rapidly. In biological systems, energy comes into the system in the form of sunlight or high energy compounds, typically highly reduced carbon compounds, and this energy is dissipated into the environment according to the second law of thermodynamics. In biological systems, some of the energy is harvested to pay for the creation of additional dissipative structures (growth and reproduction), or to create large amounts of stored energy in the form of lower energy byproducts.

The ecologist H. T. Odum was certainly convinced of the role of statistical thermodynamics in systems ecology. Writing in the American Scientist (Odum and Pinkerton, 1955), Odum sought to understand the diverse scale of rates of natural processes, and proposed that each biological system works at an efficiency that allows the maximum efficiency and power, similar to Lotka's concept that the advantage goes to organisms whose metabolism is most efficient at channeling energy for the purpose of reproduction. Odum took natural selection to mean "the persistence of those forms, which can command the greatest useful energy per unit time."

Morowitz also proposed that the far from equilibrium natural environment was responsible for self-organization of biological systems. As a consequence, Morowitz proposed that life was not only a consequence of energy flow in natural systems, but also that it is highly probable. From this perspective, natural selection is a random process, and in the words of Dewar (2005), species "are selected because they are characteristic of each of the overwhelming majority of ways in which energy and matter could flow under the constraints imposed by local energy and mass conservation". Such concepts have led to the metabolism first hypothesis of the emergence of life on earth (Smith and Morowitz, 2004).

While an excellent collection of discussions of entropy production and self-organization of natural systems has been presented in the literature (Kleidon et al., 2010), for the most part the recognition by physical scientists of the role of thermodynamics as a causal factor in the operation of biological systems stands in stark contrast to the lack of discussion of thermodynamics in the experimental life sciences literature. A major reason for this may be because of the abstract nature of statistical thermodynamics and the lack of tools to model and evaluate the thermodynamic aspects of living systems. After all, since its conceptualization developments in thermodynamics have had mostly to do with equilibrium processes, and biological systems are highly non-equilibrium.

However, in the last 20 years, statistical thermodynamics and fluctuation theorems have allowed for significant progress in understanding non-equilibrium systems. Fluctuation theorems are starting to be used to model biological systems, allowing us to begin to understand how cellular machinery operates. These theorems tell us that there is an important difference between thermodynamic models of macroscopic process and the statistical thermodynamic models of the microscopic processes such as those that make up cells. The second law of thermodynamics describes macroscopic processes and states that the entropy of a spontaneous process never decreases. The second law is silent, however, about the microscopic events that make up the macroscopic process. These microscopic events may be, for instance, sets of coupled reactions that lead to some observable change of state – a different phenotype in the parlance of biology. These microscopic events involve enzyme complexes and coupled reaction pathways in cells, which are not just scaled down versions of beaker-sized laboratory systems. Components of small systems can in fact run in reverse at times. A number of excellent reviews of fluctuation theorems exist in the literature (Harris and Schutz, 2007; Sevick et al., 2008; Seifert, 2012) and we will only give an in-a-nutshell perspective here.

In this report, we will focus on issues and challenges in thermodynamically modeling biological systems of coupled reactions, such as those that occur in metabolism. We will first discuss probability density functions based on Boltzmann probabilities and the relationship to free energy. Closely related to free energy is the concept of entropy. We will discuss different formulations of entropy and their meanings in order to provide a clear overview of entropy production. Finally, fluctuation theorems will be briefly discussed using this conceptual framework.While fluctuation theorems have not yet been used to extensively simulate metabolism, they have great promise, and have been used to examine single molecule dynamics and the dynamics of coupled biochemical reactions on multiple scales. Finally, the application of statistical thermodynamics to model biological reactions that are far from equilibrium is discussed.

#### **THEORETICAL BACKGROUND**

Understanding the foundational concepts of modeling thermodynamics is essential for understanding the challenges that the field faces. The mathematical concepts presented in the literature are often too abstract to be readily accessible to those outside the specialty field of statistical thermodynamics. A case in point is that it may seem like the literature contains a zoo of seemingly unrelated statistics all going by the name of entropy. Understanding which entropy is being used is critical for understanding and applying thermodynamic modeling and fluctuation theorems, as will become evident below.

However, a tremendous amount of physical insight into fluctuation theorems and thermodynamic modeling can be obtained if one understands the multinomial distribution function, which is simply a generalization of the common binomial distribution function when more than two outcomes are possible. With regard to reaction kinetics, more than two outcomes are possible when we have more than two interconverting species present. The mathematical form of a multinomial distribution is,

$$\Pr\left(n\_{\rm I},\ldots,\,n\_{\rm m}|N\_{\rm total},\,\theta\_{\rm I},\,\ldots,\,\theta\_{\rm m}\right) = N\_{\rm total}! \prod\_{\text{objects}j} \frac{1}{n\_{\rm j}!} \theta\_j^{n\_{\rm j}}.$$

The multinomial probability density above is the probability that *n<sup>j</sup>* objects of type *j* will be present when there are *N*total = Σ*n<sup>j</sup>* objects present. In the equation above, θ*<sup>j</sup>* is the probability of object *j* independent of the other objects. According to frequentist statistics, this probability is simply the long term proportion of the number of object *j*'s that are present, θ*<sup>j</sup>* = *nj*/*N*total. The probability density is not simply Pr = Πjθ *n*j j because each individual object of type *j* is indistinguishable from all the other objects of type *j*. Thus, the probability density has to be corrected for the number of permutations and combinations of each object type, which is accounted for by the factorial terms in the multinomial distribution function.

Now consider a system consisting of three chemical species *A*, *B,* and *C* in aqueous solution in a container of fixed volume. Each of the three species can interconvert to one of the other two species, but the total number of particles is fixed such that *n*<sup>A</sup> + *n*<sup>B</sup> + *n*<sup>C</sup> = *N*total. The Boltzmann probability θ*<sup>i</sup>* of species *i* is related to the Helmholtz free energy of solvation ∆ <sup>0</sup> i by,

$$\Theta\_{\mathbf{i}} = \frac{e^{-\Delta \mathcal{A}\_{\mathbf{i}}^{\mathbf{0}}/k\_{\mathbf{B}}T}}{\sum\_{\text{species }j}^{m} e^{-\Delta \mathcal{A}\_{\mathbf{j}}^{\mathbf{0}}/k\_{\mathbf{B}}T}}.\tag{1}$$

where *k<sup>B</sup>* is Boltzmann's constant and *T* is the temperature. For simplicity, we will disregard the internal degrees of freedom for each species. In this case, the numerator *e* −∆ 0 i /*kT* is referred to as the molecular partition function, *q<sup>i</sup>* . The denominator is simply a normalization function, usually denoted as *q* = Σ*q<sup>i</sup>* , the log of which is the Boltzmann average energy of the system, −h*E*i*B*/*kBT*. Statistically, the distribution of the particles is characterized by the multinomial Boltzmann probability density function,

$$\Pr\left(n\_1, \dots, n\_{\text{m}} \middle| N\_{\text{total}}, \theta\_1, \dots, \theta\_{\text{m}}\right) = N\_{\text{total}}! \prod\_{\text{species } j}^{m} \frac{1}{n\_j!} \theta\_j^{n\_j}$$

where *n<sup>j</sup>* is the number of particles of species *j*, and there are *N*total particles. In analogy to the macroscopic, thefree energyfrom statistical thermodynamics, an unnormalized mass density for a microscopic state can be defined that is a function of the molecular partition functions *q<sup>i</sup>* instead of the Boltzmann probabilities,

$$\frac{-A\begin{pmatrix} n\_1,\ldots,\ n\_{\text{m}}\vert N\_{\text{I}},\ q\_1,\ldots,\ q\_{\text{m}}\end{pmatrix}}{k\_{\text{B}}T} = \log\left(N\_{\text{total}}!\prod\_{j}^{m}\frac{1}{n\_{\text{j}}!}q\_{\text{j}}^{n\_{\text{j}}}\right) \tag{2}$$

For brevity, we will write *A*(*n*1, . . ., *n*<sup>m</sup> | *N*<sup>r</sup> , *q*1, . . ., *q*m) as *A*(*n*¯|*N*T, *q*¯) or simply *A*. The value *A* in Eq. 2 is not a free energy because it is not an average over all possible values for each of the *nj* . The relationship between *A* and the probability density of that microscopic state is,

$$-A/k\_{\rm B}T = \log \Pr(n\_1, \dots, n\_{\rm m} | N\_{\rm total}, \, \theta\_1, \dots, \, \theta\_{\rm m}) + N\_{\rm total} \cdot \log q\_{\rm A}$$

or equivalently,

$$\log \Pr \left( n\_1, \dots, \text{ } n\_{\text{m}} | N\_{\text{total}}, \,\theta\_1, \dots, \,\theta\_m \right) = A / k\_{\text{B}} T \,\, + \, N\_{\text{total}} \cdot \log q\_{\text{B}}$$

Since log *q* = − h*E*i*B*/*kBT*, we have the relationship

$$\begin{aligned} -\mathbf{S\_{\overline{g}}} &= A/k\_{\mathbf{B}}T - N\_{\Gamma} \langle E \rangle\_{\mathbf{B}} / k\_{\mathbf{B}}T\\ \mathbf{S\_{\overline{g}}} &= -\log \Pr \left( m\_{\mathbf{l}}, \dots, \ n\_{\mathbf{m}} \vert N\Gamma, \; \Theta\_{\mathbf{l}}, \dots, \; \Theta\_{\mathbf{m}} \right) \end{aligned} \tag{3}$$

This function on the right hand side is strictly a log likelihood, not an entropy. However, the average log likelihood is an entropy, and in fact is the Gibbs entropy for a system with a fixed number of total particles,

$$\begin{array}{rcl} S\_G &=& \sum\_{\text{microstates \{\}}} \text{Pr} \, (f) \log \text{Pr} \, (f) \\ &=& \left\langle A(\bar{n}|N\_\Gamma, \bar{q}) \right\rangle - N\_{\text{total}} \, \langle E \rangle\_\text{B} \end{array} \tag{4}$$

where Pr(*J*) is shorthand for Pr(*n*<sup>1</sup> = *n*1(*J*), . . ., *nm*(*J*)|*N*total, θ1, . . ., θ*m*) and *A*(*n*¯|*N*T, *q*¯) = is the free energy of the macroscopic state with parameter *NT*. Because the Gibbs entropy is an average over microstates, it is the entropy related to macroscopic observations (Jaynes, 1965).

Adding confusion to the definition of entropy is the related microstate relationship,

$$\mathbf{S\_B = A/k\_B T - N\_{\rm total}/E/k\_B T} \mathbf{\hat{y}\_U} \tag{5}$$

where now h*E*/*kBT*i*<sup>U</sup>* is the average energy of the microstate under the uniform distribution instead of the Boltzmann distribution. The entropy term is also given by *S* = − Σ*p<sup>j</sup>* log *p<sup>j</sup>* where again the probabilities *p<sup>j</sup>* = *nj*/*N*total are from the uniform distribution (Davidson, 1962; Cannon, 2014). The subscript indicates that this is the Boltzmann entropy because it is derived from log*W* where *W* is the multinomial coefficient. This entropy is also sometimes referred to as the configurational entropy (Davidson, 1962). The difference between the Gibbs and Boltzmann entropies of course has to do with intermolecular potentials and microscopic vs. macroscopic perspectives (Jaynes, 1965).

When the total number of particles is not fixed, adjustments need to be made to the equations above. Typically, the adjustment is to remove the normalization of the Boltzmann probabilities in Eq. 1, such that the resulting quantity *e* −*A*/*k*B*T* is an unnormalized probability mass function, or an odds of *e* −*A*/*k*B*T* : 1. The multinomial probability distribution now becomes a multinomial odds distribution, the main difference being that a probability mass function over all of state space sums to 1, while the new multinomial distribution sums to a value >1.

If the total number of particles is allowed to vary due to the system being open, then Eq. 4 gives

$$\mathcal{S}\_{\mathcal{G}} = \langle A - N\_{\text{total}}(I) \log q \rangle$$

Notice that this definition is different from one common thermodynamic definition of entropy, which defines entropy as the difference between the free energy and the average energy,

$$\begin{aligned} \mathcal{S} &= \langle A \rangle - \langle E \rangle \\ &= \langle A \rangle - \langle \mathcal{N}\_{\text{total}}(I) \log q \end{aligned}$$

Since we know from the triangle inequality, ||log *x* − log *y*|| ≥ ||log *x||* − *||*log *y*||, it follows that *S<sup>G</sup>* ≥ *S*.

For a set of coupled reactions such as,

$$A \rightleftharpoons B \rightleftharpoons C$$

a change of the microscopic state from *K* to *J* is described by the likelihood ratio,

$$-\Delta \mathcal{S}\_{\mathbb{g}, \mathbb{K}} = \log \left( \frac{\Pr \left( f \right)}{\Pr \left( K \right)} \right), \tag{6}$$

or equivalently,

$$\frac{\Pr\left(I\right)}{\Pr\left(K\right)} = e^{-\Delta S\_{\mathbb{E}, \mathbb{K}}} \tag{7}$$

which has the basic mathematical form of a fluctuation theorem, but in this case is an identity due to the definition of *S<sup>g</sup>* in Eq. 3. If we average over all states *J* and *K* and the system is at equilibrium,

$$
\left\langle \frac{\Pr\left(f\right)}{\Pr\left(K\right)} \right\rangle = \left\langle e^{-\Delta S\_{\mathbb{R}^{\rm JK}}} \right\rangle \tag{8}
$$

$$
= 1
$$

where the angular brackets denote an equilibrium average. The average value is unity since the log likelihood of Eq. 6 is zero, on average. Relation 8 simply says, that on average, the system returns to equilibrium. While Eq. 7 is exact for microscopic processes, the challenge in employing it to model time-dependent processes is that the core probabilities available for use in Eq. 1 are stationary Boltzmann probabilities, yet if the individual rates of the reactions vary enough in a system of coupled reactions, the core probabilities will not be Boltzmann probabilities, which are based solely on energy levels of the reactants and products. At equilibrium, Eq. 7 can be used for time-dependent probabilities because of detailed balance – Eq. 8. However, away from equilibrium, Eq. 7 no longer holds because detailed balance no longer exists. Instead, the true probabilities will be a function of the entire energy surface of the system, including the reaction barriers. Fluctuation theorems relate the ratio of these time-dependent probabilities to a function that is related to the time-dependent ∆*Sg*(*t*), or if ensemble averages are used, the time-dependent ∆*SG*(*t*).

For example, at a non-equilibrium steady state the average fluctuations of a system can still be characterized at times without knowing the actual probabilities of each state. Consider the fluctuation away from a steady state *J* to the new state *K* with some transition probability. We know that the system will eventually return to the steady state *J*, we just do not know specifically how. For the most part, a fluctuation away from the steady state will be along the direction of the non-equilibrium driving force. When the system returns to the steady state, an amount of energy will have been dissipated from the system. Note that if the system were to return to the steady state along the same path, no energy would have been dissipated; that is, the average likelihood of returning along the same path is not 1 as in the case for equilibrium (Eq. 8). Thus, fluctuation theorems for non-equilibrium steady state take the form,

$$\mathfrak{Q} = \left\langle \log \left( \frac{\pi\_{\rm KJ} \left( t \right)}{\pi\_{\rm KJ} \left( t \right)} \right) \right\rangle\_{\rm J, K} \tag{9}$$

where π*KJ* is the probability of trajectory *J*→*K*, and Ω is related to the dissipation of energy due to the non-equilibrium steady state. For instance, the Evans–Searles fluctuation theory relates the time-dependent probabilities to a trajectory-specific dissipation function, Ω(*t*), which is a measure of how far the system is away from detailed balance,

$$\frac{\pi\_{\rm K\!J}\left(\boldsymbol{\Omega}\left(t\right)=-q\_{\rm D}/k\_{\rm B}T\right)}{\pi\_{\rm J\!K}\left(\boldsymbol{\Omega}\left(t\right)=q\_{\rm D}/k\_{\rm B}T\right)} = e^{-q\_{\rm D}/k\_{\rm B}T} \tag{10}$$

If *q<sup>D</sup>* represents the dissipated energy due to the lack of detailed balance, then the odds of regaining that energy through a reversal of the trajectory are exponentially small. One could even think of the RHS of Eq. 10 as representing the energy of a hypothetical particle (a "dissipation") that has a Boltzmann factor of *e* −*q*D/*k*B*T* . Recent developments in fluctuation theories (reviewed by Sevick et al., 2008; Seifert, 2012) in the last two decades have pushed the envelope into the far from equilibrium domain. Many biochemical reactions are in this domain.

#### **ENTROPY PRODUCTION**

When the time-dependent flux of material through reactions can be determined, the entropy production rate can be defined in several related ways (Oster et al., 1973; Ge et al., 2006; Ge and Qian, 2010). Using Eq. 6, the microscopic entropy production can be defined for a reaction *i* in the +direction as,

microscopic entropy production rate = *J*i+∆*S*g,i

and the net entropy production through the reaction is *J*i,net∆*S*g,i, where *J*i,net = *Ji*<sup>+</sup> − *Ji*−. Taking the ratio of the entropy production

due to the forward and the reverse reaction, the odds of entropy being produced at reaction *i* are,

$$\begin{split} O\left(\Delta S\_{\rm g,i}\right) &= \frac{J\_{\rm i+} \cdot \Delta S\_{\rm g,i}}{J\_{\rm i-} \cdot \Delta S\_{\rm g,i}} \\ &= \frac{J\_{\rm i+}}{J\_{\rm i-}} \end{split} \tag{11}$$

Although the ratio of the forward and reverse flux gives us the odds of thermodynamic entropy production, the ratio itself cannot tell us the value of the thermodynamic entropy change or even if the entropy change is positive or negative; in coupled systems the flux through any specific reaction is not deterministically related to the entropy or free energy change of that reaction. The second law of thermodynamics only tells us that for macroscopic processes, the entropy must always increase; the second law does not address what might be happening on the microscopic level in individual reactions. This is an important aspect of stochastic systems: even though a reaction has a free energy change above zero or equivalently an odds below one, it can still occur given enough time. For example, if a set of coupled reactions has a large enough overall favorable change in free energy, an individual reaction can have a net positive flux even if the reaction free energy is unfavorable. Flux is an emergent property of the entire system. However, as indicated by the fluctuation theorems, the less likely the reaction, the less likely it will have a net flux in the direction of decreasing entropy change.

Several studies have asserted that the relationship between flux and free energy is ∆*G* = −*RT*log(*J* <sup>+</sup>/*J* <sup>−</sup>). This relationship was originally proposed in discussions of reversible systems and discussed in the context of deterministic kinetics (Beard and Qian, 2007). For coupled, stochastic non-equilibrium reactions, the relationship is strictly speaking an assumption. However, it is reasonable to expect in the vast majority of situations that ∆*G* and −*RT*log(*J* <sup>+</sup>/*J* <sup>−</sup>) are concordant. The relationship can be used to gain insight if used carefully. For instance, Noor et al. (2014) have used the assumption as a framework for evaluating flux statistics at individual reactions. They correctly pointed out that reactions near equilibrium act as kinetic bottlenecks in pathways that are overall far from equilibrium. This is a valid use of the assumption in that reactions at equilibrium in an otherwise nonequilibrium system are those for which the relation is approximately correct even for stochastic systems.

So far the question of how to find the steady states has been left open. A steady state could be determined by the textbook approach of solving the set of differential rate equations. However, for biological systems the required rate parameters are rarely available. In principle, a steady state can be defined based on experimental measurement of all relevant chemical species, which can be used to define the chemical potential of each species. While this task is much easier than determining all the appropriate rate constants, it is still formidable. Yet, significant progress is being made (Bennett et al., 2009).

Alternatively, one can assume that the steady state is one that corresponds to an optimal thermodynamic process. A thermodynamically optimal process is one in which a maximal amount of energy can be extracted from the environment with a minimal amount of dissipation of heat (Sivak and Crooks, 2012). Equivalently, a thermodynamically optimal path is one that requires the least work to maintain the steady state. In either case, the thermodynamically optimal steady state can be found by maximizing a steady state version of Eq. 4 in which the Gibbs entropy *S<sup>G</sup>* in a state space neighborhood Γ measures the probability density of states reachable from an initial state S due to a series of *Z* reactions involving a change of state δS*<sup>i</sup>* (Cannon, 2014),

$$\mathcal{S}\_{\mathbb{S}}\left(\Gamma\left(\mathbb{S}\right)\right) = -\sum\_{\substack{\mathrm{Rxn } i=1}}^{Z} \Pr\left(\mathbb{S}\_{i-1} + \delta \mathbb{S}\_i\right) \log \Pr\left(\mathbb{S}\_{i-1} + \delta \mathbb{S}\_i\right) \tag{12}$$

In a system moving toward equilibrium through a trajectory of *Z* reactions, the state entropy increases as the system stabilizes, and reaches a maximum at equilibrium since equilibrium requires that each respective reaction is equally likely. In a non-equilibrium system, the neighborhood Γ is a reaction path and Eq. 12 is the path entropy described by Dewar, from which the fluctuation theorem, the selection principle of maximum entropy production, and self-organized criticality can be derived (Dewar, 2003). An analogous Gibbs entropy can be defined by averaging *Sg*[Γ(S)] over many trajectories such that *SG*[Γ(S)] = h*Sg*[Γ(S)]i. If the entropy change from equilibrium is ∆*S*G(Γ(S))*S* 0 <sup>G</sup> <sup>−</sup> *<sup>S</sup>*G(Γ(S)), then the rate of production of thermodynamic entropy can then be defined as,

thermodynamic entropy production rate = *J*net (Γ) ∆*S*<sup>G</sup> (Γ (S))

While its likely that no individual organism is at the apex of thermodynamic optimality, it is also likely, as discussed in the section "Introduction," that natural selection is at some fundamental level based on filtering out individuals that are thermodynamically inefficient such that too little energy is extracted from the environment or too much of the extracted energy is simply dissipated back to the environment; such a system would not be able to channel sufficient energy into growth to compete against more efficient individuals. In this scenario of natural selection, thermodynamically optimal steady states would serve as useful models.

#### **Applications**

Beyond atomistic simulations, the application of statistical thermodynamics and fluctuation theory to biological systems is truly a frontier. To date, applications are mostly in the physics literature and include (but are not limited to) the study of molecular motors, mostly ATP synthase (Andrieux and Gaspard, 2006; Hayashi et al., 2010; Zimmermann and Seifert, 2012), small metabolic networks (Cannon, 2014), bifurcation dynamics of reaction pathways (Xiao et al., 2009), and models of the response of bacteria to changes in the environment (Barato et al., 2014). These examples were chosen to represent a hierarchy of scales in which statistical thermodynamic simulations have been applied to biology. Because the dynamics of each system is represented using different equations, it is not possible to describe in detail the form of the fluctuation theorem used other than to say that all are in some way represented by Eq. 9, except where noted. Details on the theorems

are best obtained from the original literature. Below, we briefly summarize the findings for this representative selection from the literature.

#### **SINGLE MOLECULE DYNAMICS OF ATPase F1 ROTARY MOTOR**

The F0F1–ATP synthase complex is an example of a highly nonequilibrium nanomotor. The rotary motor of F0F1–ATP synthase is powered by proton flow across a gradient producing afree energy difference of 10–20 kJ/mol of protons. This free energy difference is significantly greater than the ambient energy at room temperature of about 2.45 kJ/mol. The motor operates over a large range of scales; rate constants for the various processes making up the motor vary over 12 orders of magnitude. Andrieux and Gaspard used fluctuation theory and generatingfunctions to evaluate statistical distributions of mean rotation of the F<sup>1</sup> rotor, the dissipated work, and the probability flux across the system (Andrieux and Gaspard, 2006). The analysis showed that the ATPase motor has a highly non-linear response to chemical fuel: the mean velocity of the F<sup>1</sup> rotor as a function of the thermodynamic driving force is a sigmoid-like curve. Despite the microscopic nature of the motor, the operation of the motor is highly robust in this nonlinear regime: successive rotations are statistically correlated and remain essentially unaffected by the fluctuations. Nevertheless, it was shown that the fluctuation theorem held even in the highly non-linear regime.

#### **MULTIPLE MOLECULES: PATHWAY BIFURCATION DYNAMICS OF A CIRCADIAN CLOCK**

When multiple reactions are coupled, non-intuitive behavior can result. The Lotka–Voltera oscillator and the Brusselator are famous early examples where feedback or feed-forward interactions control the oscillatory behavior. At the cell level, an important oscillatory phenomenon is the circadian clock of organisms as diverse as fruit flies and fungi. In the circadian clock negative feedback controls, the rate of transcription and translation of specific proteins that in turn dictate the cellular circadian oscillation cycle (Dunlap, 1999).

Using a stochastic thermodynamics approach pioneered by Seifert and colleagues, Xiao et al. (2009) used a chemical Langevin equation to evaluate dynamic bifurcations that occur in the circadian clock. An explicit expression for the mean entropy production in the stationary state was formulated based on available kinetic data. On either side of the bifurcation in the circadian dynamics, the shape of the distribution of the entropy production was similar and highly skewed such that the probability of observing dynamics with negative entropy production was quite small. Thus, like the F1 motor of ATP synthase, the operation of the molecular circadian clock studied by Xiao et al. is robust despite the stochastic nature of small systems.

Although the time dependence of the entropy production in the fluctuation theorem used in this study ultimately came from rate constants, the approach demonstrated that statistical thermodynamic simulations are capable of producing similar bifurcation dynamics as stochastic kinetic simulations. Understanding the entropy production rates of metabolism is important for quantitating the capacity for organisms to adapt to their changing environment, which is discussed next.

#### **CELLULAR INFORMATION PROCESSING AND ADAPTATION**

Philosophically, one can adopt either of two opposing perspectives about the relationship between simple biological systems such as bacteria and their environment. One can take the perspective that cells make decisions based on their external environment, which is the most discussed perspective in the literature, or one can take the perspective that the external environment determines cellular response. While the former perspective imbues autonomy to the cell, the latter perspective takes the view that regulation is ultimately a function of the external environment. Who is driving – the cell or the environment? While the former perspective is correct on short time periods such as the diurnal cycles, the latter perspective is more correct on longer time periods over which the cell has adapted and evolved.

Barato et al. (2014), evaluated models of how much information cells can extract from their environment based on their thermodynamic efficiency. Although Barato et al. use the metaphor of learning for the ability to extract information, one is equally justified in using the concept of self-organization. The study found that the degree to which a cell can self-organize in response to the environment is bounded by the thermodynamic entropy production rate. A bacterium in a slowly changing environment dissipates much more energy than it harnesses for the purpose of self-organization. That is, the bacterium, once organized to respond to a particular environment, has a limited ability to further harness energy from the environment for further adaptation.

Although Barato et al. (2014) used quite simple physical models to generate hypotheses, clearly coupling this framework with more extensive thermodynamic models of metabolism has the potential to provide insight into how cells respond internally to changes in environmental driving forces on both short time scales and longer evolutionary time scales. However, modeling efforts will require more sophisticated models of metabolism in order to understand the multitude of paths that cell behavior can take. Next, early efforts that have been taken to expand the application of statistical thermodynamics to more detailed metabolic models are discussed.

#### **DETAILED METABOLIC MODELS**

The models and systems discussed above are small systems compared to the metabolism of even the smallest bacterium. Can statistical thermodynamics and fluctuation theories also be applied to more extensive biological systems such as genome-scale models of metabolism? The issue mostly pertains to whether sufficient parameters can be estimated. Large-scale estimates of thermodynamic parameters are available from sources such as the Biochemical Reactions Thermodynamics Database at University of Michigan (Li et al., 2011), the Thermodynamics of Enzyme-Catalyzed Reactions Database at NIST (Goldberg et al., 2004), and the eQuilibrator web server (Flamholz et al., 2012).

We have been developing such an approach and to-date have applied it to relatively small metabolic pathways of various bacteria (Cannon, 2014). In these initial studies, the reactions rates are assumed to be proportional to the thermodynamic driving force of the reaction, which is quantified by a probability of a reaction in a Markov model based on Eq. 7.

Initial studies have focused on the tricarboxylic acid (TCA) cycles of bacteria. These cycles are central to the metabolism of most organisms and may be as close to a universal pathway as there is (Smith and Morowitz, 2004). TCA cycles are capable of consuming acetyl-CoA to either produce high energy compounds necessary for cell function (e.g., ATP, NADPH) or carbon backbones that serve as synthetic precursors for many reactions of secondary metabolism and amino acid and nucleic acid synthesis.

Shown in **Figure 1** is the TCA cycle of *E. coli* and in **Figure 2** is the free energy, energy, and entropy profiles under metabolic conditions observed for exponential growth on glucose (Bennett et al., 2009). The cycle was simulated using statistical thermodynamics formulation of a Markov model based on a local equilibrium assumption (Cannon, 2014). As one proceeds from acetyl-CoA clockwise around the cycle to oxaloacetic acid, the free energy change across the reactions (**Figure 2**) varies smoothly, as one would expect from a maximum entropy perspective (Eq. 12). However, the change for the conversion of oxaloacetate and acetyl-CoA to citrate catalyzed by citrate synthase and the change for the conversion of 2-oxoglutarate to succinyl CoA catalyzed by 2 oxoglutarate dehydrogenase are somewhat abrupt compared to changes at the other reactions of the cycle. The reason for this is that the cofactor concentrations, which serve as boundary conditions, are held fixed at values that prevent the system from relaxing further. As a result, the system is not quite thermodynamically optimal – the entropy defined by Eq. 12 is not quite maximal compared to the value that would be obtained if each reaction was equally likely.

Clearly,information about the thermodynamics of biosynthetic pathways is important for engineering metabolism to overproduce

target compounds such as reduced carbon compounds for biofuels. While much attention has been directed at redirecting carbon flow by knocking out pathways competing for precursors, less attention has been directed at engineering redox pairs such as NADH:NAD<sup>+</sup> levels that would thermodynamically drive these reactions. Likewise, much attention has focused on the use of riboswitches to up-regulate the production of enzymes involved in the biosynthesis of target compounds (Wittmann and Suess,2012), but switching on the catalytic machinery to synthesize a compound is not useful unless the thermodynamics of the pathway are favorable. Modeling metabolic systems thermodynamically would be of enormous value for metabolic engineering.

As an example of the potential use of statistical thermodynamics for both engineering and understand organisms in the context of their natural habitats, we compared three different versions of the TCA cycle used in three very different ecological niches: a typical heterotrophic TCA cycle from *E. coli* involved in extracting energy and biosynthetic precursors from glucose; the cyanobacterial TCA cycle of *Synechococcus* sp*.* PCC 7002, which is required to produce biosynthetic precursors despite already high levels of ATP from photosynthesis; and the TCA cycle of *Chlorobium tepidum*, a green sulfur bacteria that also must produce biosynthetic precursors in the presence of photosynthesis and simultaneously fix CO2, which it does by running the TCA cycle in the reductive direction. As above, each TCA cycle was simulated using a Markov model based on a local equilibrium assumption. The free energy profiles for these organisms are shown in **Figure 3**. Clearly, each pathway is very different thermodynamically. The cycles for *E. coli*

and Synechococcus have similar profiles except for the conversion of 2-oxoglutarate to succinate. In the *E. coli* TCA cycle, this reaction has ATP as a product. *Synechococcus* and other cyanobacteria cannot use the same reaction for converting 2-oxoglutarate to succinate cycle because their cycles must operate in an environment in which ATP concentrations are quite high due to concomitant photosynthesis. Instead, the cyanobacteria use a TCA cycle that employs a ferredoxin coenzyme for this conversion, and thus high levels of ATP do not retard the production of succinate and other carbon compounds that are necessary for growth. The free energy profile of the TCA cycle for *Chlorobium* is very different from both the *E. coli* and *Synechococcus* cycles. Instead of having a highly favorable free energy profile for operation in the oxidative direction (citrate→oxaloacetate), the free energy changes are highly unfavorable. The TCA cycle of *Chlorobium* and other green sulfur bacteria, in fact, runs in the opposite direction (oxaloactetate→citrate), and these organisms use the cycle to fix CO<sup>2</sup> and produce acetyl-CoA. Not only does a thermodynamic model allow us to understand each organism in its environment, but clearly designing an optimal pathway for metabolic engineering using statistical thermodynamics would be very useful.

In comparing the free energy profiles for *E. coli* in **Figures 2** and **3**,it is clear that they differ significantly. In **Figure 2**, the free energy profile changes relatively smoothly as one traverses the cycle, while in **Figure 3** the free energy profile changes abruptly at times. The reason for these differences has to do with the conditions used in the respective simulations. In **Figure 2**, the simulations used the published experimentally measured values for *E. coli* (Bennett et al., 2009). In the latter case, the count of each intermediate in the cycle was initially set to ~20µm each instead of using the experimental published values for *E. coli* (Bennett et al., 2009), which otherwise might bias the comparison between the three organisms. Although each cycle is materially open in that two carbons come in as acetyl-CoA and carbons leave as CO2, the total of the number of intermediates is fixed by the stoichiometry of the overall reaction for completion of the cycle. For *E. coli*, the overall stoichiometry is,

$$\text{Acetyl}-\text{CoA} + \text{ADP} + \text{3NAD}^+ + \text{P}\_{\text{i}} + \text{Q} + 2\text{H}\_2\text{O}$$

$$\rightleftharpoons \text{CoA} + \text{ATP} + \text{3NADH} + 2\text{CO}\_2 + \text{QH}\_2,$$

where Q and QH<sup>2</sup> represent an oxidized and reduced electron carriers, respectively. Although the cycles are open, the sum of the count of all intermediates will only vary by ±1.

The free energy profiles of the *E. coli* TCA cycle as a function of the total concentration of the intermediates are shown at the top of **Figure 4**. The total concentration values are 1.0-fold, 0.1-fold, 0.01-fold, and 0.001-fold of the values reported by Bennett et al. (2009). If there are only a few total intermediates, then these will be transformed into the metabolites with lowest chemical potentials, which in the case of the *E. coli* TCA cycle are citrate and succinyl CoA. At very low levels of intermediates, the cycle will not operate and citrate and succinyl CoA will simply pool. For the lowest level of intermediates, there will be flux through the entire cycle only over relatively long time periods.

As the total number of metabolic intermediates is raised, the number of citrate and succinyl CoA molecules increase, as shown in **Figure 4** (bottom). Eventually, product builds up as well with a concomitant increase in the free energies of reactions producing citrate and succinyl CoA. Meanwhile, the increase in citrate decreases the free energy for the citrate to isocitrate reaction, and likewise, the increase in succinyl CoA decreases the free energy for the succinyl CoA to succinate reaction.

Eventually, metabolite levels build up to the point where all reactions become equally likely in agreement with Eq. 12. This is thermodynamically the most optimal since the state entropy (Eq. 12) has been maximized with respect to the non-equilibrium boundary conditions.

However, for the cell there is also a thermodynamic penalty to obtain this configuration. In order to handle a greater number of reactants, the enzymatic load on the cell must likewise increase. The self-organized structures needed to dissipate energy rapidly (or store the harvested energy for growth) must be paid for by the non-equilibrium driving forces.

Enzymes catalyzing reactions far from equilibrium will need to increase the least since material flow is unidirectional. This is clearly the case for the enzyme-catalyzed reactions for transformation of oxaloacetate to citrate and 2-oxoglutarate to succinate: as the total metabolite pool increases, the concentrations of the reactants oxaloacetate and 2-oxoglutarate do not change markedly.

If enzymes near equilibrium are expressed at a level just sufficient to catalyze its current load, then increasing the total pool of metabolites may require increased expression of these enzymes. However, these reactions are not likely to remain at equilibrium. This is apparent in **Figure 4** (top) in which the last four enzymecatalyzed reactions of the TCA cycle transforming succinyl CoA to oxaloacetate, are close to equilibrium when the total pool of metabolites is 0.001-fold of the values reported by Bennett et al. (2009). As the total metabolite pool grows, the reactions do not remain at equilibrium.

**FIGURE 4 | (Top) The cumulative free energy profile of the E. coli TCA cycle as a function of the total concentration of the reaction intermediates**. Although carbon can enter the cycle as acetyl-coa and leave as CO2, the total number of intermediates is constrained by the overall reaction (see text). The concentrations used are 1-fold, 0.1-fold, 0.01-fold, and 0.001-fold of those reported by Bennett et al. (2009) for exponential growth on glucose. (Bottom) The distribution of reaction intermediates as a function of total concentration.

When metabolite levels are greater than the respective Michaelis constant (*K* <sup>M</sup>), then enzyme levels need to increase in order to maintain a steady state. This is the situation described by Flamholz et al. (2013). That enzymes catalyzing reactions far from equilibrium do not increase significantly has been experimentally observed; the degree to which enzyme expression will need to increase for reactions near equilibrium will be situation dependent but generally will need to increase with increased flux (Hochachka et al., 1998).

Moreover, if the turnover rates for the enzymes in the pathway differ dramatically, then there must also be a differential level of expression of the enzymes in the pathways. It would make sense for the organism to have high intrinsic enzyme turnover rates for costly enzymes, either those that have many amino acids or require high energy co-factors, such that the thermodynamic cost to the cell can be minimized (Flamholz et al., 2013).

Considering **Figure 4** (top), the data reported by Bennett et al. (2009), implies that the TCA cycle of the laboratory strain of *E. coli* is operating near optimal efficiency with regard to Eq. 12 during exponential growth on glucose. In Lotka's words, "the struggle for existence, the advantage must go to those organisms whose energycapturing devices are most efficient in directing available energy into channels favorable to the preservation of the species."

How close are biological systems to optimal efficiency? There appear to be situations when this ideal is not achieved. For example, if glycolysis were left unchecked such that each reaction were equally likely thermodynamically, then the large free energy change for conversion of fructose 6-phosphate to fructose 1,6-bisphosphate would result in cellular concentrations of fructose 1,6-bisphosphate several orders of magnitude higher than is observed, which would most likely have detrimental affects on the cell. In fact, the enzyme catalyzing this step is highly regulated to prevent overproduction of fructose 1,6-bisphosphate. The regulation can be regarded as a self-organized and emergent property of the pathway, and one that is necessary for the organism to remain viable. Considering the framework for adaption laid out by Barato et al. (2014), this would imply that for *E. coli* species that are adapted to growth on high levels of glucose, there are very little opportunities for learning alternative ways of regulating this enzyme, or conversely, that the regulatory circuit is evolutionarily stable in this regard.

#### **Future Directions**

Determining a rate constant for an enzyme of interest is a straightforward task if the reactant or product has a distinct spectroscopic signature. However, scaling the process up to obtain all of the rate constants necessary for large-scale simulations of metabolism of any specific organism is simply not feasible. Mixing and matching rate constants from orthologous enzymes from different species can result in incorrect energetics, unless one constrains the rate constants to match the equilibrium constant for the same reaction. Moreover, *ad hoc* adjustment of a rate constant to obtain the correct equilibrium constant is likely not better than assuming rates are proportional to the thermodynamic driving force. As a result of the difficulty in obtaining rate constants, constraintbased flux models have been the method of choice for large-scale modeling of biological processes such as metabolism. However, constraint-based methods at best use the thermodynamic constraints to narrow down the solution space. Unfortunately, this limits the predictive power of these approaches.

Several promising and fundamentally sound approaches that include proper thermodynamics have been proposed to move beyond constraint-based flux modeling. One approach is to model systems using mass action kinetics for those reactions for which rate parameters are available, and to use constraint-based flux modeling of other reactions (Chowdhury et al., 2014). In this case, the fluxes modeled using mass action kinetics limit the range of fluxes that are possible for those reactions modeled with constraint-based flux modeling.

A second approach is to use available kinetic parameters where one can, and then infer the remaining parameters based on prior knowledge, including balancing rate parameters to ensure that the correct thermodynamics are obtained (Stanford et al., 2013). An alternative is to reduce the kinetic complexity of the rate equation of each reaction-based analysis of the reaction likelihood as a function of the net flux of the reaction (Canelas et al., 2011). For some reactions, the rate parameters can be eliminated altogether and replaced by the thermodynamic likelihood of the reaction without compromising the fidelity of the model.

Finally, if one knows the reaction directionality, such as from an experimentally based metabolic flux analysis, then a set of feasible metabolite concentrations and reaction free energies can be determined using optimization methods (De Martino et al., 2012). The ability to map out the energy landscape of metabolism could be very powerful and could inform us on whether the conjectures by Lotka, Odum, and others about natural selection discussed in the section "Introduction" are correct. The criteria used by De Martino et al. may actually be too stringent in that the optimization constraints required that the entropy production for each reaction be positive. As indicated in the section "Discussion" around Eq. 11, the second law only requires that the entropy production for the overall macroscopic process be positive. An individual reaction may have a positive flux and also a positive free energy change, but the chance of such an event decreases exponentially with increases in the free energy (Evans and Searles, 1994). The analysis requires the input of flux configurations or reaction directionality. However, this is where fluctuation theories can play a role if they can provide flux values as well.

The use of detailed fluctuation theorems will depend on whether theorems can be developed for non-equilibrium steady states that do not use rate constants and are instead based on chemical potentials and thermodynamic driving forces. If so, then one can set the chemical potentials based (ideally) on metabolomics measurements and carry out large-scale simulations of metabolism that would be identical to kinetic simulations based on rate constants. Experimentally measuring metabolite concentrations is an emerging area of great interest. Key to making the measurements useful for interpretation and modeling is reducing the uncertainty that the measured values reflect *in vivo* concentrations (Noack and Wiechert, 2014).

An alternative statistical thermodynamic approach is to model the process as thermodynamically optimal in which the rates are proportional to the thermodynamic driving force. In a thermodynamically optimal process, the maximum amount of energy is extracted from the environment with a minimal amount of dissipation of heat (Sivak and Crooks, 2012). A model based on this assumption would be roughly consistent with the historical perspectives of the physical basis of biological systems. An analogous approach has been used to analyze metabolomics data, in which the free energies of reactions are minimized with respect to available metabolomics data in order to infer sites of enzyme regulation (Kummel et al., 2006).

As mentioned above, a challenge to using simulations based on statistical thermodynamics is determining accurate standard free energies of reaction or formation of each metabolite. Standard free energies based on group contribution methods are available *en masse* (Jankowski et al., 2008; Noor et al., 2013), but group contribution methods can be inaccurate at times. One must be careful when estimating a standard reaction free energy from group contribution estimates of standard formation free energies in that the errors in estimates are additive; one must ensure when taking the difference between two chemical species that any approximations used for group energies cancel out. The use of electronic structure calculations with an appropriate solvent model is an attractive alternative for determining standard free energies and chemical potentials. Such calculations have been done on a large scale for chlorinated hydrocarbons (Bylaska, 2006) and it is feasible to carry these out for many metabolites. Larger molecules from secondary metabolism, such as those from plants, may present a challenge in that they may have multiple minima that contribute to their free energy of solvation.

#### **ACKNOWLEDGMENTS**

This work was supported by the Laboratory Directed Research Program at the Pacific Northwest National Laboratory and the Environmental Molecular Sciences Laboratory (EMSL) at the Pacific Northwest National Laboratory (PNNL). EMSL is a national scientific user facility operated by PNNL for the Office of Biological and Environmental Research at the U. S. Department of Energy. PNNL is operated by Battelle for the U.S. Department of Energy under Contract DE-AC06-76RLO.

### **REFERENCES**


Davidson, N. (1962). *Statistical Mechanics*. New York, NY: McGraw.


Wittmann,A., and Suess, B. (2012). Engineered riboswitches: expanding researchers' toolbox with synthetic RNA regulators. *FEBS Lett.* 586, 2076–2083. doi:10.1016/ j.febslet.2012.02.038

Xiao, T. J., Hou, Z. H., and Xin, H. W. (2009). Stochastic thermodynamics in mesoscopic chemical oscillation systems. *J. Phys. Chem. B* 113, 9316–9320. doi:10.1021/jp901610x

Zimmermann, E., and Seifert, U. (2012). Efficiencies of a molecular motor: a generic hybrid model applied to the F-1-ATPase. *New J. Phys.* 14, 20. doi:10.1088/1367- 2630/14/10/103023

**Conflict of Interest Statement:** The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 09 September 2014; accepted: 27 October 2014; published online: 26 November 2014.*

*Citation: Cannon WR (2014) Concepts, challenges, and successes in modeling thermodynamics of metabolism. Front. Bioeng. Biotechnol. 2:53. doi: 10.3389/fbioe.2014.00053*

*This article was submitted to Systems Biology, a section of the journal Frontiers in Bioengineering and Biotechnology.*

*Copyright © 2014 Cannon. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

## Metabolic network discovery by top-down and bottom-up approaches and paths for reconciliation

## **Tunahan Çakır <sup>1</sup>\* and Mohammad Jafar Khatibipour 1,2**

<sup>1</sup> Computational Systems Biology Group, Department of Bioengineering, Gebze Technical University (formerly known as Gebze Institute of Technology), Gebze, Turkey

<sup>2</sup> Department of Chemical Engineering, Gebze Technical University (formerly known as Gebze Institute of Technology), Gebze, Turkey

#### **Edited by:**

Daniel Machado, University of Minho, Portugal

#### **Reviewed by:**

Pierre Millard, University of Manchester, UK Corrado Priami, University of Trento and COSBI, Italy Huub C. J. Hoefsloot, University of Amsterdam, Netherlands

#### **\*Correspondence:**

Tunahan Çakır, Department of Bioengineering, Gebze Technical University, Gebze 41400, Kocaeli, Turkey e-mail: tcakir@gyte.edu.tr

The primary focus in the network-centric analysis of cellular metabolism by systems biology approaches is to identify the active metabolic network for the condition of interest. Two major approaches are available for the discovery of the condition-specific metabolic networks. One approach starts from genome-scale metabolic networks, which cover all possible reactions known to occur in the related organism in a condition-independent manner, and applies methods such as the optimization-based Flux-Balance Analysis to elucidate the active network.The other approach starts from the condition-specific metabolome data, and processes the data with statistical or optimization-based methods to extract information content of the data such that the active network is inferred.These approaches, termed bottom-up and top-down, respectively, are currently employed independently. However, considering that both approaches have the same goal, they can both benefit from each other paving the way for the novel integrative analysis methods of metabolome dataand flux-analysis approaches in the post-genomic era. This study reviews the strengths of constraint-based analysis and network inference methods reported in the metabolic systems biology field; then elaborates on the potential paths to reconcile the two approaches to shed better light on how the metabolism functions.

**Keywords: constraint-based models, metabolic network inference, active metabolic state, metabolome, network biology, reverse engineering, flux-balance analysis**

#### **INTRODUCTION**

Metabolic network is the outmost layer of cellular activity from the genome. The genome of a cell is a comprehensive and condensed information base, defining a boundary for the biochemical capacity of the cell. The processing of genetic information passes through several layers of fabrication and regulation before reaching their end products. This is from information to the function, from genotype to phenotype. Metabolic enzymes count for a significant percentage of the end products of genes, and their activity sets the physiology of the cell. Since metabolic network activity is the major representative of cell functionality, it is of great importance to gain as much knowledge as possible on the active metabolic network at a specific cellular state.

Systems-based approach to molecular biology has contributed to an increased knowledge of metabolic pathways for an increasing number of organisms, and led to almost complete metabolic networks for a number of major organisms, from yeast to human. Such static networks are available in a condition-independent manner through web-based databases such as KEGG or Meta-Cyc (Altman et al., 2013), or reconstructed in a format suitable for simulation by several researchers at genome scale (Oberhardt et al., 2009; Kim et al., 2012). There are several mathematical approaches to process such networks to come up with conditionspecific networks, the most common one being the Flux-Balance Analysis (FBA) framework (Orth et al., 2010). This is a bottom-up direction toward the active network since already-known "parts," interactions, are used as inputs (Bruggeman and Westerhoff, 2007; Petranovic and Nielsen, 2008).

In parallel to the developments on the knowledge of metabolic networks, techniques to measure metabolite levels at high throughput, termed metabolomics, have arisen (Kell, 2004; Dunn et al., 2005). Quantitative or semi-quantitative metabolome data, although one of the most challenging compared to other omic sciences, have come a long way in a decade, from the detection and quantification of about 50 metabolites (Devantier et al., 2005) to more than 1000 metabolites (Psychogios et al., 2011). Metabolome data are a snapshot of the condition-specific status of the investigated organisms. Reverse-engineering metabolome data to discover the underlying network structure is the goal behind metabolic network inference approaches (Srividhya et al., 2007; Çakır et al., 2009). The information content of metabolome data is revealed by processing it with correlation or optimization-based methods (Weckwerth et al., 2004; Hendrickx et al., 2011; Öksüz et al., 2013). Such an approach to discover metabolic network structure is termed top-down approach since the parts, interactions, are not known *a priori*, and predicted from the whole set of available biomolecules (Bruggeman and Westerhoff, 2007; Petranovic and Nielsen, 2008).

In this review, we will cover the basic developments in bottom-up and top-down approaches to discover active metabolic network, and then ponder over the possible ways of reconciling these two approaches for a better prediction of active

network structure. **Figure 1** illustrates the two alternative network discovery approaches.

### **BOTTOM-UP APPROACHES TO DISCOVER CONDITION-SPECIFIC METABOLIC NETWORKS**

Different methods and algorithms have been used for the discovery and characterization of active metabolic networks at different states of cells and culture environments. In the bottom-up approach, everything starts from an already available network of biochemical transformations that cover all possible scenarios in the distribution of metabolic fluxes, and sets an upper bound for the existence of reactions in the active metabolic network. Such a network is termed a static metabolic network. A static metabolic network can be provided either by a previously reconstructed genome-scale stoichiometric model or by a collection of all reactions whose existence in the organism of interest has been certified in literature and databases. Most popular among such databases are KEGG (Kanehisa et al., 2014), MetaCyc (Caspi et al., 2014), and Reactome (Croft et al., 2014). Other efforts with more curated databases such as Rhea (Alcántara et al., 2012) and MetRxn (Kumar et al., 2012) are also available. A genome-scale stoichiometric model is reconstructed based on the annotation of all genes in the genome of one organism to their end products and then to the corresponding reactions, leading to a list of geneprotein-reaction rules (Thiele and Palsson, 2010). In this way, the minimum information content of a genome-scale model is (i) a list of reactions, and (ii) a list of gene-protein-reaction rules. The presence of gene-protein-reaction rules in stoichiometric models has enabled the opportunity for transcriptome and proteome data to be incorporated into the discovery methods of active metabolic networks (Blazier and Papin, 2012).

Given a genome-scale reaction network, the aim is to find the active reaction network at a specific condition or for a specific cell type in a multicellular organism (**Box 1**). The core of all such discovery approaches is a stoichiometric matrix. Each row of the stoichiometric matrix represents a metabolite and each column stands for a reaction, the corresponding element being the stoichiometric coefficient of that metabolite in that reaction. The relationship

#### **Box 1 Different levels of Metabolic Network Structure Information**.

Our understanding of an active metabolic network can be sorted into several stages of information.


between the reaction rates in the network and the dynamic change in the concentration of metabolites is represented as given below:

$$\frac{d\mathbf{C}}{dt} = \mathbf{S} \times \mathbf{v} \tag{1}$$

where **S** is the stoichiometric matrix,**C** is the vector of intracellular metabolite concentrations, and **v** is a column vector of metabolic reaction rates (fluxes) to be determined. Under the assumption of steady state, the concentration of each intracellular metabolite is not going to change with time, meaning the sum of rate of reactions producing that metabolite is equivalent to the sum of rate of reactions consuming that metabolite (metabolic fluxes around each metabolite are balanced). This is represented mathematically as follows:

$$\mathbf{S} \times \mathbf{v} = \mathbf{0} \tag{2}$$

This is an algebraic system of linear equations with all fluxes being zero as a trivial solution. In order to escape from the trivial solution, the value of at least one of the fluxes must be set to a nonzero value, that flux usually being an exchange flux between the intracellular and extracellular environment since the experimental measurement of exchange fluxes is relatively easier. The system is almost always underdetermined with a large solution space, mainly because of the existence of branch points in the metabolic network. There are both experimental and computational approaches to estimate a condition-specific network for such a system.

The experimental approach is based on stable-isotope (mostly 13C carbon) labeling of the major carbon source, and then tracing the propagation of the labeled carbon atoms down to proteinbound amino acids at isotopic steady state by using mass spectrometry or NMR spectroscopy (Wiechert et al., 2001; Sauer, 2006; Mueller and Heinzle, 2013). The qualitative isotopic labeling information is then used as an input to two alternative methods. In one method, termed isotopomer modeling, a total flux distribution is estimated based on the experimental labeling results through a computationally demanding non-linear optimization formulation, which employs global iterative fitting and statistical analysis (Wiechert et al., 2001; Antoniewicz et al., 2007). The other 13Clabeling assisted method is based on the estimation of the local ratios of fluxes emerging from a branch point (Sauer, 2006; Zamboni et al., 2009) rather than the absolute quantification of all fluxes. These experimental flux split ratios can be used to shrink the solution space of Eq. 2 in a complementary flux calculation, leading to the discovery of a condition-specific network (Schuetz et al., 2007; Tarlak et al., 2014). Softwares are available for the rather sophisticated calculation of experimental fluxes (or flux ratios) from carbon labeling data for both methods (Zamboni et al., 2005; Quek et al., 2009; Weitzel et al., 2013). A new trend in this area is to collect data at the non-stationary phase of isotopic labeling rather than at the isotopic steady state, which was shown to be more informative in terms of predicting the flux-weighted active metabolic network structure (Schaub et al., 2008; Young et al., 2008; Wiechert and Nöh, 2013). Works on the tracing of intracellular metabolites rather than only 10–15 protein-bound amino acids have also appeared due to the higher coverage of metabolic pathways despite the inherent experimental difficulties in terms of higher turnover rates as well as stability issues (Van Winden et al., 2005; Toya et al., 2007; Millard et al., 2014).

The computational approach for the discovery of conditionspecific metabolic network based on Eq. 2 is known as constraintbased modeling. Constraint-based modeling methods aim to shrink the solution space of the equation as much as possible by putting relevant constraints on the system. The most common method, FBA, treats the problem in Eq. 2 as an optimization problem and linear programing is applied to solve it. The stoichiometry of metabolic reactions (stoichiometric matrix), reaction directionality information, a physiologically relevant objective function, and the value of at least one of the exchange fluxes are all that

are required for FBA to return a condition-specific flux distribution. The flux distribution returned by FBA is not necessarily unique, and there may be a variety of flux distributions all leading to the same optimum value of the objective function. Therefore, Flux Variability Analysis (FVA) must be used together with FBA, to determine the variability, if any, on each metabolic flux in regard to the condition of interest (Mahadevan and Schilling, 2003; Müller and Bockmayr, 2013). The maximization of biomass production has been successfully applied as a reliable objective function for FBA to predict flux distributions in a variety of microorganisms (Varma and Palsson, 1994; Feist and Palsson, 2010). In some studies, it has been hypothesized that one objective function alone may not capture the metabolic behavior of the cell comprehensively. Therefore, multi-objective optimization platforms have been designed and utilized to come up with more specific flux distributions. Several modified versions of FBA including parsimonious FBA, pFBA (Lewis et al., 2010), and flexible-optimality FBA, flexoFBA (Tarlak et al., 2014), have been developed in this manner. On the other hand, some research groups have developed methods based on the availability of additional omics data, which are discussed below. For a thorough review of a number of FBA-derived flux calculation methods, the readers are referred to Lewis et al. (2012).

#### **CONSTRAINTS BASED ON TRANSCRIPTOME OR PROTEOME DATA**

The rate of an enzymatic reaction inside the cell is a function of several different factors, such as the concentration of substrates, products, and regulators of the enzyme and also the amount of available active enzyme for that reaction. Among these factors, the concentration of active enzymes can be related to the activity of genes through layers of transcription, translation, and post-translational modifications. Transcriptome data are much more accessible and comprehensive compared to the other omics data. Several different research groups have developed different strategies to incorporate transcriptome data into constraint-based models. The idea behind this is that the amount of mRNAs (gene activities) may be correlated with the concentration of active enzymes, and hence this can be utilized to provide additional constraints on metabolic fluxes. At the bottom line, if an enzyme coding gene is not transcribed at steady state, the corresponding reaction should be inactive at that steady state, if there is no other enzyme catalyzing that reaction. This idea was first used by Akesson et al. to set the flux values to zero for those reactions whose corresponding genes were expressed at low levels (Åkesson et al., 2004). More sophisticated and structured versions of this approach appeared later, under the names of GIMME (Becker and Palsson, 2008) and iMAT (Shlomi et al., 2008). These approaches classify some reactions as inactive reactions based on the low expression levels of their associated genes. Then, they employ a computational framework which minimizes the contradiction between the classification and an active physiological flux distribution since some of these classifications may render the flux state unrealistic (such as zero growth rate). Several other alternative methods appeared recently to incorporate transcriptome data into the prediction of active metabolic network and flux distribution. In an interesting study, for example, mRNA levels from transcriptome data were used as weights for the corresponding reactions to predict a flux distribution without using a conventional objective function such as the maximization of biomass growth (Lee et al., 2012). A recent study (Machado and Herrgård, 2014) evaluated these methods systematically for the prediction of flux distributions, and the results were compared to that of parsimonious FBA as a reference method that does not consider the transcriptome data. In general, none of the methods could significantly improve the results of pFBA and none of them outperformed the others for the tested cases (*S. cerevisiae* and *E. coli*). Instead of the prediction of flux distributions, these methods, however, may significantly help in the discovery of active metabolic networks in context/tissue-specific cells and in the conditions where a relevant objective function cannot be hypothesized.

Transcriptome data are not necessarily correlated with the rate of corresponding reactions. Inconsistency between mRNA levels and reaction rates is a result of influence of several other factors in the regulation of enzymatic reactions. Therefore, if proteome data are available, it can be used instead of transcriptome data as a better representative for the concentration of active enzymes since proteome is hierarchically closer to the enzyme states than transcriptome data. The methods that are developed to integrate transcriptome data with the FBA method can all be used for the purpose of integrating proteome data. For example, GIM-MEp (Bordbar et al., 2012) is the proteome equivalent version of GIMME. Some of such integrative methods were primarily tested with proteome data. INIT (Agren et al., 2012), for example, was developed by using proteome abundance data from Human Protein Atlas database. However, it was shown that utilizing proteome data instead of transcriptome data could not improve the prediction of flux distributions for the tested cases (*S. cerevisiae* and *E. coli*) (Machado and Herrgård, 2014). In a study which used metabolome and proteome data in the flux calculation method, on the other hand, even the use of only proteome data were shown to improve the results compared to the traditional FBA (see the next section for more details) (Yizhak et al., 2010).

Substrate concentrations, the concentration of enzyme regulators, the turn over number of the catalyzing enzyme, and the concentration of the active enzyme are all playing significant roles in the determination of reaction rates, and among them only the concentration of the active enzyme may be represented by the corresponding protein or mRNA concentration. Translated proteins are not necessarily active enzymes, and they may need to undergo post-translational modifications (e.g., phosphorylation/acetylation) to become capable of catalyzing the reactions. This is one of the main reasons behind inconsistency between protein levels and reaction rates. On the other hand, the turn over number (catalytic power) of one enzyme may differ by several orders of magnitude from the turn over number of another enzyme (Hoppe, 2012). It means that although the concentration of one enzyme may be much less than the others in the network, the reaction catalyzed by that enzyme can proceed much faster than others. According to this fact, the use of the absolute concentrations of proteins or mRNAs to constrain reaction rates does not seem promising. However, the turn over number of one enzyme in an individual is an intrinsic parameter of the enzyme that does not change from one condition to another except by effective mutations that rarely occur. Because of this, the relative levels of proteins

or mRNAs can be utilized to overcome the problem of big differences in turn over numbers. One steady state with available data on flux values and protein/mRNA levels can be taken as the reference state, and then the relative/differential levels of proteins/mRNAs to the reference state can be used to predict the flux distributions at the new conditions. Based on this approach, algorithms have been developed to incorporate relative/differential transcriptome data into metabolic-flux analysis, among which are MADE (Jensen and Papin, 2011) and GX-FBA (Navid and Almaas, 2012). One other main reason for the inconsistency between protein levels and reaction rates is the distribution of flux control among different layers from genotype to phenotype. Metabolic fluxes can be regulated hierarchically (through gene expression levels) or metabolically (through metabolic interactions) (Daran-Lapujade et al., 2007; Postmus et al., 2008; Nikerel et al., 2012; Chubukov et al., 2013). Use of transcriptome or proteome data will not be helpful if the metabolic fluxes are controlled at the metabolic level.

#### **CONSTRAINTS BASED ON METABOLOME DATA**

One approach to find more specific and physiologically relevant flux distributions is to provide additional constraints by specifying the directionality of reversible reactions. This can be done by taking Gibbs free energies of metabolites into consideration. The Gibbs free energy change of a reversible biochemical transformation (one reaction or a series of reactions) determines the direction of that transformation and its departure from reversibility. The earlier studies assumed standard conditions (all metabolite concentrations were assumed to be 1 M), and did not explicitly consider metabolite concentrations in the calculation of Gibbs energy changes of reactions due to the scarcity of metabolome data (Henry et al., 2006). Recent studies, however, take the concentration of metabolites into account, when available, to perform thermodynamic-based metabolic-flux analysis, leading to more reliable predictions (Hoppe et al., 2007; Bennett et al., 2009; Soh and Hatzimanikatis, 2010; Hamilton et al., 2013).

Extracellular metabolome data can be used to constrain genome-scale metabolic models for the calculation of intracellular flux distributions by simply constraining the secretion and uptake rates of extracellular metabolites based on such data (Çakır et al., 2007; Mo et al., 2009). In a different approach, Michaelis– Menten-based kinetics was used for the estimation of reaction rates for the reactions for which appropriate intracellular metabolome (and proteome) data are available (Yizhak et al., 2010). The FBA framework was designed in such a way that the calculated fluxes are as consistent as possible with the kinetically derived reaction rates, if available. The simultaneous use of metabolome and proteome data for this purpose significantly improved the results. The use of metabolome data alone also resulted in better predictions than the traditional FBA. In a recent study, a kinetic platform was established based on Michaelis–Menten equation to bridge gene expression levels, metabolite concentrations and metabolic fluxes without requiring the knowledge of kinetic parameters (Zelezniak et al., 2014). They could show that changes in metabolite concentrations relative to a reference steady state can be predicted by their formulation that includes information on network connectivity in addition to differential mRNA expression levels. All those works utilizing kinetic information demonstrate the necessity of dynamic models for a more comprehensive analysis of metabolic networks.

Kinetic models of biochemical reactions not only provide a rational platform for omics data – especially metabolomics – to be incorporated in the estimation of metabolic fluxes but also they enable the prediction and study of the dynamics of metabolic networks far beyond the steady state (**Box 1**). Such models were only possible for small-scale metabolic networks until recently (Teusink et al., 2000; Chassagnole et al., 2002), since, they require detailed information on the enzyme kinetics of each individual reaction. Estimation of kinetic parameters is a major obstacle in the applicability of dynamic modeling of metabolic networks. New platforms and algorithms were established to circumvent this problem so that the estimation of explicit kinetic parameters is not a prerequisite to study the dynamic capacity and behavior of the system (Link et al., 2014). Approximative kinetic models (lin-log, power-law, mass-action) on the other hand, try to fit a standard rate expression formula to all reactions of the network to increase the range of their applicability to larger networks (Visser et al., 2004; Sorribas et al., 2007). Thanks to approximative kinetics, attempts to reconstruct large-scale kinetic metabolic models with more than 100 reactions were recently presented (Smallbone et al., 2010; Chakrabarti et al., 2013; Stanford et al., 2013), but their prediction power is limited to the conditions adequately close to the corresponding steady state.

As a better alternative to approximative kinetics, an approach was established and utilized based on the concept of parametric Jacobian, which covers the behavior of all possible kinetic models that are consistent with an experimentally observed operating point (Steuer et al., 2006). This approach provides an opportunity to detect and analyze bifurcation characteristics of the metabolic network without the need for explicit determination of kinetic parameters. Ensemble modeling of metabolic networks (Tran et al., 2008) is an elegant idea for large-scale kinetic modeling of biochemical reaction networks. In this method, each enzymatic reaction is broken down to its elementary reactions that all follow mass-action kinetics. An ensemble of thermodynamically consistent kinetic models with different dynamic behavior that all converge to a reference steady state is collected with the help of intracellular metabolome data. This ensemble is then filtered by the results of perturbation experiments to filter out inconsistent models from the ensemble and to increase the predictability of remaining models. The approach was successfully applied, among others, to construct kinetic models of *E. coli* (Khodayari et al., 2014) and cancer metabolisms (Khazaei et al., 2012), leading to promising flux predictions.

#### **TOP-DOWN APPROACHES TO DISCOVER CONDITION-SPECIFIC METABOLIC NETWORKS**

Time series of metabolite concentrations in response to a perturbation, and also replicates of metabolome data at a specific steady state, both implicitly contain information on the structure of active metabolic network. Reverse engineering of these data to infer the condition-specific metabolic network without necessarily prior knowledge on the genome of the organism and its static metabolic network is an alternative to all bottom-up approaches that are based on the availability of a large-scale stoichiometric model of the organism. Although promising, less attention has been paid to these top-down approaches compared to bottomups mainly because of the technical obstacles in gathering reliable metabolome data in large scale. This limitation will be removed with future advancements in the detection and quantification of intracellular metabolites such as higher coverage and temporal resolution. At this stage, however, several research groups have established algorithms and methods for reverse engineering of metabolic networks by using either time series or steady-state replicates of metabolite concentrations (Crampin et al., 2004; Chou and Voit, 2009; Hendrickx et al., 2011; Lecca and Priami, 2013).

#### **NETWORK DISCOVERY BASED ON TIME-SERIES DATA**

The use of time-series metabolite concentration data to predict the underlying network connectivity information first appeared in the literature about two decades ago. Time-lagged correlations combined with a projection technique called multidimensional scaling were shown to construct the structure of generic biochemical networks with few nodes (Arkin and Ross, 1995). Correlation between time-series profiles of metabolites, with the consideration of the delay in the influence of one metabolite on the next, is the basis of the time-lagged correlation method for the inference of metabolic networks. The approach, called correlation metric construction, was later experimentally verified *in vitro* by inferring the first steps of glycolytic pathway in a 14-metabolite system (Arkin et al., 1997). Modified versions of the approach appeared later (Samoilov et al., 2001; Lecca et al., 2012). In the latter, metabolic pathway of an anticancer drug was deduced from the time-lagged correlations of corresponding metabolite concentration measurements. The modification introduced by the former work was recently improved by using mutual information similarity score rather than simple linear correlation (Villaverde et al., 2014). The authors compared their method, called MIDER, with several other methods by applying it to different types of cellular networks, including *in vitro* glycolytic pathway data. The approach outperformed the other methods.

Another method to reconstitute a network using time-series data is based on perturbation experiments around steady state. The initial curve of concentration changes of metabolites in response to a pulse change on the concentration of a metabolite is processed with the method of zero initial slopes (Vance et al., 2002). The method successfully inferred the structure of glycolysis based on *in vitro* experimental data (Torralba et al., 2003). Performance comparison of the method with the correlation metric construction approach was later provided based on *in silico* data of *S. cerevisiae* and *E. coli* central metabolic networks (Hendrickx et al., 2011). An approach based also on perturbation experiments, but with a different formulation aiming to calculate Jacobian matrix from time derivatives of concentration data, was first applied to gene networks (Schmidt et al., 2005). A modified version of the approach recently used *in vivo* metabolite concentration measurements from tomato seedlings to reconstruct quercetin glycosylation pathway (Astola et al., 2011).

Apart from such model-free structure identification methods, model-based methods use time-series metabolite concentration data not only to identify network structure but also to estimate proper model parameters such as rate constants of kinetic

expressions (Chou and Voit, 2009). Majority of these approaches use power-law (also called S-system) formulation (Savageau and Voit, 1987) to approximate reaction kinetics. An approach, for example, used S-system modeling with a multi-objective optimization by simultaneously minimizing the number of interactions and the error in the fitting (Liu and Wang, 2008). They applied their method to major metabolites involved in ethanol fermentation. An earlier work analyzed a small three-metabolite network of phospholipid metabolism by combining S-system modeling and an evolutionary modeling method, genetic programing (Ando et al., 2002). Later, a new representation of S-system approach, called Strees, was combined with genetic programing to reverse-engineer yeast fermentation pathway in a more efficient manner by using *in silico* time-series concentration data of five metabolites (Cho et al., 2006). In a sophisticated approach, others used symbolic regression based on genetic programing to infer both the structure and the model of yeast glycolytic oscillations from *in silico* data (Schmidt et al., 2011). Their use of acylic graph encoding rather than tree-based encoding together with symbolic regression approach ensured the identification of parsimonious (sparse) models. Rather than S-system formulation, mass-action kinetics can also be used to infer pathway connectivity and reaction mechanism (Srividhya et al., 2007). This minimizes the computational burden on the algorithm since only rate constants are to be estimated as parameters in the mass-action formulation. The authors tested their method with real time course experimental metabolome data of *Lactococcus lactis* glycolysis. A graphical user interface was later made available by the same group to ease the inference of kinetics and network architecture from dynamic data of biochemical pathways (Mourão et al., 2011). Genetic programing was also combined with mass-action kinetics in an algorithm, which ensures the estimation of biochemically more plausible models (Gormley et al., 2013). The small phospholipid network of (Ando et al., 2002) was inferred in a more compact way by this algorithm.

#### **NETWORK DISCOVERY BASED ON STEADY-STATE DATA**

The use of steady-state metabolome data to infer metabolic network structure has also drawn attention in the last decade. The biological variability in the metabolism of the organisms at around steady state is a known phenomenon due to slight variations in the enzyme levels or due to slight natural or environmentinduced fluctuations within cellular processes. Slight variations in the steady-state measurements of metabolite levels can be informative on the network structure (Steuer et al., 2003; Camacho et al., 2005; Çakır et al., 2009). The most common approach here is to use the similarity measures such as Pearson correlation to assign edges between metabolites. One should note that such correlations are not necessarily strong among neighboring metabolites whereas there could be strong correlations among distant metabolites in the network (Camacho et al., 2005). In a comprehensive study, different alternative similarity measures (linear vs. non-linear, and full vs. partial) were applied to *in silico* metabolome data belonging to two microorganisms to systematically analyze method performances (Çakır et al., 2009). The results revealed no clear superiority between linear (Pearson correlation) and non-linear (mutual information) similarity measures. The best performing method was identified as nth order partial Pearson correlation, known also as graphical Gaussian modeling. Graphical Gaussian modeling was also applied to metabolome data from blood serum samples to reconstruct human fatty acid metabolism (Krumsiek et al., 2011). Others (Nemenman et al., 2007) analyzed *in silico* metabolome data of red blood cell metabolism by ARACNE approach (Margolin et al., 2006), which is based on pruning mutual information scores. An elegant improvement on ARACNE based reverse engineering of metabolic profiling data was suggested later (Bandaru et al., 2011). The approach puts a constraint on the possible metabolic transformations to satisfy the mass conservation between the connected metabolites. Synthetic data covering up to about 200 metabolites were generated to test the approach. One issue in such similarity-based approaches is that only pairwise interactions are aimed to be found. However, a metabolic reaction can involve more than two metabolites. Based on this reasoning, an attempt to also deduce triple interactions by using ternary mutual information was suggested (Diˆe.p et al., 2011). Analysis of synthetic yeast glycolysis data and red blood cell data showed the success of this approach in capturing higher order interactions.

A different approach to discover active metabolic networks from steady-state data is based on Lyapunov equation. In Eq. 1, the rate vector, **v**, is a complex non-linear function of concentrations, **C**. For systems around steady state, the equation can be expressed in terms of Jacobian matrix,**J**, by the help of linear approximation:

$$\frac{d\mathbf{X}}{dt} \approx \mathbf{J}\mathbf{X} \tag{3}$$

with **X** = **C** − **C<sup>s</sup>** , and **C<sup>s</sup>** shows the steady-state metabolite concentrations. Jacobian matrix stores detailed information on the structure of the underlying network; such as the directionality of interaction, strength of interaction, and regulation type of interaction. For small fluctuations around steady state, the righthand side of Eq. 3 becomes zero, and the left-hand side can be expressed in such a way that a link between the covariance matrix of metabolome data, Γ, and Jacobian matrix is provided. The details of the derivation are given elsewhere (Van Kampen, 1992; Steuer et al., 2003).

$$\mathbf{J}\mathbf{\varGamma} + \mathbf{\varGamma}\mathbf{\varGamma}^{\mathsf{T}} = -2\mathbf{\varDelta} \tag{4}$$

**D** in the equation shows the extent of fluctuations. Eq. 4, known as Lyapunov equation, can be used to infer metabolic network structure since it provides a link between the data-based covariance matrix and network connectivity stored in **J**. Reverse-engineering metabolome data by using the Lyapunov equation was first discussed via a hypothetical three-metabolite system (Steuer et al., 2003). A recent work provided a theoretical analysis on the use of the Lyapunov equation to infer network structure from steadystate metabolome data (Öksüz et al., 2013). The authors used a rearranged version of the Lyapunov equation:

$$\mathbf{A}\mathbf{j} = 2\mathbf{d} \tag{5}$$

Here, **j** and **d** are vectorized versions of **J** and **D** matrices. **A** is a matrix based on the covariance of data. In that work, directed networks were inferred from *in silico* metabolome data of *S. cerevisiae* glycolysis, *E. coli* central carbon metabolism, and brain glycolysis by solving Eq. 5 for **j** using a genetic-algorithm based formulation. In the optimization formulation, the dual objective function was simultaneous maximization of the sparse structure and minimization of the residual norm of the equation. When compared to the inference results based on nth order partial Pearson correlation, a much higher prediction accuracy was reported. One other advantage of the optimization-based approach is the fact that Eq. 5 infers a directed network whereas correlation-based approaches cannot predict directions of interactions. The Lyapunov equation was recently used to infer differential changes in Jacobian matrix rather than the inference of network structure by predicting Jacobian matrix itself (Sun and Weckwerth, 2012; Kügler and Yang, 2014; Nägele et al., 2014).

#### **PATHS TO RECONCILE BOTTOM-UP AND TOP-DOWN METABOLIC NETWORK DISCOVERY APPROACHES**

Previous sections reviewed bottom-up and top-down metabolic network discovery approaches from literature. Top-down approaches are dependent on intracellular metabolome data, and there are bottom-up approaches, which aim to use omics data as additional constraints. The simultaneous use of both approaches to discover better condition-specific networks has not been a focus in the scientific community. Here, we will elaborate on the ways to reconcile these two approaches when intracellular metabolome data of a condition in question are available.

All model-based top-down approaches using time-series data also infer a Jacobian matrix of the model. Many other top-down approaches are based on correlations between metabolites. There is a significant relationship between the correlation strengths and the strengths of interactions implied by Jacobian entries (Çakır et al., 2009). Therefore, correlation strengths or Jacobian-interaction strengths of the inferred edges can be used as edge scores in the bottom-up constraint-based modeling approaches as additional constraints for a better identification of the active metabolic network as follows: all inferred edges in a top-down approach based on metabolome data are ranked with respect to their edge scores. Afterward, cut-off values for high- and low-scores are determined. If a high-score edge also appears in the corresponding static genome-scale stoichiometric model, that reaction is assigned a high weight. If a high edge-score does not have a corresponding connection in the genome-scale model, this could imply a novel or a regulatory interaction. As it is known, genome-scale metabolic models do not account for regulatory interactions of metabolites with enzymes, however, top-down approaches do not have this limitation since they are purely data-based. If the edge-score is low, the corresponding reaction in the stoichiometric model is assigned a low weight. Similarly, if the top-down approach assigns no edge between two metabolites, which are linked with a reaction in the stoichiometric model, such reactions are also assigned low weight. All other reactions can be assigned with a medium-weight. Then, a mixed-integer programing based optimization framework can be used with Eq. 2 such that the resulting condition-specific flux distribution is as consistent as possible with the edge scores, including maximum possible number of high-weight reactions and minimum possible number of low-weight reactions as active. Thereby, the strength of

top-down predictions can be used for better bottom-up flux predictions.

Use of transcriptome or proteome data as constraints in metabolic-flux calculations resulted in several alternative methods such as GIMME, iMAT, and INIT. These approaches remove reactions from the static metabolic reaction set if the controlling gene or protein is not active. However, a recent work comparing all these methods could not identify a method with clear superiority over the parsimonious FBA (Machado and Herrgård, 2014). This approach can be combined with edge scores (inferred Jacobian-interaction strength or calculated correlation strength) information to yield better network identification. GIMME-like approaches remove reactions from the model, this means also removal of metabolites. Two different approaches can be used: (i) removed reactions whose main substrates and products show high edge scores must be retained in the reaction set, implying an active edge (ii) reactions whose main substrates and products show very low and insignificant correlations must be candidates to be removed from the reaction set, implying an inactive edge if their removal does not hamper the objective function. Such a flux calculation powered by the top-down inference of network edges can lead to a more refined network.

One reconciliation approach will be the integrative use of flux-balance equation (Eq. 2) and rearranged Lyapunov equation (Eq. 5). Flux-balance equation was widely used in the last two decades because of its simplicity, requiring only the stoichiometric coefficients of reactions, and few measurement constraints. The rearranged Lyapunov equation bears a similar simplicity since it is only based on the covariances of metabolome measurements. The only major issue, as it is the case in flux-balance equation, is a proper choice of objective function to solve the equation. Since both **J** and **v**, the unknowns in both equations, represent the active network structure, the coupled use of these two equations can be beneficial from two different aspects: (i) a better flux distribution can be found thanks to the metabolome-based constraint provided by Eq. 5, (ii) the information stored in stoichiometric matrix, since it will reveal all possible non-interacting pairs, will provide a constraint to get a better estimate of Jacobian matrix by setting edge scores of some pairs to zero.

An approach getting popular to construct genome-scale kinetic models is ensemble modeling. This modeling approach constructs kinetic models from an ensemble of models, and filters the inconsistent models out by using the results of perturbation experiments (Tran et al., 2008; Khodayari et al., 2014). On the other hand, a number of methods infer networks from time-series data by using a model-based approach. The output of such methods is both the network structure and the dynamic kinetic model with estimated parameters (Srividhya et al., 2007; Liu and Wang, 2008). A number of alternative models are scanned in these methods to infer the most suitable one. Therefore, the strengths of model-based network inference and ensemble-based kinetic model reconstruction can be combined to yield better frameworks.

In summary, both bottom-up and top-down discovery of metabolic networks have come a long way in the last 20 years, providing the scientific community with a number of computational methods, as reviewed in this review. Considering the improvements that are being experienced both on the coverage and precision of metabolome data, the coming decade will witness an exponential increase in the number of metabolome datasets, similar to what was experienced with transcriptome data in the last decade. This review aimed at drawing attention to this point, as ways to reconcile the two major metabolic network discovery approaches will gain increasing importance.

#### **ACKNOWLEDGMENTS**

The financial support by TUBITAK, The Scientific and Technological Research Council of Turkey, through a career grant (Project Code: 110M464) is gratefully acknowledged.

#### **REFERENCES**


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 24 September 2014; accepted: 14 November 2014; published online: 03 December 2014.*

*Citation: Çakır T and Khatibipour MJ (2014) Metabolic network discovery by top-down and bottom-up approaches and paths for reconciliation. Front. Bioeng. Biotechnol. 2:62. doi: 10.3389/fbioe.2014.00062*

*This article was submitted to Systems Biology, a section of the journal Frontiers in Bioengineering and Biotechnology.*

*Copyright © 2014 Çakır and Khatibipour. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

## Microalgal metabolic network model refinement through high-throughput functional metabolic profiling

#### **Amphun Chaiboonchoe1,2† , Bushra Saeed Dohai 1,2† , Hong Cai 1,2† , David R. Nelson1,2, Kenan Jijakli 1,2,3 and Kourosh Salehi-Ashtiani 1,2\***

<sup>1</sup> Division of Science and Math, New York University Abu Dhabi, Abu Dhabi, UAE

<sup>2</sup> Center for Genomics and Systems Biology (CGSB), New York University Abu Dhabi Institute, Abu Dhabi, UAE

<sup>3</sup> Engineering Division, Biofinery, Manhattan, KS, USA

#### **Edited by:**

Kai Hua Zhuang, Technical University of Denmark, Denmark

#### **Reviewed by:**

Eduardo S. Zeron, Centro de Investigación y de Estudios Avanzados del Instituto Politécnico Nacional, Mexico Eleftherios Pilalis, National Hellenic Research Foundation, Greece

#### **\*Correspondence:**

Kourosh Salehi-Ashtiani, Laboratory of Algal, Systems, and Synthetic Biology (LASSB), Division of Science and Math, Center for Genomics and Systems Biology (CGSB), P.O. Box 129188, New York University Abu Dhabi, Abu Dhabi, UAE e-mail: ksa3@nyu.edu

†Amphun Chaiboonchoe, Bushra Saeed Dohai and Hong Cai have contributed equally to this work.

Metabolic modeling provides the means to define metabolic processes at a systems level; however, genome-scale metabolic models often remain incomplete in their description of metabolic networks and may include reactions that are experimentally unverified. This shortcoming is exacerbated in reconstructed models of newly isolated algal species, as there may be little to no biochemical evidence available for the metabolism of such isolates. The phenotype microarray (PM) technology (Biolog, Hayward, CA, USA) provides an efficient, high-throughput method to functionally define cellular metabolic activities in response to a large array of entry metabolites.The platform can experimentally verify many of the unverified reactions in a network model as well as identify missing or new reactions in the reconstructed metabolic model. The PM technology has been used for metabolic phenotyping of non-photosynthetic bacteria and fungi, but it has not been reported for the phenotyping of microalgae. Here, we introduce the use of PM assays in a systematic way to the study of microalgae, applying it specifically to the green microalgal model species Chlamydomonas reinhardtii. The results obtained in this study validate a number of existing annotated metabolic reactions and identify a number of novel and unexpected metabolites.The obtained information was used to expand and refine the existing COBRAbased C. reinhardtii metabolic network model iRC1080. Over 254 reactions were added to the network, and the effects of these additions on flux distribution within the network are described. The novel reactions include the support of metabolism by a number of D-amino acids, L-dipeptides, and L-tripeptides as nitrogen sources, as well as support of cellular respiration by cysteamine-S-phosphate as a phosphorus source. The protocol developed here can be used as a foundation to functionally profile other microalgae such as known microalgae mutants and novel isolates.

**Keywords: microalgae, Chlamydomonas reinhardtii, flux balance analysis, phenotype microarray, metabolic network refinement**

#### **INTRODUCTION**

Optimization of algal metabolism toward improved bioproduct production while maintaining strain robustness remains a challenge that requires experimental strategies informed through systems-level analyses of metabolism. The use of metabolic network models can guide the development of optimization strategies that would be otherwise difficult through rational designs (Oberhardt et al., 2009; Schmidt et al., 2010; Koskimaki et al., 2013; Koussa et al., 2014). While an increasing number of algal species are being isolated and sequenced for biofuel or other applications, to date, there are only a handful of reconstructed algal networks available (Koussa et al., 2014). A major obstacle in the reconstruction of high-quality network models for algae remains hinged on the inability to obtain rapid and high-throughput metabolic phenotypic data to guide and validate reconstruction efforts.

One potential high-throughput phenotypic analysis technology is the Biolog OmniLog® phenotype microarray (PM) (Biolog, Hayward, CA, USA) (Bochner et al., 2001; Bochner, 2003, 2009).

By assaying cellular metabolism in response to thousands of metabolites, signaling molecules, and effector molecules (as well as osmolites), the Biolog PM assays have greatly boosted functional metabolic profiling by providing insight into function, metabolism, and environmental sensitivity (Bochner et al., 2001; Bochner, 2003, 2009). Biolog PM assays rely on the measurement of metabolite utilization of cells in 96-well microplates. Each well contains different nutrients, metabolites, and pH and osmolarity solutes. Other bioactive molecules such as antibiotics and hormones may also be assayed. This utilization is assessed and measured in the form of cell respiration determined by the amount of color development produced by the NADH reduction of a tetrazolium-based redox dye (Bochner et al., 2001; Bochner, 2003, 2009). Plates can be monitored automatically over time with the OmniLog platform. A common set of 20 96-well microplates are designed to measure carbon, nitrogen, sulfur, phosphorus utilization phenotypes, along with osmotic/ion, and pH effects. This high-throughput and standardized approach has the ability to provide a quick method for

the phenotypic comparison of different strains and organisms in a convenient manner leading to insights into the metabolic state of the cell. While the PM technology has been used for metabolic phenotyping of various microbial species including bacteria and fungi, it has not been reported for the phenotyping of microalgae. Likewise, the technology has been successfully used for verification and expansion of a number of existing microbial metabolic network models (Bochner et al., 2001; Bochner, 2003, 2009; Bartell et al., 2014), yet its use for improvement of microalgal models remains unreported.

The goal of the present study is to establish a reliable method for characterizing metabolic phenotypes of microalgae that can be used to expand existing network models or guide the reconstruction of new algal metabolic models. We present the implementation of the PM platform for metabolic phenotyping of microalgae using *Chlamydomonas reinhardtii* as a model organism then expand a well-curated existing metabolic network model of *C. reinhardtii* accordingly.

## **MATERIALS AND METHODS**

#### **PHENOTYPE MICROARRAY EXPERIMENTS**

Phenotyping was done using standard Biolog assay plates and using the OmniLog instrument. In total 190 substrate utilization assays for carbon sources (PM01 and PM02), 95 substrate utilization assays for nitrogen sources (PM03), 59 nutrient utilization assays for phosphorus sources, and 35 nutrient utilization assays for sulfur sources (PM04), along with peptide nitrogen sources (PM06–08) were utilized. A defined tris-acetate-phosphate (TAP) medium (Gorman and Levine, 1965) containing 0.1% tetrazolium violet dye "D" (Biolog, Hayward, CA, USA) was used for the PM tests. The carbon, nitrogen, phosphorus, or sulfur component of the media was omitted from the defined medium when applied to the respective PM microplates that tested for each of those sources.

*Chlamydomonas reinhardtii* strain CC-503 was obtained from the *Chlamydomonas* Resource Center at the University of Minnesota, USA. Cells were grown in fresh TAP media to mid-log phase, then spun down at 2,000 × *g* for 10 min, and then resuspended in fresh media to a final concentration of 1 × 10<sup>6</sup> cells before inoculation into Biolog's 96-well plates. A 100µL aliquot of cell-containing media was inoculated into each well before the plates were inserted into the OmniLog system. A final concentration of 400µL/mL timentin® (GlaxoSmithKline, New Zealand) was used to inhibit bacterial growth in all plates. In addition, ampicillin and kanamycin were used at 50–100µg/ml occasionally. Bacterial contamination was monitored by streaking cells on yeast extract/peptone plates and performing gram stains before and after Biolog assays. All microplates were incubated at 30°C for up to 7 days and the dye color change (in the form of absorbance) was read with the OmniLog system every 15 min. As the OmniLog instrument does not provide a source of continuous light during incubation, the algae is assumed to be carrying out heterotrophic respiration.

#### **DATA ANALYSIS**

The Biolog PM data analysis was carried out using an OmniLog phenotype microarray (OPM) software package (Vaas et al., 2012, 2013) that runs within the R software environment. The raw

kinetic data were exported as CSV files to the OPM package and then the biological information was added as metadata (e.g., strain designation, growth media, temperature, etc.). Kinetic curves were plotted from the raw data in the form of *xy* and level plots, and a statistical analysis was carried out to visualize the metabolic properties and generate OmniLog values. An OmniLog value or the curve parameter"A"simply lists the maximum height of the growth curve.

Duplicate assays were carried out for all the plates that were tested to assess reproducibility of the data. An assay was considered positive when the absorbance (OmniLog value) was positive after subtraction from the negative control well and the respective blank well. This summation is a representation of the abiotic reaction of the dye with the media in the presence of the tested compound.

#### **IDENTIFICATION OF REACTIONS AND GENES ASSOCIATED WITH NEW METABOLITES**

Gene to reaction associations for compounds were established as follows: assignment of a compound's enzyme commission number (EC) and relevant reactions were performed by searching KEGG<sup>1</sup> and MetaCyc<sup>2</sup> . The genomic evidence for each reaction was then recovered by using the identified EC numbers as a search basis in multiple available annotation resources from available algal annotation databases, such as the Joint Genome Institute (JGI), Phytozome<sup>3</sup> , and peer-reviewed publications. When the query returned no genomic evidence for a given EC number, the relevant associated proteins in other organisms were identified then a profilebased search was carried out using the NCBI PSI-BLAST server with default settings and using non-redundant protein sequences (nr) in *C. reinhardtii* (taxid: 3055). PSI-BLAST hits with E values of ≤0.05 were manually curated for relevance to the searched EC number through either the evaluation of their described enzymatic activity, or by querying those BLAST hits through EMBL-EBI Pfam<sup>4</sup> , or InterPro<sup>5</sup> protein domain prediction servers.

#### **MODEL REFINEMENT AND EVALUATION**

Identified reactions with their associated genes were added to iRC1080 using the COBRA Toolbox functions add Reaction and Change Gene Association. In addition, transport reactions for the new metabolites were incorporated into the model as transport by passive diffusion from the extracellular medium into the cytosol. The behavior of the new resultant model, iBD1106, was tested by carrying out flux balance analyses under light and dark conditions for the maximization of biomass as the objective function. The comparison of the two models was based on reported shadow prices (sensitivity of the objective function to changes in system variables) of the metabolites. The Biomass function was defined previously (Chang et al., 2011) for growth under dark and light conditions. The revised model can be found in the supplementary file iBD1106.xml in an SBML file format.

<sup>4</sup>http://pfam.xfam.org/search

<sup>5</sup>http://www.ebi.ac.uk/interpro/

<sup>1</sup>http://www.genome.jp/kegg/

<sup>2</sup>http://metacyc.org/

<sup>3</sup>http://www.phytozome.net

#### **RESULTS**

#### **PHENOTYPE MICROARRAY SCREENING OF MODEL ALGA CHLAMYDOMONAS REINHARDTII**

To implement the use of the PM platform for algal metabolic phenotyping, we used *C. reinhardtii* as a model. The single-cell green alga *C. reinhardtii* is a model organism that has been widely used for basic and applied biological research. Its genome was sequenced and publically released by JGI in 2007 (Merchant et al., 2007) and genome-scale models of its metabolism have been

reconstructed (May et al., 2009;Chang et al., 2011; Dal'Molin et al., 2011). The ability to grow phototrophically or heterotrophically, along with rapid growth and scalability, are features that make this alga an attractive model system for algal-based biofuel studies.

Our pipeline (**Figure 1**) integrates the high-throughput PM assays, applied to the alga of interest, with genomic searches to provide experimental evidence that can lead to the refinement of an existing metabolic network model. The pipeline may also be applied for a new reconstruction if an existing model is not

available. The PM assays test the ability of the alga to utilize various carbon, nitrogen, sulfur, and phosphorus sources in a minimal medium. When a new compound tests positive for utilization, the compound's relevant reaction profiles are defined using metabolic knowledge bases such as KEGG (see text footnote 1) or MetaCyc (see text footnote 2). This step defines all potential reactions and pathways that can be associated with a metabolite to provide EC numbers. The next step is to find supporting genetic evidence from genetic databases specific to the alga, such as databases from the JGI, Phytozome (see text footnote 3), or peer-reviewed publications. If genetic evidence is available, the reactions and metabolites are added to the model to expand and refine the model. If, on the other hand, genomic evidence is not found in support of the EC number, a profile-based search, such as PSI-BLAST, can be performed to identify candidate genes associated with the reaction. The results of such searches are then manually evaluated; those passing this QC step are added to the network model. In exceptional cases, if genes are not identified for reactions but compelling biochemical evidence exists, reactions may be provisionally added to the network pending future investigations.

#### **IMPLEMENTATION AND VALIDATION**

We optimized the PM assays for metabolic profiling of *C. reinhardtii* by modifying the standard Biolog protocol with respect to inoculum concentration, type of dye, and pre-inoculation growth conditions (Materials and Methods). We used plates 1–4 and 6–8 of the PM platform, which provide a range of test compounds including utilization of carbon, nitrogen, sulfur, phosphorus, and a variety of di- and tripeptides. The summary kinetics of selected plates (PM01 and PM03) are shown in **Figure 2**. Splined-based curve fitting was implemented to extract the curve parameters [the lag phase (λ), the respiration (or growth rate µ or the steepness of the slope), the maximum cell respiration "A," and the area under the curve (AUC)]. The maximum cell respiration "A" of the blank and negative controls of each microwell plate (which represents abiotic reactivity of the dye with the medium and the test metabolite) were used as background subtraction values to identify positive metabolites. The"*xy*-plots" show the respiration measurements over time mapped to the assay 96-well plates, in terms of the raw measurements values (*y*-axis) and time (*x*-axis). In addition, the data was transformed to a heat map format to allow for a quick comparative overview of the multitude of the kinetic data. The heat map presents the kinetic values with different colors (varied from light yellow to dark orange or brownish; **Figures 2B,D**).

To assess the level of combined experimental and biological noise and systematic errors and biases from Biolog's PM measurements, the data from two independent replicate experiments were plotted against one another (**Figure 3**). This figure visually assesses the reproducibility of the PM data obtained from PM01–04 and PM10 plates. **Figure 3** shows that the majority of the data were identical as they fall on the 45° line with only a few outliers. This plot confirms the quality and reproducibility of the experiments for this alga.

#### **IDENTIFICATION OF NEW METABOLITES**

We compared the number of metabolites that can be identified by Biolog's PM (662 chemical compounds from seven plates {PM01–PM04, and PM06–PM08}) with the iRC1080 metabolites and the metabolites measured using gas chromatography timeof-flight (GC-TOF) (Bölling and Fiehn, 2005) (**Figure 4**). Only six metabolites were overlapping among the three sets (adenine, glycerol, glycine, myo-Inositol, putrescine, and uracil), while 149 were common between iRC1080 and the Biolog set under investigation. This shows that while each technology/tool has its strength in metabolic profiling research, the Biolog set can be a significant source of new metabolic information.

After subtracting the background signal, we observed acetic acid as the only positive assay for carbon utilization (in PM01 plate). Detection of acetate as the only carbon source from this plate is consistent with the *Chlamydomonas* literature (e.g.,Harris, 2009) and provides evidence for specificity of our assays. Four positive reactions for sulfur utilization (sulfate, thiosulfate, tetrathionate, d,l-Lipoamide) and four positive assays for phosphorus utilization (thiophosphate, dithiophosphate, d-3-phospho-glyceric acid and cysteamine-*S*-phosphate) were detected. *C. reinhardtii* showed positive results for several nitrogen sources including both l-amino and d-amino acids, and less common amino acids such as l-homoserine, l-pyroglutamic acid, methylamine, ethylamine, ethanolamine, and d,l-α-amino-butyric acid. Furthermore, a large number of dipeptides and a few tripeptides assayed positive (**Table 1**).

Altogether, we identified 128 new metabolites from the PM data that were not present in our iRC1080 metabolic model: eight d-amino acids, tetrathionate, thiophosphate, dithiophosphate, cysteamine-*S*-phosphate, l-pyroglutamic acid, and ethylamine, 108 dipeptides, and 5 tripeptides. We note that sequence specificity was observed for utilization of both di- and tripeptides. The identified metabolites are summarized in **Table 1** and Table S2 in Supplementary Material.

We searched KEGG and MetaCyc to define all possible reactions and EC numbers associated with the identified new metabolites. Forty-nine unique EC numbers were associated with the newly identified metabolites. Table S2 in Supplementary Material includes pathways, reactions, EC numbers, proteins, and *Chlamydomonas* annotation sources for each of the metabolites. Five different sources were used to obtain genomic evidence for the reactions. These included Phytozome Version 10.0.2 (Goodstein et al., 2012), JGI Version 4 (Ghamsari et al., 2011), AUGUSTUS 5.0 and 5.2 (Chang et al., 2011), annotations from Manichaikul et al. (2009), and KEGG (Kanehisa et al., 2014). Out of 49 searched ECs, 15 transcripts could be found with annotations matching the searched ECs (**Table 1**; Tables S1 and S2 in Supplementary Material).

The metabolic reactions and their respective EC numbers for which no genomic evidence was found (using the aforementioned resources) were then entered into the Universal Protein Resource website (UniProt)<sup>6</sup> (Apweiler et al., 2004; Consortium, 2014). There, sequences that are related to the metabolites but are from other organisms were identified. Those sequences were then used to run Position-Specific Iterated BLAST (PSI-BLAST queries)<sup>7</sup>

<sup>6</sup>http://www.uniprot.org/

<sup>7</sup>https://blast.ncbi.nlm.nih.gov/Blast.cgi

Respiration (or growth) xy-plots and level plots of the PM01 [Carbon sources; **(A,B)**] and PM03 [Nitrogen sources; **(C,D)**] assay plates are shown. The figure is an 8 × 12 array where each cell represents a well plate and, thus, a given metabolite or growth environment. Within each cell or well representation, curves represent dye conversion by reduction (y-axis) as a function of time (x-axis). PM respiration curves from the CC-503 and blank are both shown in

color represents CC-503). The level-plot represents each respiration curve as a thin horizontal line changing color (or remaining unchanged) over time. Shading color changes from light yellow to dark orange or brownish based on the level of respiration measurement values, with the brownish color representing higher respiration values. Metabolites utilized by C. reinhardtii (CC-503) and the blank plates are shown.

**FIGURE 3 | Reproducibility of PM tests**. OmniLog values were collected over a 168 h period and the maximum values were plotted for two replicate studies. Each axis represents the maximum OmniLog values for each study (the x-axis being one replicate study and the y-axis another). Identical values fall on a 45° line; there are a few deviating test values (some deviations were by more than 50 units). Each point represents a single maximum OmniLog value.

**FIGURE 4 | Venn diagram of metabolites**. The Venn diagram is a representation of metabolites common to Biolog's PM plates, the iRC1080 metabolic model and Gas Chromatography time-of-flight (GC-TOF) experiments. Each circle indicates the total number of metabolites that exists in each respective method of study, while the overlapping regions represent the number of metabolites shared between those methods of study. The iRC1080 metabolic model contains a total of 1,068 unique metabolites, the GC-TOF identified a total of 77 metabolites (Bölling and Fiehn, 2005), while there are a total of 662 metabolites tested using Biolog's PM plates.

#### **Table 1 | List of identified positive substrate utilization metabolites (C, P, S, N) not present in the iRC1080 model**.


(Continued)

#### **Table 1 | Continued**


<sup>a</sup>Reaction was not include if no gene was identified.

<sup>b</sup>Phytozome version 10.0.2 (http://phytozome.jgi.doe.gov/pz/portal.html#!info? alias=Org\_Creinhardtii).

<sup>c</sup>JGI version 4 (Ghamsari et al., 2011).

<sup>d</sup>Augustus version 5 (Chang et al., 2011).

<sup>e</sup>KEGG (http://www.genome.jp/kegg/kegg1.html).

<sup>f</sup>JGI version 3.1 (Manichaikul et al., 2009).

from the NCBI website to identify homologous sequences in *C. reinhardtii*. Only the sequences that produced significant alignments were considered; specifically, results with an E-value below 0.005 were retained. The final step before integrating the genes from the PSI-BLAST results with the iRC1080 metabolic model was to check whether the genes' relevant reactions related to the new metabolites; only hits with relevant annotated enzymatic reactions were kept. The PSI-BLAST yieldedfour additional transcripts (**Table 1**; Table S2 in Supplementary Material).

#### **MODEL REFINEMENT**

The metabolites identified as new to the network were categorized and annotated in the model based on their utilization into nitrogen sources, phosphate sources, and sulfur sources. The nitrogen source metabolites were 8 d-amino acids, 2 l-amino acids, 108 dipeptides, and 5 tripeptides. The phosphate sources were cysteamine-*S*-phosphate, thiophosphate, and dithiophosphate.

#### **Table 2 | Contents of iRC1080 and iBD1106**.


#### **Table 3 | Summery of new reactions in iBD1106**.


The only new sulfur source metabolite was tetrathionate. No genomic evidence for tetrathionate was found in databases and its PSI-BLAST E values did not pass the threshold of 0.005, thus, no reaction for this metabolite was added to the model. In addition, l-pyroglutamic acid, thiophosphate, dithiophosphate, and ethylamine were not added to model due to lack of genomic evidence.

To expand the existing model, reactions associated with the new metabolites and the genes associated with the new reactions were added to iRC1080 model to generate an expanded network, iBD1106. iBD1106 accounts for 2,445 reactions, 1,959 metabolites, and 1,106 genes (**Table 2**). The new 254 added reactions are distributed as follows: 20 amino acid reactions, 108 di-peptide reactions, 5 tri-peptide reactions, and 120 transport reactions (**Table 3**). The new 20 amino acids reactions were associated with 4 new genes (Cre02.g096350.t1.3, au.g14655\_t1, e\_gwW.1.243.1, Cre12.g486350.t1.3). The d-amino acids are oxidized into ammonium and a 2-oxo-carboxylate via the following reaction with EC number of 1.4.3.3 and associated gene Cre02.g096350.t1.3:

$$\text{D}-\text{amino acid} + \text{O}\_2 + \text{H}\_2\text{O} \rightarrow \text{NH}\_4 + \text{H}\_2\text{O}\_2 + 2-\text{oxo carbonate} \tag{1}$$

Equation 1 is a general reaction for all d-amino acids. However, some d-amino acids contribute to different reactions in addition to their own oxidation reactions. For example,d-serine reacts with ATP producing ADP and phospho-d-serine. Moreover, the chirality of d-amino acids can also be inverted into L forms and vise versa through annotated racemases (Table S2 in Supplementary Material).

Four genes identified by PSI-BLAST were added into the model and account for the d-alanine transaminase reaction (Eq. 2); XP\_001698572.1, XP\_001693532.1, XP\_001701890.1, XP\_001700930.1:

#### 2 −oxoglutarate+D−alanine ↔ D−glutamate+pyruvate (2)

In addition,XP\_001692123.1, a PSI-BLAST identified gene,was associated with the oxidation of d-asparagine reaction as shown in Eq. 1.

A total of 113 added new reactions account for the hydrolysis of dipeptides and tripeptides. The hydrolysis of dipeptides and tripeptides are associated with two genes, one for dipeptides (Cre02.g078650.t1.3), and one for tripeptides (Cre16.g675350.t1.3). The dipeptides and tripeptides are decomposed into their unit l-amino acids, for instance, Leu–Pro decomposes into l-leucine and l-proline.

With respect to sources of phosphorus, a reaction for hydrolysis of cysteamine-*S*-phosphate into cysteamine and phosphate was added according to the following reaction that is associated with the gene JLM\_162926:

Cysteamine−S−Phosphate+H2O → Cysteamine+Phosphate (3)

In order to specify the cellular compartment where the new reactions occur, we used the WoLF PSORT tool (Horton et al., 2007) 8 and the results reported by Ghamsari et al. (2011). By providing protein sequences that are associated with the new reactions, WoLF PSORT predicted that the new reactions are localized to the cytosol.

In metabolic models, incomplete biochemical information may create gaps that form discontinuity in the network. In order to identify if any new gaps were introduced in the new model, gapFind, a COBRA command that lists root gaps, was used. The root gaps are defined as metabolites that cannot be produced in the metabolic model (Becker et al., 2007; Schellenberger et al., 2011). Using this command we found that both iRC1080 and iBD1106 models contain the same 91 root gaps. This indicates that the addition of the new metabolites and their associated reactions, did not introduce any new gaps. We note that transport reactions for the import of new metabolites into the cytosol were added.

The metabolic behavior of iBD1106 was tested under light conditions (no acetate) and dark conditions (with acetate) by carrying out flux balance analyses with the biomass as the maximized objective function. To assess the contribution that each metabolite makes to the set objective function, shadow prices for all metabolites were obtained (Tables S3 and S4 in Supplementary Material). The shadow price of a metabolite is defined as the change in an objective function with respect to flux changes of a metabolite (Varma et al., 1993; Orth et al., 2010). Shadow price allows the determination of whether a metabolite is in "excess" or is "limiting" the objective function, e.g., biomass production. Negative values are for metabolites that will decrease the objective function, positive values are for those that will increase the objective function, and values of 0 are for metabolites that will have no effect on the objective function. The comparison of shadow prices between iBD1106 and iRC1080 indicates that, for most metabolites, a large change is not observed (**Figure 5**; Tables S3–S5 in Supplementary Material); however, differences are observed in 105 and 70 cases under light and dark growth, respectively. Instances of such metabolites are provided in **Table 4**.

#### **DISCUSSION**

Algae are a group of diverse photosynthetic eukaryotes, which are polyphyletic in origin (Pröschold and Leliaert, 2007). Algal

lineages include the viridiplantae, which the green algae (or Chlorophyta) belong to; stramenopiles that include brown, golden, and yellow algae and diatoms; rhodophyta or the red algae; and photosynthetic alveolates that include dinoflagelates (Barton et al., 2007). Given the evolutionary distances between these lineages, significant differences in genome size and coding potential,

metabolic behaviors of metabolites of iRC1080 and iBD1106 under dark

growth conditions.

<sup>8</sup>http://www.genscript.com/psort/wolf\_psort.html



environmental niche, and metabolic properties can be expected. Members of green algae may be aquatic or soil organisms with mixotrophic or autotrophic modes of metabolism (Pröschold and Leliaert, 2007). In addition, microalgae may or may not require co-factors for their growth. Studies on microalgal growth requirements have indicated that more than half require cobalamin (vitamine B12), while 22% require thiamine and 5% need biotin (Croft et al., 2006). Interestingly, these requirements are not reflected in algal phylogeny (Helliwell et al., 2011).

Genomic approaches powered by next-generation sequencing technologies help to improve the understanding of the encoded algal metabolic potential; however, the full characterization of algal metabolism requires phenotypic data. For instance, the metabolome of *C. reinhardtii* has been studied under a number of conditions, including sulfur deprivation (Matthew et al., 2009; Shu and Hu, 2012; Aksoy et al., 2013), nitrogen deprivation (Blaby et al., 2013;Courant et al., 2013), and response to irradiance (Mettler et al., 2014) to provide insight into regulatory and metabolic responses of the species to environmental perturbations. In addition, transcriptomics, proteomics, and metabolomics studies have guided non-targeted profiling approaches for the detection and quantification of metabolites. Those non-targeted profiling approaches have included nuclear magnetic resonance (NMR), liquid chromatography mass spectrometry (LC-MS/MS), and gas chromatography mass spectrometry (GC/MS) (Veyel et al., 2014; Wase et al., 2014). The ability to study functional responses and phenotypes has been classically limited to targeted serial studies that usually employ mutagenesis, genetic knockouts, genetic over-expression, and physiological studies (Bochner, 2003; Dent et al., 2005; Morgan et al., 2009; Tshikhudo et al., 2013; Greetham, 2014). The wealth of phenotypic information gained from the PM technology, as demonstrated in this study, can help provide more complete systems-level knowledge when combined with other omics data, and help develop and refine metabolic models.

Genome-scale metabolic networks provide predicted genotypephenotype relationships through metabolic flux optimizationbased approaches. We previously reconstructed a genome-scale model for *C. reinhardtii* (iRC1080) (Chang et al., 2011) based on literature evidence (entailing ~250 sources), structurally verified genomic evidence, and predicted gene function and cellular localization information. This model has 1,706 metabolites with 2,191 reactions. Through the pipeline that we have described in this work, we were able to expand the network significantly to include 1,959 metabolites, 2,445 reactions, and 1,106 associated genes. A clear advantage that the PM provides is functional assays for entry metabolites to inform model refinement. Whereas mass spectrometry approaches give information on intermediate- and

final-level metabolites, PM assays have the unique capability, due to the accounting for entry-level metabolites, to inform more complete models from the start of metabolic pathways. PM assays and mass spectrometry can therefore be considered as complementary approaches when characterizing organisms' metabolic profiles, with each technology refining and filling in specific gaps in metabolic models. Yet, PM's contribution to a metabolic model's refinement is made through a rapid, high-throughput, and convenient manner with an entire set of metabolites assayed in 5–7 days.

#### **NEW METABOLITES**

We have identified a number of di and tripeptides, and d-amino acids that significantly expand the list of nitrogen utilization compounds in *C. reinhardtii*. While we found d-amino acids can support metabolism of *C. reinhardtii*, they may be involved in additional functions. A serine-type d-alanyl-d-alanine carboxypeptidase was found in *C. reinhardtii*'s genome that could potentially be involved in d-alanine metabolism. Serine-type dalalyl-d-alanine carboxypeptidases have been shown to play a variety of protective roles including protection against ionic and hyperosmotic stress (Príncipe et al., 2009). A d-alanine ligase was found in *C. reinhardtii*'s genome that is potentially involved in d-alanine multimerization. Recent research using <sup>15</sup>N NMR spectroscopy found that d-alanine accumulated in plants during UV exposure and this finding is supported by previous research under various stress signals (Monselise et al., 2014). Therefore, the possibility that d-amino acids might have additional cellular functions in *C. reinhardtii*, aside from providing a source of nitrogen, can be a subject of future investigations.

*Chlamydomonas reinhardtii* is known to be able to use a variety of amino acids as a sole nitrogen source as long as acetate is present (Munoz-Blanco et al., 1990). In *C. reinhardtii*, arginine is the only amino acid known to be imported with high affinity; the rest are believed to be deaminated extracellularly (Kirk and Kirk, 1978) or transported passively (Zuo et al., 2012). We note that a search in the literature for d-amino acid transports has not provided any information on the mode of transport for this class of amino acids in *C. reinhardtii*, nor is it known if the *C. reinhardtii* deaminase can deaminate d-amino acids. However,*C. reinhardtii* has been shown to exhibit amino acid racemerase activity (Takahashi et al., 2012), which could explain the ability to assimilate d-amino acids intracellularly. This also provides indirect evidence that these amino acids may be absorbed or transported into the cell for conversion to their L counterparts. A biological function for d-amino acids has not been clearly defined; however, d-alanine and d-aspartate were detected in algae using a reversed-phase HPLC; d-alanine was present in some marine diatoms while d-aspartate was found in

all the selected freshwater green microalgae and marine diatoms (Yokoyama et al., 2003).

In many microbes, dipeptides are imported into the intracellular compartment before they are eventually hydrolyzed. For instance, *Francisella tularensis* relies on an amino acid transporter of the major facilitator superfamily of secondary transporters for transporting amino acids intracellularly. Furthermore, dipeptides containing asparagine were effective at restoring cellular multiplication in the infection cycle of a *F. tularensis* mutant that lacked that essential amino acid transporter (Gesbert et al., 2014). In this study, a variety of dipeptides were found to promote heterotrophic respiration in *C. reinhardtii*. The latest version of *C. reinhardtii*'s genome contains a gene annotated as a peptide hydrolase Cre02.g078650.t1.3. We note that the detected utilization of the dipeptides is not without sequence specificity as 159 out of 267 of the dipeptides and 9 out of 14 of the tripeptides did not result in positive assay results.

From these newly identified metabolites, three phosphorus compounds were found: (1) cysteamine-*S*-phosphate (C2H7NO3PS), which is an organic phosphorothioate anion that is derived from deprotonation of thiophosphate OH groups and protonation of the amino group, (2) thiophosphate (or phosphorothioate), and (3) dithiophosphate, which is the product of the reaction of a base with phosphorus pentasulfide.

The only new sulfur source that was identified, tetrathionate, is a sulfur oxoanion and is derived from the compound tetrathionic acid and is commonly found in soils. It is a key intermediate in the oxidation of various reduced inorganic sulfur compounds. Several species of bacteria including *Salmonella enterica* (Winter et al., 2010) and *Acidithiobacillus ferrooxidans* (Rohwerder et al., 2003; Holmes and Bonnefoy, 2007; Chen et al., 2012) are known to be able to assimilate tetrathionate. Strains of *A. ferrooxidans* overexpressing tetrathionate hydrolase (tetH) were found to grow better on both sulfur and tetrathionate. In the archeon *Acidianus hospitalis*, tetrathionate is secreted to form filaments from tetrathionate homomultimers (Krupovic et al., 2012). These remarkable filaments are believed to play a role in sulfur metabolism and adaptation to *A. hospitalis*'s extreme environment. Prokaryotes have also been shown to use tetrathionate as an electron acceptor in cobalamin (coenzyme B12) synthesis (Roth et al., 1996). Sulfur is commonly assimilated as reduced sulfur for most living organisms, but bacteria are known to reduce tetrathionate, thiosulfate, sulfite, sulfur, and dimethyl sulfoxide in dissimilatory reactions as well (Barrett and Clark, 1987). Tetrathionate is often used as an electron sink for oxidative phosphorylation (Chen et al., 2012). Bacteria that are known to respire using tetrathionate are often found to have the capability of reducing thiosulfate as well, but thiosulfate is not found to be reduced among organisms that do not respire thiosulfate (Rohwerder et al., 2003). Considering that *C. reinhardtii* is a soil organism, the ability to assimilate this compound is likely to provide an important survival advantage in *Chlamydomonas*' natural life cycle.

#### **iBD1106 MODEL VS. iRC1080**

Different behaviors can be observed for iBD1106 than those for iRC1080 under different conditions. When the biomass production was set as the objective function, a differential change can be noticed as a result of growth conditions. The addition of the new nitrogen sources (d-amino acids, dipeptides, and tripeptides) has a significant and differential effect on the shadow prices of metabolites under light and dark conditions for biomass production (**Figures 5A,B**, respectively).

Under light growth, the d-aspartate in iBD1106 showed significant effect on the behavior of the chloroplastic metabolites of the riboflavin pathway. In iBD1106, d-aspartate is converted into l-aspartate through racemase, and l-aspartate can be produced through hydrolysis of its dipeptides (Asp– Leu, Asp–phe, Pro–Asp, Asp–Ala, Asp–Gln, and Asp–Gly). Also the oxidation of d-asparagine produces d-aspartate as oxocarboxylate (Eq. 1). The addition of l-aspartate increases its consumption in purine metabolism, which yields to more production of 2,5-Diamino-6-hydroxy-4-(5<sup>0</sup> -phosphoribos ylamino)-pyrimidine (25dhpp). The latter can be converted into 5-Amino-6-(5<sup>0</sup> -phosphoribosylamino)uracil (5apru) in the riboflavin metabolism resulting in an excess of 4-(1-d-Ribitylamino)-5-aminouracil (4r5au) and 5aprbu, with shadow prices of 0.168 and 0.158, respectively. Those results were not observed in iRC1080.

Another example of model discrepancy under light growth is the effect of adding d-serine reactions in iBD1106. Addition of d-serine limited the availability of the metabolite 1-(9Z)-octadecenoyl,2-(11Z)-octadecenoyl-snglycerol-3-phosphate (pa1819Z18111Z) (shadow price −0.009 in iRC1080 and −0.65 in iBD1106). This metabolite is produced and consumed by the reactions of glycerolipid metabolism for the production of Palmitoyl-CoA (n-C16:0CoA) (pmtcoa). The addition of l-serine in iBD1106 results in more consumption of pmtcoa in the sphingolipid metabolism through the reaction serine C-palmitoyltransferase (SERPT) that produces 3-dehydrosphinganine.

Under dark growth conditions,4-aminobutanoate was in excess in iRC1080 and became limiting in iBD1106 with shadow price values of 0.18 and −0.05, respectively. The reason for this limiting availability is the addition of d-histidine and d-glutamate dipeptides hydrolysis reactions, e.g., Ala–His, and inversion into l-histidine and l-glutamate through a racemase. This addition increases the consumption of l-glutamate and l-histidine along with 4-aminobutanoate in glutamate and arginine and proline metabolisms, respectively. Moreover, the dark growth condition did not affect the behavior of 4-aminobutanoate significantly in iBD1106; however, in iRC1080 it was shifted from a limiting metabolite (−0.07) into an excess metabolite (0.18) (**Table 4**). The excessiveness of 4-aminobutanoate in iRC1080 under dark conditions might be related to the high consumption of NADPH under dark growth conditions. In proline metabolism, NADPH and 4-aminobutanoate are consumed more rapidly in dark than that in light conditions. As such, the addition of d-histidine and d-glutamate compensates the effect of growth under dark in the proline metabolism.

#### **CONCLUSION**

Phenotypic profiling has tremendous utility in modeling and understanding algal metabolism and is essential in elucidating genotypic differences in algae and the effects of environmental conditions on metabolism. The method presented here demonstrates the first reproducible study utilizing PM assays in profiling microalgae using *C. reinhardtii* as a model. We observed positive growth on 148 nutrients (one positive assay for C-source utilization, four positive assays for the S-source and P-source utilization, and 139 positive assays for N-source utilization). The wealth of phenotypic data can be used along with other references to compare organisms with known mutants or unknown isolates. This wealth of information will also shed light on new and novel metabolic pathways. The substrate utilization information and the newly identified metabolites were used for metabolic network expansion and refinement of the iRC1080 metabolic model. The study also provides a framework to bridge the missing links between genomics and metabolomics in microalgae. The described work provides an excellent method for the initial characterization of newly isolated or uncharacterized strains of algae. This combination of high-throughput phenotypic screening with metabolic modeling can allow for rapid refinement of existing metabolic network models as demonstrated and also provides biochemical evidence to support *de novo* reconstruction of new algal models.

#### **ACKNOWLEDGMENTS**

Major support for this work was provided by New York University Abu Dhabi Institute grant G1205-1205i and NYU Abu Dhabi Faculty Research Funds (AD060). We thank Basel Khraiwesh for help in designing some of the figures.

#### **SUPPLEMENTARY MATERIAL**

The Supplementary Material for this article can be found online at http://www.frontiersin.org/Journal/10.3389/fbioe.2014.00068/ abstract

#### **Data Sheet S1:**

**Table S1 | PM assays for PM01 to PM04 and PM06 to PM08, positive assays are shaded with yellow**.

**Table S2 | A table of new metabolites that were added to generate iBD1106 along with their genetic annotations**.

**Table S3 | Shadow prices for metabolites of iRC1080 and iBD1106 when biomass was set as objective function under growth with light and no acetate**.

**Table S4 | Shadow prices for metabolites of iRC1080 and iBD1106 when biomass was set as objective function with growth under dark with acetate**.

**Table S5 | Shadow prices for new metabolites in iBD1106 when Biomass was set as the objective function (growth under light without acetate and under dark with acetate)**.

**Data Sheet S2 | Chlamydomonas reinhardtii metabolic network model, iBD1106, in SBML format**.

**Figure S1 | Phenotype microarray results for plates 1–4, and 6–8**.

#### **REFERENCES**


*Future of Algal Systematics*, eds J. Brodie and J. M. Lewis (Boca Raton, FL: CRC Press), 123–153.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 22 October 2014; paper pending published: 15 November 2014; accepted: 24 November 2014; published online: 10 December 2014.*

*Citation: Chaiboonchoe A, Dohai BS, Cai H, Nelson DR, Jijakli K and Salehi-Ashtiani K (2014) Microalgal metabolic network model refinement through highthroughput functional metabolic profiling. Front. Bioeng. Biotechnol. 2:68. doi: 10.3389/fbioe.2014.00068*

*This article was submitted to Systems Biology, a section of the journal Frontiers in Bioengineering and Biotechnology.*

*Copyright © 2014 Chaiboonchoe, Dohai, Cai, Nelson, Jijakli and Salehi-Ashtiani. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

## Integrative analysis of metabolic models – from structure to dynamics

## **Anja Hartmann<sup>1</sup>\* and Falk Schreiber 2,3**

<sup>1</sup> Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Gatersleben, Germany

<sup>2</sup> Monash University, Melbourne, VIC, Australia

<sup>3</sup> Martin-Luther-University Halle-Wittenberg, Halle, Germany

#### **Edited by:**

Daniel Machado, University of Minho, Portugal

#### **Reviewed by:**

Rafael Costa, Instituto de Engenharia de Sistemas e Computadores, Investigação e Desenvolvimento em Lisboa (INESC-ID/IST), Portugal Ina Koch, Goethe University Frankfurt am Main, Germany

#### **\*Correspondence:**

Anja Hartmann, Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Corrensstr. 3, OT Gatersleben, Stadt Seeland 06466, Germany

e-mail: hartmann@ipk-gatersleben.de

The characterization of biological systems with respect to their behavior and functionality based on versatile biochemical interactions is a major challenge. To understand these complex mechanisms at systems level modeling approaches are investigated. Different modeling formalisms allow metabolic models to be analyzed depending on the question to be solved, the biochemical knowledge and the availability of experimental data. Here, we describe a method for an integrative analysis of the structure and dynamics represented by qualitative and quantitative metabolic models. Using various formalisms, the metabolic model is analyzed from different perspectives. Determined structural and dynamic properties are visualized in the context of the metabolic model. Interaction techniques allow the exploration and visual analysis thereby leading to a broader understanding of the behavior and functionality of the underlying biological system. The System Biology Metabolic Model Framework (SBM<sup>2</sup> – Framework) implements the developed method and, as an example, is applied for the integrative analysis of the crop plant potato.

**Keywords: metabolic modeling, integrative analysis, kinetic analysis, flux balance analysis, petri net analysis, topological analysis**

## **1. INTRODUCTION**

Metabolic models have been reconstructed for an increasing number of organisms to understand complex biochemical processes. At least 54 bacterial, 6 archaeal, and 16 eukaryotic reconstructions are available to-date while many others are under development (Xu et al., 2013). In addition, resources such as Path2Models (Büchel et al., 2013) provide draft models for a large number of organisms. Such metabolic models are composed of biochemical reactions and associated experimental parameters of the biological system under investigation. Different metabolic models can be reconstructed depending upon the completeness of knowledge about the detailed interaction mechanisms in a biological system. The metabolism is thereby roughly represented in large and mostly qualitative models and smaller, but more quantitative models (Steuer and Junker, 2008). Different model sizes and knowledge details allow the structural and dynamic properties to be analyzed using different modeling formalisms. For further details on modeling formalisms in Systems Biology the reader is referred to (Machado et al., 2011). Several modeling formalisms entail different analysis techniques facilitating the investigation of a metabolic model from different perspectives and thus, revealing complementary insights.

A couple of review papers evaluated modeling formalisms (Wiechert,2002; Steuer and Junker,2008;Hübner et al.,2011;Koch et al., 2011; Machado et al., 2011; Pfau et al., 2011; Dandekar et al., 2012) and revealed among others kinetic, Petri net, stoichiometric, and topological modeling methods as well-established. The

strengths and weaknesses of each formalism are summarized in **Figure 1**.

Kinetic modeling using ordinary differential equations (ODEs) includes detailed quantitative descriptions on the biochemical processes and therefore requires often difficult to obtain kinetic rate equations and parameters. Due to this, kinetic modeling is generally limited to smaller models, but leads to quantitative predictions and reveals dynamic behavior of the underlying biological system (Resat et al., 2009). Petri net modeling is powerful due to several Petri net extensions for qualitative and quantitative analysis. The stochastic effects involved in quantitative predictions and system dynamics can be accounted for by using, for example, the stochastic Petri net (SPN) simulation. However, these extensions complicate the qualitative analysis (Baldan et al., 2010). Stoichiometric modeling using optimization-based analysis such as flux balance analysis (FBA) (Orth et al., 2010) allows for quantitative predictions due to the steady-state assumption. A static description of the biochemical processes is therefore sufficient when including stoichiometric, thermodynamic, and enzyme capacity constraints. Thus, stoichiometric modeling is applicable for large models, but is limited in revealing the dynamic behavior of the underlying biological system (Lewis et al., 2012). Topological modeling considers only the topological information of models (not limited in model size) and can identify structures and robustness against disturbances. Using,for example, centrality analysis (Koschützki and Schreiber, 2008) different importance concepts provide insights into key elements based on metabolite or reaction graphs (Steuer and Junker, 2008).


Some of the introduced metabolic modeling formalisms are already investigated in different approaches to analyze metabolic models at the system level and to overcome problems due to the lack of experimental data. Described methods either extent qualitative models with obtained analysis results to investigate a follow-up quantitative analysis, or models are reduced to assign less data for quantitative analysis. In most cases, such as Birch et al. (2014) and Chowdhury et al. (2014), the stoichiometric formalism FBA is used to obtain flux distributions,which are utilized to derive ODEs for kinetic analysis (Resat et al., 2009). Methods using the Petri net formalism for model reduction to integrate less data for kinetic analysis are described by Chen et al. (2011), Gilbert and Heiner (2006), and Koch and Heiner (2008). An advanced method is presented by Machado et al. (2010) whereby Petri net formalism is applied to integrate both of the aforementioned methods for model reduction and a follow-up kinetic analysis. Grafahrend-Belau et al. (2013) combined overview kinetic models (household models) with FBA toward a quasi-dynamic FBA. Heiner et al. (2012) and Nagasaki et al. (2010) propose a unifying Petri net framework comprised of a family of related Petri net types. In this approach qualitative, stochastic and continuous Petri net analyses are conducted by converting different Petri net types into each other.

Here,we introduce an integrated approach,which complements the presented approaches through a formalization leading to a standardized, transformable, and extensible abstraction of metabolism. This method allows the investigated metabolic models to be integrated, utilizing different well-established modeling formalisms and at the same time maintaining a

standardized visualization. Moreover, the integration of analysis results with corresponding elements of the metabolic model leads to a combination of model structure and model dynamics. Several interaction techniques support the exploration and interpretation of the gained analysis results to provide a comprehensive understanding of the underlying biological system.

## **2. MATERIALS AND METHODS**

In general, metabolic models are networks consisting of different elements such as metabolites and reactions with relations between these elements and additional attributes. Thus, a suitable data structure for metabolic models is a graph. Dependent upon the modeling formalism, graphs with different structure and attributes are able to represent kinetic, Petri net, stoichiometric, or topological models. Each of these graphs contains nodes (metabolites and/or reactions), which are related to each other through edges.

Following the concept of generalization, different *specific graphs* representing qualitative and quantitative metabolic models (**Figure 2C**) are generalized into a *unified graph* (**Figure 2A**). This concept allows a standard graphical representation to be maintained (**Figure 2B**) and additionally, to transform the *unified graph* into *specific graphs* to apply different modeling formalisms. Some formalisms utilize a reduced structure and attribute set of the *unified graph* to perform analyses (this will be described in detail in the Transformation Section). Using our method, the analysis results from different formalisms are visualized in the context of the metabolic model through data assignment functions (**Figure 2D**). Thus, the underlying

biological system is characterized from different perspectives providing complementary insights. Using interaction techniques, the subsequent visual analysis is conducted. Furthermore, analysis results can be integrated in other formalisms to constrain this analysis and thereby make them either feasible or more precise.

The following sections introduce the concept depicted in **Figure 2** in detail.

#### **2.1. FORMALIZATION**

With the aim to formally represent qualitative and quantitative metabolic models a directed, attributed, bipartite graph (called the *unified graph*) is defined as follows.

**Definition 2.1** (unified graph). The *unified graph G*Unified = (*M*, *R*, *E*, *A*) is a directed, attributed, bipartite graph consisting of two finite, non-empty sets *M* of metabolites and *R* of reactions, whereby both sets are disjoint *M* ∩*R* = ∅. Other finite sets are directed edges *E* ⊆ (*M* × *R*)∪(*R* × *M*) and attributes *A* = {*type*, *stoichiometry*, *localization*, *label*,*concentration*,*capacity*, *rate*, *boundaries*, *objective function*}, which are assigned to nodes and edges using the following functions:


Furthermore, the following requirements must be fulfilled:

For all reactions *r* ∈ *R* applies: (1) there exists at least one incoming and one outgoing edge (whereby the incoming edge is not of type *i*) and (2) if one incoming or outgoing edge is reversible (irreversible) than all incoming and outgoing edges are reversible (irreversible). With this rule a reaction is either connected to reversible edges or irreversible edges but not a combination of them.

Between a metabolite *m* ∈ *M* and a reaction *r* ∈ *R* there are at most two edges *e*, *e* <sup>0</sup> ∈ *E* of different types. If two edges *e* and *e* 0 connect *m* with *r* the type of *e* is *ci* and the type of *e* 0 is *i*. This case describes a substrate inhibition at high substrate concentrations, whereby a metabolite is substrate and inhibitor at the same time.

**corresponding SBGN-PD visualization (right)**: **(A)** irreversible reactions, **(B)** inhibition of irreversible reactions, **(C)** localization (compartment) of

reactions, **(F)** export reactions (top irreversible and bottom reversible), and **(G)** import reactions (top irreversible, bottom reversible).

If one edge *e* connects *r* with *m* and another edge *e* 0 connects *m* with *r* the type of *e* is *pi* and the type of *e* 0 is *i*. In this case, a product inhibition is modeled with a metabolite as product and at the same time inhibitor of a reaction.

An explicit formulation of both cases for reversible reactions is not needed because the reaction mechanisms already provide implicit substrate- and product inhibition.

Moreover, the following sets are defined to simplify the transformation of the *unified graph* into *specific graphs* for analysis. The edge set *E* is composed of three subsets, *E* = *E<sup>i</sup>* ∪*Eir* ∪*E<sup>r</sup>* . The subset of inhibitory edges is *E<sup>i</sup>* = {*e* ∈ *E*|*type*(*e*) = *i*}, the subset of irreversible edges is *Eir* = {*e* ∈ *E|type*(*e*) = *ci* ∨ *type*(*e*) = *pi*} and the subset of reversible edges is

*E<sup>r</sup>* = {*e* ∈ *E|type*(*e*) = *cr* ∨ *type*(*e*) = *pr*}. The set of metabolites *M* consists of a subset of metabolites *Mcp*, which are either consumed or produced in reactions *Mcp* = {*m* ∈ *M|*∃*r* ∈ *R*: (*m*, *r*) ∈ *E<sup>r</sup>* ∨ (*m*, *r*) ∈ *Eir*}∪{*m*<sup>0</sup> ∈ *M*|∃*r* ∈ *R*:(*r*, *m*<sup>0</sup> ) ∈ *E<sup>r</sup>* ∨ (*r*, *m*<sup>0</sup> ) ∈ *Eir*}.

To assign analysis results to nodes and edges of the *unified graph*, data assignment functions that integrate calculated structural and dynamic data are used (this will be described in detail in the Transformation section).

Due to the definition of the *unified graph* with a rich attribute set qualitative and quantitative metabolic models can be represented and additionally visualized using standards. **Figure 3** illustrates the basic elements of the *unified graph* and the corresponding visualization in *SBGN-PD*.

## **2.2. VISUALIZATION**

In order to derive a standardized graphical representation of the *unified graph* the Systems Biology Graphical Notation (Le Novère et al., 2009) (*SBGN*) is utilized. *SBGN* has been developed to interpret biological models easily without the need for extensive descriptions using three sub-languages. *SBGN-PD* (Moodie et al., 2011) is the *Process Description* sub-language visualizing the temporal dependencies of biological interactions in detail and is thus suited for the metabolic models encoded in the *unified graph*.

The translation of the *unified graph* in a *SBGN-PD* visualization is based on the following schema. All elements of the metabolite set *m* ∈ *M* (reaction set *r* ∈ *R*) are visualized using *simple chemicals* ∈ *entity pool nodes* (*process* ∈ *process nodes*). All elements of the edge set *e* ∈ *E* are visualized using arcs of the set *connecting arcs* based on the assigned type. Edges of type *ci* are visualized using *consumption arc*, *pi* using *production arc*,*cr* using *production arc* in the opposite direction, *pr* using *production arc* and *i* using *inhibition arc*, respectively.

The edge attribute *stoichiometry* is visualized using *cardinality* and the metabolite attribute *localization* is visualized using *compartment*, which is a container for metabolites defined for this location. The localization of reactions is independent of a *compartment*, hence, a reaction could be located within, outside or on top of the border of a *compartment*. Import or export reactions in *SBGN-PD* are defined using the additional symbol *source and sink* ∈ *entity pool nodes*, see **Figure 3**.

Furthermore, interaction techniques allow the exploration and subsequent visual analysis leading to a broader understanding of the behavior and functionality of the underlying biological system (which will be described in detail in the Results and Discussion section).

### **2.3. TRANSFORMATION**

Overall, five transformations from the *unified graph* (*G*Unified) into the *specific graphs* (*G*Kinetic, *G*Petri net, *G*Stoichiometric, *G*Metabolite, *G*Reaction) have to be performed as a prerequisite to analyze a metabolic model using different modeling formalisms. The different models, modeling formalisms and the transformation from *G*Unified into *G*Stoichiometric are described in the following. The transformations from *G*Unified into *G*Kinetic,*G*Petri net, and into both of the topological graphs *G*Metabolite, *G*Reaction are defined in the Supplementary Material.

### **2.3.1. Kinetic model**

A kinetic metabolic model (ODE model) consists of a structural description of relations between metabolites and reactions and is extended with detailed kinetic data including rate equations, metabolite concentrations, and additional kinetic parameters. The kinetic model is represented by the *kinetic graph* (*G*Kinetic), which is transformed from the *unified graph (G*Unified*)*, see **Figure 2C** and for details Definition 1.1 in Supplementary Material. This transformation results in no structural differences, but in a reduced attribute set.

To analyze the kinetic metabolic model its *kinetic graph* is converted in ODEs, which are numerically solved (Resat et al., 2009). Changes in metabolite concentrations and reaction rates over a period of time are obtained as the results of the analysis.

## **2.3.2. Petri net model**

A Petri net metabolic model can be defined using different Petri net types. Here, we refer to extended qualitative place/transition Petri nets (*eP/T nets*) and extended quantitative stochastic Petri nets (*eSPNs*). The extension includes continuous tokens (to model metabolite concentrations), continuous arc weights (to model non-integer stoichiometry), continuous place capacities (to model limited resources), and inhibitor arcs (to model inhibition). An inhibition is modeled using an inhibitor arc from a place to a transition meaning that the transition can only fire if no token is on that place. The transition may only fire when the place is empty.

Both Petri net types share the same structure, but *eSPNs* are specialized by weights for the exponentially distributed random variable (firing time) assigned to transitions. For further details on Petri nets for modeling metabolic models the reader is referred to Baldan et al. (2010). The Petri net model is represented by the *Petri net graph* (*G*Petri net), which is transformed from the *unified graph* (*G*Unified), see **Figure 2C** and for details Definition 1.2 and Figure S1 in Supplementary Material. This transformation results in structural differences (reversible reactions are represented using a pair of irreversible reactions for both directions) and a reduced attribute set.

A Petri net metabolic model can be analyzed qualitatively or quantitatively. For the qualitative analysis, the *Petri net graph* is converted into a linear equation system, which can be solved to derive invariants describing main pathways (T-invariants) or metabolite conservation (P-invariants) of a metabolic model [more details in Murata (1989), Baldan et al. (2010), and Reisig (2013)]. Furthermore, all possible states are calculated using the reachability analysis and if the reachability graph cannot be constructed then the coverability graph is calculated instead (infinite state-space). The main purpose of the quantitative analysis (simulation) of a Petri net metabolic model is to include stochastic effects. The reactions can additionally be weighted with reaction rates to conduct a more constraint stochastic simulation revealing changes in metabolite concentrations over a number of simulation steps.

## **2.3.3. Stoichiometric model**

Compared to both of the aforementioned models a stoichiometric model consists of stoichiometric reactions without quantities, such as metabolite concentrations, or reaction rates. Due to the steady-state assumption, the regulatory effects resulting from enzymes or inhibitors are neglected; see Orth et al. (2010) for more details.

**Definition 2.2 (stoichiometric graph)**. The *unified graph G*Unified is transformed in a directed, attributed, bipartite *stoichiometric graph G*Stoichiometric = (*MS*, *RS*, *ES*, *AS*) with a metabolite set *M<sup>S</sup>* = *Mcp*, which is a subset of the set *M* in *G*Unified. Metabolites with only inhibitory interactions to reactions are not considered. The reaction set in *G*Stoichiometric *R<sup>S</sup>* = *R* equals the reaction set *R* set in *GUnified* and the edge set in *G*Stoichiometric *E<sup>S</sup>* = *Eir* ∪*E<sup>r</sup>* is a subset of the set *E* in *GUnified*. Edges of type *i* are excluded. The attribute set in *G*Stoichiometric *A<sup>S</sup>* ⊆*A* is a subset of the set *A* in *G*Unified with *A<sup>S</sup>* = {*type*, *stoichiometry*, *localization*, *label*, *boundaries*, *objective function*}.

**Figure 2C** and for details Figure S4 in Supplementary Material depict the transformation of inhibited reactions from *G*Unified into *G*Stoichiometric and thereby detailing the difference between both graphs. This transformation results in structural differences (no inhibitions) and a reduced attribute set. Thereby, all regulatory information and quantitative data are lost.

Using the *stoichiometric graph*, a metabolic model can be validated utilizing the *Dead-End* analysis or *Gap-Finding* analysis revealing blocked reactions or dead-end metabolites. To examine the flow of metabolites through a metabolic model the *stoichiometric graph* is converted into a system of mass balance equations at steady-state, which are solved by minimizing or maximizing an objective function. This optimization can be conducted using a linear optimization instead of a non-linear optimization to handle the problem of alternate optimal solutions. Applicable optimization-based methods are *FBA*, flux variability analysis (*FVA*), robustness analysis (*RA*), and knockout-analyses (*KA*) resulting in a flux distribution, minimal and maximal fluxes, sensitivity curves, and sensitivity values, respectively. For a detailed description of optimization-based methods the reader is referred to (Lewis et al., 2012).

#### **2.3.4. Topological models**

Metabolic models are analyzed according to topological properties in order to understand the importance of key elements, structure, and robustness against disturbances. Since the *metabolite graph* (nodes represent metabolites, edges reactions) and *reaction graph* (nodes represent reactions, edges metabolites) are predominantly

used for topological analysis (Steuer and Junker, 2008) the *unified graph G*Unified is transformed into both, see **Figure 2C** (For details see Definition 1.3 and Figure S2 in Supplementary Material for *metabolite graph* and Definition 1.4 and Figure S3 in Supplementary Material for *reaction graph*). This transformation results in structural differences (unipartite graphs) and a reduced attribute set. Thereby, all regulatory information and quantitative data are lost.

Topological analysis of the metabolic model based on its *metabolite graph* or *reaction graph* is conducted using the corresponding adjacency matrix. A shortest path analysis results in paths (subgraphs which could be the graph itself). Furthermore, centrality analysis with different centrality measures leads to a ranking of graph elements according to different importance concepts. For further details on different centrality measures the reader is referred to Koschützki and Schreiber (2008).

#### **2.4. INTEGRATION**

To integrate structural and dynamic analysis results in the *unified graph*, which have been computed using *specific graphs*, data assignment functions are applied. To focus on several analysis methods, we chose typical examples from a number of analysis methods comprised in the different modeling formalisms. Using these analysis methods, two sets of data types are generated: vectors of numeric values and graph elements, which are assigned to different graph elements of the *unified graph*, see **Table 1**.

Numeric values of the vector (*nv* ∈ *NV*) are assigned to elements of the *unified graph* (*M* metabolite, *R* reaction, and *E* edge) using the assignment function *zn*: *M*, *R*, *E* →*NV*, whereby the vector could comprise numeric values (e.g., sensitivity values),


**Table 1 | Summary of typical examples of analysis methods and corresponding results produced with different modeling formalisms grouped in data types, which will be assigned to different graph elements [metabolite nodes (M), reaction nodes (R), and edges (E)] of the unified graph.**

<sup>a</sup>Analysis results from forward and backward reactions of the Petri net are integrated into the corresponding reversible reactions in the unified graph. <sup>b</sup>Analysis results from edges of the metabolite graph or reaction graph correspond to several edges and nodes in the unified graph.

pairs of numeric values (e.g., min and max fluxes), and a set of time, step, and flux value dependent numeric values (e.g., metabolite concentrations over time, steps and sensitivity curves, respectively).

Another type of analysis results data are the elements of graphs, which are assigned to the *unified graph* using the assignment function *zg* : *M*, *R*, *E* →*Mx*, *Rx*, *Ex*, whereby *x* can be replaced with *P* Petri net, *S* stoichiometric, *K* kinetic, *M* metabolite, or *R* reaction to define the *specific graphs*. As an example, *Gap-Finding* analysis results in a set of metabolites of the *stoichiometric graph*, which must be assigned to metabolites in the *unified graph* using *zg* : *M* →*MS*.

These assignment functions provide the basis for the visualization of the analysis results in the context of the metabolic model. Furthermore, interaction techniques such as *brushing* & *linking* and *animation* support the exploration, for example, of different Petri net invariants in the context of the metabolic model [for more details concerning interaction techniques seeVon Landesberger et al. (2011)]. An integrated visualization by means of an application using the developed method is represented in the Results and Discussion section.

#### **3. RESULTS AND DISCUSSION**

In conclusion, the developed method allows previously separated well-established modeling formalisms to be combined into one application using one workflow, supported by interaction techniques and integrated visualizations in the context of the metabolic model. The method mitigates the weaknesses (−) of independent modeling formalisms as explained in the Introduction section and leads to major (+) or minor (±) improvements of an integrated analysis as already depicted in **Figure 1**.

In detail, using the integrated approach it is not required to define detailed kinetics to derive quantitative predictions and reveal dynamic behavior of the underlying biological system. Instead, using some parameters the Petri net simulation or stoichiometric modeling method FBA could be performed to approximate kinetic simulations. Thus, larger models are applicable in the integrated approach leading to analysis results, which could be again integrated to analyze the model further. Additionally, qualitative analysis can be conducted for extended Petri nets using another integrated formalism such as Dead-End analysis or centrality analysis. Quantitative predictions can be revealed for a qualitative model with a static description using stoichiometric analysis.

Hence, different modeling formalisms complement each other even through, overlaps between the introduced metabolic modelingformalisms exist. For example, the stoichiometric matrix used in the stoichiometric modeling formalism to derive mass balance equations corresponds to the incidence matrix of the Petri net formalism used to derive an equation system solved for, e.g., invariant analysis. In the case of structural analysis, both the stoichiometric and the Petri net formalism could be utilized to reveal, for example, *Dead-End* metabolites. Additionally, Petri net T-invariants correspond to flux modes, which could be directly calculated using the stoichiometric analysis method elementary flux modes (not presented here).

The described method is implemented as an Add-on for the *VANTED* system (Rohn et al., 2012), called the System Biology Metabolic Model Framework (*SBM*<sup>2</sup> – Framework). It utilizes and extends *VANTED*s functionality for the interpretation of experimental data and for analyzing metabolic models with different modeling formalisms.

In order to characterize the metabolic functionality and behavior of the crop plant potato (*Solanum tuberosum*) an integrative analysis is performed using the described method. Due to its main component, starch in the potato tuber, potato is of great importance as food and in industry, for example, for the production of fuel. Therefore, a major aim of plant breeding is to improve the distribution of biomass within the plant in favor of harvestable plant parts. Based on the homogeneous tissue of the potato tuber the main flux of metabolites is from sucrose to starch (Geigenberger et al., 2004). The investigation of sucrose degradation can be conducted. Almost all genes of this pathway are already known and thus provide the basis for the reconstruction of a metabolic model of the potato tuber.

Using a kinetic model representing the sucrose breakdown in the developing potato tuber (Junker, 2004) the integrative analysis is performed and analysis results are shown in **Figure 4A**. The model comprises of 15 reactions and 17 metabolites located in the cytosol. Sucrose (*Suc*) is converted into hexose phosphates (e.g., glucose-6 phosphate, *G*6*P*) utilized in glycolysis (*Glyc*) and as precursorsfor starch synthase (*StaSy*). The pathways*Glyc*, starch biosynthesis, and energy consumption (*ATP*cons) are modeled as summarized reactions. This is a necessary simplification to avoid unknown transport processes into additional compartments. To describe the environment the model is extended through sucrose import (*Imp*) and starch export reactions (*Exp*).

The kinetic analysis results in time-course diagrams converging toward a steady-state producing starch, which can be increased by an overexpression of the enzyme invertase (*Inv*) as described in Junker (2004). The consequence of the overexpression can be compared and visually analyzed to investigate both situations side by side in the model, see **Figures 4A,B**.

To perform a stochastic simulation the steady-state reaction rates generated by the kinetic analysis are used to weight the reactions of the eSPN. The stochastic simulation results in increasing and decreasing metabolite concentrations, which oscillate with different amplitudes (data not shown). The results indicate the production of starch and the utilization of reactions with different probabilities.

Additionally, the invariant analysis reveals beside 3 P-invariants (reflecting substance conservation) 19 T-invariants, which can be grouped in trivial and non-trivial T-invariants. Each of the seven trivial T-invariants corresponds to a reversible reaction. The non-trivial T-invariants can be differentiated in a group of nine representing the cleavage of sucrose by invertase and another group of three where the sucrose is cleaved by sucrose synthase. These T-invariants reflect the main processes that are pathways taking place in the metabolic model in reality (Koch et al., 2005). One of the T-invariants is illustrated in **Figure 4A** by adding numbers (firing counter) to the corresponding reactions. Sucrose is initially cleaved by invertase, leading to the production of hexose phosphates,which are metabolized in*Glyc* and starch biosynthesis.

The stoichiometric analysis (irrespective regulatory processes), using only three steady-state reaction rates (*Inv* = 0.16µM/FW/s, *SuSy* = 4.89µM/FW/s, *ATPcons* = 100µM/FW/s) to constrain the fluxes for these reactions, results in a flux distribution, which

(firing counter, left lower corner in pink) assigned to reactions. The

is comparable to the kinetic analysis results. In **Figure 4A**, the edge thickness corresponds to flux values. The flux through the starch biosynthesis reaction with 6.42µM/FW/s is equal to the one of the kinetic analysis. Additionally, the reaction *AdK* is not

correspond to the highlighted (red) nodes in **(A,C)**.

utilized as can be seen in results of the kinetic and Petri net analysis.

Using the *metabolite graph*, see **Figure 4C**, the structure of the potato model is investigated. To identify important metabolites that occur on the shortest paths between two nodes in a ranked way the *shortest path betweenness (SPB)* centrality analysis is conducted. As a result, the table in **Figure 4D** illustrates *Suc* and *G*6*P*, which are selected to be highlighted in **Figures 4A,C**. Both metabolites are very important in the model, indicating that without these metabolites the reactions of starch biosynthesis and *Glyc* could not be processed.

In summary, using the integrative analysis allows different modeling formalisms to be investigated in one workflow. An integrated and interactive visualization of the analysis results leads to an advantage over the use of each modeling formalism independently. This helps to compare analysis results from different formalisms within one metabolic model and allows for the investigation of analysis results from one formalism in another, as mentioned in the use case.

#### **4. CONCLUSION**

We described a method, which is able to bring together different metabolic modeling formalisms. The integration is realized by a *unified graph*, enabling graph transformations, and a visualization in a standardized and formalized way. The *unified graph* supports user interaction and thereby allows different analysis results to be explored in the context of the metabolic model. The application reveals structural and dynamic properties of the crop plant potato utilizing the integrative analysis. The method has been implemented as an extension of the *VANTED* system and could also be applied to other model types, but we have focused here on metabolic models as an application area.

Combining different modeling formalisms opens many possibilities for future research. Additional analysis algorithms can be added to study metabolic models in more detail. We plan to extend the method for different types of models such as gene regulatory models to investigate further cellular processes. This extension requires the adaptation of the *unified graph*, adding of appropriate modeling formalisms, and corresponding transformations. Furthermore, the visualization has to be adapted to represent different types of models in *SBGN* using,for example, the sub-language *SBGN-AF* for gene regulatory models.

#### **AUTHOR CONTRIBUTIONS**

Anja Hartmann developed the theoretical framework, the use case, and implemented the *SBM*<sup>2</sup> – Framework software. Falk Schreiber supervised the project and gave conceptual advice. Both authors wrote the manuscript.

#### **SUPPLEMENTARY MATERIAL**

The Supplementary Material for this article can be found online at http://www.frontiersin.org/Journal/10.3389/fbioe.2014.00091/ abstract

#### **REFERENCES**

Baldan, P., Cocco, N., Marin, A., and Simeoni, M. (2010). Petri nets for modelling metabolic pathways: a survey. *Nat. Comput.* 9, 955–989. doi:10.1007/s11047- 010-9180-6


Xu, C., Liu, L., Zhang, Z., Jin, D., Qiu, J., and Chen, M. (2013). Genome-scale metabolic model in guiding metabolic engineering of microbial improvement. *Appl. Microbiol. Biotechnol.* 97, 519–539. doi:10.1007/s00253-012-4543-9

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 12 September 2014; accepted: 30 December 2014; published online: 26 January 2015.*

*Citation: Hartmann A and Schreiber F (2015) Integrative analysis of metabolic models – from structure to dynamics. Front. Bioeng. Biotechnol. 2:91. doi: 10.3389/fbioe.2014.00091*

*This article was submitted to Systems Biology, a section of the journal Frontiers in Bioengineering and Biotechnology.*

*Copyright © 2015 Hartmann and Schreiber. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited andthatthe original publication inthis journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

## **Modeling the contribution of allosteric regulation for flux control in the central carbon metabolism of** *E. coli*

#### *Daniel Machado<sup>1</sup> \*, Markus J. Herrgård<sup>2</sup> and Isabel Rocha<sup>1</sup>*

*<sup>1</sup> Centre of Biological Engineering, University of Minho, Braga, Portugal, <sup>2</sup> The Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Hørsholm, Denmark*

#### *Edited by:*

*Firas H. Kobeissy, University of Florida, USA*

#### *Reviewed by:*

*Osbaldo Resendis-Antonio, Instituto Nacional de Medicina Genomica (INMEGEN), Mexico Jeffrey Varner, Purdue University, USA*

#### *\*Correspondence:*

*Daniel Machado, Centre of Biological Engineering, University of Minho, Braga 4710-057, Portugal dmachado@deb.uminho.pt*

#### *Specialty section:*

*This article was submitted to Systems Biology, a section of the journal Frontiers in Bioengineering and Biotechnology*

> *Received: 16 July 2015 Accepted: 22 September 2015 Published: 08 October 2015*

#### *Citation:*

*Machado D, Herrgård MJ and Rocha I (2015) Modeling the contribution of allosteric regulation for flux control in the central carbon metabolism of E. coli. Front. Bioeng. Biotechnol. 3:154. doi: 10.3389/fbioe.2015.00154* Modeling cellular metabolism is fundamental for many biotechnological applications, including drug discovery and rational cell factory design. Central carbon metabolism (CCM) is particularly important as it provides the energy and precursors for other biological processes. However, the complex regulation of CCM pathways has still not been fully unraveled and recent studies have shown that CCM is mostly regulated at post-transcriptional levels. In order to better understand the role of allosteric regulation in controlling the metabolic phenotype, we expand the reconstruction of CCM in *Escherichia coli* with allosteric interactions obtained from relevant databases. This model is used to integrate multi-*omics* datasets and analyze the coordinated changes in enzyme, metabolite, and flux levels between multiple experimental conditions. We observe cases where allosteric interactions have a major contribution to the metabolic flux changes. Inspired by these results, we develop a constraint-based method (arFBA) for simulation of metabolic flux distributions that accounts for allosteric interactions. This method can be used for systematic prediction of potential allosteric regulation under the given experimental conditions based on experimental data. We show that arFBA allows predicting coordinated flux changes that would not be predicted without considering allosteric regulation. The results reveal the importance of key regulatory metabolites, such as *fructose-1,6-bisphosphate*, in controlling the metabolic flux. Accounting for allosteric interactions in metabolic reconstructions reveals a hidden topology in metabolic networks, improving our understanding of cellular metabolism and fostering the development of novel simulation methods that account for this type of regulation.

**Keywords: metabolism, systems biology, constraint-based modeling, allosteric regulation,** *Escherichia coli*

## **1. Introduction**

Mathematical models of metabolism have become a fundamental tool for understanding cellular behavior and for designing genetic or environmental modifications to change that behavior toward a specific purpose (Heinemann and Sauer, 2010). Metabolic models have found applications in both biomedical research and industrial biotechnology. Examples of applications in biomedicine include using metabolic models of human cells to analyze the altered behavior of cancer cells and to suggest potential drug targets (Folger et al., 2011). In the context of industrial biotechnology, models of microbial metabolism are widely used for rational design of microbial cell factories (Zomorrodi et al., 2012).

There are two major approaches for modeling cellular metabolism, namely, kinetic modeling and constraint-based modeling (Machado et al., 2012). The former, based on kinetic rate laws, requires extensive experimental data for determination of the enzymatic mechanisms and respective kinetic parameters. For that reason, these models have been limited to central pathways of well-studied organisms, such as *Escherichia coli* and *Saccharomyces cerevisiae* (Teusink et al., 2000; Chassagnole et al., 2002). Constraint-based modeling, on the other hand, only accounts for the stoichiometry and directionality of biochemical reactions, which can be obtained from genome annotations and limited other information for the organism (Bordbar et al., 2014). With the increasing number of fully sequenced genomes for multiple organisms, the number of genome-scale metabolic reconstructions suitable for constraint-based modeling is also rapidly increasing, with over a hundred reconstructions currently available (Monk et al., 2014).

Constraint-based models can be used to estimate the steadystate flux distribution of a metabolic network, using the so-called Flux Balance Analysis (FBA) approach (Orth et al., 2010). Since the flux solution is not unique with only stoichiometric and directionality constraints, in FBA a single solution is selected based on the assumption of an evolutionary principle of optimality, such as maximization of cellular growth. Methods have been developed to refine metabolic flux predictions by integration of metabolic models with models of other biological processes, such as signaling and transcriptional regulatory networks (Gonçalves et al., 2013). However, some limitations of these methods, such as the reduction of gene expression levels to Boolean states, hamper the predictive ability of the integrated models. More recently, several approaches were developed to directly integrate gene expression data into metabolic models. These methods are based on the assumption that reaction fluxes should be proportional to their respective gene expression levels. However, a recent systematic evaluation of these methods showed little improvement in simulation accuracy when gene or protein expression data are used for flux prediction with a wide range of proposed methods (Machado and Herrgård, 2014). One of the conclusions from this study is that the assumption of proportionality between gene expression levels and reaction rates is not valid for many reactions.

The conclusion that transcriptional or translational regulation does not significantly regulate metabolic fluxes is consistent with recent experimental observations in multiple organisms showing that central carbon metabolism is mostly regulated at posttranscriptional levels (Daran-Lapujade et al., 2007; Chubukov et al., 2013; Kochanowski et al., 2013a). Regulation analysis is a method introduced by ter Kuile and Westerhoff (2001) for quantitatively decomposing flux regulation into *hierarchical* and metabolic coefficients. The former accounts for transcriptional and translational regulation as well as post-translational modifications, whereas the latter accounts for allosteric regulation and thermodynamics. The application of this method to three parasitic protists showed that regulation of glycolytic fluxes is never completely hierarchical, being mostly metabolic in many cases. Similar conclusions were obtained by applying this method to *S. cerevisiae*, where it was observed that metabolic regulation contributed to 50–80% of the flux change in glycolytic enzymes for the given cultivation conditions (Daran-Lapujade et al., 2007).

The partial contribution of transcriptional regulation for flux control in central carbon metabolism can be explained by the cellular trade-off between lowering the investment of protein synthesis (keeping enzymes saturated), and the need to achieve fast regulatory responses and maintain metabolic homeostasis under environmental changes (Fendt et al., 2010; Wessely et al., 2011). In fact, metabolite measurements in *E. coli* and *S. cerevisiae* have shown that most enzymes in central carbon metabolism are not saturated, with substrate levels being close to their respective *K<sup>M</sup>* values (Bennett et al., 2009; Fendt et al., 2010). A recent study in *B. subtilis* showed that transcriptional regulation is insufficient to explain the observed flux change for growth in different carbon sources (Chubukov et al., 2013). Interestingly, the authors observed that the changes in substrate concentrations were also insufficient to explain the observed flux change, leaving an important contribution for post-translational modifications and allosteric regulation.

Learning how allosteric regulation controls the metabolic flux is fundamental for understanding cellular metabolism. Given the growing scope of the constraint-based modeling approach, we propose to expand this formalism with an explicit representation for allosteric interactions. In this work, we build a constraint-based model of allosteric regulation in the central carbon metabolism of *E. coli* and use it to analyze the role of this type of regulation for controlling the metabolic flux under different perturbations.

Allosteric information data are collected from relevant databases and used to build a constraint-based model expanded with allosteric interactions. We analyze how this new layer of interactions affects the network topology in terms of node connectivity and identify relevant metabolic hubs. The model is used as a scaffold to perform regulation analysis using multiple omics data for *E. coli*. Finally, a new method for constraint-based simulation accounting for allosteric interactions is proposed and used for model-based prediction of regulatory effects on flux control.

## **2. Results**

## **2.1. Model Reconstruction**

In order to analyze the effects of allosteric regulation in the central carbon metabolism, we expanded a constraint-based model of the core metabolism of *E. coli* (Orth et al., 2009) with allosteric interactions obtained from relevant sources (see Figure S3 in Supplementary Material and Methods section for details). The expanded model is presented in **Figure 1**. It can be observed that the integration of regulatory interactions reveals an intricate topology that is not captured by the stoichiometric reconstruction alone. In this case, the connections represent signal flow rather than mass flow. Much like in the case of signaling pathways, it is possible to observe a highly complex *crosstalk* between different subpathways. This includes multiple feedback links between upper and lower glycolysis, upper glycolysis and the oxidative part of the pentose-phosphate (PP) pathway, lower glycolysis and the TCA cycle, and a positive feedback link from citrate to upper glycolysis.

**Figure 1** shows that most regulatory interactions are inhibitory. It is possible that some of these inhibitory interactions are competitive rather than allosteric (i.e., the binding site of the effector coincides with the catalytic site). Since the binding mechanisms are not generally reported in the databases, and the regulatory effect is similar, this distinction will be disregarded for the purpose of this work.

Topological analysis in terms of connectivity degree shows an increased connectivity for several metabolites when allosteric regulation is considered (**Figure 2**). However, the median value of connectivity remains the same (4 connections per metabolite). Unsurprisingly, there is an increased connectivity for metabolites that were previously known metabolic hubs. For instance, phosphoenolpyruvate (*pep*) is now connected to a total of 13 reactions (previously 8), reinforcing the importance of this glycolytic compound as a metabolic hub (Link et al., 2013; Matsuoka and Shimizu, 2015). However, changes are also observed for lowly connected metabolites. A notable case is *fructose-1,6-bisphosphate* (*fdp*), which can now be considered as a hub metabolite (with a total of 6 connections), although its connectivity is bellow the median if regulation is not considered. This metabolite was recently identified as a key fluxsignaling metabolite in the glycolytic flux-sensing mechanism of *E. coli* (Kochanowski et al., 2013b).

## **2.2. Omics Data-Based Analysis of Allosteric Regulation**

In order to understand how the coordination between hierarchical and metabolic regulation drives the metabolic flux, we used the reconstructed model to integrate and analyze a multi*omics* dataset for *E. coli* (Ishii et al., 2007). This dataset contains transcript, protein, metabolite, and flux data for *E. coli* strains

growing aerobically in a chemostat. It comprises several experiments, including variations of dilution rate for the wild-type strain (0.1–0.7 h*−*<sup>1</sup> ) and 24 single knockout mutants growing at the reference dilution rate (0.2 h*−*<sup>1</sup> ). Herein, we will refer to the wildtype strain growing at 0.2 h*−*<sup>1</sup> as the reference condition, and the remaining as the perturbed conditions (28 in total).

The data were analyzed using the concept of *regulation analysis* introduced by ter Kuile and Westerhoff (2001) to decompose the contribution of hierarchical (*ρh*) and metabolic (*ρm*) control coefficients during flux change between two experimental conditions (*ρ<sup>h</sup>* + *ρ<sup>m</sup>* = 1). We applied the generalization proposed in Chubukov et al. (2013) to simultaneously compare multiple conditions (see Methods). This generalization assumes that the coefficients are conserved across conditions. The results are presented in Figure S4 in Supplementary Material. It can be observed that in many cases the slopes are close to zero or even negative, indicating poor evidence of transcriptional control. Only three reactions (*PGI*, *CS*, *FUM*) present an estimated hierarchical control coefficient above 0.5. Hence, only these reactions are likely to be predominantly regulated at the transcriptional level.

Given the lack of evident hierarchical control for most enzymes, one can try to analyze the allosteric control exerted by single effectors in a similar fashion (see Methods). The results are presented in Figure S5 in Supplementary Material. In order to observe active flux control, positive slopes would be expected for enzyme activators and negative slopes for enzyme inhibitors. However, this behavior can only be observed in a few cases. The flux of *FBA* positively correlates with its two activators, citrate and *pep*. Some correlation is also observed between ATP levels and two of its inhibition targets, *GND* and *PFK*.

Given the large number of reactions without evident transcriptional or allosteric control, we hypothesize that the assumption of constant control coefficients across all conditions does not hold for the given experimental conditions. It is likely that, during different perturbations, different kinds of control are predominant for each reaction. This has also been observed in previous studies in *S. cerevisiae* (Rossell et al., 2006).

We analyzed the flux change for each reaction at each perturbed condition individually, by comparing the logarithmic change of enzyme, flux, and metabolite levels between all 28 perturbed conditions and the reference condition. Although this would result in a total of 672 potential case studies (24 regulated reactions times 28 perturbations), due to the sparsity of the data (especially the metabolome data), this study was restricted to all reactioncondition pairs with sufficient data to perform a meaningful analysis (see Methods). This reduced the number of case studies to 38 (see Figure S6 in Supplementary Material for details). We then analyzed the evidence of allosteric control for these cases (see Methods) and observed a total of 8 cases where allosteric regulation seems to play a role in controlling the reaction flux for the given perturbation (Figure S6 in Supplementary Material). These 8 cases will be analyzed in detail below.

The regulation mechanisms of the three reactions involved (*PFK*, *PPC*, and *PYK*) are depicted in **Figure 3A**. The intricate regulation of these enzymes is evident, in particular for *PFK* and *PYK*, which are catalyzed by multiple isozymes and regulated by multiple effectors. The logarithmic change of flux and all measured intervening molecules for the selected reaction-condition pairs is presented (**Figure 3B**). It can be observed that, in most cases, the change in enzyme concentration is in the opposite direction of the flux change. For *PFK*, only one of the isozymes is measured. In the case of *PYK*, where both isozymes are measured, it can be observed that the level of one isozyme increases while the other decreases. In the few cases where the flux change follows the direction of the enzyme level, the magnitude of enzyme change is still insufficient to explain the flux change (since the reaction rate would be directly proportional to the enzyme concentration). Regarding the change in substrate levels, it can be observed that, in most cases, it is also opposite to the direction of the flux change.

The effect of allosteric control is evident in some scenarios. For instance, in the ∆*ppsA* mutant, the flux of *PPC* largely increases, despite the decrease of its only enzyme (*ppc*) and its main substrate (*pep*). This increase can be explained by the increased concentration of its allosteric activator (*fdp*). There are cases where the different allosteric regulators have a cooperative effect in flux control (e.g., *PYK* at 0.7 h*−*<sup>1</sup> ) and cases where there is a competing effect (e.g., *PFK* at 0.4 h*−*<sup>1</sup> ). One can observe that flux change is not always controlled by the same combination of effectors. For instance, at high dilution rates (0.4–0.5 h*−*<sup>1</sup> ) the flux of PFK increases with the decrease of its inhibitors (ATP and *pep*), despite the decrease of its activator (ADP). However, at an even higher dilution rate (0.7 h*−*<sup>1</sup> ), the flux increase coincides with higher levels of the activator, whereas the two inhibitors change in opposite directions.

The interpretation of the results is hampered by the lack of protein and metabolite measurements for many experimental conditions. One cannot exclude the possibility that some flux changes are also driven by changes in unmeasured isozymes, cofactors, or reaction products.

## **2.3. Model-Based Prediction of Allosteric Regulation**

Given the scarcity of multi-*omics* datasets with all the data required to perform a quantitative analysis of allosteric regulation, we developed a constraint-based approach for model-based predictions. This method is based on the assumption that, if a reaction is activated (respectively, inhibited) by a compound present in a pathway, then its flux change should be positively (respectively, negatively) correlated with the flux change in that pathway (see Supplementary Material for details). It has been proposed that allosteric intermediates function as flux-signaling metabolites that directly translate flux information to metabolite concentration (Kotte et al., 2010; Matsuoka and Shimizu, 2015). The method, named allosteric regulation FBA (arFBA), is a variation of parsimonious FBA (pFBA) (Lewis et al., 2010) where the objective function is extended as follows:

$$\min\_{\nu} \sum\_{i} |\nu\_{i}| + \sum\_{R\_{\emptyset} > 0} \left| \frac{\nu\_{j}}{\nu\_{j}^{0}} - \frac{t\_{i}}{t\_{i}^{0}} \right| + \sum\_{R\_{\emptyset} < 0} \left| \frac{\nu\_{j}}{\nu\_{j}^{0}} + \frac{t\_{i}}{t\_{i}^{0}} - 2 \right|.$$

Here, *v* is the flux distribution to be estimated, *v* 0 is the flux distribution for a given reference condition, *t<sup>i</sup>* is the turnover rate of metabolite *i*. The allosteric interactions are represented in a new matrix *R*, which has a structure similar to the stoichiometric matrix, with *Rij* = 1 (respectively, *−*1) if metabolite *i* activates (respectively, inhibits) reaction *j*, and 0 otherwise (note that the stoichiometric matrix *S* is not changed). The *wij* parameters are arbitrary weights that represent the strength of the interaction between effector*i* and reaction *j*. If all *wij* are close to zero, then the method defaults to a simple pFBA simulation. The minimization of the extra terms in the objective function affects the respective fluxes when regulation is active. For an activation, the subtraction forces the flux and turnover ratios to be the same. For an inhibition, the term forces that a change in the turnover is compensated by an opposite change in the flux. A detailed justification for these terms is given in the Supplementary Material. The full implementation of the method is slightly more complex due to the presence of reversible reactions and reactions without flux in the reference condition (see Supplementary Material for a complete description).

In general, it is not possible to know the strength of the allosteric interactions beforehand. Therefore, we implemented an ensemble modeling approach in order to find the most plausible models (**Figure 4**). The approach is similar, albeit different, to the ensemble modeling approach used for kinetic modeling (Tran et al., 2008). A model ensemble was built by randomly sampling the *wij* parameters (see Methods). The simulated flux distributions are then compared with the intracellular flux data from Ishii et al. (2007). The accuracy of each model is given by the (*L*1-norm) distance between the predicted and measured flux distributions. The original ensemble is split into two groups containing the models with prediction accuracy above and below the median. We then perform enrichment analysis by comparing the distributions of each parameter between the two ensembles. For a particular experimental condition, if a parameter *wij* has systematically higher values in the ensemble with higher predictive accuracy, then the assumption of allosteric control between effector *i* and reaction *j* results in improved flux predictions for that condition.

**Figure 5** shows *t*-test values for all parameters across all experimental conditions. Although there are not clearly defined clusters in the clustered heatmap, some general patterns can be observed. About one-quarter of the interactions are positively enriched for most experimental conditions, representing probable cases of active allosteric control for those conditions. On the other hand, almost half of the parameters are negatively enriched for a majority of conditions. These represent allosteric constraints that, in most cases, hamper the predictive ability of the models. Finally, there is a subset of allosteric interactions which are neither positively nor negatively enriched. Accounting for these interactions has very little effect in the prediction of flux distributions for the given experimental conditions.

The most frequent positively enriched interactions include inhibition of the oxidative phase of the pentose-phosphate pathway (PPP) by reducing agents NADH and NADPH; mutual inhibition between PPP and upper glycolysis; feedforward activation of *PPC* and *PYK* by *fdp*; and inhibition of the glyoxylate shunt by multiple effectors. Interestingly, several parameters that are positively enriched for a subset of conditions are also negatively enriched for some of the remaining conditions. Hence, although the respective interactions improve the flux predictions in some conditions, in other conditions they make predictions worse.

In order to test the predictive ability of our*in silico* approach, we analyzed the enrichment results for the potential cases of allosteric control previously detected by data-driven analysis (**Figure 3B**). Some of the allosteric interactions were significantly enriched, namely the activation of *PFK* by ADP at the highest dilution rate (*t* = 4.28, *p* = 1.88e-5), activation of *PPC* by *fdp* in the ∆*ppsA* mutant (t = 19.0, *p* = 8.17e-79), and activation of *PYK* by *fdp* in the ∆*gnd* mutant (*t* = 4.70, *p* = 2.63e-6) and the ∆*galM* mutant (*t* = 7.09, *p* = 1.44e-12).

It should be noted that we are using our simulation method (arFBA) in the reverse direction, i.e., a model ensemble is compared with experimental data to find which parameters (weighting factors) result in improved predictions. Although, in theory, one could use the method in the forward direction, i.e., to perform simulations with improved flux predictions, this would require finding a "universal" parameter configuration that fits all conditions. The previous results show that such universal configuration cannot be found due to the condition-specific nature of allosteric regulation. Nonetheless, we tested the accuracy of arFBA by measuring the distance between simulated and experimental flux distributions. **Figure 6** shows the frequency distribution of the distances obtained by random sampling of the weighting factors for each experimental condition. The distance obtained with FBA is shown for comparison. It can be observed that,

for most experimental conditions, the average distance obtained with arFBA is lower than that obtained with FBA, indicating a higher accuracy of the former. Finally, we tested the accuracy of arFBA with *a posteriori* calibration of the weighting factors (see methods). It can be observed that, after calibration, the accuracy of arFBA is higher than FBA for 26 of the 28 conditions.

## **3. Discussion**

In this work, we analyzed the role of allosteric regulation for flux control in the central carbon metabolism of *E. coli*. For this, we extended a constraint-based metabolic model of *E. coli* with allosteric regulation. The application of such a model is twofold. First, it can be used as an integrative scaffold for multi-*omics* dataset analysis, revealing the coordination between enzyme, metabolite, and flux levels. Second, it can be used for *in silico*based predictions that account for allosteric regulation in the simulation of the metabolic phenotype. For that purpose, we implemented an FBA variant, named arFBA, that accounts for allosteric interactions in the determination of the flux distribution.

Using the expanded model and a multi-*omics* dataset for *E. coli* (Ishii et al., 2007), we analyzed the impact of allosteric regulation in controlling the metabolic flux under multiple environmental and genetic perturbations. We implemented a generalized form of regulation analysis (ter Kuile and Westerhoff, 2001) in order to find which reactions are predominantly under transcriptional or allosteric control. The results reveal that most reactions are generally not controlled by the same mechanism across all conditions. This led us to analyze the effects of perturbations in single reactions for each experimental condition. This analysis is hampered by missing protein and metabolite measurements, which does not allow accounting for all participating compounds in the reactions analyzed. Although we neglected the effect of missing isozyme and cofactor measurements, as well as product concentrations for irreversible reactions, only 38 out of 672 possible case studies (24 reactions *×* 28 perturbations) could be analyzed in a meaningful way (Figure S6 in Supplementary Material). Nonetheless, it was possible to identify 8 (out of 38) cases where the reaction flux is predominantly controlled by allosteric mechanisms.

Considering that the dataset published by Ishii et al. (2007) is one of the most comprehensive multi-*omics* dataset for a model organism published so far, we can conclude that purely datadriven analysis is very limited for studying metabolic regulation. Therefore, we applied our simulation method using an ensemble modeling approach to identify which allosteric interactions result in improved flux predictions. Enrichment analysis of the weighting factors in our model revealed that several allosteric interactions were significantly enriched when the models were filtered by their agreement with experimental flux data. A comparison between the *in silico* results and the datadriven analysis showed that 4 of the 8 cases of allosteric control previously identified were also detected by the computational approach.

Given the very limited scope of the cases analyzed in detail, the cross-comparison between the data driven and *in silico* results can hardly be considered a validation of the latter. In order to determine the accuracy of the simulation method it would be necessary to estimate the number of false positive and false negative results for the whole dataset. Instead, the two approaches should be seen as complementary methods to guide the analysis of allosteric regulation. Furthermore, the data analysis revealed that the predominant mode of regulation for each reaction is condition dependent. This was also observed in the *in silico* analysis, hampering the determination of a universal set of weighting factors for arFBA. Given the interplay between different regulation mechanisms, the approach developed herein could be suitable for integration with other methods for identification of regulation mechanisms (Bordel et al., 2010).

An ensemble modeling approach was also employed by Link et al. (2013) for systematic identification of allosteric interactions in *E. coli*. The authors measured metabolite concentrations using rapid sampling and <sup>13</sup>*C*-labeled substrates (glucose and fructose) to determine the transient profile of glycolytic intermediates in dynamic cultures switching between glycolysis and gluconeogenesis. A kinetic ensemble model for glycolysis was used to test 126 putative interactions. The results not only confirmed previously known interactions but also predicted new interactions that had not been previously reported. Although the model used in this study differs from ours, the results regarding interactions common to both models are consistent. In particular, both studies revealed the importance of *PFK* as an active regulation target for controlling the glycolytic flux, and the role of *fdp* as key regulator of *PPC* and *PYK* to control *pep* consumption.

At the end of our data-driven analysis, some flux changes remain unexplained by hierarchical or metabolic control. One main reason for this is the lack of coverage of the metabolomics data, which only accounts for approximately half of the metabolites in the model. Another possibility is that the regulatory mechanisms for the respective enzymes are not fully known or the relevant allosteric interactions were not included in the model. It is also possible that the enzyme concentrations do not correlate with the respective enzymatic activity due to post-translational modifications (PTMs). It has been shown that PTMs, such as acetylation, have important regulatory functions in *E. coli* (Castaño-Cerezo et al., 2014).

The generation of high-quality multi-*omics* datasets will be necessary for a deeper understanding of metabolic regulation. Herein, we used a previously published dataset for chemostat cultures. However, steady-state data may be insufficient to analyze regulatory responses. It has been observed that fast metabolic responses precede the slower transcriptional response during metabolic adaptation (Ralser et al., 2009). Since allosteric regulation operates on a faster time-scale compared to transcriptional regulation, transient profiles on short time scales should be particularly informative (Link et al., 2013).

## **4. Conclusion**

In this work, we focused on the role of allosteric regulation in central carbon metabolism. The reconstruction of an allosteric model revealed that allosteric information is inconsistent among different data sources even for these highly studied pathways. The allosteric interactions added a new layer to the network topology, changing the overall network connectivity and revealing metabolic hubs that would otherwise be ignored (e.g., *fdp*). Hierarchical and allosteric regulation analysis using a multi-*omics* dataset revealed that there is no predominant mechanism of regulation across all experimental conditions. Nonetheless, situations of predominant allosteric control could be identified for some reactions at particular conditions. Our new method for modelbased prediction of allosteric control was able to capture at least a few of these situations. However, the assessment of the predictive ability of this method is hampered by the lack of more comprehensive data.

For central carbon metabolism, it would have been feasible to perform this analysis using a kinetic modeling approach [similarly to Link et al. (2013)]. However, as we move toward regulatory analysis at the genome-scale, the constraint-based approach should become especially useful. Building a genome-scale model of allosteric regulation is a daunting task that will require literature mining, extensive manual curation, and prediction of putative interactions. Our knowledge of the *allosterome* is currently limited by the lack of high-throughput screening methods for detecting metabolite–enzyme interactions. It is likely that the vast majority of allosteric interactions are yet to be discovered (Lindsley and Rutter, 2006). Recent experimental methods have been developed toward systematic identification of metaboliteprotein interactions (Gallego et al., 2010; Li et al., 2010; Orsak et al., 2011; Feng et al., 2014). However, we are still far from a genome-scale screening of the hundreds of thousands of potential interactions between all metabolites and enzymes in an organism.

Notebaart et al. (2014) have recently unraveled the *underground* metabolism of *E. coli* by expanding a genome-scale metabolic model with reactions resulting from promiscuous enzyme activity. With the *allosterome*, we can unravel yet another hidden layer in the network topology of cellular metabolism. New expanded models of metabolism will be certainly useful for applications, such as drug discovery and rational strain design, as we slowly move toward what has been called the "second secret of life" (Fenton, 2008).

A python implementation of arFBA as well as the allosteric model in SBML format are available on GitHub: https://github. com/cdanielmachado/arfba.

## **5. Materials and Methods**

## **5.1. Model Reconstruction**

The original model of the core metabolism of *E. coli* (Orth et al., 2009) was extended with allosteric interactions obtained from BRENDA (Schomburg et al., 2002), EcoCyc (Keseler et al., 2011), and two previously published kinetic models (Chassagnole et al., 2002; Kotte et al., 2010). We searched for evidence of regulatory interactions for each possible combination of enzymes and metabolites in the model. A total of 148 regulatory interactions were found (Figure S3 in Supplementary Material). Since the majority of these interactions can only be found in one data source, for the sake of curation we only included in the model the interactions that are reported in at least two different sources. In a few cases the same metabolite is reported as activator and inhibitor of an enzyme (e.g., *phosphoenolpyruvate* binding to *fructose-bisphosphatase*). In these cases, we used the most frequently reported effect.

## **5.2. Regulation Analysis** 5.2.1. Cross-Condition Analysis

The metabolic flux of a reaction (*Ji*) can be generically described in terms of the concentrations of the respective enzyme(s) (*Ei*) and all the intervening metabolites (substrates, products, effectors):

$$J\_i = k\_{\text{cat}} E\_i f(M)$$

where *k*cat is the turnover rate of the enzyme, and *f*(*M*) represents a non-linear function of the metabolite concentrations. *Regulation analysis* introduced by ter Kuile and Westerhoff (2001) decomposes the contribution from hierarchical and metabolic control by considering the logarithmic change between two experimental conditions:

$$
\Delta \log(f\_i) = \Delta \log(E\_i) + \Delta \log(f(M))
$$

and estimating the respective contribution coefficients:

$$1 = \frac{\Delta \log(E\_i)}{\Delta \log(f\_i)} + \frac{\Delta \log(f(M))}{\Delta \log(f\_i)} = \rho\_h + \rho\_m.$$

Since *f*(*M*) is generally unknown, one can estimate *ρ<sup>h</sup>* (and consequently *ρm*) by measuring the enzyme and flux levels across different conditions. Chubukov et al. (2013) generalized this comparison from two to multiple conditions in order to decrease the effects of experimental error. The estimation is performed by linear regression between log(*Ei*) and log(*Ji*) across all experimental conditions using a robust linear regression method (Theil–Sen estimator).

We further generalized this concept to the study of allosteric regulation, by decoupling the effect of allosteric regulators in the reaction flux from the non-linear *f*(*M*) component, using a power-law approximation:

$$f(M) \approx g(\mathbb{S}, P) \prod\_{j} A\_j^{\gamma\_{\mathcal{V}}} \prod\_{j} I\_j^{-\gamma\_{\mathcal{V}}}$$

where *S*, *P*, *A*, *I* represent, respectively, the set of substrates, products, activators and inhibitors of reaction *i*, and *γij* is the apparent kinetic order of effector *j* in reaction *i*, as defined in Biochemical-Systems Theory (Voit, 2013). This allows us to estimate individual allosteric regulation coefficients (*ρa*) for each effector as:

$$\rho\_{\mathbf{a}(j)} = \begin{cases} \gamma\_{\text{ij}} \frac{\Delta \log \left( A\_{\text{j}} \right)}{\Delta \log \left( I\_{\text{l}} \right)} & \text{if } j \text{ is an activation of } i \\\\ -\gamma\_{\text{ij}} \frac{\Delta \log \left( I\_{\text{l}} \right)}{\Delta \log \left( I\_{\text{l}} \right)} & \text{if } j \text{ is an inhibition of } i \end{cases}$$

With the exception of effectors exhibiting cooperative binding, we can assume that the kinetic orders are close to or below unity (*γij ≤* 1). Hence, the allosteric control coefficient is bound by the slope of the linear regression.

Regulation analysis was performed for all allosterically regulated reactions with available fluxomics and proteomics data. A total of 18 (out of 24) regulated reactions were experimentally measured. Due to gaps in the proteomics dataset, we restricted the analysis to enzymes with available data for at least 10 (out of 29) experimental conditions.

## 5.2.2. Single-Condition Analysis

Allosteric effects were analyzed for each perturbation individually by comparing the logarithmic change of enzyme, flux, and metabolite levels between all 28 perturbed conditions and the reference condition. Due to the sparsity of the data (especially the metabolome data), this analysis was restricted to all reactioncondition combinations where the following criteria were satisfied: (1) at least one associated enzyme was measured; (2) all main substrates (excluding cofactors) were measured; (3) at least one effector was measured. Furthermore, we excluded flux changes that were not significant (i.e., the perturbed flux falls within a 95% confidence interval of the reference flux).

Evidence of allosteric control was detected by selecting conditions where the flux change is not fully explained by changes in enzyme concentration (∆log(*E*)/∆log(*J*) *<* 0.5) or substrate abundance (∆log(*S*)/∆log(*J*) *<* 0.5), and is at least partly related with changes in one allosteric activator (∆log(*A*)/∆log(*J*) *>* 0.25) or inhibitor (*−*∆log(*I*)/∆log(*J*) *>* 0.25). For reversible reactions,

## **References**


the effect of flux changes arising from changes in the thermodynamic driving force cannot be excluded. Therefore, for these reactions we only considered reactions where the products were experimentally measured (excluding cofactors) and the flux change cannot be fully explained by the change in product abundance (*−*∆log(*P*)/∆log(*J*) *<* 0.5).

## 5.2.3. Ensemble Modeling with arFBA

For each experimental condition, an ensemble of 10<sup>4</sup> models was built by sampling the weighting factors (*wij* parameters) from a log-normal distribution. Each model is constrained with the experimentally measured glucose and oxygen uptake rates, and the growth rate, which is given by the dilution rate.

## 5.2.4. Calibration of Weighting Factors in arFBA

Condition-specific weighting factors were calibrated for each experimental condition as follows: an ensemble of 10<sup>4</sup> arFBA models was built as described above; the accuracy of each model was determined by the *L*1-norm distance between the experimental and simulated flux distributions; the calibrated weighting factors were calculated as the average of the 10% most accurate models.

## **Acknowledgments**

DM and IR would like to thank the FCT Strategic Project PEst-OE/EQB/LA0023/2013 and the Project "BioInd – Biotechnology and Bioengineering for improved Industrial and Agro-Food processes", REF. NORTE-07-0124-FEDER-000028, co-funded by the Programa Operacional Regional do Norte (ON.2 – O Novo Norte), QREN, FEDER. MJH would like to thank the Novo Nordisk Foundation for support.

## **Supplementary Material**

The Supplementary Material for this article can be found online at http://journal.frontiersin.org/article/10.3389/fbioe.2015.00154


metabolism into mathematical models. *Mol. Biosyst.* 9, 1576–1583. doi:10.1039/ c3mb25489e


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2015 Machado, Herrgård and Rocha. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

## Analysis of genetic variation and potential applications in genome-scale metabolic modeling

#### **João G. R. Cardoso<sup>1</sup> , Mikael Rørdam Andersen<sup>2</sup> , Markus J. Herrgård<sup>1</sup> and Nikolaus Sonnenschein<sup>1</sup>\***

<sup>1</sup> The Novo Nordisk Foundation Center of Biosustainability, Technical University of Denmark, Hørsholm, Denmark <sup>2</sup> Department of Systems Biology, Technical University of Denmark, Lyngby, Denmark

#### **Edited by:**

Natalia Polouliakh, Sony Computer Science Laboratories Inc., Japan

#### **Reviewed by:**

Guanglong Jiang, Indiana University School of Medicine, USA Nathan Price, Institute for Systems Biology, USA

#### **\*Correspondence:**

Nikolaus Sonnenschein, The Novo Nordisk Foundation Center of Biosustainability, Technical University of Denmark, Kogle Allé 6, Hørsholm DK-2970, Denmark e-mail: niso@biosustain.dtu.dk

Genetic variation is the motor of evolution and allows organisms to overcome the environmental challenges they encounter. It can be both beneficial and harmful in the process of engineering cell factories for the production of proteins and chemicals. Throughout the history of biotechnology, there have been efforts to exploit genetic variation in our favor to create strains with favorable phenotypes. Genetic variation can either be present in natural populations or it can be artificially created by mutagenesis and selection or adaptive laboratory evolution. On the other hand, unintended genetic variation during a long term production process may lead to significant economic losses and it is important to understand how to control this type of variation. With the emergence of next-generation sequencing technologies, genetic variation in microbial strains can now be determined on an unprecedented scale and resolution by re-sequencing thousands of strains systematically. In this article, we review challenges in the integration and analysis of large-scale re-sequencing data, present an extensive overview of bioinformatics methods for predicting the effects of genetic variants on protein function, and discuss approaches for interfacing existing bioinformatics approaches with genome-scale models of cellular processes in order to predict effects of sequence variation on cellular phenotypes.

**Keywords: genetic variation, SNP, next-generation sequencing, constraint-based modeling, metabolic engineering, adaptive laboratory evolution, metabolism, high-throughput analysis**

### **1. INTRODUCTION**

Genetic engineering has been used for several decades to manipulate microorganisms in order to allow production of valuable products, including primary metabolites (e.g., amino-acids and organic acids), secondary metabolites (e.g., antibiotics), and enzymes or other recombinant proteins (Adrio and Demain, 2010). Genetic engineering is thus a central part in the quest to establish sustainable and efficient processes for the production of fuels, chemicals, food ingredients, and pharmaceutical products.

Most of these achievements would not been possible without sequencing technologies that allowed us to identify the genetic sequences and validate the genetic manipulations in microorganisms. More recently, Next-Generation Sequencing (NGS) technologies have provided us with the capability of fast and cheap sequencing of DNA at an unprecedented scale. NGS has allowed *de novo* assembly of the genomes of thousands of organisms for which no genome sequences were previously available, ranging from complex multicellular organisms (Li et al., 2010; Nakamura et al., 2013; Pegadaraju et al., 2013; Kelley et al., 2014) to microorganisms (Soares-Castro and Santos, 2013; Yamamoto et al., 2014). NGS technologies also provide us with the means to re-sequence organisms (Atsumi et al., 2010; Wang et al., 2014), i.e., the sequencing of genetically distinct strains that are close enough to a reference strain with a sequenced genome. Re-sequencing is used to determine genetic variants ranging from single nucleotide variants (SNV) to more complex structural variants such as

large deletions, inversions, and translocations. The falling cost of sequencing allows routine re-sequencing of strains isolated from the wild, monitoring the genetic stability of production strains during genetic engineering and fermentation processes, and determining the genetic basis of adaptive laboratory evolution (ALE) (Herrgård and Panagiotou, 2012). In addition to biotechnological applications, re-sequencing of microbial strains plays also a key role in other areas such as epidemiology of infectious diseases caused by bacterial and fungal pathogens, and in understanding the effects of human activity on microbial diversity and evolution in the environment.

Genome-scale metabolic models (GSMs), consisting of biochemical reactions and their relations to the genome and proteome of a cell [through gene–protein-reaction (GPR) associations], are a proven framework for the *in silico* analysis of the metabolic physiology of microbes. Genome-scale metabolic models have also been used successfully for the design of metabolically engineered strains with improved production of commercially valuable proteins and metabolites: recombinant antibodies, food additives (e.g., vanillin), organic acids, ethanol, among others (Tepper and Shlomi, 2009; Brochado et al., 2010). These models have become increasingly popular over the past decade, and more than 100 models for different organisms have been published up to this date (http://optflux.org/models). The greatest strength of GSMs lie in their simplicity and computational efficiency; new GSMs can be readily built from genomic annotations complemented with limited experimental data, and predictions from GSMs can be obtained using standard mathematical optimization methods (Varma and Palsson, 1993; Segrè et al., 2002; Shlomi et al., 2005) allowing phenotypic predictions within minutes.

Genetic variation that entails a complete loss of function – commonly referred to as gene knockout – has been successfully used to tailor GSMs to a specific genotype to improve the production of valuable compounds [e.g., biobutanol (Lee et al., 2008), sesquiterpene (Asadollahi et al., 2009), vanillin (Brochado et al., 2010), polyhydroxyalkanoates (Puchałka et al., 2008), or L-valine (Park et al., 2007)], but so far no methodological framework has been developed that would allow the incorporation of other types of genetic variants systematically. In this work, we review existing tools for analyzing genetic variants that capture more subtle changes such as synonymous and non-synonymous SNVs in coding regions or variants in promoter or other regulatory regions. We will focus on outlining the challenges of combining more subtle genetic variant information with GSMs in order to use models to predict strain-specific phenotypes.

#### **2. UNVEILING THE EFFECTS OF GENETIC VARIATION**

#### **2.1. GENETIC VARIABILITY**

Genetic variants, including SNVs and larger structural variants are commonly seen when natural or engineered strains are resequenced (**Figure 1**). SNVs can be found across the genome in different functional regions: (i) protein coding sequences, (ii) promoters and other regulatory elements such as ribosome binding sites, (iii) splice sites and other regions affecting transcript structures, and (iv) other genomic regions with unknown direct connections to any given protein function. Moreover, insertions or deletions of nucleotides (indels) within a coding region can cause a shift in the open reading frame usually denoted as frameshift mutations (**Figure 1A**). At the genome structure level, chromosomal rearrangements, e.g., swaps, inversions, deletions, and insertions, can affect the function of one or more proteins (**Figure 1B**).

The spectrum of the resulting effects caused by these genetic variations on individual gene or protein function or expression is very broad. Non-synonymous SNVs or in-frame indels in protein coding sequences can disrupt, enhance, or modify the activity of the protein depending on the exact amino-acid change introduced. Introduction or removal of a stop codon by specific SNVs or out-of-frame indels would be expected to result in more drastic changes of protein function. For example, the appearance of a stop codon might lead to the separation of a multi-domain protein to multiple individual single-domain proteins. The removal or replacement of a stop codon could cause translational read-through leading to an elongated protein with potential new functions (Long et al., 2003). SNVs and indels in regulatory regions such as promoters can affect the transcription or translation processes giving rise to variation in expression levels in specific proteins. In eukaryotes, variants within introns can also affect transcript structures by introducing new exons or removing existing ones. Some variations can also be completely silent with no change of phenotype, for example, a change in a stop codon location might not change the protein activity. Ideally, we should be able to predict the degree in which single and multiple genetic variants within or near a coding locus affect the relevant protein function or expression. This would allow us to rapidly make sense of the vast quantities of re-sequencing data that is becoming available without having to test the effects of all variants experimentally.

Larger-scale structural variations, such as duplications, deletions, translocations, and inversions, can have significant effects on the expression or activity of individual proteins. For example, there can be a complete loss of one or more genes, or a duplication of genomic regions can modify the expression of multiple genes within or nearby these regions (Blount et al., 2012). Very large-scale genomic changes, such as duplication of entire chromosomes, can change the activity of hundreds of proteins at once and have been reported in both natural microbial strains (Gordon et al., 2009) and in strains created by ALE (Caspeta et al., 2014). The effects of structural genomic variation are often more systemic than the effects of smaller scale variations, but any framework attempting to predict the phenotypic effects of genetic variation needs to consider both small- and large-scale variation.

#### **2.2. IN SILICO: PREDICTING THE EFFECT OF GENETIC VARIANTS**

A major challenge to understanding the phenotypic consequences of genetic variation lies in our ability to predict the mechanistic consequences of mutations. Proteins are very complex structures

that fall into different functional categories and can be characterized by many distinct properties. For example, how protein activities are measured depends on their functional category: transcription factors can be characterized by their binding strength to a certain promoter region while metabolic enzymes would typically be characterized by their catalytic activity and specificity for a certain substrate. Moreover, proteins do not operate in isolation but interact with each other and with metabolites, and these interactions have consequences on the activities of proteins. Here, we provide a non-exhaustive review of the types of methods that are commonly used to predict the effects of genetic variants on protein function.

The study of single nucleotide polymorphisms (SNP) that affect human health is one of the major focus areas of modern medical research. In human genetics, SNPs are single nucleotide substitutions found in more than 1% of a population. Several algorithms were implemented to determine the effect of SNPs, mostly specialized to the analysis of human genotyping data (see **Table 1** and **Figure 2**). One limitation of most of these algorithms is that they are binary classifiers – deleterious or neutral, disease-causing or neutral, and tolerant or intolerant. This means that the genetic changes will either be predicted to have no effect or to cause some measurable, negative impact on the phenotype. This may not be an issue in the context of human diseases as SNP data are primarily used in diagnostics. However, fine tuning engineered microbial strains requires more than a black and white approach for predicting variant effects on protein function. This is because many genetic variants can yield proteins with either increased or decreased activity, requiring methods that are able to predict also potential gains or modifications of functions. In particular, when mutagenesis and selection or ALE methods are applied, one commonly sees gain of function mutations of specific genes that are crucial for the adaptation to, for example, new carbon sources (Conrad et al., 2011).

Of the existing algorithms (**Table 1**), *SIFT* (**S**orting **I**ntolerant **f**rom **T**olerant) (Ng and Henikoff, 2001) is often used as a gold standard to compare the performance of new algorithms or as a foundation for novel prediction strategies. SIFT and related approaches are based on the notion that evolutionary conservation can be used to predict the functional importance of each aminoacid in a protein and the impact of specific amino-acid substitutions. These methods typically use multiple sequence alignments of related proteins to determine a probabilistic description of what amino-acid substitutions are allowed in specific sites within the target protein. These descriptions can be used to determine the probability that non-synonymous coding SNPs observed in a resequencing data set will be tolerated by the protein; substitutions with a probability score smaller than a threshold are assumed to be deleterious (Kumar et al., 2009).

Sorting intolerant from tolerant provides only a binary deleterious/non-deleterious classification, and other methods have been developed to allow predicting cases where SNPs improve protein function. The *Polyphen* (Ramensky, 2002) and *PolyPhen2* (Adzhubei et al., 2010) approaches provide the means to discriminate three states when analyzing the effect of a SNP: benign, neutral, or deleterious. *Polyphen* uses a list of predetermined rules that combine the output of multiple algorithms using

combinations of structural and sequence-based measures of mutation impact. *PolyPhen2* uses a machine-learning approach (a naive Bayes model) to predict an overall score for the variant effect, and the classification to three categories is based on thresholds. Although the algorithm is trained with human datasets, similar methods could potentially be used to build predictive models for variant effects in microorganisms. The overall variant effect score could also be exploited in more advanced methods that combine scores from different variants affecting different proteins to make phenotypic predictions.

Most studies on genetic variation focus on SNPs and disregard indels, which are also commonly observed when related microbial strains are compared to each other. The *PROVEAN* (Choi et al., 2012) and *Mutation taster 2* (Schwarz et al., 2014) approaches are capable of analyzing both SNPs and indels. *PROVEAN* uses substitution matrix scores (i.e., BLOSUM62) with gap and extension penalties to compute a variation score between the wild-type and mutant. More recently,*Mutation taster 2* computes several features (structural and evolutionary properties) for the mutated sequence using a Bayes classifier.

One possible approach for improving our ability to predict variant effects on protein function would be to predict effects of amino-acid changes on protein stability and folding (Khan and Vihinen, 2010). There are a number of tools available for these tasks (Khan and Vihinen, 2010), and stability predictions could be used to predict variant effects on protein function, as strongly destabilizing mutations would result in complete loss of function for the protein. Methods for predicting variant effects on protein stability have only been found to be moderately accurate in independent evaluation studies (Khan and Vihinen, 2010). For this reason, stability predictors should be combined with other variant effect prediction approaches to improve their predictive power for general variant effect analysis. The application of these types of stability prediction methods will be discussed in Section 3.2 in more detail together with the applications of metabolic modeling.

The majority of algorithms (53%) for variant effect prediction listed in **Table 1** rely on machine-learning approaches [e.g., AUTO-MUTE (Masso and Vaisman, 2010), FunSAV (Wang et al., 2012), or HANSA (Acharya and Nagarajaram, 2011)], which is a practical strategy given the huge amount of data available for human diseases. Regarding the selection of features, most methods use evolutionary conservation information (92%) and more than half rely on structural properties (69%). The selection of sufficient features is a challenge in itself; no matter what approach is used, it is necessary to define which properties and attributes of proteins are capable of discriminating the phenotypes of interest. The improvements in the prediction capabilities provided by sequence-, evolution-, or structural-based features has been previously studied, and these studies have shown that the inclusion of structural properties leads to significant improvements in predictive power (Saunders and Baker, 2002). This has been recently confirmed by a benchmark performance test that includes several of the existing algorithms (Thusberg et al., 2011). Another effort to benchmark and improve different approaches is the Critical Assessment of Genome Interpretation (CAGI) community, which organizes a benchmark competition on predicting the effect of genetic variants on known disease phenotypes.

#### **Table 1 | A summary of the available software tools for predicting the effect of the genetic variants**.


(Continued)

#### **Table 1 | Continued**


While the majority of algorithms aim to predict variant effects on individual proteins, a different objective is followed by the SNP-IN method that predicts how protein–protein interactions (PPIs) are affected by a SNP (Zhao et al., 2014). This is achieved by a set of features that includes the relative free energy change between wildtype and mutant PPI, the energy of all interactions in a protein complex, and other physicochemical properties, e.g., hydrophobic solvation or water bridges. Using these features, supervised and semi-supervised machine-learning approaches are used to predict how deleterious SNPs are. This approach is a very interesting, as changes in PPIs could be used to explain epistatic interactions between multiple variants. Like some previously mentioned prediction algorithms, SNP-PI requires an existing 3D model of the protein structure and, in addition, knowledge of the PPIs a given protein is involved in.

At a larger scale, genome-wide association studies are used to identify how differences between hundreds of thousands of individuals and make genotype to phenotype consequences. This approaches work as black boxes and make use of statistical and machine-learning approaches that require huge datasets. The

current work and applications (e.g., clinical risk assessment) have been recently reviewed (Okser et al., 2014).

#### **2.3. IN VIVO: DEEP MUTATIONAL SCANNING AND TN-SEQ**

Next-generation sequencing has enabled studying the effects of genetic variation on individual proteins or regulatory elements *in vivo* and *in vitro*. Deep mutational scanning (DMS) is an effective high-throughput method to measure the effects of mutations on protein stability and function (Fowler and Fields, 2014). The space of all possible amino-acid substitutions in a protein is exhaustively screened by first constructing a library of sequence variants using standard techniques like error prone PCR, then by using a high-throughput assay to select variants based on a fitness measure (e.g., growth rate, ligand binding, or product fluorescence), and finally by applying deep sequencing to the selected and unselected sequence variant pools. This approach results in a matrix that contains fitness values for each amino-acid substitution discovered in the selected pool. Depending on the method used for creating sequence diversity and sequencing depth, DMS can also be used to measure epistatic effects between substitutions at different sites.

The applicability of DMS is primarily limited by the lack of high-throughput functional assays for most proteins and, so far, DMS has not been applied to metabolic enzymes. When DMS can be applied at a broader scale, the results obtained from the assay could increase the predictive power of bioinformatic tools for genetic variation analysis by providing more complete training datasets for the types of predictive methods discussed in the previous section. Methods similar to DMS can also be used to systematically study effects of genetic variation in regulatory regions on protein expression using fluorescence protein-based assays.

Here, we will highlight a few case studies using DMS and related methods to study protein or regulatory element function. In the analysis of *Saccharomyces cerevisiae* poly(A)-binding protein (Melamed et al., 2013), strong epistatic effects between substitutions at specific sites were discovered. Although epistasis was not widespread, this is worrying from a computational modeling perspective, as modeling approaches usually do not account for epistasis. Another important highlight is the identification of alternative start codons. Although analyzed in previous studies, the DMS has shown that some amino-acids can be replaced by methionine and yield functional proteins (Kim et al., 2013). This biological information can be extrapolated to other studies and is highly relevant when developing strategies to understand the effect of mutations, either *in vivo* or *in silico*. Strategies similar to DMS have also been used to systematically study the effects of variation in transcription factor binding sites and other regulatory elements such as ribosomal binding sites (Kosuri et al., 2013). These studies will build the foundation for predicting effects of non-coding sequence variants on protein expression.

The methods described above allow us to systematically study the effects of a large number of variants in individual proteins or regulatory regions. In microorganisms, it is also possible to use a next-generation sequencing-based method called Tnseq to systematically study the effect of disruption of a large number of genomic loci on cellular phenotypes (van Opijnen and Camilli, 2013). Transposons are mobile DNA elements that can disrupt a genetic locus by integrating themselves into it (**Figure 1B**). Tn-seq, using high density transposon insertion libraries, can be used to interrogate the function of, for example, regulatory elements and specific protein domains in a single genome-wide assay (van Opijnen and Camilli, 2013). Tn-seq has found many applications in microbiology, and it has been used for the identification of gene function, understanding genome organization, mapping genetic interactions, or assessing gene essentiality (van Opijnen and Camilli, 2013; Yang et al., 2014). Tn-seq does not offer a resolution on the single base-pair level, but the method can be rapidly used to generate sub-gene-level information relating, for example, to the essentiality of specific domains in a protein. This information in turn could be used to improve variant effect predictions, as variants in essential domains of a protein would be more likely to be predicted to be deleterious than variants in non-essential domains of the same protein.

#### **3. PREDICTING PHENOTYPES FROM GENOTYPES AT THE GENOME-SCALE**

#### **3.1. STATISTICAL AND NETWORK-ORIENTED APPROACHES FOR PREDICTING PHENOTYPES FROM GENOTYPES**

Section 2 focused on the task of predicting the effects of genetic variation on individual protein function or expression. However, this is only a small part of a much larger problem, which of predicting cellular or organism phenotypic effects of all the genetic variants present in a genome. This requires combing the effects of variation on the function and expression of all proteins. So far, there have been surprisingly few efforts to take all genetic variants discovered in an individual (either a human or a microbial strain) and attempt to predict how certain phenotypes would be affected by all these variants together (Burga and Lehner, 2013; Lehner, 2013).

One of the first systematic attempts toward this goal was the pioneering study by Jelier et al. in *S. cerevisiae*, where growth phenotypes of selected yeast strains under different conditions were predicted from genetic differences between a reference strain and the strain of interest (Jelier et al., 2011). This was achieved by first predicting effects of coding and regulatory variants on protein function and expression using approaches similar to the one outlined in the previous section. These variant effect predictions were then combined into a single phenotypic prediction for the strain, using published single gene deletion growth phenotyping data for a yeast reference strain under the same condition. This approach can be considered to be highly simplistic, as the effects of multiple genetic variants acting on separate proteins were treated cumulative. Despite this, the approach still allowed accurate prediction of growth phenotypes across a broad range of conditions. There have also been a number of other approaches for predicting broader phenotypic consequences of single variants by mapping the variant data onto biological networks such as PPI or genetic networks (Carter et al., 2013). However, these approaches have typically not attempted to use the whole genotype of an individual (i.e., more than one variant at a time) to predict specific phenotypes.

#### **3.2. USING GENOME-SCALE METABOLIC MODELS FOR INTERPRETING GENETIC VARIANTS**

The phenotype prediction methods described above are data-driven and use statistical models to predict the effects of genetic variants in the context of biological networks. However, for metabolic networks we can go beyond statistical models and

graph-based descriptions to constraint-based models that are scalable to the genome-level and incorporate physicochemical, flux capacity, and reaction directionality constraints [see Price et al. (2004) for a review of constraint-based modeling]. This type of mechanistic modeling approach is very useful for understanding genetic changes that affect specific metabolic phenotypes. For example, the study of SNPs that affect mitochondrial metabolism (Jamshidi and Palsson, 2006) is a good example of how variant data can be mapped onto metabolic networks in order to explain the mechanistic basis of disease phenotypes.

A genome-scale metabolic models are composed of biochemical reactions, collected from literature and the genome annotation of an organism. This system of reactions is encoded as a matrix of stoichiometric coefficients that is usually referred to as stoichiometry matrix<sup>1</sup> . Assuming metabolism is in a steady-state, i.e., metabolite concentrations do not change over time, all fluxes have to balance each other. These flux-balances constitute linear constraints that can easily be analyzed using methods from linear algebra.

Furthermore, after inclusion of further constraints, e.g., known uptake and secretion rates and knowledge about reaction directionality, linear optimization methods can compute biologically relevant flux vectors that maximize defined objective functions. For example, growth can be simulated by maximizing the consumption of biomass precursors in empirically determined proportions. This type of analysis is usually referred to as flux balance analysis [FBA; see Orth et al. (2010) for a comprehensive introduction to this method].

Global optimal solutions to this linear optimization problems can be calculated very efficiently using linear programing (computation times are on a millisecond to second range for genome-scale models). Thus, one can compute thousands of phenotypes in a few minutes, simply by changing the constraints of the problem [see Lewis et al. (2012) for a comprehensive list of available *in silico* methods and (Bordbar et al., 2014) for a review of their applications].

Since the relationship between reactions, enzymes, and genes (usually referred to as GPR associations) is usually known and encoded in these models, the effect of a gene knockout can readily be mapped to the associated reactions by constraining their fluxes to be zero or by removal from the model. This way FBA can be used to compute the metabolic phenotype associated with a metabolic gene deletion, making it suitable for the analysis of genetic variation data that involves deletions or other mutations that lead to the complete loss of function of enzymes.

Flux balance analysis assumes that knockout strains can recover to an optimal growth phenotype, which might be unrealistic in cases where regulatory mechanisms – not modeled explicitly in these models – might not be able to accommodate the desired state. Other methodologies [e.g., ROOM (Shlomi et al., 2005), MoMA (Segrè et al., 2002), MiMBl (Brochado et al., 2012), and RELATCH (Kim and Reed, 2012)] employ more plausible assumptions and have been shown to improve the accuracy of knockout

<sup>1</sup>The rows and columns of the stoichiometry matrix correspond to metabolites and reactions respectively; negative (positive) factors represent consumption (production) of substrates (products).

predictions. For example, MoMA minimizes the euclidean distance of the wild-type and mutant flux distributions, assuming that a mutant reaches the closest feasible flux distribution that is not necessarily optimal. The predictive power of FBA and these other approaches have been extensively assessed using genomewide gene knockout assays (Snitkin et al., 2008) and transposon insertion libraries (Yang et al., 2014) and have resulted generally in a high degree of accuracy (Monk and Palsson, 2014).

Constraint-based models have also been applied to predict epistatic interactions by simulating effects of pairwise gene deletions, but with a significantly reduced accuracy in comparison to single deletions (Szappanos et al., 2011). Furthermore, simulations of multiple gene deletions have been successfully applied in developing design strategies for metabolic engineering by redirecting flux to desired products (Milne et al., 2009; Blazeck and Alper, 2010).

A number of limiting factors can diminish the ability of constraint-based models to predict phenotypic effects of loss of function mutations: (i) missing reactions and erroneous GPRs, (ii) erroneous flux constraints due to the lack of thermodynamic or regulatory information, and (iii) the assumption of a fixed biomass composition that is known to change across growth conditions. Even with these limitations, constraint-based models still outperform statistical models in predicting consequences of gene deletions (Szappanos et al., 2011).

Since constraint-based models have demonstrated good ability to predict phenotypic outcomes of single and multiple gene deletions, these models should also be useful for predicting effects of other genetic variants. A SNV or indel that is predicted to reduce the maximal flux rate of an enzyme can be used to constrain the upper bound of a flux. FBA and similar methods can be used to compute the effects of these variations on the phenotype, providing a system-wide overview of the effects caused by the substitution (Jamshidi et al., 2007). This is a fast and effective way of predicting phenotypes, but it requires that one can estimate the effect the variant has on the maximum flux rate. Nevertheless, cases of complete loss of function fall into the same category as gene knockouts, and combining the bioinformatic prediction tools discussed in Section 2.2 with modeling capabilities can be used to integrate variant data. This approach can also be extended to any number of variants and genes, with the caveat that epistatic interactions are currently not captured accurately by the models.

There is currently only a limited number of studies that use GSMs to systematically explore the effects of genetic variants on phenotypes. Chang et al. (2013) conducted a study where GSMs coupled with protein structures of metabolic enzymes (GEM-PRO<sup>2</sup> ) were used to interpret genetic variant data of *Escherichia coli* strains evolved to tolerate high temperatures (Chang et al., 2013). In this study, a GSM of *E. coli* was constrained using experimentally or bioinformatically determined thermostabilities of metabolic enzymes. Since the maximum flux capacity of a reaction is proportional to the concentration of active enzyme, temperature changes can be modeled by varying the flux constraints accordingly. This enables the prediction of enzymatic steps

that are disproportionately temperature sensitive. For the evolved strains, flux balance analysis was used to explore the adaptation of the mutated enzymes; constraints associated with mutated proteins were relaxed to explain the experimentally measured growth rates (Chang et al., 2013). The study did not include separate predictions of variant effects on protein function, but rather treated all variants observed in a protein as potentially affecting its activity.

A more recent study by Nam et al. (2014) describes the use of GSMs for understanding the metabolic effects of cancer mutations. In particular, Nam et al. use genetic mutation information, gene expression profile data, and a human GSM (Thiele et al., 2013) to construct context-specific models for different cancer types. Loss and gain of function were systematically analyzed. Loss of function was modeled as described above (i.e., constraining affected reactions' fluxes to 0). Gain of a function, on the other hand, was modeled by adding novel promiscuous activities as predicted by chemoinformatic approaches. This approach allowed the prediction of potential oncometabolites.

#### **3.3. KINETIC MODELING OF GENETIC VARIANTS**

As mentioned in the previous section, constraint-based modeling does not provide any information about the dynamic behavior of a metabolic system. A full kinetic description of a biochemical reaction network can be formulated using ordinary differential equations (Heinrich and Schuster, 1996). The major advantage of using kinetic models to study effects of genetic variation lies in their ability to account for mutations affecting catalytic or regulatory sites of an enzyme, causing either a gain or loss of catalytic activity, or binding sites of allosteric regulators.

Previous studies of red blood cell metabolism provide an overview on how SNPs can alter kinetic parameters and how kinetic models can be used to explain metabolic syndromes caused by enzyme deficiencies (Jamshidi, 2002; Jamshidi and Palsson, 2009). A disadvantage of using kinetic models is that kinetic parameters are not available for most enzymes and measuring the parameters can be challenging. For this reason, building predictive genome-scale kinetic models remains a challenge (Stanford et al., 2013). Kinetic models are a viable tool for interpreting genetic variant data only in specific cases like, for example, the red blood cell that harbors a relatively simple metabolism.

#### **4. CONSIDERATIONS AND FUTURE DIRECTIONS**

#### **4.1. METHODS AND TOOLS TO PREDICT THE EFFECT OF GENETIC VARIANTS**

Many approaches have been explored in the past decade to understand and analyze the effects of genetic variation. In particular, the most active field has been the application of NGS techniques to characterize of genetic variation in the context of human disease. The amount of disease related information makes machinelearning approaches very suitable for the purpose of predicting effects of single genetic variants. Since most prediction methods have been trained and tested with human data, many of the existing methods do not perform as well or are simply not suited for the analysis of microbial genetic variants.

The other area where the study of microbial genetic variation lags behind human genetics is the systematic collection of variant and phenotyping data. Efforts to collect human genotype and

<sup>2</sup>Genome-scale metabolic models are sometimes also referred to as GEMs.

phenotype data in a standardized way are currently underway with databases such as dbSNP and European Variation Archive. The UniProt database also collects variants found in the proteins sequences when this information is available. Every day thousands of new environmental or pathogenic isolates and laboratory developed microbial strains are sequenced around the world, but there is no centralized repository for this data in common use. We argue that it is of utmost importance to collect genetic variant data together with associated phenotypic data in a standard way for microbes as well.

All the existing algorithms for variant effect prediction are used to classify variants to preassigned categories (for example deleterious or non-deleterious). The approaches that predict deleterious effects can already be handled as knockouts in modeling their phenotypic effects using GSMs, but more subtle effects of mutations are missed by this approach. In order to improve our ability to predict phenotypes, there is a need to move beyond classification toward quantitative measures of variant effects on individual protein function. There are numerous features related to protein function that may be relevant for predicting variant effects: evolutionary and conservation, physicochemical (e.g., charge, polarity, or free energy), and structural (e.g., secondary structures, spatial distances between amino-acids or B-factors).

Existing methods for predicting variant effects have been primarily focused on generic predictors for all proteins irrespective of their function (e.g., enzymes, transcription factors, transporters, chaperons, etc.) and how do they behave in their environment (i.e., interaction with other elements: proteins, metabolites, DNA, etc.). This limits the predictive power of the methods in cases where additional information is readily available such as the relatively well studied field of microbial metabolism. For example, for metabolic enzymes, information on how kinetic parameters are affected by mutations and how these parameters vary between enzymes from different species is systematically collected in databases such as BRENDA. This type of information could be used to build improved variant effect predictors specifically for metabolic enzymes.

#### **4.2. MODELING AND HIGH-THROUGHPUT DATA ANALYSIS**

Improvements in genome-wide variant effect prediction can also come from improving or extending genome-scale modeling approaches. Recent innovations like GEM-PRO, as discussed in Section 3.2, fulfill the requirement of 3D protein structures to predict the effects of genetic variation at the protein level and could be used to systematically analyze the effect of genetic variation on a genome-scale for metabolism.

Approximately 10–30% of the genes encoded in a microbial genome are represented in metabolic GSMs, limiting the utility of these models for interpreting genomic variant data. Metabolic GSMs can be extended in a number of ways to increase coverage of the overall set of genes. The transcriptional regulatory network represented as interactions between transcription factors and target genes, can help extend the coverage of predictive models and can be integrated with metabolic GSMs in a number of ways (Covert et al., 2004; Chandrasekaran and Price, 2010). These integrated models have been successfully used to make phenotypic predictions.

Another recent extension of GSMs is ME-Models<sup>3</sup> . These models account for the entire machinery needed for gene and protein expression, providing a higher coverage of cellular functions and a higher resolution of cellular composition (O'Brien et al., 2013). ME-models have also been extended further to incorporate protein translocation from the cytoplasm to the periplasm (Liu et al., 2014). Currently, most of these extensions of GSMs have only been developed for *E. coli* and significant efforts will be required to build these extended models for other bacteria as well as eukaryotic model organisms such as *S. cerevisiae*.

The development of accurate kinetic models of metabolism, which could be useful for investigating the effects of mutations on allosteric regulation and catalytic activity, is still a tedious process. These models are usually limited to small parts of metabolism focusing on central carbon metabolism (Chassagnole et al., 2002; Peskov et al., 2012; Machado et al., 2014). There are two main reasons for these limitations: the models become huge in size and kinetic information of many enzymes is still unknown. Protocols (Stanford et al.,2013) and methodologies (Chowdhury et al., 2014) are being developed to bring kinetic modeling to the genome-scale, but the resulting models have not yet reached sufficiently mature stage for use in variant effect prediction.

In comprehensive level, a strategy for building whole-cell models by combining multiple individual models of different cellular processes including cell cycle, metabolism, transcription, and transport has been proposed (Karr et al., 2012). This strategy that also allows combining models using different representations (constraint-based, kinetic, and stochastic) was used to build a functioning whole-cell model of one of the simplest prokaryotes, *Mycoplasma genitalium*. Efforts toward building more complete genome-scale models of microbes will continue as more and more information is collected and computing power increases. These models will bring us closer to the goal of genome-wide prediction of phenotypes from genotyping data.

#### **4.3. OPPORTUNITIES**

Genetic engineering tools, such as MAGE (Wang et al., 2009) or CRISPR/Cas9 (Xu et al., 2014), already allow us to quickly edit genomes in a precise and accurate fashion at the single base-pair resolution level at multiple loci simultaneously. These methods will allow us to map epistatic interactions of variants within a single gene and between multiple genes more comprehensively than before. On the other hand, new *in silico* tools for predicting variant effects on phenotypes outlined above open the way to a new style of modeling at the scale of single nucleotides. These new modeling tools will greatly benefit from better training datasets that can be obtained using MAGE, CRISPR/Cas9 or other genome editing methods systematically to map epistatic interactions. The application of these novel strategies provides a way to fine tune activities of proteins in the context of complete cellular networks. For example, we envision that in the future we will have predictive models of how engineering of multiple enzymes at the single amino-acid level would affect the production of a desired metabolite.

To achieve the maximum potential of genome-scale biochemical network modeling and genetic variant analysis, a link must

<sup>3</sup>Metabolism and Expression models.

be created between these two fields. The necessary information to connect both worlds is already there: we know the genes, the proteins, and the reactions. The major limitations are in the current methods and data sources. On the one hand, we must overcome the limitations of the tools available to predict variant effects by allowing more fine grained predictions of how a variant may affect any given protein function or expression. The usage of protein folding predictions, for example, has already been established in metabolic modeling (Chang et al., 2013), and it should be possible to use tools that predict variant effects on protein stability together with genome-scale models. On the other hand, we need to improve biochemical network modeling techniques: this is a evolving field and in the past decade there have been efforts to standardize the construction of models (Thiele and Palsson, 2010) and improving prediction methods by including high-throughput data (Machado and Herrgård, 2014).

Finally, it should be acknowledged that there will always be limitations in using solely genomic variant data as the basis for making phenotypic predictions for specific strains. We may also need to measure intermediate phenotypes such as transcript, protein, or metabolite levels for these strains in order to make predictions of how a given genotype affects a specific phenotype (Burga and Lehner, 2013). Fortunately enough comprehensive multi-omic datasets are currently being collected for wild-type microbial strains, allowing refinement of modeling and bioinformatic approaches for phenotypic prediction (Ishii et al., 2007; Skelly et al., 2013). Hopefully, systematizing such datasets and a concerted action between modelers, geneticists, microbiologists, and bioinformaticians will allow us to achieve the prediction of changed and novel metabolic capabilities of a microbial strain from genomic re-sequencing data.

#### **ACKNOWLEDGMENTS**

JC, MH, and NS acknowledge support by the Novo Nordisk Foundation through the Novo Nordisk Foundation Center for Biosustainability. MA acknowledges funding from a Biotechnologybased Synthesis and Production Research grant from the Novo Nordisk Foundation.

#### **REFERENCES**


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 25 November 2014; paper pending published: 25 December 2014; accepted: 22 January 2015; published online: 16 February 2015.*

*Citation: Cardoso JGR, Andersen MR, Herrgård MJ and Sonnenschein N (2015) Analysis of genetic variation and potential applications in genome-scale metabolic modeling. Front. Bioeng. Biotechnol. 3:13. doi: 10.3389/fbioe.2015.00013*

*This article was submitted to Systems Biology, a section of the journal Frontiers in Bioengineering and Biotechnology.*

*Copyright © 2015 Cardoso, Andersen, Herrgård and Sonnenschein. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

## RobOKoD: microbial strain design for (over)production of target compounds

#### Natalie J. Stanford1, 2 \*, Pierre Millard1, 2, 3, 4, 5 and Neil Swainston1, 2

<sup>1</sup> Manchester Institute of Biotechnology, University of Manchester, Manchester, UK, <sup>2</sup> School of Computer Science, University of Manchester, Manchester, UK, <sup>3</sup> INSA, UPS, INP, LISBP, Université de Toulouse, Toulouse, France, <sup>4</sup> INRA, UMR792, Ingénierie des Systèmes Biologiques et des Procédés, Toulouse, France, <sup>5</sup> Centre National de la Recherche Scientifique, UMR5504, Toulouse, France

Sustainable production of target compounds such as biofuels and high-value chemicals for pharmaceutical, agrochemical, and chemical industries is becoming an increasing priority given their current dependency upon diminishing petrochemical resources. Designing these strains is difficult, with current methods focusing primarily on knocking-out genes, dismissing other vital steps of strain design including the overexpression and dampening of genes. The design predictions from current methods also do not translate well-into successful strains in the laboratory. Here, we introduce RobOKoD (Robust, Overexpression, Knockout and Dampening), a method for predicting strain designs for overproduction of targets. The method uses flux variability analysis to profile each reaction within the system under differing production percentages of target-compound and biomass. Using these profiles, reactions are identified as potential knockout, overexpression, or dampening targets. The identified reactions are ranked according to their suitability, providing flexibility in strain design for users. The software was tested by designing a butanol-producing Escherichia coli strain, and was compared against the popular OptKnock and RobustKnock methods. RobOKoD shows favorable design predictions, when predictions from these methods are compared to a successful butanol-producing experimentally-validated strain. Overall RobOKoD provides users with rankings of predicted beneficial genetic interventions with which to support optimized strain design.

#### Road, Manchester M13 9PL, UK natalie.stanford@manchester.ac.uk

Edited by: Markus J. Herrgard,

> Denmark Reviewed by: Monika Heiner,

Technical University of Denmark,

Brandenburg University of Technology Cottbus-Senftenberg, Germany Osbaldo Resendis-Antonio, Instituto Nacional de Medicina Genomica, Mexico \*Correspondence: Natalie J. Stanford, School of Computer Science, University of Manchester, Oxford

#### Specialty section:

This article was submitted to Systems Biology, a section of the journal Frontiers in Cell and Developmental Biology

> Received: 03 November 2014 Accepted: 25 February 2015 Published: 24 March 2015

#### Citation:

Stanford NJ, Millard P and Swainston N (2015) RobOKoD: microbial strain design for (over)production of target compounds. Front. Cell Dev. Biol. 3:17. doi: 10.3389/fcell.2015.00017 Keywords: synthetic biology, systems biology, metabolic engineering, strain design, constraint-based modeling

## Introduction

The sustainable production of target compounds such as biofuels and high-value chemicals for pharmaceutical, agrochemical, and chemical industries is becoming an increasing priority given their current dependency upon diminishing petrochemical resources. The challenge of producing such compounds from microbial cells straddles both systems and synthetic biology. The development of microbial cell factories first requires a comprehensive understanding of host cell metabolic functions through metabolic model construction, and subsequent in silico experimentation, using systems biology methods. This in silico experimentation can suggest host cell manipulations that can be applied in vitro using synthetic biology techniques, leading to increased production of the target compound (Koide et al., 2009).

Target producing microbial strains are typically designed using combinations of gene manipulations. These manipulations include gene additions (often recombinant genes from other organisms) and removal of genes via knockouts. Furthermore, over-expression or inhibition of host genes can either increase or dampen metabolic flux through the reactions that their expressed proteins catalyze. Successful application of such strategies can be used to overproduce host-native targets (Ng et al., 2012; Li et al., 2014) or produce non-host-native targets (Atsumi et al., 2009; Angermayr et al., 2014; Yuan et al., 2014). Identifying successful gene manipulation combinations has traditionally relied on static network inspection, and experimental trial and error to test the strategies (Varman et al., 2011). This approach is not optimal as it limits the amount of network information that can be used, discounts metabolic complexity, and therefore prevents predictions of less intuitive metabolic modifications (Kitano, 2002).

Through modeling approaches, strain predictions can be improved by taking into account full metabolic complexity during the design phase. Designed strains can also be screened in silico before they are engineered and tested in the laboratory. The process involves iterative application of the following steps: (i) characterization of the host metabolic network; (ii) identification of gene additions to bridge native metabolism to the target; (iii) optimization of the modified metabolic network through gene addition, deletion, overexpression or dampening; (iv) trialing successful predictions in the laboratory. This process affords the potential to develop successful strains more cost effectively, and time efficiently. This work focuses on step (iii), which involves elements of network characterization in order to identify suitable optimization strategies.

To characterize the metabolic network, genome-scale models (GEMs) can be used in conjunction with constraint-based techniques. GEMs are computer-analyzable, structured knowledge bases of genes, proteins, and metabolites present within a given organism (Thiele and Palsson, 2010). GEMs therefore encode the complexity of host cell metabolism and are available for an increasingly large number of organisms (Büchel et al., 2013). Constraint based techniques, including flux balance analysis (FBA) and flux variability analysis (FVA), provide quantitative predictions of cellular behavior such as metabolic flux patterns and cellular growth rates. These are computed by applying constraints, which can be assigned from experimentally measured nutrient uptake rates (Orth et al., 2010) and intracellular fluxes (Sauer, 2006), or inferred through interpretation of gene expression data (Lee et al., 2012). These predictions provide insights into the metabolic pathways active under different growth conditions (Liao et al., 2011), gene essentiality (Joyce and Palsson, 2008; Dobson et al., 2010; Heavner et al., 2012), and as a result, the fitness optimality of a given strain (Harcombe et al., 2013). More detailed introductions to these techniques can be found in **Boxes 1**, **2**.

Optimization of microbial strains is complex, requiring a balance between target production and cell viability (Lo et al., 2013). This makes the problem a multi-objective optimization problem, whereby metabolic flux of cellular growth and target production must be considered simultaneously. Successful optimization strategies therefore include gene modifications (knockouts, overexpression, dampening) which re-route flux toward the target product whilst minimizing the effect on flux toward synthesis of metabolites required for cellular maintenance.

Amongst the more prominent methods used for identifying knockout targets are OptKnock (Burgard et al., 2003) and RobustKnock (Tepper and Shlomi, 2010). OptKnock aims to optimize the maximum flux toward the target product whilst retaining cell viability, using up to five reactions knockouts to generate the strain solution. The method does not take into consideration flux variability, and therefore whilst there may be a reasonable maximal flux yield toward to target product, it is possible that the minimal flux toward the target product could be zero. RobustKnock was developed to improve on this shortcoming, by optimizing the minimal flux toward the target product, again by applying up to five reaction knockouts. Limitations of these methods include the prediction of only a single gene knockout strategy, and also no consideration of over-expression or dampening targets, which are key aspects of successful strain design (Dellomonaco et al., 2011). A complementary method, optGene (later updated to optFlux (Rocha et al., 2010)), can be used for overexpression analysis. Flux Variability Analysis has been used in a number of studies for identifying overexpression targets (Choi et al., 2010; Park et al., 2012), as well as more comprehensive strategies (Pharkya and Maranas, 2006; Feist et al., 2010), although these have not been extensively used. Elementary modes have also been used to identify suitable knockout targets (Ballerstein et al., 2012; von Kamp and Klamt, 2014).

To integrate the requirements of predicting both knockouts and over-/under-expressions, we introduce RobOKoD (Robust Overexpression, Knockout and Dampening). RobOKoD takes into consideration metabolite centrality and flux variability in order to comprehensively identify potential knockouts and gene over-/under-expressions, ranked by significance, and follow the schematic presented in **Figure 1**. This ranking is a strength, as it allows for further, manual analysis of the system to be used for strain design.

The performance of RobOKoD was tested against that of Opt-Knock and RobustKnock in their ability to predict an engineering strategy for production of butanol from Escherichia coli using the reverse β-oxidation cycle. The predictions were validated against a successful, experimentally-validated butanol producing strain developed by Dellomonaco et al. (2011).

## Materials and Methods

## Escherichia coli model

The model used in this study is a derivation of a core metabolism model derived from the iAF1260 reconstruction of E. coli metabolism proposed by Feist et al. (2007). The core metabolism model of 95 native reactions was modified to include the βoxidation pathway—a total of eight genes catalyzing 30 additional reactions—to produce the model iNS142 (see **Table 1**). This model contains 142 genes, 125 reactions, and 93 metabolites (**Figure 2**). The model is available in Supplementary Folder 1 in SBML format (Hucka et al., 2003).

#### BOX 1

Flux Balance Analysis (FBA) allows the computation of fluxes, and cellular growth, by using a set of constraints. FBA uses the stoichiometric matrix (S), which is a matrix consisting of rows of metabolites (m), and columns of reactions (n). An example based on the toy network in Figure B1a can be seen in Table B1a. The matrix is usually sparse and filled with positive (negative) coefficients for metabolites produced (consumed) by a reaction. Linear programming is used to compute feasible fluxes (v) through the network ensuring that a steady state is satisfied (Equation i), subject to a set of constraints (Equation ii) and optimizing (Z) a specific function (Equation iii, where c is a vector of weights, typically a vector of zeros with biomass production set to 1). The minimum solutions of Equation (i) are elementary modes, which are minimal sets of enzymes that can operate at steady state, also known as minimal functional units (de Figueiredo et al., 2009). If Equation (i) cannot be satisfied, then FBA cannot be computed on the system.

$$\begin{aligned} \mathsf{Sv} &= 0 \quad (i) \\ \mathsf{lb}\_l \subseteq \mathsf{v}\eta \ge \mathsf{u}b\_l, i &= 1, \dots, n \quad (ii) \\ \mathsf{Z} &= \mathsf{c}^T \mathsf{v} \quad (iii) \end{aligned}$$

In the example network below (Figure B1a), c is given as an uptake rate of 10 units of metabolite a. In the center network Z = Target, and in the right-hand network Z = Biomass. Reaction bounds are all assigned as lbi = 0, ubi = 1000. Meaning that each reaction through the network is irreversible. Computing FBA for Z = Target we get 10 units of flux flowing through v2 and v3, producing v\_Target = 10 units. For Z = Biomass we get 10 units of flux flowing through v3, v7, and v9, producing v\_Biomass = 10 units.

FIGURE B1a | Illustrating FBA for independent optimisation of target and biomass.

TABLE B1a | Stoichiometric matrix (S).


## RobOKoD

The RobOKoD method is based on the two following assumptions:


A simplified schematic of the method based on these two assumptions can be seen in **Figure 1** and additional details are given in the next sections. First, a metabolite consumption test (MCT) is applied which computes whether a given metabolite in the target production pathway demonstrates flux loss to biomass production. If flux loss is identified, all reactions that consume that metabolite are flagged as potentially favored targets. Second, flux variability analysis profiling (FVAp) is performed to determine the flux variability of each reaction, at increments of maximum biomass flux and then at increments of maximum target product flux. The profiles of each reaction are used to calculate a score from which the importance of each reaction for growth and target production can be estimated. Finally, MCT and FVAp results are combined to rank potential modifications.

#### BOX 2

Flux Variability Analysis (FVA). Box 1 showed an example of FBA, where a single set of fluxes was identified, which can maximize biomass production (Z). It can be seen in the central network of Figure B2a, that this set of fluxes was just one of two possible solutions that could be selected to maximize Z—route A and route B. FVA allows us to garner this additional information by identifying the minimum and maximum flux that each reaction can carry (Equation i). FVA can be implemented at the optimal state whereby y = 1 (Equation ii), subject to flux constraints for each reaction (Equation iii) as demonstrated in the right-hand network in Figure B2a (Gudmundsson and Thiele, 2010). Here the main information identified is which reactions are interchangeable. It is also common to compute FVA under suboptimal conditions (i.e., y = 0.95 as used in RobOKoD), which introduces a small amount of flexibility in the system and reduces the chances of optimal pathways being unrealistic when compared in vivo.

Modifications can consist of (i) gene deletions; (ii) changes of environmental conditions; (iii) gene over-expressions; and (iv) gene dampenings.

This strategy ensures that reactions that are vital for either growth or target product production, or those that produce key metabolites, are not selected as potential knockouts. Conversely, reactions that (i) significantly divert carbon away from target production; and (ii) consume a metabolite known to promote flux loss from target production; are selected preferentially. Once the first knockout is predicted, the model is modified to block this reaction, and the same selection process is used to select the second reaction to delete. This method can be applied iteratively to predict a number of modifications that should enhance target production whilst maintaining growth.

TABLE 1 | Reactions and genes added to the core iAF1260 model to implement the β-oxidation cycle.


All code was developed in Matlab to maintain compatibility with the COBRA Toolbox (Schellenberger et al., 2011), and is available in Supplementary Folder 1.

#### Metabolite Consumption Test (MCT)

Metabolite Consumption Test (MCT) identifies metabolites within the optimal target production pathway that are also consumed to produce biomass. The MCT score is given in a two-step process. First flux change (Xm) per metabolite (m) is calculated, then an MCT-value of 1 is given to all reactions that consume metabolites, denoted by a negative Xm. X<sup>m</sup> is calculated according to Equation (1). For each metabolite that is featured in the optimal target producing pathway, for the example network in **Figure 3**, that would be metabolites **a**, **b**, **e**, all producing and consuming reactions are identified. Then per identified reaction, a unitary constant c is calculated which identifies the reaction as a producer (+1) or consumer (−1) of the metabolite during biomass production, thereby indicating whether there is a potential flux loss or gain from that reaction. Each reaction is then weighted (w) according to whether it is vital for both target and biomass (0); or potentially used (1), or not used (0) for biomass production. v is the maximum flux through the reaction during biomass production. All reactions that consume a metabolite m with a negative Xm-value are flagged with a 1 in the corresponding column (see MCT column in **Table 2**).

$$X\_m = \sum\_{i=1}^n c^{(i)} \cdot \nu^{(i)} \cdot \nu\_{\text{max}}^{(i)} \tag{1}$$

#### FVA Reaction Profile (FVAp)

Prior to FVAp, FBA is applied to predict the maximal theoretical yield of both biomass (ybm) and target product (ytarget). FVAp is then performed which computes the flux variability of each reaction: (1) at different percentage (0–100%) of ybm whilst optimizing target product; and (2) at different percentage (0–100%) of ytarget, whilst optimizing biomass. By computing FVAp the flux capacity of each reaction is profiled over a range of target constraints. The key areas of interest are the extremes of target production, and biomass production. It can be seen in **Figure 5** that the first and last quartile of the x axis for all examples holds the key information from which beneficial genetic interventions can be inferred.

#### Knockout Scoring

Knockouts were selected by computing a knockout ranking score. The ranking score is calculated for each reaction using FVAp at different percentage (0–100%) of ybm whilst optimizing target product (red shaded area). Let us denote with (vmax) target|<sup>p</sup> and (vmin) target|<sup>p</sup> the maximal and minimal flux, respectively of reaction i obtained through FVAp when requiring a percentage p of ybm to be produced while maximizing for product. Likewise let the maximal and minimal flux of reaction i obtained through FVAp when requiring a percentage p of ytarget to be produced while maximizing for biomass be defined as (vmax) biomass|p and (vmin) biomass|p, respectively. It must be noted that the percentage p refers to either biomass or target product production requirement depending on the objective function.

A suitable knockout target displays the key characteristics shown in **Figure 5A**, where the first quartile of x axis 0-25% of ybm (red shaded area) carries a lower v (i) max|target, than 75- 100% of ybm, which shows that the reaction is required to carry a higher flux to sustain optimal biomass production. This characteristic is captured in Equation (2) (biomass reaction activation). A reduced variability in the fourth quartile also demonstrates a stronger constraint on the flux to produce ybm, this is captured in Equation (3) (product variability area). The final knockout scoring R i KOr for each reaction was computed according to Equation (4), which takes into account the features of both the biomass reaction activation and product variability area.

Biomass reaction activation:

$$\sum\_{p\_1=75\%}^{100\%} \left(\nu\_{\text{max}}^{(i)}\right)^{\text{target}} |p\_1 - \sum\_{p\_2=0\%}^{25\%} \left(\nu\_{\text{max}}^{(i)}\right)^{\text{target}}|p\_2\tag{2}$$

Product variability area:

$$\sum\_{p=75\%}^{100\%} \left(\nu\_{\text{max}}^{(i)}\right)^{\text{target}} |p - \left(\nu\_{\text{min}}^{(i)}\right)^{\text{target}}|p\tag{3}$$

$$R\_{KOr}^i = \frac{\text{biomass reaction activation}}{\text{product } \text{variability area}} \tag{4}$$

Reactions that obtain a high R i KOr, are identified as a putative target for knocking out providing it is not a lethal target for the cell. Identified target reactions for knocking out are first ordered by R i KOr, before secondary sorting by MCT flags. An example of this sorting can be seen in **Table 2** based on the toy network presented in **Figures 3**, **4**.

#### Over-Expression Ranking

The characteristics of a strong over-expression target can be seen in the lower quartile of x axis in **Figure 5B**, where at 0-25% of ybm (red shaded area) v (i) min|target has a higher flux capacity than 75-100% of ytarget (blue shaded area), v (i) min|biomass (target extra flux, see Equation 5). A lower variability is also desirable for optimizing target subject to 0-25% of ybm (target variability, Equation 6) as it ensures that the minimum flux the reaction

reactions, and green, blue, and orange circles represent extracellular metabolites, intracellular metabolites involved in carbon transfers, and

can carry is close to optimum. The final ranking (R i OEx) is determined using Equation (7), where reactions with the highest R i Oex are the most likely over-expression targets. An example of a weaker over-expression target (corresponding to a lower R i OEx) is shown in **Figure 5C**, which illustrates an over-expression that will increase flux to both target and biomass. Negative R i OEx represent show reversible reactions. Water is not shown for clarity of the layout.

potential dampening targets (see **Figure 5D**), which display the opposite characteristics.

Target extra flux:

$$\sum\_{p\_1=0\%}^{25\%} \left(\nu\_{\text{max}}^{(i)}\right)^{\text{target}} |p\_1 - \sum\_{p\_2=75\%}^{100\%} \left(\nu\_{\text{max}}^{(i)}\right)^{BM}|p\_2\tag{5}$$

This means that reactions with an equal R KOr can be differentiated by a secondary sorting against whether they directly consume a metabolite that is important for the target production (see Table 2).

TABLE 2 | Using the toy network presented in Figures 3, 4 we computed the MCT score and R<sup>i</sup> KOr of the intracellular reactions.


v3, v4, v6, v7, and v8 all have the same FVAp profiles and therefore R<sup>i</sup> KOr scores. Of the top ranking reactions within this network v3 and v4 consume a metabolite that is important for target production. These reactions are then sorted as a higher priority within the equally ranked reactions to select as a knockout target.

Target variability:

$$\sum\_{p=0\%}^{25\%} \left(\nu\_{\text{max}}^{(i)}\right)^{\text{target}} |p - \left(\nu\_{\text{min}}^{(i)}\right)^{\text{target}}|p\tag{6}$$

$$R\_{OEx}^i = \frac{\text{target extra flux}}{\text{target variability}}\tag{7}$$

### OptKnock and RobustKnock

The OptKnock algorithm (Burgard et al., 2003) is available in the COBRA Toolbox for Matlab, and RobustKnock algorithm is available as a Matlab script from the original paper (Tepper and Shlomi, 2010). Both are repackaged in Supplementary File 1 allowing for reproduction of the following results.

## Results

As a case study, RobOKoD was applied to design an E. coli strain with a reverse β-oxidation cycle for butanol production. These results can be recreated by unzipping the code in Supplementary File 1, and running the test script iNS142\_butanol.m in Matlab [requires the COBRA Toolbox (Schellenberger et al., 2011), and if RobustKnock is to be tested, the Tomlab solver (Tomlab Optimization Inc., Västerås, Sweden)]. This test script runs RobOKoD over a maximum of five iterations of knockout scoring, implementing the highest scoring knockout, generating a results document and reaction FVA profile plots for each iteration in the directory iNS142\_butanol\_results, and outputting an updated SBML model in which the knockouts have been implemented. It subsequently runs over-expression ranking, again generating output in the iNS142\_butanol\_results directory. OptKnock and RobustKnock are then run in order to compare predictions from each method. Knockout scoring, over-expression rankings, and FVA profiles for all relevant reactions (such as those illustrated in **Figure 3**) can then be inspected manually.

MCT allows the identification of reactions which consume metabolites present in the optimal target production pathway that demonstrate flux loss toward biomass. These reactions are flagged in the listing of potential knockouts with a value of 1, allowing these reactions to be identified preferentially, out a set of reactions with the same knockout score. In this network, pyruvate was identified as a key metabolite where flux loss to biomass production could occur, 11 reactions were then identified that consume pyruvate.

FVA profiles representative of the different situations commonly encountered are shown in **Figure 5**. Knockout targets (**Figure 5A**) are identified based on fixed biomass optimal target FVAp (red profile). As the percentage of fixed biomass increases, the flux through the reaction increases to accommodate a higher biomass requirement, and the variability of the flux narrows. Strong overexpression targets (**Figure 5B**) show the opposite behavior of knockouts, whereby the flux through the reaction reduces as the percentage of fixed target is reduced as biomass is optimized (blue profile). Weak overexpression targets (**Figure 5C**) show similar characteristics, but are not required to carry a flux for the target to be optimized. Dampening targets (**Figure 5D**) are characterized by their ability to carry higher flux through a reactions at low percentage of fixed target with optimized biomass, than at both a high percent of fixed target and optimized biomass, and a low percent of fixed biomass and optimized target.

It is noted that some reactions obtain identical scores, hence their deletion are predicted to have the same impact on the system. This is for instance the case for two consecutive reactions of an unbranched, linear pathway. More generally, this is observed for the subsets of reactions that carry perfectly correlated fluxes (Heiner, 2009; Feist et al., 2010). A feature of RobOKoD is therefore its ability to identify such subsets of reactions. The corresponding knockouts are expected to result in a similar phenotype, hence the modification to perform for such subsets of reactions should be evaluated in the light of technical considerations. The most practical modifications should be selected,

whilst the resulting strain should still be amongst the optimal producers.

For comparison purpose, the well-established algorithms Opt-Knock and RobustKnock were applied on the same model to predict the optimal strain for butanol production. For each method, the maximum number of modifications was fixed to five, since constructing such a strain can still be managed experimentally. The optimal producer strains predicted by each method are listed in **Table 3** and are compared to the most efficient producer strain which has been experimentally validated (Dellomonaco et al., 2011). OptKnock and RobustKnock predicted strains that were theoretically unable to produce butanol during growth, and in the case of OptKnock, not viable for growth.

**Table 4** compares the functionality modifications of the predicted in silico cells, and the experimental strain. It appears that RobOKoD automatically captures most of the functional modifications experimentally carried out. In particular, it predicted that fermentation pathways (pfl, ldhA) should be knocked out to avoid diversion of carbon and reduced cofactors toward by-products of poor interest. Moreover, by highlighting the competing interests of oxygen uptake pathway between the production of biomass and butanol, RobOKoD was able to indicate an anoxic condition change, similar to the experimental strain which knocked-out fumarate reductase and was grown under microaerobic conditions.

In addition to the knockout predictions, RobOKoD was also able to predict over-expression and dampening targets. It predicted that enzymes catalyzing the reactions associated with the reverse β-oxidation cycle should be over-expressed, consistent with the experimental strain where the activity of transcriptional inhibitors of this pathway are dampened (fadR, atoC(c), crp<sup>∗</sup> , and 1arcA strains). Moreover, RobOKoD also predicts that a number of transport reactions (or rather genes encoding the relevant transport proteins) should be dampened, hence providing additional modifications that could enhance butanol production. These dampening predictions, less intuitive, were not carried out in the experimental strain and have not been experimentally verified.

**Table 5** compares the molar production of butanol per mole of glucose uptake, when the objective of the cell is to optimize biomass. It shows that RobOKoD predicted the most successful butanol strain design, with molar ratio values similar to that achieved in the experimental strain. Neither OptKnock or RobustKnock predicted successful strains, and in the case of OptKnock, the strain was predicted to be no longer viable.

The strain predicted by RobOKoD was developed iteratively by automatically knocking out the highest ranked suggested

FIGURE 5 | Typical FVA profiles characteristic of knockout targets (A), strong overexpression targets (B), weak overexpression targets (C), and dampening targets (D). The red profiles show FVAp of each reaction at different percentages of (0–100%) of ybm whilst optimizing target product. The blue profiles show FVAp at different percentages of (0–100%) of ytarget whilst optimizing biomass. Knockout targets (A) are identified using (75–100%) of ybm (corresponding to the fourth quartile of x axis) with target optimization: where v (i) max|target increases as ybm increases, coupled with a reduced variability between v (i) max|target and v (i) min|target. Strong overexpression targets (B) are

#### TABLE 3 | Gene modifications, based on the reactions predicted by the three computational methods, and their comparison with those successfully applied experimentally (Dellomonaco et al., 2011).


knockout target, that also was flagged by MCT as a potential route for flux loss from the butanol production pathway. This was to prevent selection bias for trialing its validity. It is strongly

identified using (0–25%) of ybm optimizing target, and (75–100%) of ytarget optimizing biomass (corresponding to the first quartile of x axis), where v (i) max|target (red) has a higher flux carrying capacity than v (i) max|biomass (blue), again with reduced variability between v (i) max|target and v (i) min|target. Weak overexpression targets (C) show similar characteristics, with a smaller difference between v (i) min|target and v (i) min|biomass and a larger variability between v (i) min|target and v (i) min|target. Profiles of dampening targets (D) are the reverse of overexpression targets.

#### TABLE 4 | Functional similarities captured in the gene manipulations predicted by each method.


recommended to use the method more flexibly, looking at the FVAp graphs that are produced for the reactions, knowledge of the organism, and the scorings in order to decide on suitable knockouts.

TABLE 5 | Molar ratio of glucose:butanol produced in predicted strains.


## Discussion

These results illustrate two limitations of OptKnock and Robust-Knock. First, the knockout predictions are deterministic, not ranked, and a unique set of knockouts is predicted. As shown by these results, different knockouts which may give similar phenotypes cannot be identified by these algorithms. With RobOKoD, a score is attributed to each modification, and one can readily check whether some modifications are expected to result in similar phenotypes and select those that can be more easily implemented experimentally. Secondly, OptKnock and RobustKnock are unable to predict over-expression or dampening strategies, which are of prime interest to increasing or decreasing flux down key pathways, respectively. However, it is argued that using a range of available techniques may help to build up a more comprehensive understanding of the system, and comparing the results obtained by different methods (e.g., Burgard et al., 2003; Choi et al., 2010; Tepper and Shlomi, 2010; Park et al., 2012) would be the most valuable strategy for designing producing strains.

It is also important to note that constraint-based modeling is not appropriate in all instances for prediction of suitable strains for target molecule production. FBA, a key method of assessing the functionality of a given strain, has the flaw whereby side reactions are not predicted to be carrying flux in silico as this would reduce the optimal resources that are routed to growth. An example being FBA run on yeast not producing ethanol under an intuitively appealing set of constraints (Westerhoff et al., 2009). This means that only solutions for target production pathways which are heavily coupled with growth can be identified. This is not an issue in most cases since a viable strain is desired but limits the applicability of this framework in particular cases, for example, when there is a need to decouple production from growth. It also means that the false negative rate for in silico strain predictions is high, with many successful laboratory strains not appearing so when translated to an in silico model. In future the field needs to look more toward different ways of predicting metabolic fluxes. Combining kinetic and stoichiometric models of the metabolic system (Chowdry et al., 2014) provides additional levels of constraints (including enzyme inhibition and activation) and is expected to improve the prediction of effective interventions. A longer term goal is therefore the production of detailed, large-scale kinetic models of the whole metabolic system (Stanford et al., 2013).

When running OptKnock and RobustKnock, it was clear that OptKnock was more user friendly, owing to it being made available in the COBRA Toolbox for Matlab and therefore applicable to a number of MILP (mixed integer linear programming) solvers. This was not the case for RobustKnock, which required a non-standardized model structure and the use of a specific solver, Tomlab, which has limited free access. An additional goal of designing RobOKoD was therefore to ensure its accessibility and robustness by reusing freely-accessible solvers, extensively validated COBRA Toolbox methods, and standardized model formats such as SBML.

A necessary future direction for both RobOKoD and existing tools such as OptKnock and RobustKnock will be to move to making predictions regarding knockouts, over-expressions, etc. at the level of the gene, rather than, as currently, at the level of the reaction. Due to the presence of both isoenzymes and promiscuous enzymes, it is clear that there is not a 1:1 mapping between gene and reaction. Consequently, manipulation of a given gene is likely to affect a number of reactions. Modification of this method to consider the gene-protein-reaction (GPR) relationships that are present in many genome-scale metabolic models will be a priority for future development.

To summarize, RobOKoD provides an additional tool to aid the task of designing strains for the (over)production of target products. It is able to predict and rank knockouts, overexpressions, and dampening targets. While predicting an optimized set of gene modifications to implement, unlike other methods, RobOKoD also provides lists of candidate modifications, along with graphical flux variability profiles, allowing the user to manually validate the set of predictions. Such a flexible approach—particularly when used in conjunction with other analysis methods mentioned previously—will allow for sensible gene manipulation approaches to be taken into the laboratory.

## Author Contributions

NJS conceived the study, led the project, developed the method and the code, and wrote and led the writing of the manuscript. PM contributed to the method conception and development, and writing of the manuscript. NS contributed to the code development and the writing of the manuscript.

## Acknowledgments

NJS is grateful for funding under grant code BB/M013189/1, and the European Union under the Preparatory Phase Projects in the framework of FP7 (project reference 312455). PM received the support of the INRA (program CJS) and of the European Union in the framework of the Marie-Curie FP7 COFUND People Program, through the award of an AgreenSkills fellowship (under grant agreement n◦ 267196). NS is grateful to the BBSRC for funding under grant code BB/K019783/1. All authors thank both Ettore Murabito and Michael Howard for useful discussions whilst developing this work.

## Supplementary Material

The Supplementary Material for this article can be found online at: http://www.frontiersin.org/journal/10.3389/fcell.2015. 00017/abstract

## References


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2015 Stanford, Millard and Swainston. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

## Reaction Abbreviations


## Succinate overproduction: a case study of computational strain design using a comprehensive Escherichia coli kinetic model

#### **Ali Khodayari † , Anupam Chowdhury† and Costas D. Maranas\***

Department of Chemical Engineering, The Pennsylvania State University, University Park, PA, USA

#### **Edited by:**

Daniel Machado, University of Minho, Portugal

#### **Reviewed by:**

Cong T. Trinh, The University of Tennessee Knoxville, USA Joonhoon Kim, University of Wisconsin-Madison, USA Kiran Raosaheb Patil, European Molecular Biology Laboratory, Germany

#### **\*Correspondence:**

Costas D. Maranas, Department of Chemical Engineering, The Pennsylvania State University, 112 Fenske Laboratory, University Park, PA 16802, USA e-mail: costas@psu.edu

†Joint first authors

Computational strain-design prediction accuracy has been the focus for many recent efforts through the selective integration of kinetic information into metabolic models. In general, kinetic model prediction quality is determined by the range and scope of genetic and/or environmental perturbations used during parameterization. In this effort, we apply the k-OptForce procedure on a kinetic model of E. coli core metabolism constructed using the Ensemble Modeling (EM) method and parameterized using multiple mutant strains data under aerobic respiration with glucose as the carbon source. Minimal interventions are identified that improve succinate yield under both aerobic and anaerobic conditions to test the fidelity of model predictions under both genetic and environmental perturbations. Under aerobic condition, k-OptForce identifies interventions that match existing experimental strategies while pointing at a number of unexplored flux re-directions such as routing glyoxylate flux through the glycerate metabolism to improve succinate yield. Many of the identified interventions rely on the kinetic descriptions that would not be discoverable by a purely stoichiometric description. In contrast, under fermentative (anaerobic) condition, k-OptForce fails to identify key interventions including up-regulation of anaplerotic reactions and elimination of competitive fermentative products. This is due to the fact that the pathways activated under anaerobic condition were not properly parameterized as only aerobic flux data were used in the model construction. This study shed light on the importance of condition-specific model parameterization and provides insight on how to augment kinetic models so as to correctly respond to multiple environmental perturbations.

**Keywords: computational strain design, kinetic model, bilevel optimization, succinate overproduction, model parameterization**

#### **INTRODUCTION**

Engineered microorganisms are increasingly being used as cellular factories for the bio-production of chemicals of interest (Curran and Alper, 2012; Hong and Nielsen, 2012; Lee et al., 2012). Keeping pace with genome editing techniques for strain design, several computational tools have been developed to identify system-wide genetic modification strategies that improve the yield of targeted biochemicals (Pharkya et al., 2004; Kim et al., 2011; Xu et al., 2011; Maia et al., 2012; Cotten and Reed, 2013a). In general, these tools rely on a stoichiometric representation of a metabolic network and solve bilevel optimization problems to suggest prioritized intervention strategies that divert metabolic flux towards the chemical of interest (Segre et al., 2002; Burgard et al., 2003; Kim and Reed, 2010; Rocha et al., 2010; Tepper and Shlomi, 2010). The methodology and comparative benefits of each procedure is discussed in detail elsewhere (Zomorrodi et al., 2012). However, key methodological impediments of these approaches are the stoichiometryonly representation of metabolism and the on–off representation of regulation. This may lead to a metabolite concentration, enzymatic activity, and metabolic regulation-agnostic intervention strategies. Therefore, identified flux re-direction predictions (especially up/down flux modulation) are sometimes difficult to

translate into actionable genetic interventions. For example, it is unclear if a desired metabolic flux up-regulation is achievable or even consistent with enzyme kinetics or physiological metabolite concentrations.

Some of the shortcomings of genome-scale stoichiometric models in quantifying the effect of concentration and enzyme levels on reaction throughput and regulation can be addressed by kinetic models of metabolism (Mahadevan et al., 2002; Fleming et al., 2010; Jamshidi and Palsson, 2010; Smallbone et al., 2010; Feng et al., 2012). Kinetic models yield a system of ordinary differential equations (ODEs) that describe the time evolution of metabolite concentrations, enzyme activities, and reaction fluxes. Several efforts have been made in recent years for improving the accuracy of stoichiometry-based tools by partially integrating kinetic information (Nikolaev, 2010; Song and Ramkrishna, 2012; Angermayr and Hellingwerf, 2013;Almquist et al., 2014). However, most of these procedures are aimed towards improved metabolic phenotype prediction through *ad hoc* constraints (Cotten and Reed, 2013b) rather than strain design. The k-OptForce procedure (Chowdhury et al., 2014) extends the previously developed straindesign OptForce algorithm (Ranganathan et al., 2010) by integrating all available mechanistic details afforded by kinetic models

within a constraint-based optimization framework tractable even for genome-scale models. Reactions with available kinetic descriptions yield (generally unique) steady-state flux values while the remaining reactions are only constrained by stoichiometric relations. Genetic intervention strategies consistent with restrictions imposed by maximum enzyme activity,bounds on metabolite concentrations and kinetic expressions are identified using a bilevel Mixed Integer Nonlinear Program (MINLP) optimization framework (Chowdhury et al., 2014). Examples addressed in Chowdhury et al. (2014), however, accounted for only a handful of reactions with kinetic expressions.

In this paper, we apply k-OptForce procedure for the recently published large-scale kinetic model of *E. coli* core metabolism (Khodayari et al., 2014). The kinetic model includes 138 reactions, 93 metabolites, and 60 substrate-level regulatory interactions and accounts for glycolysis/gluconeogenesis, pentose phosphate (PP) pathway, TCA cycle, major pyruvate metabolism, anaplerotic reactions, glyoxylate shunt, Entner–Doudoroff (ED) pathway, and a number of reactions in other parts of the metabolism. The model was parameterized using the ensemble modeling (EM) formalism (Tran et al., 2008) by simultaneously satisfying normalized flux data per 100 mmol of glucose uptake (for approximately 25 reactions per mutant) for the wild-type and seven single gene deletion mutants, under aerobic condition (Ishii et al., 2007). The EM approach decomposes all reactions into elementary steps bypassing the need of detail kinetic expressions. First, an ensemble of kinetic models is generated by uniformly sampling reaction reversibilities and enzyme fractions following different time trajectories but all reaching the same steady-state flux values (Tan and Liao, 2012). Next, a Genetic Algorithm (GA) implementation is used to "swap" kinetic parameterizations between models in the ensemble so as to minimize the deviations from all set of mutant network fluxes. Models constructed using flux data for a single strain do not always perform well in predicting deletion strain metabolic phenotypes (Jouhten, 2012; Villaverde et al., 2014). Unlike stoichiometric models that could reveal physiologically relevant flux re-directions in response to perturbations by re-optimizing biomass yield, kinetic models must be endowed beforehand with all known substrate-level regulatory interactions to capture metabolic responses to genetic/environmental perturbations (Jouhten, 2012; Heijnen and Verheijen, 2013; Villaverde et al., 2014). Note that while the EM based elementary mode analysis was used for strain design in an earlier effort (Flowers et al., 2013), the limited scope of the model may fail to capture genome-scale flux re-directions.

The k-OptForce procedure (Chowdhury et al., 2014) was used to identify the minimal interventions that maximize the yield of succinate production using a hybrid kinetic (Khodayari et al., 2014) and stoichiometric *i*AF1260 (Feist et al., 2007) description of *E. coli* metabolism. Succinate was chosen as the target bioproduct as there exists numerous experimental strain-engineering studies to compare the suggestions of k-OptForce procedure (Lee et al., 2005; Cao et al., 2011; Tan et al., 2011). This study was carried out under both aerobic and anaerobic conditions to assess the fidelity of the kinetic model when used to make predictions for a different environmental condition (i.e., anaerobic) than the one parameterized for (i.e., aerobic). The goal was to

quantify the reduction in prediction quality moving from aerobic to anaerobic under glucose minimal condition and suggest model modifications that remedy these shortcomings. k-OptForce recapitulated existing strategies while also pointing at promising but currently unexplored interventions. In addition, results under anaerobic condition indicate that the kinetic model needs to be reparameterized with mutant flux information involving a reversed TCA cycle routing flux towards succinate. A number of regulatory modifications of the kinetic model are also found to be necessary to better reflect metabolic fluxes associated with anaerobic succinate production. These include activation of fermentation pathways and pyruvate formate lyase (PFL) by key regulatory proteins FNR (fumarate and nitrate reductase regulation) and ArcA (aerobic respiratory control).

## **MATERIALS AND METHODS**

Using k-OptForce, the genome-scale stoichiometry matrix is divided into two parts: reactions with stoichiometric information only (*J* stoic), and those having additional kinetic information (*J* kin). A schematic representation of the framework is depicted in **Figure 1**. The kinetic information was extracted from the kinetic model of *E. coli* central metabolism developed in Khodayari et al. (2014). The number of reactions in the kinetic representation is a compromise between reduction of solution space using kinetic data and run time for solving the non-linear expressions of mass conservations. Upon exclusion of the exchange/transport reactions and elimination of reactions not involved in succinate synthesis (such as glycogen pathway), a subset of the kinetic model was selected containing 36 reactions and 31 metabolites. The resulting model includes reactions from glycolysis/gluconeogenesis, PP pathway, TCA cycle, anaplerotic reactions, glyoxylate shunt, and ED pathway with available experimental data during model parameterization. This model was finally supplemented with the stoichiometric *i*AF1260 model of *E. coli* (Feist et al., 2007).

Glucose minimal condition were simulated by restricting glucose uptake flux (which serves as a basis for the fluxes in the metabolic network) to −100 mmol gDW−1h −1 . Oxygen uptake was limited to −200 mmol gDW−1h −1 for aerobic condition and set to zero for fermentative condition. Regulatory information for both aerobic and anaerobic conditions was imported from the supplementary material of *i*AF1260 model (Feist et al., 2007). The minimum production levels of succinate was set at 90% of its theoretical maximum for each condition (i.e., 135 mmol gDW−1h −1 in aerobic and 149 mmol gDW−1h −1 in anaerobic conditions) while a minimum level of biomass production equal to 10% of its theoretical maximum was simultaneously imposed (i.e., 0.965 h−<sup>1</sup> in aerobic and 0.303 h−<sup>1</sup> in anaerobic conditions). The k-OptForce algorithm was implemented in the same stepwise procedure as described previously [see Methods in Chowdhury et al. (2014) for details]. At first, we identify all reactions that must depart (hence called MUST sets) from the reference phenotype to realize the targeted levels of overproduction of the desired chemicals under stoichiometric and kinetic constraints. Subsequently, we solve a bilevel optimization formulation (see **Figure 1E**) where we maximize the target flux by gradually increasing the total number (κ) of enzymatic interventions (for reactions in *J* kin) and/or flux manipulations (for reactions in *J* stoic) from the MUST sets. Starting from

#### **FIGURE 1 | Continued**

**(A)** The reactions with kinetic descriptions are shown in blue. **(B)** The reactions are first decomposed into their elementary steps. **(C)** Elementary kinetic parameters are expressed as a function of reaction reversibilities and enzyme fractions. Reaction reversibilities and enzyme fractions are sampled to construct an ensemble of models, for any given reaction. **(D)** A genetic

a single intervention, we stop this procedure when the target flux does not improve appreciably with additional interventions. The optimization formulations for the characterization of the overproducing network and identification of the FORCE sets were altered from the original procedure to incorporate the kinetic information of each reaction in *J* kin as a function of the decomposed expressions of its elementary steps (see **Figure 1**) instead of directly manipulating the reaction enzyme activities (*v* max). Additional constraints were imposed to express the flux of each reaction in *J* kin as the difference of the forward and reverse reactions for each elementary step. The sum of individual enzyme fractions *e* is represented by *e* tot (i.e., normalized total enzyme concentration) that is equal to one in the reference (wild-type) strain, but varies when up/down-regulated in mutant strains. Here, we allowed the *e* tot of each reaction to vary between zero (i.e., removal of its activity) and a 10-fold up-regulation in its expression to account for individual enzymatic perturbations in mutant strains. Likewise, the same limits of variation were set for the individual enzyme fractions *e* for each reaction.

The metabolite concentrations were allowed to vary within fivefold from their steady-state values in the reference phenotype. The FORCE set of interventions was identified in a two-step procedure [see Methods of Chowdhury et al. (2014)]. The first step identified the reactions in *J* kin (using binary variables *y* kin) whose enzymatic activity (i.e., *e* tot) must be altered from their reference level to achieve the required flux re-distribution towards succinate. The lower and upper bounds on *e* tot (i.e., *e* tot,lb and *e* tot,ub) are functions of *y* kin and the maximum fold-change *z*, as follows:

$$e\_j^{\text{tot,lb}} = \begin{cases} 1, & \text{if } j \in J^{\text{kin}} \backslash \text{MUST}^L \\ 1 - \nu\_j^{\text{kin}}, & \text{if } j \in J^{\text{kin}} \cap \text{MUST}^L \end{cases}$$

$$e\_j^{\text{tot,ub}} = \begin{cases} 1, & \text{if } j \in J^{\text{kin}} \backslash \text{MUST}^U \\ (z - 1) \nu\_j^{\text{kin}} + 1, & \text{if } j \in J^{\text{kin}} \cap \text{MUST}^U \end{cases}$$

If selected for down-regulation (i.e., when the reaction is part of MUST*<sup>L</sup>* ), *e* tot is allowed to vary from zero (*e* tot,lb = 0 for *y* kin = 1) to its reference expression. Otherwise, *e* tot is set to one. Likewise, if selected for up-regulation (i.e., when the reaction is part of MUST*U*), *e* tot is allowed to vary from one to a *z*-fold overexpression (*e* tot,ub = *z* for *y* kin = 1). The MINLP formulation for the first-step was initially solved using a local solver [DICOPT (Grossmann et al., 2002)], and the results were used as inputs to find the global optimum using the BARON solver (Sahinidis, 1996). Subsequently, by fixing the fluxes in *J* kin, the second step identified additional flux manipulations in *J* stoic (using binary variables *y* stoic) while assuming a worst-case scenario for the inner objective function. The relation of the modified bounds *v* lb j , *v* ub j on the reaction fluxes in *J* stoic with *y* stoic is similar to that explained for

algorithm (GA) implementation identifies the optimal combination of the sampled parameters by minimizing the deviation from experimentally measured flux data for multiple mutant strains [see Methods of Khodayari et al. (2014)]. **(E)** The k-OptForce procedure identifies a minimal set of interventions that maximizes the yield of targeted product [see Methods of Chowdhury et al. (2014)].

the first step of FORCE set identification for the implementation of up/down-regulations and/or reaction removals [see Methods of Chowdhury et al. (2014)].

## **RESULTS**

#### **EXAMINING THE PREDICTIVE PERFORMANCE OF THE KINETIC MODEL**

The perturbed phenotype prediction accuracy of the parameterized kinetic model was first assessed for five different engineered strains under aerobic condition. The experimentally reported product yield was compared against the kinetic model and FBA predictions (see **Table 1**). A twofold up-regulation for small foldchange, and 10-fold up-regulation for a high fold-change are used to express enzyme up-regulation, whenever such information is not available in the relevant literature. The enzyme level manipulation in the kinetic model is achieved by changing *e* tot for each particular enzyme. Gene deletions are implemented by setting the *e* tot of the encoded enzyme to zero.

The kinetic model closely matches the succinate producing strain while FBA over-estimates it because the former captures the feed-forward inhibition on glyoxylate shunt by intermediates phosphoenolpyruvate (pep) (MacKintosh and Nimmo, 1988; Ogawa et al., 2007) and isocitrate (icit) (Hoyt et al., 1988). For both L-serine and L-threonine, FBA directs all carbon flux towards biomass predicting little to no amount of product formation. The kinetic model over-estimates L-serine yield as product inhibition of the phosphoglycerate dehydrogenase (PGCD) (Grant, 2012; Li et al., 2012; Wang et al., 2014) is not captured in the kinetic model (see **Figure 2A**). In contrast, the kinetic model under-estimates the yield of L-phenylalanine production. A possible reason is that the feed-forward activation of pep on 5-enolpyruvylshikimate-3 phospahte synthase (EPSPS) (Gruys et al., 1992) is absent in the kinetic model (see **Figure 2B**). In addition, due to lack of experimental data during parameterization, the model over-estimates the inhibitory effect of phosphate on transaldolase (TALA) activity (Sprenger et al., 1995), which further restricts flux towards l-phenylalanine production. The naringenin engineered strain productivity is better reflected by the kinetic model as FBA does not capture the feedback inhibition of acetyl-CoA on phosphoglucomutase (PGM) activity (Sanwal et al., 1972; Duckworth et al., 1973) that limits flux towards the flavanone pathway.

#### **OVERPRODUCTION OF SUCCINATE UNDER AEROBIC CONDITION**

Both OptForce and k-OptForce adopt similar strategies for redirecting flux towards succinate under aerobic condition by routing more flux through isocitrate lyase (ICL), increasing flux through phosphoenolpyruvate carboxylase (PPC), and converting the intermediate glyoxylate back to glycerate 2-phosphate (2pg) using glycerate metabolism (see **Figure 3**). However, the number of required interventions varies. While OptForce suggests that only three interventions are required to achieve a


#### **Table 1 | A comparison between model predictions and experimental yields for five different products in E. coli under aerobic condition**.

The engineering strains are simulated using both the kinetic model and FBA (max biomass).

SUCD, succinate dehydrogenase; ICL, isocitrate lyase; PPC, phosphoenolpyruvate carboxylase; PDH, pyruvate dehydrogenase; PGCD, phosphoglycerate dehydrogenase; PGK, phosphoglycerate kinase; PPC, phosphoenolpyruvate carboxylase; PYK, pyruvate kinase; DDPA, 3-deoxy-D-arabino-heptulosonate 7-phosphate synthetase; TKT, transketolase; SUCOAS, succinyl-CoA synthetase; FUM, fumarase; ACCOAC, acetyl-CoA carboxylase; GAPD, glyceraldehyde-3-phosphate dehydrogenase.

succinate yield of 90% of its theoretical maximum, k-OptForce suggests that additional direct up-regulations in the glycolysis and TCA cycle are necessary. For example, k-OptForce suggests at least ninefold up-regulation of ICL enzyme activity to pull TCA cycle flux from icit towards succinate. Likewise, up-regulation of enolase (ENO) enzyme by twofold of its reference activity is required to push more glycolytic flux towards succinate precursors oxaloacetate (oaa) and acetyl-CoA. Regular OptForce suggests that up-regulation of aconitase (ACONT) and down-regulation of isocitrate dehydrogenase (ICDH) are sufficient to indirectly increase flux through PPC and ICL. In contrast, k-OptForce suggests that PPC and ICL must be directly up-regulated to improve

succinate yield. In addition, up-regulation of ENO pulls glyoxylate flux towards 2pg through the glycerate pathway to compensate for the pep depletion. OptForce does not require any enzymatic intervention to route metabolic flux towards acetyl-CoA sending a significant portion (58 mmol gDW−1h −1 ) from oaa towards acetyl-CoA using the threonine pathway. k-OptForce reveals that such a high flux cannot be routed through the threonine pathway. Even with maximum (i.e., 10-fold) up-regulation of the aspartate transaminase (ASPTA) only 20 mmol gDW−1h −1 can be diverted towards threonine. In addition, k-OptForce suggests up-regulation of PPC enzyme activity (by 50% of its reference activity) to ensure availability of equal amounts of acetyl-CoA and oaa for the production of citrate thus preventing the accumulation of intermediates.

The abovementioned interventions suggested by k-OptForce are geared towards circumventing upper bounds on max enzyme activities (i.e., no more than 10-fold). However, limits on metabolite concentrations also play a significant role in restricting flux towards succinate. The maximum yield of succinate suggested by k-OptForce (1.2 mol/mol glucose, 80% of theoretical maximum) is less than the one suggested by OptForce (1.3 mol/mol glucose, 90% of theoretical maximum). This is because as ICL is up-regulated, the concentration of intermediates pep and icit increase reaching twice their reference values. As these metabolites are competitive inhibitors of ICL, the maximum flux through the pathway towards succinate is restricted. In addition, to alleviate the regulatory effect of malate (mal) on the activity of PPC, k-OptForce also proposed a 10-fold down-regulation of the enzymes that catalyze mal production, fumarase (FUM), or succinate dehydrogenase (SUCD). Likewise, k-OptForce suggests removal of transketolase (TKT2) to alleviate the inhibition of 6-phospho-D-gluconate (6pgc) on glucose-6-phosphate isomerase (PGI) to improve the glycolytic flux towards succinate, which also reduces the production of biomass precursors.

Most of the k-OptForce interventions were consistent with engineering efforts aimed at improving succinate production under aerobic condition. For example, up-regulation of ICL and removal of SUCD and ICDH activities improved succinate yield in *E. coli* to 0.5 mol/mol glucose (Lin et al., 2005b). Further improvements in succinate production (up to 0.7 mol/mol glucose) have been achieved by up-regulation of PPC (Lin et al., 2005a). Notably, the same interventions improved aerobic succinate production in *C. glutamicum* to 0.5 mol/mol glucose (Litsanov et al., 2012). Similar to proportional up-regulation of ENO and PPC that fixes

the branching ratio of the metabolic flux at pep, regulation of pep to pyruvate in the phosphotransferase system (PTS) reaction for glucose uptake was suggested to reduce the accumulation of intermediates (pyruvate and acetate) and improve succinate yield (Lin et al., 2005a). k-OptForce, however, fails to capture the accumulation of acetate upon up-regulation of PPC and glyoxylate shunt (Lin et al., 2005a; Zhu et al., 2013). This may be due to the fact that no fluxomic data for mutant strains with anaplerotic/glyoxylate shunt up-regulations was included during kinetic model parameterization. As a result, the kinetic model is unaware of the up-regulation that leads towards increased acetate production. Interestingly, k-OptForce routes glyoxylate (formed by the ICL reaction) back to 2pg using the glycerate pathway instead of the malate synthase (MALS) reaction. This pathway improves the yield of succinate since it reduces the overall loss of carbon flux to carbon dioxide. This pathway was engineered by *E. coli* (Hubbard et al., 1998; Osterhout et al., 2011) for the production of ethylene glycol and glucarate consumption, respectively, but remains to be explored for succinate overproduction.

#### **OVERPRODUCTION OF SUCCINATE UNDER ANAEROBIC CONDITION**

Under fermentative condition the electron transport chain is not active, thus preventing the oxidation of cofactor NADH generated primarily in glyceraldehyde 3-phosphate dehydrogenase (GAPD) reaction in glycolysis back to NAD. Without an adequate NADH sink, significant amount of metabolic flux is routed towards fermentative products such as ethanol, acetate, lactate, formate, etc. to restore redox balance and cellular growth. Therefore, the general strategy for succinate overproduction is to eliminate all competitive fermentative pathways while pushing more flux towards succinate through the glyoxylate shunt and reversing the reductive branch of TCA cycle (see **Figure 4**). This flux re-direction also regenerates NAD, thus simultaneously coupling succinate production with biomass generation.

In contrast to the aerobic case, k-OptForce suggestions for the anaerobic overproduction of succinate are less accurate compared to OptForce predictions. OptForce requires only five interventions to achieve a succinate yield of 1.42 mol/mol glucose. However, k-OptForce suggests a maximum yield of only 1.08 mol/mol glucose even after nine interventions. While k-OptForce recapitulates some of the interventions identified by OptForce (e.g., threefold up-regulation of the glyoxylate pathway enzymes ICL and MALS), the remaining suggestions deviate from OptForce and proven engineering strategies. The sources of these discrepancies can be traced back to incompatible parameterization of the kinetic model for the anaerobic case. First, due to absence of sufficient flux data in the parameterization procedure, the kinetic model was not tuned to capture reversal of the reductive branch of the TCA cycle necessary for succinate overproduction. k-OptForce suggests upregulation of all three enzymes of the reductive branch [i.e., malate dehydrogenase (MDH), FUM, and fumarate reductase (FRD)]. However, even after a 6.5-fold up-regulation in MDH activity and 10-fold up-regulation in FUM only 80% of the anaplerotic flux (57 mmol gDW−<sup>1</sup> h −1 ) goes towards succinate, while the remaining amount (11 mmol gDW−<sup>1</sup> h −1 ) uses the aspartate metabolism to bypasses MDH and FUM (see **Figure 4B**).

The kinetic model also fails to capture the metabolic transition of *E. coli* central metabolism from aerobic to anaerobic condition due to lack of regulatory information (Salmon et al., 2003, 2005). Under anaerobic condition, PP pathway, PPC, and TCA cycle are repressed, while glycolysis and, in particular, fermentative pathways are up-regulated (Perrenoud and Sauer, 2005; Cho et al., 2006). In addition, pyruvate dehydrogenase (PDH) is deactivated while PFL carries most of the flux from pyruvate to acetyl-CoA (Partridge et al., 2006). Even though the kinetic model captures down-regulation of TCA cycle upon removal of oxygen it cannot capture the remaining changes. Unable to capture the repression of PPC [anaerobic PPC flux is one-tenth of aerobic flux (Choudhary et al., 2011)], k-OptForce does not suggest any up-regulation in its activity to push more flux from pep towards oaa, contrary to OptForce suggestion of a minimum 15-fold up-regulation in PPC flux (8.4–133.3 mmol gDW−<sup>1</sup> h −1 ). In contrast, failing to recognize the regulatory activation of PFL under anaerobic condition, k-OptForce suggests a minimum eightfold up-regulation in its activity, while OptForce requires no such intervention. Unable to recognize the up-regulation of the enzyme activities in the fermentative pathways in the reference (non-engineered) strain, k-OptForce does not suggest any down-regulations since the parameterization of the enzymes does not allow a significant amount of flux towards ethanol, acetate, and lactate. In contrast, OptForce requires the removal of lactate dehydrogenase (LDH), alcohol dehydrogenase (ALCD), and acetaldehyde dehydrogenase (ACALD) to prevent diverting pyruvate flux away from succinate. Surprisingly, k-OptForce suggests a fivefold up-regulation in ACALD activity to maintain NAD/NADH redox balance. A large fraction of the produced acetaldehyde is reduced to ethanol (46 mmol gDW−<sup>1</sup> h −1 ), while the rest is exported out of the cell (3 mmol gDW−<sup>1</sup> h −1 ). However, we note that as no information capturing the effect of acetaldehyde on cell fitness was included in the kinetic model, it is unable to capture the chemical's toxicity. k-OptForce also suggests a minimum 1.5-fold up-regulation in triose phosphate isomerase (TPI) activity and a twofold upregulation in GAPD or phosphoglycerate kinase (PGK) activity to route additional PP pathway flux through glycolysis, even though the PP pathway is negligibly active in anaerobic condition (Choudhary et al., 2011). It is to be noted here that down-regulation of TKT2 for aerobic overproduction of succinate and up-regulation of GAPD for anaerobic case are not equivalent interventions even though both strategies do increase glycolytic flux. This is because, the flux distribution in the pay-off phase of glycolysis, which is different in both cases, affects the metabolite concentrations of the preparatory phase of glycolysis. Up-regulation of ENO in aerobic overproduction study pulls additional metabolic flux down from upper glycolysis in addition to TKT2 removal. In absence of ENO up-regulation, removal of TKT2 cannot reroute the entire amount of PP flux towards glycolysis. As a result, up-regulation of both GAPD and PGK (and TPI) is necessary. It is also to be noted that the inactivation of PDH (and the subsequent activation of PFL) in anaerobic condition affects the reactions preceding it.

Comparison with experimental studies shows that unlike in the aerobic case, most of the verified engineering strategies are consistent with OptForce suggestions. k-OptForce overlooks key interventions such as up-regulation of PPC and removal of fermentative pathways, that were identified to have the largest impact in enhancing succinate yield (Millard et al., 1996; Zhang et al., 2009). In addition, even in cases where k-OptForce correctly identifies interventions, such as of MDH, FUM, and FRD upregulation, inaccurate parameterization result in yield predictions far below experimentally observed succinate yield [1.08 vs. 1.2– 1.6 mol/mol glucose with fewer interventions (Cao et al., 2013)]. In other cases, untested interventions such as up-regulation of PFL most likely will not improve succinate yield, considering that the deletion of *pflB* was found to improve succinate yield (Sanchez et al., 2005; Wu et al., 2007).

## **DISCUSSION**

In this study, we compared the performance of k-OptForce in predicting interventions for overproduction of succinate in *E. coli* under both aerobic and anaerobic conditions. k-OptForce predictions under aerobic condition was found to be much more consistent with experimental strain-design strategies as compared with the stoichiometry-only OptForce predictions. In contrast, interventions for succinate overproduction under anaerobic condition by k-OptForce led to significantly less promising strategies largely inconsistent with experimental observations. This indicates that kinetic models have the potential to substantially over-perform FBA predictions when parameterized under the same (or similar) conditions but they may perform worse than FBA when asked to predict a significantly different metabolic phenotype. Note that the two-step strategy of the k-OptForce procedure does not affect the optimality of the results for the aerobic case as all interventions were identified from the kinetic part of the model. The flux distribution in the stoichiometric part of the model, which is determined by the worst-case inner problem, was effectively locked by the kinetic expressions. In general, however, we may miss better intervention strategies (for example in the anaerobic case study) when implementing the two-step approach as a tradeoff for improving computational performance.

The kinetic model was successful in capturing the underlying kinetic regulation when the flux re-distribution was consistent with the mutant flux information used for parameterizing the kinetic model. For example, the effect of enzymatic interventions around glycolysis and TCA cycle were identified with reasonable accuracy in both anaerobic and aerobic cases. Under aerobic condition, the kinetic model successfully captures the need for equimolar amounts of acetyl-CoA and oaa to supply the TCA cycle while preventing accumulation of intermediates (Lin et al., 2005a). Even when the kinetic model failed to correctly quantify fluxes, it provided a qualitative basis for making the right interventions. For example, k-OptForce correctly identifies that up-regulation of MDH, FUM, and FRD improves succinate production under anaerobic condition, even though it over-estimates the kinetic bottleneck towards such a flux-reversal resulting in poorer yields than experimental observations. Note that the developed kinetic model cannot capture changes in glucose uptake rate for different environmental and/or genetic backgrounds as all mutant fluxes used to train the model were scaled with the corresponding glucose uptake. Shortcomings in the model could be rectified by re-parameterizing the model using additional fluxomic information of mutant strains that allow for pathway reversal

[e.g., using metabolic flux analysis information of a ∆SUCD strain (Li et al., 2006)]. In general, the re-parameterization is a compromise between model scope and accuracy. The observations showed that parameterizing the kinetic model by making use of mutant data located in the proximity of a target product provides a more accurate flux distribution predictions by the model and consequently results to the identification of more targeted interventions using the k-OptForce procedure. In contrast, integration of a widerange of conditions with limited experimental data for model training may provide a better global qualitative agreement. While one could use separate kinetic models for aerobic and anaerobic conditions, ideally we would like a single model parameterization that could reproduce both aerobic and anaerobic responses. By creating two separate aerobic and anaerobic models it becomes unclear what model to use under micro/partial aerobic condition (Partridge et al., 2007).

This study shows that the model does not retain fidelity of predictions when growth is switched from aerobic to anaerobic condition. Aerobic to anaerobic metabolic transition is mainly controlled at the transcriptional level (Kochanowski et al., 2013) by the activities of global regulatory proteins FNR and ArcA (see **Table 2**). In absence of such regulatory interactions, the kinetic model could not capture the activation of PFL and fermentative pathways, and the deactivation of PPC and (to a small extent) PP Pathway. As a result, k-OptForce failed to identify key down-regulations (e.g., LDH, ALCD) in the former case, while suggested unnecessary up-regulations for the latter. These shortcomings are harder to address and require the incorporation of adequate regulatory information into the model (see **Table 2** for details) to capture the aerobic to anaerobic transition.



SUCOAS, succinyl-CoA synthetase; SUCD, succinate dehydrogenase; FUM, fumarase; MDH, malate dehydrogenase; PDH, pyruvate dehydrogenase; ACONT, aconitase; CS, citrate synthase; ICDH, isocitrate dehydrogenase; PFL, pyruvate formate lyase; NDH, nadh dehydrogenase.

In general, this study revealed some of the strengths and limitations of kinetic model-driven strain design. It demonstrated the need to carry out model parameterization for a diverse range of genetic/environmental perturbations (Khodayari et al., 2014) and the tight integration of transcriptional level along with substratelevel regulatory interactions. At a fundamental level, kinetic models must be *a priori* provided with the quantitative description of as many as possible regulatory switches that become active in response to genetic or environmental perturbations. This richness in mechanistic information enables a detailed description of metabolism that captures dynamics, enzyme activities, and metabolite concentrations but can lead to erroneous predictions due to missing and/or incorrect modeling assumptions. Nevertheless, by studying failure modes of kinetic models, valuable information can be uncovered for restoring prediction consistency for new phenotypes.

## **AUTHOR CONTRIBUTIONS**

Conceived and designed experiments: Costas D. Maranas, Anupam Chowdhury, Ali Khodayari. Performed the experiments: Anupam Chowdhury and Ali Khodayari. Analyzed the data: Anupam Chowdhury, Ali Khodayari, Costas D. Maranas. Contributed reagents/materials/analysis tools: Anupam Chowdhury, Ali Khodayari, Costas D. Maranas. Wrote paper: Ali Khodayari, Anupam Chowdhury, Costas D. Maranas.

## **ACKNOWLEDGMENTS**

The authors gratefully acknowledge funding from the NSF (http: //www.nsf.gov/) award no. EEC-0813570 and the DOE (http: //www.energy.gov/) grant no. DE-SC10822882. The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.

### **REFERENCES**


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 03 September 2014; accepted: 05 December 2014; published online: 05 January 2015.*

*Citation: Khodayari A, Chowdhury A and Maranas CD (2015) Succinate overproduction: a case study of computational strain design using a comprehensive Escherichia coli kinetic model. Front. Bioeng. Biotechnol. 2:76. doi: 10.3389/fbioe.2014.00076*

*This article was submitted to Systems Biology, a section of the journal Frontiers in Bioengineering and Biotechnology.*

*Copyright © 2015 Khodayari, Chowdhury and Maranas. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

## Improving collaboration by standardization efforts in systems biology

## **Andreas Dräger 1,2\* and Bernhard Ø. Palsson<sup>1</sup>**

<sup>1</sup> Systems Biology Research Group, Department of Bioengineering, University of California, San Diego, La Jolla, CA, USA

<sup>2</sup> Cognitive Systems, Center for Bioinformatics Tübingen (ZBIT), Department of Computer Science, University of Tübingen, Tübingen, Germany

#### **Edited by:**

Daniel Machado, University of Minho, Portugal

#### **Reviewed by:**

Padraig Gleeson, University College London, UK Taishin Nomura, Osaka University, Japan Herbert M. Sauro, University of Washington, USA

#### **\*Correspondence:**

Andreas Dräger, University of California, Atkinson Hall, Room 2506, 9500 Gilman Drive # 0412, La Jolla, CA 92093-0412, USA e-mail: andraeger@eng.ucsd.edu

Collaborative genome-scale reconstruction endeavors of metabolic networks would not be possible without a common, standardized formal representation of these systems.The ability to precisely define biological building blocks together with their dynamic behavior has even been considered a prerequisite for upcoming synthetic biology approaches. Driven by the requirements of such ambitious research goals, standardization itself has become an active field of research on nearly all levels of granularity in biology. In addition to the originally envisaged exchange of computational models and tool interoperability, new standards have been suggested for an unambiguous graphical display of biological phenomena, to annotate, archive, as well as to rank models, and to describe execution and the outcomes of simulation experiments. The spectrum now even covers the interaction of entire neurons in the brain, three-dimensional motions, and the description of pharmacometric studies. Thereby, the mathematical description of systems and approaches for their (repeated) simulation are clearly separated from each other and also from their graphical representation. Minimum information definitions constitute guidelines and common operation protocols in order to ensure reproducibility of findings and a unified knowledge representation. Central database infrastructures have been established that provide the scientific community with persistent links from model annotations to online resources. A rich variety of open-source software tools thrives for all data formats, often supporting a multitude of programing languages. Regular meetings and workshops of developers and users lead to continuous improvement and ongoing development of these standardization efforts.This article gives a brief overview about the current state of the growing number of operation protocols, mark-up languages, graphical descriptions, and fundamental software support with relevance to systems biology.

**Keywords: model formats, modeling guidelines, ontologies, model databases, network visualization, software support**

#### **1. INTRODUCTION**

Since its emergence in the 1960s systems biology has always been tightly related to the availability of powerful computational resources. While at the beginning of research in the field and its applications quick and simple script-based solutions were sufficient, the bar for publication and review has been drastically raised (Sauro et al., 2003). It has been realized that individual scripts, which are specific to certain computational environments and that are not very reproducible are of small benefit for the scientific community and progress of the field (Lloyd et al., 2004). The development of standardized data formats, models, and computational methods have paved the way toward the evolution and maturation of systems biology into a main-stream field of research (Macilwain, 2011). Sufficient annotation and metadata of models, experiments, and other data enhance the reproducibility of

**Abbreviations:** ANSI, American National Standards Institute; API, application programing interface; BRAIN, brain research through advancing innovative neurotechnologies; CAD, computer-aided design; COPASI, complex pathway simulator; CSS, cascading style sheets; DAE, differential-algebraic equation; DIN, Deutsches Institut für Normung; FBA, flux balance analysis; fbc, flux balance constraints; GO, gene ontology; HTML, hyper text mark-up language; IEEE, Institute of Electrical and Electronics Engineers; IETF, internet engineering task force; ISML, *in silico* mark-up language; JSON, JavaScript object notation; KiSAO, kinetic simulation algorithm ontology; LEMS, low entropy model specification; MAMO, mathematical modeling ontology; MIASE, minimum information about a simulation experiment; MIBBI, minimal information for biological and biomedical research; MIRIAM, minimal information required in the annotation of models; NCBI, National Center for Biotechnology Information; NuML, numerical mark-up language; OBO, open

biomedical ontologies; ODE, ordinary differential equation; OMEX, open modeling exchange format; OMG, object management group; OSB, open-source brain; OWL, web ontology language; PDE, partial differential equation; PharmML, pharmacometrics mark-up language; PHML, physiological hierarchy mark-up language; RDF, resource description framework; SBGN, systems biology graphical notation; SBGN-ML, systems biology graphical notation mark-up language; SBML, systems biology mark-up language; SBOL, synthetic biology open language; SBRML, systems biology result mark-up language; SBW, systems biology workbench; SED-ML, simulation experiment description mark-up language; SVG, scalable vector graphics; SWIG, simplified wrapper and interface generator; TEDDY, terminology for the description of dynamics; URI, uniform resource identifier; W3C, world wide web consortium; XML, eXtended mark-up language.

results (Wolstencroft et al., 2011). For individual areas of research, different models are required, hence different standards for their encoding. Research in constraint-based modeling (Bordbar et al., 2014) deals with the encoding of the stoichiometric matrix and flux bounds, whereas, dynamic metabolic modeling (Dräger and Planatscher, 2013a) is usually based on building ordinary differential equation systems, model calibration, and parameter estimation (Dräger et al., 2009a; Kronfeld et al., 2009; Dräger and Planatscher, 2013b). Spatial-temporal simulations require encoding three-dimensional geometries and partial differential equation systems (Moraru et al., 2008).

It can hence be observed that the modeling community in systems biology has diversified. One reason for this development is that main parts of funding for these standardization attempts originate from ambitious large-scale projects, each having has different requirements. These efforts include, for example, goal of specifically reconstructing all reactions in specific organisms, such as human or yeast, resulting in giant reaction networks (Duarte et al., 2007; Herrgård et al., 2008; Rolfsson et al., 2011; Thiele et al., 2013) or systematically representing the complete knowledge about biochemical reactions available today (Büchel et al., 2013a). Trans-European projects like SysMO (Booth, 2007) want to comprehensively record and describe dynamic molecular processes in unicellular microorganisms and to present all processes in the form of computerized mathematical models. The German Virtual Liver Network (Holzhütter et al., 2012) aims to mathematically explain all phenomena in the human liver across multiple cell types and levels of organization. The Physiome project attempts to achieve a full quantitative description of all physiological dynamics and functional behaviors of the intact human body (Hunter and Borg, 2003). The US BRAIN (Brain Research through Advancing Innovative Neurotechnologies) Initiative aims to support the development of new technologies for classifying the anatomical constituents for the brain and to allow simultaneous recording from an unprecedented number of neurons simultaneously. The EUs Human Brain Project seeks to develop the infrastructure for creating computational models of brain regions at multiple scales on high-performance computing platforms (Shepherd et al., 1998; Markram et al., 2011; Kandel et al., 2013). Thereby, medical applications become increasingly important (Büchel et al., 2013b; Grillner, 2014).

Common to all these consortia is that with the increasing number of active researchers and collaborators the exchange, reproduction, and accessibility of models, data, and further information in specific online databases play a major role (Brazma et al., 2006; Schellenberger et al., 2010;Wolstencroft et al., 2011;Yu et al., 2011; Chelliah et al., 2013). Just like the documentation of source code, the careful annotation of models and data are also necessary to achieve a fruitful collaboration. The more meta information that is provided, the easier the model can be comprehended, modified, simulated, and analyzed (Waltemath et al., 2013). The use of standard formats is highly recommended for the publication of results even if not required by the prospective journal.

In addition, new fields and areas of application are emerging, for instance, pharmacometric models or synthetic biology (Endler et al., 2009; Galdzicki et al., 2011; Müller and Arndt, 2012). There is therefore no one-size-fits-all solution that would be equally

suitable for all fields of research. The standardization community therefore needs to continuously catch up with these developments in the actual modeling community and to reinvent itself over and over again. Recent approaches have suggested to modularize modeling languages by introducing highly specialized *packages* for modeling aspects that can otherwise not be represented in the main data format (Chaouiya et al., 2013).

The structure of how standards are defined has also matured. Brazma et al. (2006) describe that four steps are required for the development of a standard: (i) data and information need to be collected about the domain of interest that are relevant for an unambiguous transfer and interpretation as well as conceptual model design, (ii) the model needs to be formalized, (iii) an exchange format must be defined, and (iv) software support must be implemented. Nearly all modeling formats described in this article now follow this suggestion and are based on a minimum information requirement description (Taylor et al., 2008). These documents define what kind of information has to be stored in a respective model in order to guarantee that the model can be reused and understood by other researchers. In this way, the information requirement and the corresponding modeling standard are decoupled, exchangeable, and independent. The minimum information requirement is usually complemented with a specific ontology, i.e., a hierarchical collection of field-specific terms and their definitions (Courtot et al., 2011). These terms can be associated to model components and descriptions. In addition, elaborate and persistent annotation frameworks have been developed,which allow the modeler to precisely express,what individual model components are and how they are to be understood (Juty et al., 2012, 2013). The development of standards, minimal information requirements, and ontologies needs to be orthogonal to existing respective standards.**Table 1** and **Figure 1** give an overview about the relationship amongst various standards discussed in this article.

The structural representation of the model [for instance, SBML by Hucka et al. (2004) or CellML by Cuellar et al. (2006)], its application and analysis [SED-ML by Waltemath et al. (2011b)], its (graphical) display [SBGN guidelines by Le Novère et al. (2009)], and features should be accurately discriminated and encoded in

**Table 1 | Standards with relevance for modeling in systems biology**.


vocabularies, so-called ontologies and modeling guidelines build the basis for model encoding formats. These formats can refer to terms from ontologies and their organization is in accordance with the modeling guidelines. Recommendations for a visual representation of models as well as the execution of individual models in numerical simulation or optimization are separated from the structural models. Numerical results can be encoded in further standard data formats.

distinct formats. Depending on the concrete modeling format, structural models can also include mathematical formulations, but not their interpretation framework (such as the algorithm to solve the model or the simulation end time). Recently, a new archive format has been proposed in order to link and distribute these independent modeling aspects all together in a single file (Bergmann et al., 2014).

Much effort has also been invested in software support and the creation of infrastructures for diverse standards. For each data format, a specific library has been implemented for reading and writing files as well as for manipulating components of the format in memory (Bornstein et al., 2008; Miller et al., 2010; Demir et al., 2013). Often, language-bindings for diverse programing environments are provided, but sometimes specific libraries have been developed in order to support certain programing languages (Dräger et al., 2011). These parsing libraries help developers to use and exploit the individual standards. Often these libraries provide interfaces to corresponding ontologies and controlled vocabulary annotations (Courtot et al., 2011). However, the interpretation, analysis, drawing, etc. of models cannot be facilitated by these libraries. Higher level software has been implemented to support model building, display, simulation, etc. (Deckard et al., 2006; Keller et al., 2013). Sometimes, this is done in the form of plug-ins to more general frameworks, and often there are diverse standalone or web-based tools for various purposes (König et al., 2012; Krause et al., 2013).

When the first XML- or OWL-based exchange formats for models were proposed, developers of existing software tools were often involved, and their individual software was adapted in order to fit the standard. Nowadays, with many standards being well established,software is specifically tailored with respect to the standards. The stringent elaboration and clear distinction between models, purpose, simulation, and annotation can also be a source of inspiration for young researchers who enter the field. In the long-term, using standard formats can lower the expenses for software development because they allow the reuse of existing tools in new applications. Moreover, with the many available tools for standard formats, less research time is needed for the interconversion of tool-specific files, making it much easier to collect information from diverse sources (Demir et al., 2010).

While international and national standardization bodies, such as OMG, W3C, IEEE, ANSI, IETF, DIN, etc., usually approve standards and release specifications, the situation is different in systems biology, where *de facto* standards are established by the scientific community (Brazma et al., 2006). The fast-moving nature and ongoing development of research makes this approach necessary.

However, keeping track of the growing number of model formats and standards for diverse purposes has become more and more difficult. This review article gives a broad overview of a wide range of currently existing modeling standards, formats, and online repositories, and a selection of software solutions for systems biology and related fields of research. The aim of this article is to highlight specific standards, their usability, and application in order to give the reader an up-to-date picture of model definition, encoding, and availability in systems biology.

### **2. MATERIAL AND METHODS**

#### **2.1. MODELING GUIDELINES**

Modeling formats give us the syntax of models (Juty et al., 2012). In order to enhance accessibility of data and to facilitate the reuse of models, several modeling guidelines have been proposed, which are discussed in this section. These guidelines are often called "Minimum Information of/for,"which should express that without at least this form of information optimal use and reproducibility of results cannot be guaranteed. More information can always be provided on top of the minimal requirements. The guidelines are hence a form of checklists that describe which kind of information to include and often go back to the idea of the MIBBI project (Minimal Information for Biological and Biomedical research) proposed by Taylor et al. (2008). The open biomedical ontologies (OBO) foundry<sup>1</sup> maintains orthogonal (non-overlapping) collections of controlled vocabularies, which provide the semantics for models. The most well-known ontology is probably the gene ontology (GO) by (Ashburner et al., 2000).

## **2.1.1. Minimum information required in the annotation of models**

Reuse of models can be compromised if inconsistent identifier systems are used for individual components. For instance, when merging models, it is necessary to match overlapping components. If a molecule is identified as water in one model and as H2O in another such a matching is already difficult for automated procedures. To solve such problems, the minimum information required in the annotation of models (MIRIAM) guidelines has been proposed as a general model curation checklist (Le Novère et al.,2005). The MIRIAM registry (Laible and Le Novère, 2007) goes further

<sup>1</sup>http://obofoundry.org

and provides a connection between controlled vocabularies (Courtot et al., 2011) and formats, tools, and databases. Most modeling standards provide mechanisms to attach MIRIAM annotations to their components. These annotations are structured based on a subject-predicate-object scheme. Here, the subject is the identifier of the model element. The predicate is one of several predefined qualifiers, e.g., hasPart or is. The object should be a web resources pointing to an identifiers.org address (Juty et al., 2012, 2013), for instance http://identifiers.org/kegg.compound/C00001 for water. This Uniform Resource Identifier (URI) is therefore composed of the prefix identifiers.org/, the definition of the data collection (in this example kegg.compound) followed by the delimiter and finally the record identifier (here C00001). Using such an identifiers.org address instead of directly pointing to an entry in ChEBI (Brooksbank et al., 2013; Hastings et al., 2013), MetaCyc (Caspi et al., 2014), KEGG (Kanehisa and Goto, 2000), or any other of the more than 30 currently supported data collections has several advantages. Should the original resource location or address schema change, the identifiers.org site will point to the new location. identifiers.org also measures the uptime of mirror servers for identical records and preferably directs to the most reliable mirror.

#### **2.1.2. Minimum information about a simulation experiment**

The minimum information about a simulation experiment (MIASE) project (Waltemath et al., 2011a) aims to unambiguously define how to reproduce the results of a model simulation. For stochastic models, the results should be within an acceptable small range from the original results, and for deterministic models, the results should be identical. This requirements checklist also supports the review process of scientific publications. Relevant ontologies (Courtot et al., 2011) for MIASE are the kinetic simulation algorithm ontology (KiSAO) that defines the method to use, the terminology for the description of dynamics (TEDDY), and the mathematical modeling ontology (MAMO).

### **2.1.3. Ontologies**

*2.1.3.1. Kinetic simulation algorithm ontology.* The KiSAO gathers computational methods that can be used to simulate a model in a certain way (Courtot et al., 2011). It contains, for instance, definitions of several differential equation solvers for numerical calculations. Organizing these algorithms in a hierarchical structure allows tools to automatically select the most similar solver within their collection of implemented methods.

*2.1.3.2. SBO.* The Systems Biology Ontology is a collection of terms that describe the structure of a model, its components, modeling frameworks, and processes (Courtot et al., 2011). By using terms from this ontology, the semantics of individual parts of a model can be made explicit. This is often of particular importance if elements can participate in processes where they can have multiple roles, such as catalysts or inhibitors.

*2.1.3.3. Mathematical modeling ontology.* The recently developed ontology (MAMO, see http://bioportal.bioontology.org/ ontologies/MAMO) has complemented and refined the *modeling framework* branch of SBO. Both ontologies are intended to cross link each other. While SBO mainly focuses on the entities and parameters in the model and describes the relationships among them, MAMO has been developed in order to precisely define and categorize *types* of mathematical models (e.g., *ODE*) and their characteristics (e.g.,*discrete* vs.*continuous*) as well as types of readouts (such as*time-course analysis*) and variables (such as *dependent variable*).

*2.1.3.4. Terminology for the description of dynamics.* The TEDDY defines a formal way to specify how the numerical results of a dynamic system behave when a simulation experiment is conducted (Knüpfer et al., 2006; Courtot et al., 2011). In this way, a machine-readable representation of such a description can be automatically generated upon simulation and be stored along with the model. When querying a database of numeric results, this terminology can help to find models with a desired behavior, such as ongoing oscillations.

## **2.2. MODELING FORMATS**

Reconstructing computational models based on a textual description in a publication can be difficult,because required information, such as a clear definition of the units of all components, can be lacking, the language might be imprecise or ambiguous, or a combined explanation of simulation procedure and actual model hamper the implementation of the model (Cooling, 2010; Dräger et al., 2010). In cases, where models are distributed in form of source code implemented for a specific run-time environment or programing language, executing these programs can be a challenge because of diverse dependencies to operating systems or required third-party libraries (Lloyd et al., 2004). In this section, we will discuss several formats that encode systems biological models in different ways with the aim to overcome this problem.

### **2.2.1. Systems biology mark-up language**

The Systems Biology Mark-up Language (Finney and Hucka, 2003; Hucka et al., 2003, 2004) 2 is a hierarchical XML-based format consisting of several lists of components, such as compartments (finite spaces), (reactive) species, parameters (constants or variables), reactions with kinetic laws, user-defined functions and rules, events, units, and many more. SBML has been developed as a model exchange format that covers a wide range of modeling approaches used today (Hucka et al., 2004), including dynamic and steady-state metabolic networks as well as gene-regulatory and signaling networks (Lambeck et al., 2010; Vlaic et al., 2013). The term *reaction* should no longer be seen as a strict (bio-)chemical reaction. It is rather a process with inputs and outcomes. Specific annotation with SBO terms and MIRIAM identifiers clarify the purpose of all elements. The reactions implicitly define a differential equation system, whose explicit structure needs to be assembled at simulation time or prior to simulation. The rationale behind this design decision is that the same model can be interpreted in terms of a different modeling framework, such as stochastic simulation, etc.

The libraries libSBML (Bornstein et al., 2008) and JSBML (Dräger et al., 2011) facilitate the implementation of import and

<sup>2</sup>http://sbml.org

export functions of SBML models in customized software solutions. While libSBML provides bindings to a large variety of programing languages based on the wrapper generator SWIG (Beazley, 1996), the JSBML library has been specifically developed for the platform-independent Java™ language. Both libraries strive to attain a high degree of compatibility. Specific API libraries have also been implemented for working with SBML under MAT-LAB™ (Keating et al., 2006) and Mathematica™ (Shapiro et al., 2004, 2007).

It has been recognized that the interpretation and simulation of SBML models can be quite challenging and that different simulation environments can yield divergent results on identical input files (Bergmann and Sauro, 2008). For this reason, a comprehensive test suite of manually created SBML models has been established including reference results. This test suite can be used as a benchmark test case for simulation routines.

SBML handles the increasing diversification of modeling approaches and community requirements with the development of several specific and orthogonal packages, which can be used in addition or separately from the core format. The following extension packages have already been released: hierarchical model composition (comp) (Smith et al., 2013b), flux balance constraints (fbc) (Orth et al., 2010; Bergmann and Olivier, 2013), three-dimensional arrangement of elements in diagrams (layout) (Gauges et al., 2006), and qualitative relationships (qual) (Chaouiya et al., 2013). Draft specifications are available for the following extensions: arrays, sampling of values from statistical distributions (distrib), dynamic creation and destruction of structures during a simulation (dyn), grouping of elements (groups), entity pools with multiple states and complex composition of species (multi), drawing graphical representations of a model (render), indication of those model elements that are changed by packages (req), and spatial processes and geometries (spatial). For an up-to-date list and more detailed explanation of available extension packages, see http://sbml.org/Community/Wiki.

#### **2.2.2. CellML**

The XML-based model storage and exchange format CellML<sup>3</sup> has been developed for the IUPS Physiome project with the aim to facilitate reuse of models or their components in a softwareindependent manner (Lloyd et al., 2004; Cooling, 2010). CellML eases the creation of new models based on parts of existing models and hence accelerates the cumbersome model building process (Cooling et al., 2010). CellML models contain structural information about the organization of the model (components, connections, and units), mathematical equations (arbitrary MathML) to quantitatively describe biological processes, and metadata that link model components to online resources. An important design feature of CellML allows components and parameters to be shared across models via import statements and well-defined interfaces. This also allows users to structure their models into multiple files, similar as can be done with HTML pages, and increases reusability of individual black-box models,but also requires a strict decoupling of components. CellML uses RDF tags for semantic annotations and allows for hierarchical groupings of components. A set of software tools is available to edit CellML models, including an API implementation (Miller et al., 2010) or the graphical modeling environments OpenCell (Lloyd, 2013) and OpenCOR (Nickerson et al., 2013). CellML can be inter-converted from and to SBML and to the scripting language Antimony (Schilstra et al., 2006; Smith et al., 2013a). The rates of change of all components are explicit in CellML. When adding components or connections to a model, these rates of change would need to be updated. With the help of interfaces modelers can avoid this cumbersome update process (Cooling, 2010).

#### **2.2.3. FieldML**

FieldML<sup>4</sup> is an XML-based model interchange standard, which has been developed with a focus on the euHeart and Physiome projects and is currently available in version 0.5 (Britten et al., 2013). The main purpose of the format is to encode geometric models in explicit or implicit mathematical form with respect to biological and medical phenomena with spatial-temporal variation, such as the simulation of power fields and gradients. FieldML focuses on fields over multiple discrete indices and multivariate fields with discrete or continuous variables as well as interpolation functions. With these approaches, it is possible to model muscle contraction as part of cardiac mechanics, blood flows, and other multi-scale processes. Other applications include the modeling of patient-specific clinical images with the help of specific annotations and fitting of models to fields. Similar approaches are also planned for the spatial extension for SBML (Schaff et al., 2013). A powerful C++ API with wrappers for Java, Fortran, and Python as well as a software plug-in for the physiome model repository (PMR) support FieldML and provide several high-level functions for model building and simulation (Yu et al., 2011). Version 0.5 already includes model composition over multiple files and data sources.

#### **2.2.4. BioPAX**

The motivation for the creation of the BioPAX<sup>5</sup> format (Biological Pathway Exchange) was the aim to unify the various co-existing pathway encoding formats of numerous online databases (Demir et al., 2010). This format is intended to facilitate the communication between diverse software systems and also serves as a common knowledge representation of pathways. With BioPAX the structure of metabolic, signaling, and gene-regulatory pathways can be encoded, including relationships between elements (such as genes or molecules) as well as diverse states (such as post-translational modifications). A growing number of pathway databases and software tools provide BioPAX files as import or export formats (Shannon et al., 2003; Funahashi et al., 2008; Demir et al., 2010; Kelder et al., 2011) and BioPAX is useful to integrate information from heterogeneous sources, to support visualization, and analyses. The definition of BioPAX is the result of a continuous community effort. The BioPAX language is organized in levels that increasingly add features to the language definition. BioPAX is based on OWL and it is implemented as an ontology. An online

<sup>3</sup>http://www.cellml.org

<sup>4</sup>http://physiomeproject.org/software/fieldml

<sup>5</sup>http://www.biopax.org

validator can be used to check the correctness of BioPAX files. All elements within a BioPAX file can be annotated using controlled vocabularies and MIRIAM (Laible and Le Novère, 2007; Juty et al., 2012). For writing, reading, manipulating, and analyzing the API library Paxtools (Demir et al., 2013) <sup>6</sup> has been created and is freely available. Quantitative relationships and temporal sequences of events do not belong to the objectives of BioPAX. However, since it is also possible to encode qualitative relationships in SBML (Chaouiya et al., 2013), BioPAX can be converted to SBML without loss of information (Büchel et al., 2012).

#### **2.2.5. NeuroML**

The object-oriented mark-up language NeuroML (Gleeson et al., 2010) <sup>7</sup> has been developed as a standard to specifically encode, share, and store computational models of information transfer in neurosciences (Goddard et al., 2001). The aim of the language is to cover diverse structural levels beginning at individual neuron cell membranes and ranging to entire neural networks. This XML-based language encodes biophysically detailed neuronal and network models including ion channels,synapses,and the anatomical connectivity of neurons and how these elements underlie the complex electrical behavior of the brain (Gleeson et al., 2010). Therefore, from the very beginning, modularity, portability, and clarity were the main language requirements (Goddard et al., 2001). Supporting high-performance simulations and creating software frameworks for neuroinformatics are the aims of the language (Beeman, 2013). To this end, NeuroML 2 has been built on the Low Entropy Model Specification (LEMS) language (Cannon et al., 2014), which hierarchically defines structure and dynamics of a large variety of biological models. For parsing, writing, and manipulating NeuroML and LEMS files, the Python APIs libNeuroML and PyLEMS as well as the Java™ APIs jNeuroML and jLEMS are available (Vella et al., 2014). The original idea to link sub-modules of processes in NeuroML to models encoded in SBML or CellML (Gleeson et al., 2010) has since been further elaborated. The LEMS libraries allow users to import SBML models and can also export SED-ML (Waltemath et al., 2011b) files for reproducible simulation experiments. The main repository for NeuroML is Open-Source Brain (Gleeson et al., 2013).

#### **2.2.6. ISML and PHML**

The XML-based language ISML (*insilicoML*) allows users to describe biophysiological models that cross multiple scales and levels. This format is fully compatible to CellML 1.0, but incorporates a specific ontology of physiological functions (Asai et al., 2008). A large collection of models in ISML can be obtained from an online database at http://www.physiome.jp. The physiological hierarchy mark-up language (PHML) has been designed as the successor of ISML (Asai et al., 2013). PHML defines each biological or biophysical element as a module, which can be encapsulated and linked through ports. This concept hierarchically structures the language. Furthermore, PHML can integrate SBML models as sub-cellular phenomena (Asai et al., 2012).

#### **2.2.7. PharmML**

The Pharmacometrics Mark-up Language PharmML (Moodie et al., 2013) <sup>8</sup> belongs to the most recent languages in the family of XML-based standards for biomedical computation and is currently under development. The purpose of this new language is to exchange and store pharmacometric models, which includes studies, trials, simulations, estimation, and exploration. It will support metadata, non-linear mixed effects models, serve as an encoding platform for new approaches and elements, as well as support model-based analysis. The developers want to ensure backwards compatibility with existing relevant standards in order to use existing software tools. Use-case scenarios are, for instance, the kinetics of tumor growth, observation models, or trial design for treatment-dosing-related data.

#### **2.2.8. Synthetic biology open language**

The Synthetic Biology Open Language<sup>9</sup> also belongs to the latest modeling standards (Galdzicki et al., 2014). This RDF-based format has been designed in a community process in order to facilitate the creation of synthetic biology components by providing an exchange format for software tools. As a specialty, SBOL comes with a specific graphical representation for promoters, their regulators, and many additional genetic structures (see **Figure 2**).

#### **2.3. STANDARDS FOR MODEL SIMULATION PROCEDURES**

Defining the structure of a model does not give any information about reproducible simulation experiments. In order to perform the identical simulation of the model as described in a corresponding research article, the exact name of the numerical solving algorithm, step size, error tolerance, etc. must be precisely defined. The purpose of the Simulation Experiment Description Mark-up Language (SED-ML)<sup>10</sup> is to provide a standardized, machine-readable, platform-independent data format for this purpose (Waltemath et al., 2011b). SED-ML follows the MIASE guidelines (Waltemath et al., 2011a) and hence enables users to attach both a model as well as the description of its intended use to a publication, which could also simplify review processes. It therefore contributes to the reproducibility aspect in science, where only stochastic approaches might diverge within a small range from published data. The XMLbased language SED-ML is organized in levels and can describe

<sup>8</sup>http://www.ddmore.eu <sup>9</sup>http://www.sbolstandard.org <sup>10</sup>http://sed-ml.org **FIGURE 2 | SBOL visual**. The horizontal bar represents a DNA molecule to which various features can be visually attached. Here, a few examples are applied for demonstration purposes. A full specification and an exhaustive list of all available symbols can be found online at http://www.sbolstandard.

org/visual.

<sup>6</sup>http://www.biopax.org/paxtools/

<sup>7</sup>http://www.neuroml.org

multiple simulation experiments within the same file. Language components can be annotated using MIRIAM resources (Laible and Le Novère, 2007). A key idea of SED-ML is not to distribute concrete implementations of simulation procedures, but rather to use ontologies such as KiSAO (Courtot et al., 2011) to refer to the method and its settings. Since this ontology has a hierarchical structure, it is possible to apply related simulation algorithms in case a required method is not implemented in a certain software tool. Structural model changes prior to simulation and post-processing steps of the results (such as converting between amounts and concentration units) as well as the presentation of the output can also be defined (Waltemath et al., 2013). The model can, in principle, be encoded in an arbitrary standardized format and addressed through URI links. SED-ML does not provide an encoding of the simulation results itself, but can be used in combination with numerical mark-up language (NuML) or SBRGML (Dada et al., 2010). An extension to SED-ML has been proposed in order to also support sampling sensitivity analysis simulation experiments (Miller et al., 2012). Some simulation environments have already adopted this young format (Olivier et al., 2005; Myers et al., 2009; Kolpakov et al., 2011; Keller et al., 2013). A workflow editor (SED-ED), API libraries (libSedML, jlibSEDML), and a simplified scripting language (Antimony) are also available (Smith et al., 2009; Adams, 2012).

#### **2.4. GRAPHICAL MODEL REPRESENTATION FORMATS**

The visual representation of biochemical pathways has a long tradition. Displays of biological circuit diagrams and reaction pathways can be found in numerous textbooks and a plethora of publications. Databases such as KEGG (Kanehisa and Goto, 2000) or MetaCyc (Caspi et al., 2014) take this up and provide displays of biological networks in their specific layout and style, which follows many traditional aspects. In order to display and draw similar maps, several programs have been developed, for instance, CellDesigner (Funahashi et al., 2008), JDesigner (Sauro et al., 2003), TinkerCell (Chandran et al., 2009), VCell (Resasco et al., 2012), or Cytoscape (Shannon et al., 2003) with its diverse plug-ins (König et al., 2012; Gonçalves et al., 2013). We now discuss recommendations for the display of pathways and standardized data formats for exchanging these maps.

#### **2.4.1. SBGN and SBGN-ML**

The myriad of graphical notations that are being used can lead to confusion or ambiguity. The development of a unified and standardized notation has thus become necessary (Le Novère et al., 2009). The Systems Biology Graphical Notation<sup>11</sup> effort aims to make the display of biological networks exchangeable between software tools and at the same time to clearly define the meaning of specific nodes and arcs in such networks in order to ease their interpretation and automated processing. Therefore, the number of graphical symbols is intentionally limited in order to keep the learning curve flat and to create a visually, syntactically, and semantically consistent schema, which is modular in size and complexity

(Le Novère et al., 2009). The SBGN neither defines layout (placement and adjustment) nor style (such as line thickness or color) of objects. In order to represent the current needs for such a display, it is organized in levels, so that in the future new versions can be proposed. The specifications of the SBGN are organized in three different languages, each of which has been designed for certain use-case scenarios and has inherent strengths and weaknesses. (i) In process-description diagrams (Kitano et al., 2005; Funahashi et al., 2008), the level of detail is very high and these maps show sequences of processes, which also involve temporal causality (see **Figure 3A**). These maps are well suited for metabolic pathways, but not for the consistent display of the combinatorial complexity of several proteins with many phosphorylation states (van Iersel et al., 2012). (ii) Activity flow charts (van Iersel et al., 2012) are much more abstract and neglect many molecular mechanisms. By design, these maps introduce a certain ambiguity and can hence be used to describe effects whose precise underlying mechanisms are either not know or not relevant (see **Figure 3B**). In this type of diagram, stimulation and inhibition, effects of perturbation, and the activity of components can be displayed. Activity flow charts are thus suitable for the display of causality chains (van Iersel et al., 2012). (iii) The entity-relationship diagrams (Kohn et al., 2006) are particularly useful when the temporal sequence of events does not play the main role, but precise molecular interactions are to be displayed (see **Figure 3C**). These maps are more concise than process-diagrams for protein modifications and interactions, but less capable of representing reactions (van Iersel et al., 2012).

In order to specifically store and exchange SBGN maps in XML files, the Mark-up Language SBGN-ML has been developed (van Iersel et al., 2012). The main requirement for this format is its simplicity, i.e., it should be easy to draw and to interpret. Most significantly, SBGN-ML is not tied to any of the network representation standards. While, this format does not include rendering information, it has been proposed to incorporate a rendering extension, similar as can be done with SBML files. In contrast to the SBML layout extension, this format is focused on the concepts of SBGN only and can be validated against the SBGN specifications. The API library libSBGN<sup>12</sup> facilitates the import and export of SBGN-ML files. The code of libSBGN has been automatically created from an XML Schema Definition file (XSD), which significantly reduces the implementation effort, makes native language implementations in C++ and Java™ possible, and can be used for Schematron validation. A growing number of libSBGN-based software tools support the SBGN-ML format, such as the VANTED (Junker et al., 2006) plug-in SBGN-Ed (Czauderna et al., 2010), the Cytoscape (Shannon et al., 2003) plug-in CySBGN (Gonçalves et al., 2013), the online tool BioGrapher (Krause et al., 2013), the model generator KEGGtranslator (Wrzodek et al., 2013), or the visual editor CellDesigner (Funahashi et al., 2008).

#### **2.4.2. Visualization of CellML**

For CellML, a specialized interactive framework has been developed for the display of models (Wimalaratne et al., 2009). This

<sup>11</sup>http://sbgn.org

<sup>12</sup>http://libsbgn.sourceforge.net

**FIGURE 3 | (A)** The glycolysis in human erythrocytes, simplified from Dräger (2011). This example network depicts the reaction steps from extracellular glucose to intracellular lactose as a chain of enzyme-catalyzed reactions in SBGN PD notation. Metabolites that occur multiple times in the map, such as ATP or NAD<sup>+</sup> , have darker clone markers on the bottom. Simple molecules are displayed as circles, whereas, macromolecules appear as rounded rectangles. Reactions are indicated as square process nodes. **(B)** This activity flow diagram displays the interaction of two mammalian signaling pathways that are stimulated by epidermal growth

framework can either depict the physical model, i.e., the actual components of the CellML format,or the biological interpretation. CellML hence provides its own two-dimensional visual language for both concepts, which can be used in programs to link between the display and the underlying data structure, and also for dynamic image manipulation. For both kinds of displays, a small set of distinct glyphs are defined: entities, processes, and roles. While the physical display tends to be very complex, the biological view is much more straightforward. The developers of the CellML visualization scheme interact with the SBGN team (Wimalaratne et al., 2009). On the longer term, it is intended to combine ideas from SBGN (Le Novère et al., 2009) and the CellML display. Currently, not all concepts of the CellML display can be expressed in SBGN (Wimalaratne et al., 2009).

Ca<sup>2</sup><sup>+</sup>

et al. (2011).

**2.4.3. Layouts in SBML** Layouts can directly be stored in SBML models with the help of the layout extension (Gauges et al., 2006). With this extension, it is possible to attach information about position and size of objects, such as reactive species, compartments, or reaction arcs. Text labels can also be placed. The SBML layout package is based on boundary boxes and defines neither shapes nor colors of objects, but it can be further extended with additional rendering information (Deckard et al., 2006; Shen et al., 2010). Tools such as SBML2LaTeX or SBML2Ti*k*Z (Dräger et al., 2009b; Shen et al., 2010) can interpret layouts stored in this extended SBML to be consistent with SBGN process-diagram maps. In general, these two SBML extensions allow users to store arbitrary forms of network representations. Programs such as KEGGtranslator

factor (EGF) and tumor necrosis factor alpha (TNFα) and their influence on the nuclear factor κ-light-chain-enhancer of activated B cells (NFκB) and mitogen-activated protein kinases (MAPK) cascades. Adapted from Chaouiya et al. (2013) and generated with the program CellNOpt (Terfve et al., 2012). Here, external stimuli are colored in green. **(C)** This figure displays an example for an entity-relationship diagram, in which

/calmodulin-dependent protein kinase II (CaMKII) is precluded if it dimerizes or if it binds to the protein calmodulin. Adapted from Le Novere (Wrzodek et al., 2011, 2013) use the layout extension to preserve initial layouts from the KEGG database in SBML files. In combination with the SBML extension for qualitative models (Chaouiya et al., 2013), it is also possible to create activity flow networks. In contrast to the SBML layout extension, no standardized way has been proposed to directly store SBGN-ML layouts inside of SBML files. However, the recent COMBINE format (Bergmann et al., 2014) allows users to store files of diverse forms all together within one archive file (see Section 2.6).

#### **2.5. REPRESENTATION OF NUMERICAL RESULTS**

In order to store the results of numerical simulation, specific file formats have been proposed. The Systems Biology Result Mark-up Language (SBRML, Dada et al., 2010) has been succeeded by the NuML13. This new format has been developed as a standardized exchange and archiving format for the results of numerical methods. This new language has been designed as a format that is usable in various disciplines besides systems biology. The C++ library libNUML can be used for parsing, manipulating, and writing the information of NuML data structures.

#### **2.6. COMBINE FORMAT**

The COMBINE format aims to distribute diverse modeling, documentation, and data files together within one single Open Modeling Exchange format (OMEX) file (Bergmann et al., 2014). The format is basically a ZIP archive, i.e., a compressed datatype, which contains an XML-based manifest file and an optional metadata file in RDF format. While the structure of the manifest file is welldefined, there are only recommendations for the metadata file. If present, metadata should at least include information about the author of the OMEX file in form of a vCard and follow the structure proposed by the Dublin Core Metadata Initiative. The manifest file contains structured links to all included files together with a definition URI that describes the filetype. Thus, diverse types of files can be included, even publications, plots, models, graph definitions, etc. Just for the sake of significant data compression, it is already recommended to store models inside of OMEX files (file extension \*.omex). Even though the COM-BINE archive format belongs to the most recent datatypes of the

<sup>13</sup>http://code.google.com/p/numl/


#### **Table 2 | Relevant online databases**.

systems biology community, it is already supported by a number of tools and also the library libCombineArchive for dealing with it (Java™ and C#).

## **2.7. ONLINE MODEL REPOSITORIES AND DATABASES**

One important aspect of model exchange and reusability is the availability and distribution of models that have already been published or that are currently under review. Since a growing number of journals require the online availability of models along with a publication, it is important to be familiar with a number of online resources that are now available. In this section, we will discuss the different aims and features of selected online model repositories, which are summarized in **Table 2**.

## **2.7.1. BiGG**

An important resource for Biochemically, Genetically, and Genomically structured genome-scale metabolic network reconstruction is the BiGG database (Schellenberger et al., 2010). The main focus of this knowledge-base is to facilitate the bottomup genome-scale reconstruction of metabolic networks. Inclusion of every known reaction of an entire organism constitutes the ultimate goal of BiGG. To this end, it integrates published genomescale metabolic networks into one resource and applies a standard nomenclature for all of their components. Among these networks are several important model organisms, such as *E. coli* and *H. sapiens*, as well as further main branches of life (Duarte et al., 2007; Feist et al., 2007; Thiele et al., 2013). All models are manually curated and all reactions are atom-balanced. These networks also include gene–protein associations, which can be used to relate the activity of genes via Boolean logic to reactions and hence to perform knock-out or knock-down experiments *in silico*. BiGG offers various options to search, browse, and display networks. Manually curated maps can be downloaded in SVG format for a multitude of pathways. There are often several such maps available for one organism. Various build-in functions (such as decompartmentalization, orphan detection, gap filling, etc.) support the modeling process. With its SBML export function, it provides the basis for further steps in the modeling pipeline, particularly constraint-based analyses by the COBRA platform (Becker et al., 2007; Ebrahim et al., 2013). As the first database specific to constraint-based models, it precedes the SBML extension for fbc, but provides COBRA-specific model extensions that can be easily converted (Bornstein et al., 2008).

#### **2.7.2. BioModels database**

BioModels database (Chelliah et al., 2013) is an open-source project, whose license model allows free commercial and academic use. Individual authors can submit their models to this database. A team of curators further improves the models, for instance by making the annotations in the model consistent with respect to MIRIAM guidelines (Juty et al., 2012). Large parts of the database content have been imported from collaborative repositories, such as the CellML model repository (Yu et al., 2011). The web interface of BioModels database provides a large variety of services based on embedded tools, e.g., for the simulation or graphical display of models. The main format of BioModels database is SBML, but models can be downloaded in a wide variety of formats, most of which have been automatically converted from the SBML files. It is also possible to obtain an exhaustive model report about each model (Dräger et al., 2009b) that describes the details of each model component in a human-readable way. Since the database was launched in 2005, it has been observed that not only are the number of models significantly increasing, but also their complexity. It now contains a large number of models, each describing the same biological process, but with higher levels of detail. With the growing size of the database the search for a model of interest has become a problem by itself (Schulz et al., 2011). With the help of metadata stored along with each model and the actual content of the models, sophisticated ranking procedures have been designed based on information theory aiming to retrieve models from the database for a given query (Henkel et al., 2010). The metadata include the submission and modification data, the authors of the model, and references. The user can browse through the models based on several characteristics, including the model name, publication identifier, or a GO-based (Ashburner et al., 2000) classification. Besides the curation of models, the main purpose of this repository includes the reproduction of model simulation results as given by the original publication (Waltemath et al., 2013).

### **2.7.3. CellML physiome model repository 2**

The CellML physiome model repository 2 (PMR2) is the most important resource for CellML models at different states of their curation (Yu et al., 2011). It uses a Plone-based model management system that is organized in workspaces. This allows its users to collaboratively develop models based on a version-control system and also facilitates the modular development of models. The models stored in this database cover a large variety of processes, including signal transduction and metabolic pathways, electrophysiological and cell cycle models, immunological models, and models describing muscle contraction or mechanical phenomena. The idea of collaborative model development brings with it one important feature: PMR2 keeps track of a detailed version history of all models. Plug-ins to the system facilitate the presentation of models in various ways and also enable the import and export of diverse modeling formats, including SBML or FieldML besides the native database format CellML. In addition, the plug-in technique makes the database extendable. A search function returns

models of all curation states. The main focus of this database is to provide a version-controlled repository for the collaborative model development and presentation of model information, here called *exposures*.

## **2.7.4. JWS online model repository**

Another popular model resource is the JWS Online Model Repository (Snoep and Olivier, 2003). When JWS was launched as the first central model database in 2003 the standards SBML and CellML were still in their early development and not as well established. The repository itself is tightly related to the JWS online simulator (Olivier and Snoep, 2004), a particularly useful resource for educational purposes. Since then, the database has been continuously extended. Its native data format is SBML. Models can be queried based on a list of predefined characteristics (Waltemath et al., 2013), including metadata such as author, publication, organism, or model type as well as a list of categories (for instance, cell cycle or metabolism). The purpose of JWS is to provide a user-friendly online repository of kinetic models of biological systems in combination with an application that facilitates the simulation of these models. The aim of this infrastructure is to ease the review process of papers describing these kinds of models. As a result of its integration into the SEEK platform (Wolstencroft et al., 2011) a large number of collaborative projects use JWS as their default modeling platform.

### **2.7.5. SEEK platform**

The open-source SEEK platform benefits from the ability to offer JWS as its integrated simulation tool to its users. The SEEK platform goes beyond just being a model database. This web-based tool has been designed as a pragmatic data management solution for the exchange of very diverse kinds of data relevant for research in systems biology. Besides mathematical models, it also covers the exchange of experimental data, scientific protocols, and personal information about members of large research consortia (Wolstencroft et al., 2011). It allows its users to record the outcomes of experiments. One of its most important features is the ability to link between data, models, and publications, as well as to tag all uploaded items. This platform has originally been developed for the European SysMO consortium (Booth, 2007), and is also used in several other National and European projects, such as the German Virtual Liver Network (Holzhütter et al., 2012). The preferred modeling data format of SEEK is SBML with MIRIAM annotations.

### **2.7.6. ModelDB**

ModelDB (Migliore et al., 2003; Hines et al., 2004) belongs to the seven databases of SenseLab (NeuronDB, CellPropDB, ModelDB, Olfactory Receptor Database, OdorDB, OdorMapDB, and Brain-Pharm). SenseLab aims to provide a neural, genomics/genetics, proteomics, and imaging information resource for the neuroscience community and the interested public (Crasto et al., 2007). The database does not explicitly require a standard data format. Instead, authors are welcome to upload their models in arbitrary formats. As a result, the database is very flexible, but model reuse can take extra time to convert the desired model in a format for a particular execution environment (Waltemath et al., 2013).

#### **2.7.7. Open-source brain**

Inspired by the open-source movement, the collaborationoriented open-source brain (OSB) repository has been established (Gleeson et al., 2013). All models in this repository can be commented, debugged, and extended by registered users. This platform therefore complements repositories, such as ModelDB (Hines et al., 2004), which focus on distributing published models, with the aim to drive the advance of models at all stages of its development. An integrated WebGL-based 3D explorer allows users to view cells and networks in NeuroML 2 format within their browser. OSB is well integrated and links to ongoing research projects such as OpenWorm<sup>14</sup> .

### **2.7.8. WikiPathways**

The WikiPathways project (Kelder et al., 2011) provides a Web 2.0 wiki-based platform for the online curation of biological pathways. The idea for this platform is that manually curated pathways are of higher quality than automatically created ones. Motivating the scientific community to share knowledge would thus increase the quality of available pathway information. To this end, WikiPathways provides an interactive zoom-able pathway viewer that comes with a pathway diagram description, hyper-links, and detailed information as well as literature references. Users can also annotate the pathways with ontology terms. It is possible to submit private pathway information that is shared later with the public, for instance, as part of the review process, or if current knowledge about certain processes is limited. As a major feature, WikiPathways provides stable hyper-links to all pathways, which is useful in order to use the platform as a reference. Its content can be downloaded in many export formats under the terms of the Creative Commons license. The BioPAX standard (Demir et al., 2010) is thereby its most important format. Internally, it uses GPML, an XML standard that is compatible with many modeling tools, including Cytoscape (Shannon et al., 2003).

## **3. RESULTS**

## **3.1. INTEROPERABILITY OF STANDARDS**

### **3.1.1. Path2Models**

An important driving force for improved interoperability and exchange of diverse data formats and standards was the community project path2models (Büchel et al., 2013a). The aim of this project was to automatically create draft models of biological processes based on the knowledge stored in the databases KEGG (Kanehisa and Goto, 2000), MetaCyc (Caspi et al., 2014), SABIO-RK (Wittig et al., 2014), and PID (Schaefer et al., 2009). The extraction of information from these databases required the development of new algorithms in order to capture a large variety of special cases (Wrzodek et al., 2011, 2013; Büchel et al., 2012) due to the different scope of the source databases. In order to also encode qualitative networks in SBML, the standard needed to be extended (Chaouiya et al., 2013). The draft SBML models had to be quality controlled and enriched with further kinetic information for reactions for which the SABIO-RK database did not yet provide experimentally determined rate laws (Dräger et al., 2008, 2010).

Drafts of whole organism models were created by combining individual organism-specific pathway models (Swainston et al., 2011).

The main purposes of the KEGG databases are to provide a comprehensive, textbook-like educational view on the knowledge about a large variety of biological pathways. For modeling purposes, however, the information needs to be presented in a different way (Wrzodek et al., 2013). Reactions cannot be lumped together for the purpose of a better visual presentation, but have to be made explicit. The model must be as specific as possible, i.e., organism-specific variations must be reflected in pathways.

New algorithms also needed to be proposed in order to generate SBGN-ML files directly from KEGG (Czauderna et al., 2013). On the one hand, the manually created pathway maps in KEGG can be much better comprehended by human beholders than automatic layouts. However, in order to obtain an unambiguous representation of knowledge, the initial KEGG layout needs to be modified and subject to several constraints with respect to the esthetics of the result.

Such a large-scale endeavor, which resulted in more than 140,000 pathway maps that are all available from BioModels Database (Chelliah et al., 2013), was only feasible with the help of automatic procedures. Overall, this effort can be seen as a showcase application, which demonstrated the usefulness of data standardization, source code exchange, and software development in a large collaborative community project.

### **3.1.2. Workbench and workflow approaches**

Even though several data storage and exchange formats have been defined and software has been developed to import and export those formats, it is still difficult to work with a large number of different programs and in diverse environments. It can be of particular interest to process intermediate results from one program in another software package or to work with software on different computers with different operating systems. Furthermore, software is often written in diverse programing languages and compiled in diverse environments. Code reuse is still quite limited. All this can hamper building complex analysis pipelines. To address these problems, the systems biology workbench (SBW, Sauro et al., 2003) and the Garuda effort (Ghosh et al., 2011) have been launched. SBW is a software framework for communication between heterogeneous application components. It provides a broker to which each SBW-enabled software needs to register. This broker enables the software to be executed on different machines. Information can be sent from one program to the other through a specific protocol, which provides a fast binary encoded message system. SBW therefore allows programs to use each other's capabilities. In contrast, Garuda is similar to an "App Store" for systems biology (see http://www.garuda-alliance.org/). It provides a common platform, from which diverse applications (gadgets) can be launched (see **Figure 4**). Garuda gadgets can call each other and send their output the next gadget or receive input from other gadgets. A powerful workflow would be to create a model with KEGGtranslator (Wrzodek et al., 2011, 2013), which can forward its result to the rate law generator SBMLsqueezer (Dräger et al., 2008, 2010), which in turn launches SBMLsimulator (Keller et al., 2013)in order to run a simulation and parameter calibration on

<sup>14</sup>http://openworm.org

the resulting model. Garuda provides a nice and easily understandable user interface, the dashboard, from which applications can be launched.

### **3.2. SOFTWARE SUPPORT**

A large variety of software has been developed for many kinds of model building, analysis, drawing, simulation, and format interconversion. In this section, we will only discuss a small number of conceptual categories and particularly important tools. Several reviews specifically focus on available software (e.g., Dandekar et al., 2012; Hamilton and Reed, 2013; Fernández-Castané et al., 2014; Gostner et al., 2014; Koussa et al., 2014; Kramer et al., 2014). **Table 3** gives an overview of selected software. For an up-to-date list and comprehensive information, see, for instance, the dynamic software matrix at http://sysbioapps.dyndns.org/ pivot-software-matrix.html.

### **3.2.1. Visualization and model building**

Several tools provide interactive graph-based user interfaces and facilitate import or creation, manipulation, or export of complex pathway structures. Some programs can be extended via plug-ins, e.g., the Biological Network Analyzer BiNA (Gerasch et al., 2014), CellDesigner (Funahashi et al., 2008), or Cytoscape (Shannon et al., 2003). The flexible stand-alone application BiNA (Gerasch et al., 2014) is based on a hierarchical graph concept and provides highly configurable styles for the visualization of regulatory and metabolic network data as well as access to the BN++ pathway

data warehouse (Küntzer et al., 2007). The web-modeling tool BioGrapher (Krause et al., 2013) is implemented with HTML5, CSS, and JavaScript and can be used to create SBGN maps. BioGrapher can import several standard formats, including SBML and SBGN-ML, and export SBGN maps in a JSON file format or as images. The VANTED plug-in SBGN-ED supports all three kinds of SBGN maps and is therefore useful for designing and modifying SBGN-ML files (Czauderna et al., 2010). The framework program Cytoscape supports creation, import, and export of SBML and SBGN through plug-ins (König et al., 2012; Gonçalves et al., 2013). The main purpose of the straightforward and user-friendly process-diagram editor CellDesigner is the creation, manipulation, and simulation of SBML models (Matsuoka et al., 2014) with export functions to BioPAX (Mi et al., 2011) and SBGN-ML (van Iersel et al., 2012). CellDesigner can be extended through plugins, such as the kinetic law generator SBMLsqueezer (Dräger et al., 2008, 2010). The draft model generator KEGGtranslator (Wrzodek et al., 2011, 2013) automatically downloads contents of the pathway database KEGG (Kanehisa and Goto, 2000) and converts the content to diverse output formats, including SBML with extensions for layout (Gauges et al., 2006) and qual (Chaouiya et al., 2013), SBGN-ML (van Iersel et al., 2012), BioPAX (Demir et al., 2010),and many more. TinkerCell (Chandran et al.,2009) has been developed as a computer-aided design (CAD) tool and provides visual representations for systems biology and synthetic biology. OpenCOR (open-source cross-platform) for working with CellML files can be used through command-line or graphical user interface

#### **Table 3 | Selected relevant software for systems biology**.


(Continued)

#### **Table 3 | Continued**


(Nickerson et al., 2013). It supports various aspects of modeling, including editing, simulation, and analysis. As a plug-in based program, OpenCOR can be easily extended. One of its most recent plug-ins facilitates the annotation of CellML.

#### **3.2.2. Constraint-based modeling**

The most important toolboxes for Constraint-Based Reconstruction and Analysis (Bordbar et al., 2014) are the COBRA Toolbox for MATLAB (Schellenberger et al., 2011) and its Python implementation COBRApy (Ebrahim et al., 2013). These toolboxes provide state-of-the-art implementations of flux balance analysis methods, including gene deletions, flux variability analysis, sampling, and batch simulations. Both versions of COBRA incorporate tools to read-in and manipulate constraint-based models, which requires a specific extension of the SBML standard. The Mathematica-based Mass-Toolbox (Sonnenschein and Palsson, 2013) <sup>15</sup> is a complex framework for constraint-based model building and simulation, which can calculate steady-state solutions for complex enzyme reactions and even solve ODE and DAE systems with delays and events. Further important tools for FBA are FASIMU (Hoppe et al., 2011), the VANTED (Junker et al., 2006) plug-in FBA-SimVis (Grafahrend-Belau et al., 2009), and PySCeS (Olivier et al., 2005).

### **3.2.3. Dynamic simulation**

The main focus of the Mass-Toolbox (Palsson, 2011; Sonnenschein and Palsson, 2013) is kinetic modeling with a focus on mass-action rate laws and elementary reaction systems. It supports a large variety of analysis methods and high-level plotting commands for phaseportraits, and many more.

The SBToolbox2 (see http://sbtoolbox2.org, Schmidt and Jirstrand, 2006; Schmidt, 2007) provides a powerful and extensible variety of simulation and analysis functions, which smoothly integrate into the MATLAB environment. SBToolbox2 supports SBML and parameter estimation with EvA2 (Kronfeld, 2008).

CellDesigner delivers several third-party tools for interactive model simulation SOSlib (Machné et al., 2006), the Simulation Core Library (Keller et al., 2013), or COPASI (Hoops et al., 2006).

The SBW-enabled complex pathway simulation program COPASI is primarily a stand-alone program, but provides API language-bindings for several programing languages. COPASI can read, write, and understand SBML, but has its own specific modeling language and supports several other export formats. It comprises methods for simulation and analysis of biochemical networks and their dynamics based on ODEs and stochastic systems. Parameter estimation and the visualization of data as well as animated pathways are among its strengths.

The tool SBMLsimulator combines the Simulation Core Library, a comprehensive Java™ API for solving SBML models (Keller et al., 2013) with the optimization framework EvA2 (Kronfeld, 2008) in a self-explanatory user interface and provides a complete implementation of the SBML standard in terms of an ODE framework.

The stand-alone desktop tool BioUML (Kolpakov et al., 2011) is among the few tools that provide a full implementation of the SBML standard in terms of ODE systems and also provides its functions as JavaScript API.

The stand-alone tool iBioSim (Myers et al., 2009) for modeling, analysis, and design of genetic circuits has been developed as an editor and simulator (ODE and stochastic) with applications in systems biology as well as synthetic biology. Besides SBML, it also understands Petri net (LPN) models and has import access to model databases. Experimental data can also be used to infer models in iBioSim.

SOSlib (Machné et al., 2006) is an ODE-based C-API library implementation of SBML that internally uses CVODE (Hindmarsh et al., 2005). The newer C-implementation libSBMLSim (Takizawa et al., 2013) supports even more recent versions of SBML, explicit and implicit integration methods, and bindings to several programing languages. Another alternative is libRoad-Runner, a highly performant C++ library for the simulation of

<sup>15</sup>http://opencobra.github.io/MASS-Toolbox/

SBML models, which provides automatically generated languagebindings to Python (Sauro et al., 2013).

The Java-based tool JSim has been designed for building quantitative numeric models as well as the analysis of these models based on given experimental data (Butterworth et al., 2014). It supports ODEs and PDEs, discrete events, and implicit methods. JSim can import and export SBML and import CellML (Smith et al., 2013a).

The Virtual Cell suite VCell is a powerful simulation toolbox for complex biological phenomena (Moraru et al., 2008; Resasco et al., 2012). It includes sophisticated methods for: (i) molecular interactions and transport, (ii) various sub-cellular compartments, (iii) dynamics of membrane potentials, and (iv) arbitrary fluxes and passive cross-membrane transport mechanisms, and supports PDEs in addition to ODEs. It is one of very few tools to incorporate physicochemical and electro-physiological processes and can apply quasi-steady-state approximations to fast reactions. It is also an image processing tool for experimental images.

The simulation environment MOOSE was developed as a reimplementation of the GENESIS neural simulator, and initially used that simulator's model description format. Recently though it has developed support for NeuroML models, and is also capable of dealing with systems biological models (Gleeson et al., 2010; Dudani et al., 2013). New simulation algorithms can be added to MOOSE through a generic framework. It has also been developed with a focus on multi-scale models and simulation in diverse levels of detail (Dudani et al., 2013). For a more comprehensive overview about recent simulation tools with a focus on neuroscience we refer the interested reader to the review by Gleeson (2013).

The stand-alone modeling framework PhysioDesigner (Asai et al., 2013) provides several functions for the creation and analysis of PHML models. SBML models can be incorporated as submodels through PhysioDesigner (Asai et al., 2014), aiming at integrating dynamics at sub-cellular and cellular levels. The simulator Flint can efficiently solve PHML models and provides a cloud service, which allows users to remotely solve their models (Asai et al.,2012). PhysioDesigner uses Flint and submits jobs to this cloud service.

#### **3.2.4. Regulatory networks**

The inference of regulatory networks is a challenge for many areas of research. The program ModuleMaster (Wrzodek et al., 2010) identifies *cis*-regulatory modules (CRMs) in sets of co-expressed genes based on transcription factor binding information and multivariate functional relationships between regulators and target genes. As an input it uses microarray and clustering experiments and SBML models as output. In order to make the results of network inference procedures such as Net*Gene*rator (Töpfer et al., 2007) reusable in further analysis tools, the program GRN2SBML (Vlaic et al., 2013) has been developed as a converter to SBML. It provides a graphical user interface, access to BioMart Central, and can also be used as an R-package. The program GINsim has been developed for the analysis and simulation of logical models of gene interaction networks (Gonzalez Gonzalez et al., 2006) and has been recently adapted to the SBML qual extension (Chaouiya et al., 2013). The program CellNOpt can be useful for the creation of signal transduction networks based on a logical approach (Terfve et al., 2012), and it also supports SBML qual.

#### **3.3. REGULAR COMMUNITY MEETINGS**

Many standards described in this paper are based on community efforts. For this reason, community meetings have been required from their inception. In October 2010, separate workshops were combined in order to better coordinate individual developments and to reduce the necessary amount of traveling for individual researchers. This resulted in two regular annual meetings that brought together the community. The COMBINE (Computational Modeling in Biology Network) is a workshop with scientific presentations, poster sessions, and several break-out sessions, which are used to discuss and coordinate the further development of the "COMBINE Standards" BioPAX, CellML, SBGN, SBML, SBOL, and SED-ML, as well as associated and related standards. The idea of the spring Hackathon on resources for modeling in biology (HARMONY) is to provide room and time for community members to sit down, share code and ideas, program, and discuss. In contrast to the fall event, HARMONY usually has only very few talks and is much more a hands-on practical event, where participants develop new approaches and ideas. For more information about previous meetings see the meeting reports by Le Novère et al. (2011), Waltemath et al. (2014) and the COM-BINE homepage16. This alternating sequence of complementary meetings leads to a very efficient and progressive development of software and standards.

## **4. DISCUSSION**

In this review article, we have examined diverse modeling standards and data formats that are currently in use within the scientific community together, with databases from where these formats can be obtained. We discussed a selection of useful software packages and modeling approaches for systems biology and related fields. The structuring of individual standards is at present very elaborate: there is usually a modeling, annotation, or documentation recommendation that forms the theoretical basis for a corresponding machine-readable data format and involves specific controlled vocabulary terms for unambiguous specification of individual model components.

Aiming to keep even highly elaborate standards flexible and able to incorporate new findings, the specifications are becoming more and more abstract and modularized. For example, the original *reaction* element in SBML is now seen as a generic *process* whose inputs and outputs no longer strictly have to *represent* substrates and products of biochemical reactions. The idea to develop specific packages for certain needs rather than one monolithic modeling language also follows this trend. The development of all standards involves numerous people, detailed discussions, and careful consideration. This overall procedure ensures that standards mature in an open fashion and allows interested researchers to participate and to contribute to this development. At the same time, it also increases the chance that potential conflicts or inaccuracies can be discovered in early stages of development. With increased use of standards the requirements of the individual format are steadily improved and current limitations are detected and solved. Thanks to the regular meetings and ongoing exchange between the developers of the diverse standards, the individual formats are mutually

<sup>16</sup>http://co.mbine.org

adopting more and more of each other's features. It can therefore be expected that the exchange between different model and pathway representation standards will further increase.

For end-user applications, the goal is that users would no longer have to care about the underlying data format used by a specific software tool. More and more details of the internal structure and organization of underlying formats could be hidden and no detailed knowledge about these formats would be required. Plugins for platforms such as Cytoscape (Shannon et al., 2003) or CellDesigner (Funahashi et al., 2008) can provide complementary functionality for export or import of certain data formats based on a common underlying data structure (König et al., 2012; Gonçalves et al., 2013). The SBW or the Garuda framework provides further ways to increase the interoperability of tools with little effort (Sauro et al., 2003; Ghosh et al., 2011). Many tools could also benefit from the ability of the new COMBINE archive format to bridge separately stored representations or applications of the same model (Bergmann et al., 2014).

The distribution and curation of standardized models, their simulation description, and expected results by centralized databases plays a prominent role. These knowledge bases constitute valuable resources of available information about biological processes and reproducible experiments. They can therefore significantly reduce time and effort needed for the assembly of extended models and create the basis for further research. The ability to easily reproduce new scientific findings with existing simulation workflows facilitates the fast adoption and integration of these findings into new and even further elaborated works. If other researchers are able to run simulations and to comprehend models with minimal effort, it can be expected that these studies will receive higher recognition and lead to more citations compared to distributing models whose outcomes are difficult to reproduce. The distribution of models and data in standard formats amongst their project working groups will not only benefit collaboration partners, but the fine-grained structure of standards for diverse aspects of modeling workflows that is now available can even simplify the review process of scientific papers. If a model is uploaded along with a publication in a standard format, accompanied with a simulation experiment description file and a graphical representation, reviewers can quickly obtain an overview about structure and organization of a model, and even easily check if the findings described in the paper can be reproduced. Thereby, the reviewer can select any numerical tool that supports these data formats and is not restricted to any particular environment.

The development of a standard can be seen as a long-term investment. Unlike in other fields, the community-based bottomup development of exchange formats is very common in systems biology. Depending on the structure of the field, it can therefore take a long time before the overhead of developing a new standard pays off; on the other hand, standards exist as long as the community has a requirement for them (Brazma et al., 2006). It also seems that the development of standards has become a field of research by itself and is sometimes even seen as the central aspect in modeling (Waltemath et al., 2013). Models and their evaluation are certainly valuable tools for progress in research, but permanently keeping track of all emerging standards can become difficult. The proposed concepts and approaches can only be successful if these are well-known. If standard data formats are developed that are not adopted by the community, the standard will disappear and a simpler solution will gain acceptance. As we go along, new modeling techniques and new finding are established and adopted by the research community (Lerman et al., 2012; O'Brien et al., 2013). Approaches for model encoding and standardization therefore need to continuously evolve with the domain of research that they represent. It is therefore important for the standardization community to continue to closely interact with the modeling community in order to catch up with novel approaches, needs, and requirements. The solutions given to the modeling community must be simple enough in order to be easily adopted, implemented, and applied, but they must also be sophisticated enough in order to capture the complexity of the described systems. Participation of the community in proposing encoding schemes and guideline checklists is essential for the success of the respective standard. Large-scale reconstructions and community projects require data standards and at the same time push their development (Büchel et al., 2013a; Thiele et al., 2013).

While in the past even quick computation in active research required the implementation of some data structures from scratch in customized scripts, the rich variety of software libraries and modeling-specific scripting languages now available drastically simplify these tasks. If an existing software solution cannot be directly applied to solve a specific task, it is at least possible to use standards compliant data structures from the very beginning of a project. Also the quality of available software solutions is progressively increasing. For the distribution of final results, standard formats should be used as the preferred exchange and storage medium in order to ensure reusability and reproducibility of results and findings.

#### **AUTHOR CONTRIBUTIONS**

Andreas Dräger developed the conceptual idea for this review and drafted the manuscript. Bernhard Ø. Palsson supervised this work and critically revised the manuscript. Both authors approve the final manuscript.

### **ACKNOWLEDGMENTS**

The authors thank Marc Abrahams, UCSD, for carefully proofreading the manuscript and acknowledge the contribution of Nikolaus Sonnenschein,Novo Nordisk Foundation Center for Biosustainability (CFB). *Funding* : This work was funded by a Marie Curie International Outgoing Fellowship within the European Commission's 7th Framework Program for Research and Technological Development (project AMBiCon, grant number 332020) and a National Institute of Health grant for the continued development of essential SBML software support (NIH, United States, award number GM070923).

#### **REFERENCES**


of *Lecture Notes in Computer Science*, eds K. Tuyls, R. Westra, Y. Saeys, and A. Nowé (Berlin, Heidelberg: Springer).


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 01 September 2014; accepted: 14 November 2014; published online: 08 December 2014.*

*Citation: Dräger A and Palsson BØ (2014) Improving collaboration by standardization efforts in systems biology. Front. Bioeng. Biotechnol. 2:61. doi: 10.3389/fbioe.2014.00061*

*This article was submitted to Systems Biology, a section of the journal Frontiers in Bioengineering and Biotechnology.*

*Copyright © 2014 Dräger and Palsson. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

## ADVANTAGES OF PUBLISHING IN FRONTIERS

FAST PUBLICATION Average 90 days from submission to publication

COLLABORATIVE PEER-REVIEW

Designed to be rigorous – yet also collaborative, fair and constructive

RESEARCH NETWORK Our network increases readership for your article

## OPEN ACCESS

Articles are free to read, for greatest visibility

#### TRANSPARENT

Editors and reviewers acknowledged by name on published articles

GLOBAL SPREAD

Six million monthly page views worldwide

### COPYRIGHT TO AUTHORS

No limit to article distribution and re-use

IMPACT METRICS Advanced metrics track your article's impact

SUPPORT By our Swiss-based editorial team

EPFL Innovation Park · Building I · 1015 Lausanne · Switzerland T +41 21 510 17 00 · info@frontiersin.org · frontiersin.org