Introduction
Numerous approaches to understanding and utilizing gene expression measurements attempt to classify them into one of two states: active (roughly speaking, the gene product is part of an active cellular mechanism) or inactive (the cellular mechanism is not active) (Ferrell, 2002; Abel et al., 2013; Gallo et al., 2015). We label this classification a determination of the gene activity state. These approaches are becoming more and more relevant with continued dramatic increases in the quantity and diversity of transcriptomics data as prices to obtain data continue to decline. In particular, recent approaches to metabolic modeling (MM) have focused on the integration of multiple sources of genetic information including transcriptomics data (Pfau et al., 2011; Lewis et al., 2012; Bordbar et al., 2014; Chubukov et al., 2014; Machado and Herrgård, 2014; Monk et al., 2014; Rezola et al., 2014). In these approaches, gene activity states are usually incorporated into constraints on the fluxes through reactions associated with the gene products. For example, GIMME (Becker and Palsson, 2008) applies a user-specified expression level threshold to classify gene activity states in any given experiment, then computes a penalty for flux through any reaction associated with an inactive gene; Flux Balance Analysis (FBA) is then constrained by minimizing the sum of penalties across all reactions in the model. PROM (Chandrasekaran and Price, 2010) uses a version of this approach in which the researcher finds the user-specified expression level threshold by assuming that a pre-defined percentage of all genes in an experiment are active (e.g., 33% Chandrasekaran and Price, 2010 or 50% Dunman, Personal Communication.). Others have proposed approaches in a similar spirit (Jerby and Ruppin, 2012), while some have allowed for more uncertainty through the addition of an “unsure” state, which yields no corresponding flux constraint (Shlomi et al., 2008), or by examining relative expression values (Jensen and Papin, 2011). Some (Jensen et al., 2011) suggest that large relative changes in gene expression (above a threshold) signal a shift from one state to the other, while others (Van Berlo et al., 2011) use both absolute and relative changes. Most recently, some have proposed continuous approaches whereby larger expression values for an experiment are classified as more likely to be active, and lower expression values for an experiment are classified as less likely to be active (e.g., GIM3E, Schmidt et al., 2013). Constraints on FBA via the estimated gene expression states are “soft” in that they can be violated to allow for uncertainty in expression state classification and also allow for potential post-transcriptional control; precise handling of such violations varies among approaches but typically involves a penalty term in the linear programming problem. While PROM classifies gene states in the standard manner, PROM does not directly constrain FBA based on the states.
While not all MM approaches to integration of transcriptomics data attempt to classify gene activity measurements into two states (Colijn et al., 2009; Moxley et al., 2009; Fang et al., 2012; Kim and Reed, 2012; Lee et al., 2012; Navid and Almaas, 2012), integrated MM approaches which do classify gene states suffer from at least four major limitations. First, many of the existing methods do not allow for different activity state thresholds between genes (e.g., gene A is assigned a threshold of 9.25 on a log-scale between active and inactive states; whereas gene B is assigned a threshold of 10). Second, many existing methods do not allow differences in the proportion of genes that are active from one experiment to another (e.g., a higher proportion of genes are expected to be classified as active in a minimal media condition than in a rich media condition). Third, existing methods do not leverage a priori knowledge about potential gene co-regulation to improve activity state classification (e.g., given that two genes A and B are co-regulated, if A is classified as active, B should also be classified as active). Finally, almost all existing methods do not meaningfully estimate statistical uncertainty in the classification process; the typical approaches to classifying genes using Boolean rules (every gene is either active or inactive) do not attempt to incorporate uncertainty in the classification. While some have attempted to incorporate uncertainty into the classification process (e.g., some have an “unsure” classification Shlomi et al., 2008), all approaches (including PROM Chandrasekaran and Price, 2010) incorporate post-hoc uncertainty adjustments by allowing violations of the gene activity state using penalties which apply similarly across all genes—in essence uniformly down-weighting the impact of expression data to account for upstream processing uncertainty. Here we propose improvements to gene activity state classification that address these limitations. Our work is motivated by the observation that many researchers work under operational definitions of genes as “active” in some conditions and “inactive” in others. Our goal is to provide guidance to researchers who regularly put this intuition into practice, by assessing their methods for classifying genes into activity states based on gene expression data, and proposing statistical models for data analysis that lead to improved classifications. Thus, we propose a flexible Bayesian approach that uses parametric mixture models as a platform for meaningfully estimating gene activity states through the integration of expression data and knowledge of operon structure. We quantify confidence in gene state estimates, which subsequently then can be incorporated into downstream analyses. We assess the performance of the model against other common approaches of estimating gene activity states using both simulated data and real E. coli transcriptome data; we compare our activity estimates to predictions of gene activity derived from reaction fluxes in a metabolic model of E. coli as well as to experimentally measured reaction fluxes in E. coli (Ishii et al., 2007; Machado and Herrgård, 2014).
Methods
General mixture modeling framework
Throughout this paper, we consider a set of m bacterial genes from a single organism whose expression levels ϵ have been observed across n different experimental settings. We define ϵij to be the expression level recorded for the ith gene in the jth experimental sample, where ϵij is the background corrected, normalized, logarithm of recorded amount of mRNA from an expression array. In the spirit of Becker and Palsson (2008) and Chandrasekaran and Price (2010), for each gene, i, we consider these observed ϵij values to come from one of two possible (unobserved) gene states: inactive (gene i is producing only basal levels of product) and active (gene i is producing product involved in a functioning cellular process).
A natural probabilistic model for the observed gene expression levels for genes in each state is a conditional Gaussian model, which follows earlier work in other genomic contexts (Gamba et al., 2015; Morfopoulou and Plagnol, 2015). Namely, ϵi|active ~ N(μ1, σ1) and ϵi|inactive ~ N(μ0, σ0). In other words, the distribution of expression values for a gene in the active state follows a Gaussian distribution with an underlying mean, μ1, and standard deviation, σ1 where the standard deviation captures the underlying measurement (Ohtaki et al., 2010) and biological variability (Losick and Desplan, 2008; Chalancon et al., 2012) in expression measurements across settings. Similar assumptions and definitions hold for gene expression measurements from the inactive state. Since in most settings, the true state of the gene is unknown a priori, the resulting observed gene expression values, ϵi = (ϵi1, ϵi2, ϵi3, …, ϵin), can be modeled as coming from a Gaussian mixture distribution, ϵi ~ (1 − π)N (μ0, σ0) + πN(μ1, σ1), where the mixing parameter π represents the proportion of the time that the given gene i is active across the set of experiments.
Univariate inference overview
Our goal is to use ϵi = (ϵi1, ϵi2, ϵi3, …, ϵin), and the Gaussian mixture model described above to make inferential statements about αi = (αi1, αi2, αi3, …, αin), where αij ∈ {0, 1} indicates whether gene i = 1,… m is in the active state (αij = 1) or the inactive state (αij = 0) in experimental setting j = 1,…,n. In particular, since we cannot observe αi directly, we wish to generate ai = (ai1, ai2, ai3, …, ain), such that aij is the posterior probability that gene i is active in experiment j. We will use a Bayesian approach to generate ai for each gene i.
For each gene i, we start with prior distributions on the five unknown parameters from the Gaussian mixture model: π, μ0, σ0, μ1, σ1. In practice, we reduce to four unknown parameters by requiring σ0 = σ1. This assumption provides increased model convergence and robustness to outliers (Fraley and Raftery, 2007), and assumes that similar amounts of biological and measurement variability will be present in expression values for the inactive and active states. The prior distributions for each of the four unknowns are given as: μ0 ~ N(μ=8,σ=3), μ1~N(μ=9,σ=3), σ02=σ12~InverseWishart(Ψ=3,ν=1) and π ~ Beta(α = 5, β = 5). Briefly, these choices of prior distributional shapes make later mathematical computation of posterior distributions straightforward and are standard in statistical practice (Murphy, 2007). The corresponding parameter values reflect reasonable experimental and biological assumptions [e.g., E(π) = 0.5; E(μ0) < E(μ1), etc.]. The choices of 8 and 9 for the prior means of RMA normalized data (See Section Real Data Sets), represent values near the overall “average” ϵ across all genes and all experiments (the range of which tends to be between 4 and 16). We note that, for this application, our analysis of the robustness of parameter choices indicates that these choices appear to have little bearing on resulting downstream aij generation (detailed results not shown).
We use a Gibbs sampler to generate ai as follows:
Step 1. Let μ^0,k=1=E(μ0)=8,μ^1=E(μ1,k = 1)=9, σ^0,k = 1=σ^1,k = 1=E(σ0)=1, and π^k = 1=E(π)=0.5, where k = 1 indicates that this is the initial pass through the Gibbs sampler.
Step 2. Use the four estimated parameter values from Step 1 to find the estimated Gaussian mixture model ϵ^i,k = 1~(1-π^k = 1)N(μ^0,k = 1,σ^0,k = 1) + π^k=1N(μ^1,k=1,σ^1,k=1).
Step 3. Use the estimated Gaussian mixture model, ϵ^i,k=1, to find bij, k = 1(ϵij), the conditional probability that an expression value, ϵij, is from the active state, where bij,k=1=π^k=1f1,k=1(ϵij)π^k=1f1,k=1(ϵij) + (1-π^k=1)f0,k = 1(ϵij) where f1,k=1(ϵij)=1σ^1,k=12πe−(ϵij−μ^1,k=1)22σ^12, k=1 and f0,k = 1(ϵij)=1σ^0,k = 12πe −(ϵij−μ^0,k = 1)22σ^02, k = 1
Step 4. Generate a random vector Ii, k = 1, where Iij, k = 1is a single random value {0 or 1} drawn from Bernoulli(p = bij, k = 1), indicating whether gene i is active or inactive in experimental setting j, for the k = 1 iteration of the Gibbs sampler.
Step 5. Update the prior distributions of the four parameters by incorporating the prior distributions with Ii, k = 1. In particular, let Cactive and Cinactive be the set of expression values currently assigned to the active and inactive clusters, respectively, according to Ii. Then μ0~ N((nσ0−2+3−1)−1(83−1+nσ0−2ϵ¯0), (nσ0−2+3−1)−1), μ1~N((nσ1−2+3−1)−1(93−1+nσ1−2ϵ1¯), (nσ1−2+3−1)−1), σ02=σ12~InverseWishart(Ψ+∑j∈Cactive(ϵij−μ1)T (ϵij−μ1)+ ∑j ∈ Cinactive(ϵij−μ0)T(ϵij−μ0)) ν + n) and π ~ Beta(α + |Cactive|, β + |Cinactive|), where |Cactive| represents the cardinality (size) of the set of expression values in Cactive. And, where ϵ¯0 and ϵ¯1 are the means of the classified inactive and active gene expression data, respectively.
The Gibbs sampler then repeats Steps 2–5 for k = K times. In our case, we used K = 500, with values less than 500 tending to give less robust results (detailed results not shown).
To generate ai, values of Ii, k are averaged across the K runs of the Gibbs sampler, ignoring an initial set of burn-in runs, b. In our case we used b = 50, which yielded robust ai values (detailed results not shown). In particular, aij=∑k = bKIij,kK-b, for all j.
Multivariate inference overview
While the univariate Gaussian mixture model and associated Gibbs sampler provide a standard way to generate ai values gene by gene, this approach fails to account for other a priori known biological information which may be able to further improve ai estimates. For example, in bacteria, operons are sets of contiguous genes that are co-regulated and therefore are generally active or inactive simultaneously. Thus, knowledge about which genes are in operons should allow us to improve gene activity estimates.
If there are p genes (i1,i2,…,ip) located within a given operon, r, then we can extend the univariate Gaussian mixture model described earlier to a multivariate Gaussian mixture model as follows:
ϵr=ϵi1,i2,…,ip~(1-π)N(μ⃗0,Σ0)+πN(μ⃗1,Σ1),
Where, μ⃗0=(μ0,i1,μ0,i2,…,μ0,ip),μ⃗1=(μ1,i1,μ1,i2,…,μ1,ip), Σ02=(σ0,i12⋯0⋮⋱⋮0⋯σ0,ip2) and Σ12=(σ1,i12⋯0⋮⋱⋮0⋯σ1,ip2), where we assume that within each state (active or inactive) the biological and measurement co-variability between gene expression measurements is zero.
Our approach in the multivariate case is very similar to our approach in the univariate case, and so is only outlined here. For each operon r, we start with prior distributions on the four unique and unknown parameters/vectors from the Gaussian mixture model: π,μ⃗0,μ⃗1,Σ0=Σ1. The prior distributions for each of the four unknowns are given as: μ⃗0~N(μ⃗=8⃗,Σμ⃗0=(3⋯0⋮⋱⋮0⋯3)), μ⃗1~N(μ⃗=9⃗,Σμ⃗1=(3⋯0⋮⋱⋮0⋯3)),Σ02=Σ12~InverseWishart(df=p+2,scale=(1⋯0⋮⋱⋮0⋯1)) and π ~ Beta(5, 5). These distributions and initial prior parameter values are mainly explained above. Off-main diagonal values of 0 in Σμ⃗0andΣμ⃗1 suggest that there is no correlation between the means of the active (or inactive) state distributions across the genes in an operon, with Σ0 = Σ1 suggesting no co-variability in the state expression variances of genes in an operon. As before, our analysis of the robustness of parameter choices indicates that these choices appear to have little bearing on resulting downstream aij generation (detailed results not shown). The Gibbs sampler is performed as described above by simply replacing the univariate parameters with the multivariate parameters. Notably, this yields ai values which are identical across all p genes located within an operon, since inference about activity states is occurring for the operon as a whole, not gene-by gene.
Implementation details for all methods being compared
In order to compare the performance of the two methods proposed above to current best practices, we implemented five approaches to estimating aij: median thresholding, trichotomous thresholding, rank based estimation, and our proposed univariate mixture modeling and multivariate mixture modeling approaches. We now briefly describe the methods compared here, along with some relevant implementation notes:
Median threshold (MT)
This approach dichotomizes expression values such that aij = {0 if ϵij<Mj1 if ϵij≥Mj}, where Mj is median(ϵij) for all m genes in experiment j. This approach is a special case of that proposed by GIMME (Becker and Palsson, 2008), which allows users to, a priori, select any threshold (median or otherwise). In practice, some users select the median (?). We note that the GIMME software program uses the mean as the default value for the threshold1. Due to the typical symmetry of gene expression values within an experiment, the mean is nearly identical to the median.
Trichotomous threshold (TT)
This approach trichotomizes expression values such that aij = { 0 if ϵij < Plow,j0.5 if Plow,j < ϵij≤Phigh,j 1 if ϵij≥Phigh,j}, where Plow, j is the 40th percentile for all i genes in experiment j and Phigh, j is the 60th percentile for all i genes in experiment j and is in the spirit of GIMME, but allowing for an uncertain region as proposed by Shlomi et al. (2008).
Rank based approach (RB)
This approach is a continuous analog to the MT approach. In particular, aij=rank(ϵij)m, where rank(ϵij) is the rank within experiment j. This approach is in the spirit of GIM3E (Schmidt et al., 2013) which assigns the equivalent of aij = 1 to max(ϵij), and a monotone and continuously changing decreasing confidence as ϵij decreases.
Univariate mixture model (UniMM)
This approach, described above, uses a Bayesian approach to infer aij values according to a Gaussian mixture model.
Multivariate mixture model (MultiMM)
This approach, described above, uses a Bayesian approach to infer aij values according to a Gaussian mixture model while incorporating knowledge of operon structure.
Screening and imputation methods for mixture model approaches
When implementing UniMM and MultiMM we first assessed the quality of the fit of a 2-component mixture model to the observed expression data for each gene or operon. In particular, some genes/operons may not change states (between active and inactive) across the available set of expression values (all n experiment settings), thus making a 2-component mixture distribution invalid. With this in mind we used a screening method to determine which genes/operons had strong evidence that they were 1-component instead of 2-component.
The screening method uses the Bayesian Information Criterion (BIC) to assess the fit of a 1-component (univariate or multivariate) Gaussian mixture distribution vs. a 2-component mixture distribution using the R package Mclust2. Following Raftery et al. (Raftery, 1995) we require the BIC to be at least 12 points better for the 1-component model to be chosen vs. the 2-component model. We note, however, that if we were to simply choose the best BIC between the 1 and 2 component models there would be little impact on the number of genes screened as being from a single component (detailed results not shown).
The aij values for genes which were screened as coming from only a single component (all active or all inactive) were estimated using a multiple imputation approach (MI) in order to identify similar genes/operons and impute aij values. Multiple imputation is a well-known statistical procedure for estimation of missing values (Rubin, 1987). When a gene, i, was screened as being from a single component, we used the R package Mclust2 to fit a single component Gaussian distribution to the data and estimate the corresponding mean and standard deviation of the model. Results of the UniMM approach for all genes from 2-component mixtures were then evaluated to identify “similar” genes, where similar genes had (μ^0∈x¯ϵi±0.1 and σ^0∈sϵi±0.1) or (μ^1∈x¯ϵi±0.1 and σ^1∈sϵi±0.1), where 0.1 is an arbitrary threshold. The multiple imputation approach then computes ai's for each of the similar genes as if the ϵi, j came from each of the similar genes. The final ai's for each imputed gene are computed by averaging across the imputed ai's from each similar gene. In the case of operons identified as coming from a single component (MultiMM approach), each gene in the operon is first considered separately using the same MI approach as for the UniMM method, and then the resulting ai's for each gene in the operon are averaged in order to yield consistent ai values for all genes in the operon. For single component operons which are identified as always active or inactive, π^=1 or 0, respectively. Finally, we note that if no similar genes are identified, then the MI approach returns aij = 0.5 for all j.
Real data sets
For most of our analyses, we used genome-wide gene expression data from 907 different microarray data sets collected on 4329 Escherichia coli genes via the M3D data repository (Faith et al., 2007, 20083). Raw data from Affymetrix4 CEL files were normalized using RMA (Irizarry et al., 2003). Details of data processing are described elsewhere (Tintle et al., 2012; Powers et al., 2015). We also performed analysis on gene expression and fluxomics data from Ishii et al. (Ishii et al., 2007) comprising 79 E. coli genes in 29 experimental conditions. E. coli operon predictions for 2648 operons, including 1895 single gene operons, were obtained from Microbes Online (Price et al., 2005).
Simulated data sets
We also simulated expression data with “known” gene activity states (active/inactive). The simulation of expression data was informed by the E. coli expression data described above. We first ran the Screening Method described above (see Section Screening and Imputation Methods for Mixture Model Approaches) and dropped all operons, including single gene operons, for which the two-component model did not yield the highest BIC (n = 697 dropped). We then randomly selected 26.3% (= 697/2648) of the remaining 1951 operons to be single component in the simulated data, with each of the single component operons having an equal likelihood of being always active or always inactive.
We used two different methods to calculate the mixing parameter, π, used in the simulation for the 1438 two-component operons. The Uniform Method (Unif) chooses a random value for π between 0.2 and 0.8. The Fitted Method (Fit) uses the MultiMM estimate of π (details given above). Values for μ⃗0,μ⃗1, Σ0 = Σ1 are all as estimated by the MultiMM method computed on the real expression data. To generate simulated expression values, ϵijs, we drew 907(πi) random values from a multivariate normal distribution (μ⃗1i,Σ1i) and 907(1 − πi) random values from a multivariate normal distribution (μ⃗0i,Σ0i). Thus, we generated a 907 by 3435 matrix of ϵijs values.
Validation of real gene activity calls using flux variability analysis
In order to generate alternative predictions of gene activity we used Flux Variability Analysis (Mahadevan and Schilling, 2003) on the E. coli iJO1366 metabolic model (Orth et al., 2011). In particular, we ran flux variability analysis on the E. coli metabolic model yielding flux bounds vlow and vhigh for each reaction in the model. Media conditions and gene mutations were accounted for in a model maximizing biomass. The following rules were then used to determine predictions rij of reaction activity for each reaction in the model and each of the 907 experiments.
rij={0 if vij,low=01 if vij,low > 0}
These reaction-level predictions were converted to gene-level predictions, pij, accounting for isozymes and multi-gene complexes as follows. If rij = 1 and the reaction is associated with a single gene or multi-gene complex, pij for all genes involved is 1. If rij = 0 and the reaction is associated with a single gene or multiple isozymes, pij for all genes involved is 0. If rij = 1 but the reaction is associated with multiple isozymes, we cannot be sure which isozyme is responsible for enabling the reaction, so we make no prediction (assign no pij value) for any of the genes involved. If rij = 0 but the reaction is associated with a multi-gene complex, we cannot be sure which subunit is responsible for thwarting reaction activity, so we make no prediction (assign no pij value) for any of the genes involved. If the previous four rules result in contradictory pij values for any given gene (e.g., both pij = 0 and pij = 1), then we let pij = 1 for that gene. This assumes that a gene with multiple roles can be active but perform only some of its roles; for instance, a gene product may be associated with an active reaction, but also be an isozyme on an inactive reaction, resulting in contradictory pij values. In these cases pij = 1 is the correct prediction. Finally, we note that all but three of the 907 experiments resulted in a growth prediction by the metabolic model, and that, following these rules, pij values could be obtained for approximately 845,000 of the 1.2 million gene-by-experiment combinations in the metabolic model.
Statistical analysis
We used two primary approaches to evaluate the quality of aij's resulting from different methods applied to both simulated and real data. The squared deviation approach quantifies the difference between aij's and true or predicted gene activity states. We computed dij2=(aij-αij)2 for simulated data, where αij is the true gene activity state, and dij2=(aij-pij)2 for the real expression data where pij is the predicted gene activity state based on the flux variability analysis of the metabolic model. We note that dij is only computed on the 1353 genes in the metabolic model for the real expression data, since other genes have no pij values. The average deviation can then be computed by experiment, by gene or for various other subsets of the data.
The consistency approach indicates that an aij is consistent with pij or αij if a dichotomized version of the aij is consistent with pij or αij. In particular, cij is computed as follows:
cij={1 if aij >0.5 and pij =10 if aij >0.5 and pij =00 if aij <0.5 and pij =11 if aij <0.5 and pij =00.5 if aij =0.5}orcij={1 if aij >0.5 and αij =10 if aij >0.5 and αij =00 if aij <0.5 and αij =11 if aij <0.5 and αij =00.5if aij =0.5}
A consistency score can then be computed by summing cij across various subsets of the data (e.g., all genes within an experiment; all genes in an operon, etc.).
Lastly, we evaluated the differential expression of metabolic pathway components (DeJongh et al., 2007; Aziz et al., 2008; Henry et al., 2010) among different subsets of experiments using a modified gene set analysis approach (Tintle et al., 2008). These pathway components have been previously shown to demonstrate strong consistency with gene expression data (Tintle et al., 2012). Briefly, we found the average aij value for all genes within each pathway component in each of the 907 experiments, and then ran a two-sample t-test comparing the mean activity scores between different subsets of the 907 experiments.
Validation using experimentally measured fluxes
Lastly, we evaluated gene activity estimates inferred from expression data vs. experimentally measured reaction fluxes on a published set of 79 genes in 29 separate experimental conditions (Ishii et al., 2007; Machado and Herrgård, 2014). In short, we computed gene activity estimates (aij's) for each gene-experiment combination using methods described above. After generating the gene-level aij's, we mapped them to reaction-level predictions qij, accounting for isozymes and multi-gene complexes as follows. For each reaction, if the reaction is associated with a single gene, qij for that reaction is equal to the aij for that gene. If the reaction is associated with a multi-gene complex, qij for that reaction is equal to the minimum of the aij's for all the genes involved; and if the reaction is associated with isozymes, qij for that reaction is equal to the maximum of the aij's for all the genes involved. In any of these three cases, if any gene has no aij (because it was not one of the 79 genes for which expression data was measured), it is ignored for the purposes of taking the minimum or maximum; or if all of the genes associated with a given reaction have no aij, that reaction is dropped from the analysis. We repeated this procedure with each aij generation method, and also with the gene-level ϵij's for comparison purposes, to create alternate reaction-level predictions based directly on expression value.
We compared the correlations we observed between each set of qij's with the (absolute values of the) experimentally measured fluxes for those reactions in those experimental conditions. A square-root transformation was applied to both the fluxes and the ϵij-based qij's to normalize these skewed distributions to ensure robust correlation estimates were obtained. Multiple linear regression models were used to predict fluxes using gene activity method estimates and expression values in order to evaluate the explanatory ability of different gene activity state estimates with flux values.
Software
R scripts and an example implementation of the approaches considered here (MT, TT, RB, UniMM, and MultiMM) are freely available on the Software page at http://www.dordt.edu/statgen.
Discussion
We have presented a Bayesian framework for the classification of microbial gene activity states (active or inactive), based on a compendium of genome-wide gene expression data. Our approach first uses the Bayesian Information Criterion to identify genes that likely have a mixture of both active and inactive states present in the data. A Gibbs sampler is then used to provide estimates of the posterior probability that a gene is active in each condition, based on a Gaussian normal mixture model. Our approach addresses four key limitations of existing approaches for classifying gene activity states: (a) different activity thresholds for different genes (Figure 4A vs. Figure 4B), (b) different proportions of gene activity between different experiments (Results, Overall patterns of gene activity), (c) benefits of leveraging a priori evidence of co-regulation (Figures 4C–E) and (d) benefits of quantified statistical uncertainty (e.g., Figure 1, among others). Specific figures and tables in the manuscript provide visual intuition about how the proposed method addresses these limitations.
By addressing these limitations, the Mixture Model approaches (MultiMM and UniMM) show less deviation from true gene activity states on simulated data, more consistency with metabolic model flux predictions and operon structure on real data, and stronger association with experimentally measured fluxes, with MultiMM doing best. Results from model fitting on the real data by the mixture model approaches showed great variance in the overall means, standard deviations and mixing proportions suggesting empirically that the limitations stated above are, in fact, realistic concerns for existing approaches which can be addressed by these mixture model approaches. Furthermore, association between inferred gene activity states using the MultiMM approach yielded stronger correlations with observed flux data, while other methods (MT, TT, and RB, as well as the raw expression data itself) left substantial variation in flux values as unexplained. Finally, pathway components associated with synthesis activities were significantly more expressed in the absence of a rich media condition (yeast extract) than in the presence of a rich medium as expected.
The improved performance of MultiMM over UniMM highlights the utility of incorporating genome-based operon predictions in activity state estimation. The Bayesian approach we have developed acts as a general framework for future innovation via the inclusion of other –omics data sources. For example, transcriptional regulatory networks (TRNs) can be actively incorporated into the analysis pipeline by expanding the MultiMM approach to utilize regulons in addition to operons. However, full integration of TRN information will require explicitly incorporating TRN uncertainty into the Bayesian framework. For example, due to TRN uncertainty we might be only 90% sure that two genes are co-regulated, in contrast to our current approach, which requires gene sets (operons) to be defined explicitly and with 100% certainty. Furthermore, this same uncertainty approach can likely be applied to operons when, for example, an operon comprises several transcription units. We believe the Bayesian framework provided here provides a flexible platform for this future innovation, in order to continue to reduce overall deviation of activity state estimates from true gene activity states. A similar model is also being explored by our group to incorporate (a) flux profiles, (b) functional information, (c) cross-organism gene homology and other sources of genetic information which, when incorporated into the activity state estimates, may further improve their accuracy and precision.
Downstream applications of improved gene activity state measurements are numerous, though we focus our discussion here mainly on metabolic modeling. In particular, our improved gene activity state measurements will allow us to incorporate gene activity information into subsequent metabolic models simulations through statistically informed penalties, rather than arbitrary or loose penalties, which often serve to down-weight gene expression data to the point where it has little to no real impact on downstream modeling results (e.g., Chandrasekaran and Price, 2010). We are currently exploring metabolic modeling advances that incorporate aij's.
Some limitations of our approach and our analysis here are worth noting. Gene activity state estimates likely improve as the number and diversity of experimental conditions increases, though we saw promising results even in a small follow-up study of expression data in 29 conditions. Given the dramatically reduced price of RNA-sequencing, continued rapid growth in the size and diversity of expression data is expected to make this limitation less of an issue. Our analysis here focuses on E. coli, though an important area of future work involves application of our approaches to a variety of other microbes, both to demonstrate transferability of the approach, but also in order to explore methods of improving activity state inference by leveraging information from multiple organisms simultaneously (e.g., gene homology, regulatory and metabolic homology, etc.). Further work is needed to evaluate the performance of these methods on RNA-seq data, though we anticipate the performance should be similar after standard normalization and transformation procedures are applied to the data. Finally, numerous additional validation studies are possible (e.g., an expanded comparison to observed flux data) and should be considered in future work. An initial small-scale evaluation presented here showed promising initial results.