Methods for Determining the Statistical Significance of Enrichment or Depletion of Gene Ontology Classifications under Weighted Membership

Iacucci, Ernesto; Zingg, Hans H.; Perkins, Theodore J.

doi:10.3389/fgene.2012.00024

ORIGINAL RESEARCH article

Front. Genet., 23 February 2012

Sec. Systems Biology Archive

Volume 3 - 2012 | https://doi.org/10.3389/fgene.2012.00024

Methods for Determining the Statistical Significance of Enrichment or Depletion of Gene Ontology Classifications under Weighted Membership

Ernesto Iacucci^1,2*

Hans H. Zingg³

Theodore J. Perkins^4,5

¹ Department of Electrical Engineering ESAT-SCD, Katholieke Universiteit Leuven, Leuven, Belgium
² IBBT-KU Leuven Future Health Department, Katholieke Universiteit Leuven, Leuven, Belgium
³ Department of Physiology, McGill University, Montreal, QC, Canada
⁴ Department of Biochemistry, Immunology and Microbiology, University of Ottawa, Ottawa, ON, Canada
⁵ Ottawa Hospital Research Institute, Ottawa, ON, Canada

High-throughput molecular biology studies, such as microarray assays of gene expression, two-hybrid experiments for detecting protein interactions, or ChIP-Seq experiments for transcription factor binding, often result in an “interesting” set of genes – say, genes that are co-expressed or bound by the same factor. One way of understanding the biological meaning of such a set is to consider what processes or functions, as defined in an ontology, are over-represented (enriched) or under-represented (depleted) among genes in the set. Usually, the significance of enrichment or depletion scores is based on simple statistical models and on the membership of genes in different classifications. We consider the more general problem of computing p-values for arbitrary integer additive statistics, or weighted membership functions. Such membership functions can be used to represent, for example, prior knowledge on the role of certain genes or classifications, differential importance of different classifications or genes to the experimenter, hierarchical relationships between classifications, or different degrees of interestingness or evidence for specific genes. We describe a generic dynamic programming algorithm that can compute exact p-values for arbitrary integer additive statistics. We also describe several optimizations for important special cases, which can provide orders-of-magnitude speed up in the computations. We apply our methods to datasets describing oxidative phosphorylation and parturition and compare p-values based on computations of several different statistics for measuring enrichment. We find major differences between p-values resulting from these statistics, and that some statistics recover “gold standard” annotations of the data better than others. Our work establishes a theoretical and algorithmic basis for far richer notions of enrichment or depletion of gene sets with respect to gene ontologies than has previously been available.

Introduction

High-throughput studies often generate large sets of “interesting” genes that must be further researched from available annotation resources. This research may encompass literature reviews and querying various databases. This process leads to a multitude of information to consider for each gene as well as inconsistent gene descriptions from one resource to another. The application of a statistic to a particular annotation resource can identify the statistically significant classifications common to the set of interesting genes (Man et al., 2000; Berriz et al., 2003; Castillo-Davis and Hartl, 2003; Diaz-Uriarte and Dopazo, 2003; Doniger et al., 2003; Draghici et al., 2003; Hosack et al., 2003; King et al., 2003; Al-Shahrour et al., 2004; Zhang et al., 2004; Khatri and Draghici, 2005). In this paper we generalize previous work by allowing genes to have weighted memberships in classifications.

Weighted membership functions offer far more expressive power than traditional binary membership functions, as we demonstrate below. They allow us to represent complicated relationships within an ontology, distinguish good from weak representatives of a classification, account for the strength of evidence from a high-throughput study, or more generally account for relationships between different data sets. Thus, the extension of gene ontology (GO) analysis to weighted membership is an important one.

The problem of determining statistical significance under weighted membership is more complicated than under binary membership. There is a computational “cost” to the generality we gain. In this sense, it is reminiscent of the well-known knapsack problem. The knapsack problem is trivially solvable if all items have unit weight. If items have arbitrary weights, it is NP-complete, although a pseudopolynomial-time dynamic program can usually solve realistic instances quite efficiently. In similar fashion, we derive a pseudopolynomial-time dynamic programming algorithm for computing exact p-values for the statistical significance of enrichment or depletion of classifications under weighted membership. Moreover, we show that for several specific weighted membership functions, the algorithm can be specialized to gain greater computational efficiency. We apply these methods to two test data sets, where we show that weighted membership functions are better able to extract biologically meaningful classifications than traditional binary analysis.

Gene Ontology

Our methods are not restricted to any particular ontology, but because of its popularity, we focus on the well-known GO. GO (The Gene Ontology Consortium, 2000) is a controlled, hierarchically organized vocabulary that aims to provide a consistent description of genes to the biological community, along with relationships between these descriptions. GO includes three principal classifications: molecular function, biological process, and cellular component, as well as many more specific classifications. Molecular functions are the biochemical roles of genes, for example, GTPase activity. Biological processes involve one or more molecular functions over multiple distinguished steps. An example of this would be mitosis, where each interaction of proteins in the progression of the cell cycle would be considered a step. Cellular components are described as the localization of a protein in the cell or its membership in a complex. An example would be the proteins that are members of the ribosomal complex.

The relationships between classifications define a directed acyclic graph (DAG), in which vertices in the DAG correspond to classifications and a directed edge from classification X to classification Y means that Y is a direct specialization of X. Some authors assume that there is no more general classification than the three principal classifications. We assume a root vertex gene ontology, with the principal classifications as children. Having defined a root, we therefore discuss the GO DRAG (Directed Rooted Acyclic Graph) in the rest of this paper. All vertices except the root have at least one parent, and may have more. Hexose biosynthesis, for example, has two parents, hexose metabolism and monosaccharide biosynthesis (The Gene Ontology Consortium, 2000).

Genes may be mapped to one or more classifications. A mapping is the association of a gene to a classification. Cytochrome c is one such example; it is mapped to the molecular function classification electron transporter activity, the biological process classifications oxidative phosphorylation and induction of cell death, and cellular component classifications mitochondrial matrix and mitochondrial inner membrane. As parent vertices are less specific classifications than their child vertices, genes mapped to the child vertices are understood to be also described by the parent vertices. We say that gene X is a member of classification Y if X is directly mapped to Y or if it is mapped to any classification that specializes Y – that is, to any classification which is reachable in the DRAG from Y.

While GO provides classifications that can be used to describe the genes in an interesting set, it does not provide guidance on which of the classifications are of interest to researchers for further investigation. The number of classifications to consider can be reduced by focusing on classifications that are enriched or depleted as defined in a statistical framework (Draghici et al., 2003; Zeeberg et al., 2003).

Weighted Membership in Classifications

Suppose a gene g has been flagged as interesting based on a laboratory experiment. For example, gene g may be differentially expressed in a microarray experiment or may have a specific post-transcriptional modification. Now consider a classification v. Does g provide evidence, and if so, how much, for the involvement of classification v in the condition under study? The simplest heuristic would be to say that g provides evidence for v if and only if g is a member of v (Draghici et al., 2003; Zeeberg et al., 2003). However, one may also consider weighted membership scores, which might account for relationships between classifications, degrees of differential expression, or other knowledge. For example, consider the DRAG and gene mappings depicted in Figure 1. Gene g₂ is mapped to classifications v₇ and v₈. If g₂ is flagged as interesting, we do not know if it is because of the involvement of v₇, v₈, or both. In any case, however, the flagging of g₂ indicates the involvement of classification v₄. So, we might interpret g₂ as providing stronger evidence, or greater weight, for v₄ than either of v₇ and v₈. Then again, if g₅ is also flagged, and has much greater differential expression than g₂, we might decide the strongest evidence is for classification v₆. Conversely, some genes in a classification many never be differentially expressed, and that should not be taken as evidence against the classification. For example, in signaling or metabolic pathways, it is common for many proteins to be ubiquitously expressed at a basal rate, whether the pathway is active or not. One might want to give such genes small or even zero membership weight in that pathway – for the purpose of determining which pathways are represented in the interesting set of genes – so that when the genes that are regulated change their expression, the activity in the pathway is more apparent. Thus, we wish to be able to determine significance for more general notions of evidence, or weighted membership functions.

FIGURE 1

Figure 1. Schematic of GO. Abstract example of an ontology illustrating the principles of the GO DRAG and the scoring functions Boxes labeled with Root or n correspond to classifications Sets represent the genes directly mapped to each classification. The broken line indicates the subgraph rooted at v − i.

Given a DRAG and a population set of genes G, we define a weighted membership function Φ:G × V → Z⁺, where V is the set of all classifications and Z⁺ is the set of non-negative integers. Φ(g, v) is always zero if g is not a member of v, but otherwise Φ(g, v) is unconstrained. The standard membership function is an important special case, which we will denote by Φ₁(g, v) = {1 if g ∈ v; 0 if g ∉ v}. We will consider two other weighted membership functions, although many interesting choices are possible. We define Φ₂(g, v) as the number of paths in the ontology DRAG from v to any vertex v′ to which g is mapped. This is intended to account for situations such as for g₂ in the example of the previous paragraph. In that example, Φ₂(g₂, v₇) = Φ₂(g₂, v₈) = 1, but Φ₂(g₂, v₄) = 2, indicating that, all other things being equal, we have greater confidence that g₂ is representative of classification v₄ than it is of either v₇ or v₈ under any particular circumstance. For Φ₃ we assume access to an integer-valued vector X, where X(g) might represent, for example, a discretized measure of differential expression from a high-throughput gene expression experiment. We define Φ₃(g, v) = {X(g) if g ∈ v; 0 if g ∉ v}.

Ultimately, we want to assess the enrichment or depletion of a set of genes with respect to a classification. For this purpose, any weighted membership function Φ can be generalized to take as input a set of genes H and a classification, simply by summing the scores of the individual genes. That is, Φ(H, v) = Σ_g∈HΦ(g, v).

P-Values for the Significance of Enrichment or Depletion

We assume now that a laboratory experiment has flagged a set of interesting genes H, and we want to know whether the set H provides evidence for a classification v. We choose a weighted membership function Φ_A, and want to know whether Φ_A(H, v) is significantly greater than would be expected by chance. To define this precisely, we must account for properties of the interesting set, H. If H is large, for example, we generally expect higher classification scores, and thus the threshold for significant enrichment ought to be higher. More generally, we allow conditioning not just on the size of H, but on the score of the set H under any weighted membership function Φ_B evaluated at the root classification. Thus, let H′ represent a random subset of G, which we take to be uniformly distributed over all subsets of G satisfying Φ_B(H′, r) = Φ_B(H, r), where r is the root classification. In general, we consider p-values for enrichment of classification v of the form:

where Φ_A and Φ_B can be any two weighted membership functions and r is the root classification. We use the shorthand notation Φ_A | Φ_B to denote this p-value. (See Figure A1 in Appendix for some worked examples based on Figure 1.) The p-value for depletion simply has the inequality reversed.

In the special case that Φ_A = Φ_B = Φ₁, the condition Φ_B(H′, r) = Φ_B(H, r) implies that H′ is restricted to sets of the same size as H and Φ_A(H′, v) is just the number of genes in H′ that are members of v. The p-value is then just the widely used membership p-value, which can be computed via the hypergeometric equation as (Draghici et al., 2003; Zeeberg et al., 2003):

where N is the number of genes that are members of v, M is the number of genes in G that are not members of v, c = |H|, and m = c − n. We can also write this without reference to N, M, or c using the substitutions N = Φ₁(G, v), M = Φ₁(G, r) − Φ₁(G, v) and c = Φ₁(H, r). More generally, if Φ_B = Φ₁ but Φ_A is not equal to Φ₁, then the p-value represents the chance of getting a score Φ_A(H′, v) ≥ Φ_A(H, v) if H′ is uniformly random from all sets of genes of the same size as H.

The simple formula for Φ₁ | Φ₁ is often introduced in the context of ball-and-urn models. One imagines N red balls and M blue balls mixed together in an urn, and then c balls are drawn from the urn randomly but without replacement. This metaphor fails for general Φ_A | Φ_B for two reasons. First, because membership is weighted, the balls, corresponding to genes, do not have equal significance. Second, because a gene g may have Φ_A(g, v) > 0 and Φ_A(g, v) ≠ Φ_B(g, r) > 0, the balls may be not just strictly red or blue but some combination of both. There is no simple formula for Φ_A | Φ_B in general, but in the next section we describe dynamic programming algorithms for computing this p-value.

Due to representation of GO as a DRAG and the mapping of genes to GO, an exact method for measuring such p-values has never been proposed. More precisely, the main challenges presented by the GO model are the lack of independent and unique genes. As genes can be mapped to one or more classifications, they are not unique. For example, in Figure 1, g₆ is mapped to both v₂ and v₅. Due to this feature of the model, genes can be reachable in both a subgraph rooted at v and in its complement yes Thus, such genes are not independent as selecting them in v will also require that they be selected in yes For example, in Figure 1, g₁ is reachable in both the subgraph rooted at v₄ (mapped to v₇) and at its complement (mapped to v₂).

Looking at Figure 2, one can better appreciate the difference in an unweighted versus a weighted model. Unweighted values can be seen as the balls in urn A. As these balls are independent, the possible observations can be easily modeled by the hypergeometric. Under a weighted model, this is not possible due to the interrelated nature of the balls. The balls in urn B are dependent as they are attached by a string, hence the hypergeometric model cannot be used to model observations of randomly picking balls from urn B. Thus, the probability of observing two gray balls given four balls were selected from urn B is 0.16.

FIGURE 2

Figure 2. Urn representation of Figure 1 diagram illustrating the principles of independent versus interrelated models. Gray balls (genes) are those found at or below v₄ and white balls (white) are those found in the rest of the DRAG. (A) Gray and white balls (genes) in urn arc independent. (B) Gray and while (genes) in urn are interrelated, the GO DRAG and the scoring functions. The solid lines indicate that balls (genes) are related.

Our primary technical contribution is to describe how to determine p-vales given a model where one cannot assume gene weights to be independent or unique. Precise computation of the p-values allows greater confidence in biological assertions made based on the statistical modeling.

Materials and Methods

Exact computation of the p-value (Eq. 1) is NP-complete, for example, by reduction from the 0 to 1 knapsack problem. Like the 0–1 knapsack problem, however, it can be solved in pseudopolynomial-time by a dynamic programming method. We first describe a generic dynamic programming method for computing the p-value, and then describe several improvements and special cases that can significantly speed the computation.

Naïve Dynamic Programming Approach

We construct a two-dimensional table T, where T(a, b) represents the number of subsets H′ of G for which Φ_A(H′, v) = a and Φ_B(H′, r) = b. The main loop is over genes g in G. The update rule for the table (see line five below), T(a + Φ_A(g, v), b + Φ_B(g, r)) ← T(a, b) + T(a + Φ_A(g, v), b + Φ_B(g, r)), is based on the observation that any subset H′ satisfying the constraints Φ_A(H′, v) = a and Φ_B(H′, r) = b either includes the gene g or does not include g. Values for a range from 0 to A_max = Φ_A(G, v), and values for b range from 0 to B_max = Φ_B(G, r). Line 8 of the code below accounts for a set H′ made up of only the gene g. To simplify the presentation of the code, we have allowed it to access T outside of these ranges; such values should always be taken to be zero.

(1) Initialize: T(a, b) ← 0 for all a, b
(2) For each g ∈ G
(3)     For a = 0 to A_max
(4)          For b = 0 to B_max
(5)               T(a + Φ_A(g, v), b + Φ_B(g, r)) ← T(a, b) +
                       T(a + Φ_A(g, v), b + Φ_B(g, r))
(6)          End
(7)     End
(8)     T(Φ_A(g, v),Φ_B(g, r)) ← T(Φ_A(g, v),Φ_B(g, r)) + 1
(9) End

The desired p-value, Φ_A | Φ_B, is the computed as:

The time complexity of this algorithm, measured by the number of table updates, is O(|G | A_max B_max). The table entries can easily become larger than integer or long integer precision, and so it is necessary to use arbitrary precision arithmetic. Nevertheless, this computation is feasible for typical problems in which |G| may be of the order 10³ or 10⁴ and A_max and B_max may be of the order 10³–10⁵. This gives the p-value for enrichment of only a single classification. If p-values for multiple, even all, of the classifications are sought, the time complexity is multiplied by the number of classifications being tested. For particular choices of Φ_A and Φ_B, more efficient dynamic programs can be developed. This is the subject of the following subsections.

Improvements to the Naïve Dynamic Program

Several changes to the naïve dynamic program can drastically increase its computational and storage efficiencies. First, one only needs the first Φ_B(H, r) columns of the table T, as it is the last column that is used in the computation of the p-value. Although a simple observation, this can be a dramatic improvement if Φ_B(H, r) is small. For example, if Φ_B = Φ₁ and considering that typically |H| = |G|, one could expect to save at least an order of magnitude computation and storage.

Second, many genes g may have the same weights, Φ_A(g, v) and Φ_B(g, r). We define the weight count W_i,j to be the number of genes g with Φ_A(g, v) = i and Φ_B(g, r) = j. These groups of genes can be treated simultaneously – if k of these genes are included in a subset H′, then the contribution to Φ_A(H′, v) and Φ_B(H′, r) is the same regardless of precisely which k are chosen. If we add an innermost loop over k to our dynamic program, the main update rule above (step 5) can thus be restated as:

A third optimization is possible in the case that Φ_A(g, v) = 0 for many genes g. We can split the dynamic program into two tables, T¹ and T². T¹(a, b) is the number of subsets H′ of genes with Φ_A(H′, v) = a and Φ_B(H′, r) = b, where H′ is restricted to contain only genes for which Φ_A(g, v) > 0. T² is a one-dimensional table in which T²(b) is the number of subsets H′ of genes with Φ_B(H′, r) = b, where H′ is restricted to contain only genes for which Φ_A(g, v) = 0. From these two tables, the original T(a, b) can be reconstructed as T(a, b) = Σ_b′ T¹(a, b′)*T²(b − b′).

These improvements can be used independently or combined for the greatest effect, and may even be further improved in some particular cases. In the next subsection we discuss three particular p-values we use in our experiments in relation to these improvements to the dynamic program.

Dynamic Programs for Φ₂ | Φ₁, Φ₂ | Φ₂ and Φ₃ | Φ₃

There are many possible choices for Φ, and many combinations, resulting in different p-values for measuring the statistical significance of enrichment or depletion of GO classifications. Recall that in Section “P-Values for the Significance of Enrichment or Depletion” we defined several possible choices of Φ (see also Table 1 for a notational reminder): Φ₁ is the standard binary membership function, Φ₂(g, v) measured the number of paths in the DRAG from v to any classification v′ to which g is mapped, and Φ₃(g, v) is equal to some global “score” (e.g., differential expression) assigned to g if g is mapped to v, and otherwise is equal to zero. Here, we derived specialized, more efficient dynamic programs for computing p-values involving these three choices of Φ. The first is Φ₂ | Φ₁. Conditioning on Φ₁ is perhaps most natural when H is determined experimentally (say by a microarray study), because the size of H often depends on some arbitrary significance cut-off or other practical considerations. Secondly, we consider Φ₂ | Φ₂, which may be more natural when genes of interest are not determined experimentally but based on some other property – for example, sharing a particular transmembrane domain. Finally we consider Φ₃ | Φ₃, which reflects an intuitively compelling situation where one wishes to make use of expression level measurements, which are common to many genomics studies. We describe the implementations of these cases below.

TABLE 1

Table 1. Summary of notation used in this paper.

We use all three optimizations described in Section “Improvements to the Naïve Dynamic Program” to compute Φ₂ | Φ₁. In the two-table decomposition, the second table does not actually need to be computed, as T²(b) is just M choose b, where M is the number of genes that are not members of classification v. The weight counts W_i,j are zero for j≠1, so we can drop the j subscript and refer simply to W_i. Let I be the maximum value of i for which W_i is positive. The pseudocode for computing table T₁ is as follows.

(1) Initialize T₁(a, b) = 0 for all a, b
(2) For i = 1 to I
(3)     For b = 1 to |H|
(4)          For a = 1 to A_max
(5)               For k = 1 to W_i
(6)                     yes
(7)               End
(8)          End
(9)     End
(10) For k = 1 to W_i
(11)      yes
(12)     End
(13) End

To compute Φ₂ | Φ₁, the summation Σ_b′T¹(a, b′)*(M choose (b − b′)) can be substituted for T(a, b) in Eq. 3. However, the denominator is simply (|G| choose |H|) in this case, so we can simply compute Φ₂ | Φ₁ = Σ_aΣ_b′T¹(a, b′)*(M choose (b − b′)), where the a in the sum goes from Φ₂(H, v) to A_max, and divide by (|G| choose |H|).

For Φ₂ | Φ₂ we need to compute tables T¹ and T² explicitly. The straightforward computation of T¹ would be wasteful because for any subset of genes H′ we have Φ₂(H′, r) ≥ Φ₂(H′, v). Thus, all entries of T¹ below the main diagonal would be zero. Instead, we slightly redefine T¹ so that the columns correspond to Φ₂(H′, r) − Φ₂(H′, v) instead of just Φ₂(H′, r). For simplicity, we omit the weight count optimization, so that the pseudocode for computing T¹ is then:

(1) Initialize T¹(a, b) ← 0 for all a, b
(2) For each gene g with Φ₂(g, v) > 0
(3)     For b = 0 to Φ₂(G, r) − Φ₂(G, v)
(4)          For a = 1 to Φ₂(G, v)
(5)             T¹(a + Φ₂(g, v), b + (Φ₂(g, r) − Φ₂(g, v))) ←
                     T¹(a, b) + T¹(a + Φ₂(g, v), b + (Φ₂(g, r) − Φ₂(g, v)))
(6)          End
(7)     End
(8)     T¹(Φ₂(g, v),Φ₂(g, r) − Φ₂(g, v)) ←
            T¹(Φ₂(g, v),Φ₂(g, r) − Φ₂(g, v)) + 1
(9) End

The computation of T² is straightforward. T²(b) is the number of ways of making Φ₂(H′, r) using only genes g, for which Φ₂(g, v) = 0. We use the weight count trick to speed to computation, defining J to be the maximum j for which W_0,j > 0. The pseudocode for creating the table is as follows:

(1) Initialize T²(b) ← 0 for all b
(2) For j = 1 to J
(3)     For b = 1 to Φ₂(G, r)
(4)          For k = 1 to W_j
(5)                yes
(6)          End
(7)     End
(8)     For k = 1 to W_j
(9)           yes
(10)   End
(11) End

The p-value Φ₂ | Φ₂ can be computed using Eq. 3 with the substitution T(a, b) = Σ_b′T¹(a, b′ − a)*T²(b − b′). The p-value Φ₃ | Φ₃ can be computed similarly.

Data Sources

To test our approaches, we used the OXPHOS-CR (oxidative phosphorylation co-regulated) dataset (53 genes) from (Subramanian et al., 2005). The dataset can be characterized as containing OXPHOS genes that are tightly co-regulated across many tissues. Three other datasets were derived from this dataset, each assigned 100, 300, and 500 additional random genes to mimic noisy datasets.

We also applied the four methods of determining significance described above (Φ₁ | Φ₁, Φ₂ | Φ₁, Φ₃ | Φ₃, and Φ₂ | Φ₂) to a publicly available dataset concerning changes in gene expression in the rat uterus at different stages of parturition (Girotti and Zingg, 2003). The Girotti study used rat Genome U34A GeneChips (Affymetrix) to measure five time-points in rat pregnancy: day 0, day 20, at term but not in labor, at term and in labor, and postpartum. The Girotti study presents a literature search for genes showing differential expression in at least one time-point comparison. Filtering criteria applied in the study resulted in 4137 genes (Girotti erroneously reported the number as 4138) forming the population set of genes, of which 431 interesting genes (non-redundant genes that were differentially expressed between day 0 and any other time-point) were annotated using various annotation resources (SwissProt, ExPASy, and other databases) into 17 user-defined classifications. In our study, the GO DRAG was limited to vertices to which the 4137 genes (the set G) are mapped. The 431 genes were used to define the interesting set H. The GO definitions were supplied by GO for the U34A GeneChip (Affymetrix; (The Gene Ontology Consortium, 2000). For Φ₃, X(g) = 2 for average fold change over the first two time-points, 1 for less than average fold change (less than X − 2 SD), and 3 for higher than average fold change (greater than X + 2 SD).

Results and Discussion

In order to evaluate the quantitative performance of the methods, we ran them on the OXPHOS-CR dataset and took the 35 classifications that were annotated to this dataset to be a “golden standard.” In practice, a set of interesting genes will not only be composed of purely interesting genes but also noise (false positives). We therefore ran our methods on the “noisy” datasets that were derived from the golden standard by the addition of random genes (+100 genes, +300 genes, +500 genes).

The results, shown in Table 2, show that the Φ₂ | Φ₂ method was able to maintain the ranking of golden standard classifications better than the Φ₁ | Φ₁ (traditional hypergeometric) or the Φ₂ | Φ₁ methods when applied to the noisy datasets. Figure 3 provides the ROC plot of the rankings of the classifications for the “noisy” datasets resulting from the application of all three methods. Looking at the added 500 random genes dataset (Figure 3A), it is clear that the Φ₂ | Φ₂ method provides the best ranking of the classifications. Looking again at Table 2, one can see that the methods differ with respect to specificity and sensitivity (with statistical significance threshold set at p ≤ 0.05). Overall, the Φ₂ | Φ₁ and Φ₁ | Φ₁ had good specificity but poor sensitivity. In contrast the Φ₂ | Φ₂ had much better sensitivity than the other two methods but variable specificity. Likewise, Φ₂ | Φ₂ consistently had a better AUC than the other two methods.

TABLE 2

Table 2. Comparison of methods with added noise.

FIGURE 3

Figure 3. Maintenance of rank. The methods were compared based on their ability to maintain the rank of the “golden standard” classifications. (A) ROC plot of the rankings of the classifications from the added 500 random genes dataset. (B) ROC plot of the rankings of the classifications from the added 300 random genes dataset. (C) ROC plot of the rankings of the classifications from the added 100 random genes dataset.

In order to further evaluate the performance of the methods, we used the Girotti and Zingg (2003) study, which identified several user-defined classifications without the use of any statistical framework. In Table 3, we look at the analogous GO classifications and compare p-values from the four methods described above. Many of these classifications are significantly enriched at p ≤ 0.05 according to at least one of the methods, or are weakly significant. With regard to the biological relevance of the classifications, defense response, cell motility, extracellular matrix, and cytoskeleton maintained low p-values across all methods. This is consistent with the parturition process, as it is involves immune suppression and cell remodeling. There are, however, some surprising disagreements between the methods. For example, p-values for protein modification range over four orders-of-magnitude, going from 4.2 × 10⁻⁵ (Φ₂ | Φ₂) to 0.923 (Φ₃ | Φ₃). (See Figure A2 in Appendix for scatter plots of different pairs of p-values across all classifications.) It was surprising that p-values could differ so greatly between the methods. Table 4 shows how many classifications are identified as significantly enriched by each method individually (main diagonal) and by each pair of methods (off diagonal) using the p-value threshold of p < 0.05.

TABLE 3

Table 3. Comparison of methods on the Girotti and Zingg (2003) data.

TABLE 4

Table 4. Agreement of significance calls.

A conspicuously different value for cytoskeleton, between Φ₂ | Φ₂ and Φ₁ | Φ₁, can be observed in Table 3. While we have discussed above (see Introduction) the theoretical difference in how significance is computed under Φ₂ | Φ₂ and Φ₁ | Φ₁, it may serve as an informative exercise to look at some of the numbers that give rise to the difference in p-value for cytoskeleton under these two settings. Under the setting of Φ₁, 4 significant genes are mapped to cytoskeleton out of a total possible 25. Under the setting of Φ₂, 10 significant paths are mapped to cytoskeleton out of a possible 49. That is, while there are only 4 of 25 cytoskeleton genes in the interesting set, they are multifunctional – accounting for a total of 10 paths to mappings (an average 2.5 per gene), versus an average of slightly less than 2 paths to mappings for cytoskeleton genes over all. Thus, while these paths may describe different precise functions, they collectively implicate the overall cytoskeleton classification. Following up on the relevance of the cytoskeleton classification with respect to this experimental setting, it becomes clear that interesting genes mapped to it represent a biological reality worthy of distinguishing this category as significant. One of these genes is TPM1, which is an Actin binding protein and provides stability to Actin filaments. Accordingly, Actin is also present. Actin is a component of the cytoskeleton, which provides mechanical support to the cell. Actin also works with Myosin during muscle contraction in a calcium dependent interaction mechanism. Myosin is also one of the interesting genes. Myosin is a contractile protein whose function depends on the hydrolysis of ATP. Together, the presence of these genes reflects the physical demand placed on cytoskeleton with respect to the contraction and remodeling of the uterus, which occur during pregnancy (Slater et al., 2002; Wang and Hirsch, 2003; Salomonis et al., 2005).

Computing p-values for all 397 classifications by Φ₂ | Φ₁, Φ₂ | Φ₂, and Φ₃ | Φ₃ takes, on average, approximately 2 h. This is longer than for Φ₁ | Φ₁ and other statistics in the literature (Khatri and Draghici, 2005). However, it is still much more rapid than annotation by hand (Girotti took weeks to construct her annotation), and is readily justified by the time and expense that goes into collecting the data. Indeed, although microarray data analysis is typically “fast,” other modern types of analysis (e.g., analysis of high-throughput sequencing data) typically takes hours or even days, depending on their complexity. Using repeated weights (W_i,j), which allowed us to treat genes with equal scores simultaneously, is an important factor in the speed of the algorithm. For determining the value for a = Φ₂(H′, r), from table T, |W| = 54, T had 5581 entries, and required 8,172,874 operations. With 4137 genes and A_max as large as 22771, the naïve dynamic program would have required a table with approximately 500 million entries (22771*22771), and would have required approximately 2 trillion updates (4137*22771*22771).

One may wonder whether exact computation of the p-values is necessary as described here. While it is intuitive that approximations will differ from exact p-values, it is not readily apparent that such approximations are inutile. For this reason, we used the hypergeometric equation to approximate Φ₂ | Φ₂ and found that values do indeed differ significantly. When compared, the two were found to be only weakly correlated (R² = 0.2610). This is a concern not only for ontological analysis, but for broader, integrative analyses as well. Increasingly, the experimental design of high-throughput experimentation has grown to include probabilistic frameworks in which knowing the exact probabilities or p-values is important. An example would be Bayesian modeling (Segal et al., 2001). In large Bayesian networks, multiple classifications may be used as factors for the inference. Small errors in p-values can propagate when being used as conditioning values. Thus, exact computation of p-values is important not just for determining significance, but also for the sake of larger integrative studies.

Our dynamic programming approaches, for Φ₂ | Φ₂ in particular, can easily be generalized to compute p-values for other scoring functions. For example, one might define the score of a gene X with respect to a classification Y by the number of classifications reachable from Y to which X is mapped. Of course, interesting genes could also be selected based on features that have no direct connection to expression or ontology structure, such as presence of an amino-acid motif or of a common upstream motif. Our methods are also applicable to other annotation resources available to the bioinformatics community. For example, databases such as Kyoto Encyclopedia of Genes and Genomes (KEGG; Kanehisa and Goto, 2000) or Munich Information Center for Protein Sequences (MIPS; Mewes et al., 2000), have organized annotations to which statistics can be applied in a similar manner as described here.

While the methods presented here compute individual p-values exactly, one must consider the use of such statistics in context. Naïve application of these methods to a large number of candidate classifications creates a multiple hypothesis testing problem and risks the identification of related classifications. These problems can be addressed by adjusting the p-value threshold and by sampling-based methods (Zhong et al., 2004).

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

The authors wish to thank Marie-Eve Carrier, Dominic Devost, Karen Hellauer, and Elisa Monaco for their advice and assistance. Funding: This material is based upon work supported by grants from NSERC and CIHR. Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the author(s) and do not necessarily reflect the views of the funding agencies.

References

Al-Shahrour, F., Diaz-Uriarte, R., and Dopazo, J. (2004). FatiGO: a web tool for finding significant associations of gene ontology terms with groups of genes. Bioinformatics 20, 578–580.

Pubmed Abstract | Pubmed Full Text | CrossRef Full Text

Berriz, G. F., King, O. D., Bryant, B., Sander, C., and Roth, F. P. (2003). Characterizing gene sets with FuncAssociate. Bioinformatics 19, 2502–2504.

Pubmed Abstract | Pubmed Full Text | CrossRef Full Text

Castillo-Davis, C. I., and Hartl, D. L. (2003). GeneMerge – postgenomic analysis, data mining, and hypothesis testing. Bioinformatics 19, 891–892.

Pubmed Abstract | Pubmed Full Text | CrossRef Full Text

Diaz-Uriarte, R., and Dopazo, J. (2003). “Use of GO terms to understand the biological significance of microarray differential gene expression data,” in Methods of Microarray Data Analysis III, Papers from Camda ’02, Chap. 15, eds K. F. Johnson, and S. M. Lin (Durham: Kluwer Inc.), 233–247.

Doniger, S. W., Salomonis, N., Dahlquist, K. D., Vranizan, K., Lawlor, S. C., and Conklin, B. R. (2003). MAPPFinder: using geneontology and GenMAPP to create a global gene-expression profile from microarray data. Genome Biol. 4, R7.

Pubmed Abstract | Pubmed Full Text | CrossRef Full Text

Draghici, S., Khatri, P., Bhavsar, P., Shah, A., Krawetz, S. A., and Tainsky, M. A. (2003). Onto-tools, the toolkit of the modern biologist: onto-express, onto-compare, onto-design and onto-translate. Nucleic Acids Res. 31, 3775–3781.

Pubmed Abstract | Pubmed Full Text | CrossRef Full Text

Girotti, M., and Zingg, H. H. (2003). Gene expression profiling of rat uterus at different stages of parturition. Endocrinology 144, 2254–2265.

Pubmed Abstract | Pubmed Full Text | CrossRef Full Text

Hosack, D. A., Dennis, G. Jr., Sherman, B. T., Lane, H. C., and Lempicki, R. A. (2003). Identifying biological themes within lists of genes with EASE. Genome Biol. 4, R70.

Pubmed Abstract | Pubmed Full Text | CrossRef Full Text

Kanehisa, M., and Goto, S. (2000). KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 27–30.

Pubmed Abstract | Pubmed Full Text | CrossRef Full Text

Khatri, P., and Draghici, S. (2005). Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics 21, 3587–3595.

Pubmed Abstract | Pubmed Full Text | CrossRef Full Text

King, O. D., Foulger, R. E., Dwight, S. S., White, J. V., and Roth, F. P. (2003). Predicting gene function from patterns of annotation. Genome Res. 13, 896–904.

Pubmed Abstract | Pubmed Full Text | CrossRef Full Text

Man, M. Z., Wang, Z., and Wang, Y. (2000). Power sage: comparing statistical tests for sage experiments. Bioinformatics 16, 953–959.

Pubmed Abstract | Pubmed Full Text | CrossRef Full Text

Mewes, H. W., Frishman, D., Gruber, C., Geier, B., Haase, D., Kaps, A., Lemcke, K., Mannhaupt, G., Pfeiffer, F., Schüller, C., Stocker, S., and Weil, B. (2000). MIPS: a database for genomes and protein sequences. Nucleic Acids Res. 28, 37–40.

Pubmed Abstract | Pubmed Full Text | CrossRef Full Text

Salomonis, N., Cotte, N., Zambon, A. C., Pollard, K. S., Vranizan, K., Doniger, S. W., Dolganov, G., and Conklin, B. R. (2005). Identifying genetic networks underlying myometrial transition to labor. Genome Biol. 6, R12.

Pubmed Abstract | Pubmed Full Text | CrossRef Full Text

Segal, E., Taskar, B., Gasch, A., Friedman, N., and Koller, D. (2001). Rich probabilistic models for gene expression. Bioinformatics 17(Suppl. 1), S243–S252.

Pubmed Abstract | Pubmed Full Text | CrossRef Full Text

Slater, M. D., Zervou, S., and Thornton, S. (2002). Prostaglandins and prostanoid receptors in human pregnancy and parturition. J. Soc. Gynecol. Investig. 9, 118–124.

Pubmed Abstract | Pubmed Full Text | CrossRef Full Text

Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Elbert, B. L., Gillette, M. A., Paulovich, A., Pomeroy, S. L., Golub, T. R., Lander, E. S., and Mesirov, J. P. (2005). Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. U.S.A. 102, 15545–15550.

Pubmed Abstract | Pubmed Full Text | CrossRef Full Text

The Gene Ontology Consortium. (2000). Gene ontology: tool for the unification of biology. Nat. Genet. 25, 25–29.

Pubmed Abstract | Pubmed Full Text | CrossRef Full Text

Wang, H., and Hirsch, E. (2003). Bacterially-induced preterm labor and regulation of prostaglandin-metabolizing enzyme expression in mice: the role of toll-like receptor 4. Biol. Reprod. 69, 1957–1963.

Pubmed Abstract | Pubmed Full Text | CrossRef Full Text

Zeeberg, B. R., Feng, W., Wang, G., Wang, M. D., Fojo, A. T., Sunshine, M., Narasimhan, S., Kane, D. W., Reinhold, W. C., Lababidi, S., Bussey, K. J., Riss, J., Barrett, J. C., and Weinstein, J. N. (2003). GoMiner: a resource for biological interpretation of genomic and proteomic data. Genome Biol. 4, R28.

Pubmed Abstract | Pubmed Full Text | CrossRef Full Text

Zhang, B., Schmoyer, D., Kirov, S., and Snoddy, J. (2004). GOTree machine (GOTM): a web-based platform for interpreting sets of interesting genes using gene ontology hierarchies. BMC Bioinformatics 5, 16.

CrossRef Full Text

Zhong, S., Tian, L., Li, C., Storch, K.-F., and Wong, W. H. (2004). Comparative analysis of gene sets in the gene ontology space under the multiple hypothesis testing framework. Proc. IEEE Comput. Syst. Bioinform. Conf. 2004, 425–35.

Appendix

FIGURE A1

Figure A1. Explicit examples of weighted membership functions and p-value computations.

FIGURE A2

Figure A2. Method comparisons (A–F) pair-wise p-value comparisons between methods, these scatter pots show the degree of agreement between methods. R² values are provided as a summary of the agreement.

Keywords: gene ontology, weighted membership, enrichment, depletion, dynamic programming

Citation: Iacucci E, Zingg HH and Perkins TJ (2012) Methods for determining the statistical significance of enrichment or depletion of gene ontology classifications under weighted membership. Front. Gene. 3:24. doi: 10.3389/fgene.2012.00024

Received: 04 November 2011; Paper pending published: 19 November 2011;
Accepted: 06 February 2012; Published online: 23 February 2012.

Edited by:

Radhakrishnan Nagarajan, University of Arkansas for Medical Sciences, USA

Reviewed by:

Chi Zhang, University of Nebraska Lincoln, USA
Preetam Ghosh, Virginia Commonwealth University, USA

Copyright: © 2012 Iacucci, Zingg and Perkins. This is an open-access article distributed under the terms of the Creative Commons Attribution Non Commercial License, which permits non-commercial use, distribution, and reproduction in other forums, provided the original authors and source are credited.

*Correspondence: Ernesto Iacucci, IBBT-KU Leuven Future Health Department, Katholieke Universiteit Leuven, Kasteelpark Arenberg 10, Box 2446, 3001 Leuven, Belgium. e-mail: ernesto.iacucci@gmail.com

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

ORIGINAL RESEARCH article

Methods for Determining the Statistical Significance of Enrichment or Depletion of Gene Ontology Classifications under Weighted Membership

Introduction

Gene Ontology

Weighted Membership in Classifications

P-Values for the Significance of Enrichment or Depletion

Materials and Methods

Naïve Dynamic Programming Approach

Improvements to the Naïve Dynamic Program

Dynamic Programs for Φ2 | Φ1, Φ2 | Φ2 and Φ3 | Φ3

Data Sources

Results and Discussion

Conflict of Interest Statement

Acknowledgments

References

Appendix

Dynamic Programs for Φ₂ | Φ₁, Φ₂ | Φ₂ and Φ₃ | Φ₃