Skip to main content

ORIGINAL RESEARCH article

Front. Genet., 13 July 2016
Sec. Statistical Genetics and Methodology

Analysis of Variance Components for Genetic Markers with Unphased Genotypes

  • Division of Biostatistics, Institute for Health and Society, Medical College of Wisconsin, Milwaukee, WI, USA

An ANOVA type general multi-allele (GMA) model was proposed in Wang (2014) on analysis of variance components for quantitative trait loci or genetic markers with phased or unphased genotypes. In this study, by applying the GMA model, we further examine estimation of the genetic variance components for genetic markers with unphased genotypes based on a random sample from a study population. In one locus and two loci cases, we first derive the least square estimates (LSE) of model parameters in fitting the GMA model. Then we construct estimators of the genetic variance components for one marker locus in a Hardy-Weinberg disequilibrium population and two marker loci in an equilibrium population. Meanwhile, we explore the difference between the classical general linear model (GLM) and GMA based approaches in association analysis of genetic markers with quantitative traits. We show that the GMA model can retain the same partition on the genetic variance components as the traditional Fisher's ANOVA model, while the GLM cannot. We clarify that the standard F-statistics based on the partial reductions in sums of squares from GLM for testing the fixed allelic effects could be inadequate for testing the existence of the variance component when allelic interactions are present. We point out that the GMA model can reduce the confounding between the allelic effects and allelic interactions at least for independent alleles. As a result, the GMA model could be more beneficial than GLM for detecting allelic interactions.

1. Introduction

Typically, there are two different ways in assessing the statistical association of a categorical variable with a continuous outcome. We can either make a direct comparison of the group means among groups defined by the categorical variable or assess the variation that the categorical variable may contribute to the total variance of the continuous outcome. The former approach is usually conducted via fitting a general linear model (GLM) with or without an adjustment for other covariates. The latter approach is referred as the analysis of variance (ANOVA), which examines a quantitative outcome variable by partitioning its total variance into variance components attributable to different sources of variation. The original ANOVA model was proposed initially by Fisher (1918) and formalized later in Fisher (1925). The traditional ANOVA approach on estimation of variance components was mainly based on Henderson's method I-III by equating the observed sums of squares to their expected values (Henderson, 1953). Currently, for predictors with observed group levels, their variance components can be estimated via ANOVA tables, which are mostly based on the sequential (Type I) sums of squares or partial reductions in (Type III) sums of squares via fitting GLM (Searle, 1971; Searle et al., 1992). For predictors with observed or unobserved group levels, their variance components could also be evaluated via maximum likelihood estimation (MLE) or restricted maximum likelihood (REML) (see Searle et al., 1992).

In genetic association studies, the standard GLM or ANOVA approaches may not be directly applicable to genetic markers with unphased genotypes. In humans, a marker genotype is a combination of one paternal and one maternal allele. But the phase information (i.e., the origin of parental alleles) is often missing from most of the current genotype typing technologies. Appropriate modification on the classical GLM or ANOVA methods is therefore needed in order to overcome this unknown phase problem. In Wang (2011), we introduced several coding schemes for unphased marker genotypes in constructing the dummy-variable based GLM. In Zeng et al. (2005) and Wang (2014), some revised Fisher's ANOVA models were also proposed for ANOVA analysis of quantitative trait loci or genetic markers with unphased genotypes, which were referred as the general biallelic (G2A) or general multi-allelic (GMA) models. The estimation of genetic variance components based on the G2A and GMA models were also briefly explored.

In this study, by applying the GMA model, we further examine estimation of genetic variance components for genetic markers with unphased genotypes from a random sample of individuals in a study population. First, for a single locus GMA model, we derive the least square estimates (LSE) of model parameters in fitting the GMA model based on the genotypic group means and allele frequency estimates. Than we construct estimators of the variance components for one marker in Hardy-Weinberg disequilibrium (HWD). Next, we consider a fully parameterized two-locus GMA model. We derive the LSE of model parameters in fitting the GMA model and develop estimators of the variance components for two genetic markers in an equilibrium population. In both one locus and two loci cases, we also explore the difference between the GLM and GMA in association analysis of genetic markers. We show that the GMA models can provide the same partition on the genetic variance components as the original Fisher's ANOVA models but the GLM cannot. We clarify that the F-statistics based on the partial reductions in sums of squares from the standard ANOVA table for association testing of the fixed allelic effects in a GLM could be inadequate for testing the existence of variance components when higher order fixed allelic interactions are present. In addition, we point out that the GMA model can reduce the confounding between allelic effects and allelic interactions at least when the inheritance of alleles are independent. As the result, the GMA model could be more beneficial than GLM for detecting allelic interactions. Finally, a simulation example is presented to show the difference between GLM and GMA models on partitioning variance components. The performance in model selection between using GLM and GMA models is also examined.

2. Models and Results

Consider a random sample of size N from a study population. Let yi, i = 1, … , N, be their observed phenotypic values for a quantitative trait Y, and gi, i = 1, … , N, be their observed unphased genotypes at certain genetic marker loci. The quantitative trait Y is assumed to be affected by both genetic and environmental effects. Let G denote the true (unobservable) genotypic value, which could be affected by many genetic factors. If we ignore the genetic by environmental interactions, the relationship between the quantitative trait and marker genotypes can be modeled via a GLM

yi=βzi+E(G|gi)+ϵi, i=1,,N,    (1)

where zi is a vector of the adjusted environmental covariates with fixed effects β, E(G|gi) represents the expected genotypic value of the i-th individual given his/her observed marker genotypes gi, and ϵi is a model residual error contributed by other environmental and genetic factors that cannot be captured by the covariates zi and marker genotypes gi. We also assume that ϵi, i = 1, … , N, are independent and identically distributed (i.i.d) with E(ϵi) = 0 and V(ϵi) = Vϵ. Besides, {ϵi, i = 1, … , N} are independent of {gi, i = 1, … , N}. Then, the total phenotypic variance VY = V(E(G|g)) + E(V(G|g)) = V(E(G|g)) + Vϵ. In the rest of the paper, we focus on comparing the GLM and GMA modeling on the expected genotypic values E(G|g) and assessing the genetic variance components contributed by the allelic effects and allelic interactions at the marker loci to the expected genotypic variance V(E(G|g)).

2.1. One-Locus Models

Consider a single marker locus with multiple alleles A1, …, Am (m ≥ 2). Let pj, j = 1, … , m, be the allele frequencies (j=1mpj=1), and pjk = P(AjAk), j, k = 1, … , m (jk), be the genotype frequencies (jkpjk=1), in a study population. For a random sample of size N from the study population, suppose that there are njk individuals carrying genotypes AjAk for jk, j, k = 1, … , m (N=j=1mk=jmnjk). For notation simplicity, we also let pkj = pjk and nkj = njk for k > j, j, k = 1, … , m. Then, pj=pjj+kjpjk/2, for j = 1, … , m. Let nj·=2njj+kjnjk, which is the total number of alleles Aj carried by the sampled individuals (j = 1, … , m). Assume that the observed marker genotypes gi, i = 1, … , N, are i.i.d. and follow a multinomial distribution. Then the MLE of the genotype frequencies are given by p^jk=njk/N, and the MLE of the allele frequencies are given by p^j=nj·/(2N)=p^jj+kjp^jk/2, for j = 1, … , m. Also, let yjki,i = 1, … , njk be the observed phenotypic values of the individuals carrying genotypes AjAk. We define the observed genotypic group means ȳjk·=ȳkj·=i=1njkyjki/njk for j, k = 1, … , m, the allele averaged means ȳj··=ȳ·j·=(2njjȳjj·+kjnjkȳjk·)/nj· for j = 1, … , m, and the grand mean ȳ···=j=1mk=jmnjkȳjk·/N. In addition, we define the allele weighted means ȳj·*=ȳ·j*=k=1mp^kȳjk· for j = 1, … , m, and the weighted overall mean ȳ··*=j=1mk=1mp^jp^kȳjk·=j=1mp^j2ȳjj·+j=1m1k=j+1m2p^jp^kȳjk·. Note that in a classical two-way factorial ANOVA model, each individual can receive one and only one level assignment from each of the two factors. At a marker locus, however, an individuals can carry a homozygous genotype “AjAj” (j = 1, … , m) with two Aj alleles indistinguishable. In general, the allele weighted means ȳj·* may not be the same as the allele averaged means ȳj··, and the weighted overall mean ȳ··* may also differ from the grand mean ȳ···.

To assess the allelic effects on the expected genotypic values (or phenotypic values), let us first take a brief review of Fisher's one-locus ANOVA model (see Kempthorne, 1957; Weir and Cockerham, 1977). Since the marker genotypes are unphased, we usually assume that the paternal and maternal gametes share the same set of alleles, have the same allele frequencies, and contribute the same allelic effects at the marker locus. The fact that each allele contributes the same genetic effect regardless of its parental origins implies that the expected genotypic values Gjk = E(G|g = AjAk) should satisfy the symmetric property: Gjk = Gkj, for jk. So there are totally m(m + 1)/2 possible distinctive expected genotypic values Gjk, j, k = 1, …, m. By treating the paternal and maternal gametes as two independent risk factors, the traditional Fisher's one-locus ANOVA model for the expected genotypic values Gjk can be written as

Gjk=μ+αj+αk+δjk, j,k=1,,m,    (2)

where αj is referred as the average additive effect of the paternal or maternal allele Aj (j = 1, …, m), and δjk the average allelic interaction between two parental alleles Aj and Ak (j, k = 1, …, m). It is known that not all the parameters in the above model are estimable due to an over-parameterization of the model on the expected genotypic values. Typically, the following constraints can be added on the model parameters (see Kempthorne, 1957).

j=1mpjαj=0,  j=1mpjδjk=0  for k=1,,m.    (3)

From the symmetric property of Gjk, we also assume that δjk = δkj, for jk. With these constraints, if we further incorporate model (2) into GLM (1) and ignore the adjusted covariates, the LSE of parameters are given by μ^=ȳ··*, α^j=ȳj·*ȳ··* for j = 1, … , m, and δ^jk=ȳjk·ȳj·*ȳk·*+ȳ··* for j, k = 1, … , m. It has been known that, under Hardy-Weinberg equilibrium (HWE), the above model (2) can also provide an orthogonal partition of the expected genotypic variance as V(E(G|g)) = VA + VD, where VA=2jpjαj2 and VD=j,kpjpkδjk2 are the so-called additive and dominance variance components. Weir and Cockerham (1977) also explored model (2) on partitioning the expected genotypic variance in HWD. In practice, as pointed out in Wang (2014), the symmetric property of δjk's and the irregular constraints (3) could make it difficult to fit model (2) using standard statistical software especially when the adjusted covariates are involved. Besides, the random variables that are responsible for the additive and dominance variance components are not explicitly defined in model (2).

The expected genotypic values can also be modeled via a classical dummy-variable based GLM. As shown in Wang (2011), we can introduce the following indicator variables to describe the inheritance of the two parental alleles for each individual

z1j={1, the inherited paternal allele is Aj0, the inherited paternal allele is not Aj,z2j={1, the inherited maternal allele is Aj0, the inherited maternal allele is not Aj

for j = 1, …, m. Though z1j and z2j cannot be defined on unphased heterozygous genotypes, we can define the following genotype coding variables for unphased genotypes

wj(g)=z1j+z2j={2,  if g=AjAj1,   if g=AjAjc0,  if g=AjcAjc

for j = 1, …, m, and

vjj(g)=z1jz2j={1,   if g=AjAj0,  otherwise,vjk(g)=z1jz2k+z1kz2j={1,   if g=AjAk0,  otherwise

for jk, j, k = 1, …, m. Here, Ajc denotes any other allele rather than Aj. By choosing Am as a reference allele, we can construct the following GLM

E(G|gi)=μ0+j=1m1ajwj(gi)+j=1m1k=jm1djkvjk(gi),    (4)

for i = 1, …, N, where aj is usually referred as the fixed additive allelic effect of the paternal or maternal allele Aj, and djk the fixed allelic interaction between two parental alleles Aj and Ak, with respect to the reference allele Am. In terms of the expected genotypic values, we can show that μ0 = Gmm, aj = GjmGmm and djk = (GjkGkm) − (GjmGmm), for j = 1, …, m − 1 and k = j, …, m − 1.

Model (4) provides a full re-parameterization of the m(m + 1)/2 expected genotypic values. Suppose that there are no empty genotype groups for the observed random sample; i.e., njk > 0 for any j, k = 1, … , m. When we incorporate model (4) into GLM (1) and ignore the adjusted covariates, the LSE of the expected genotypic values are given by: Ĝjk = ȳjk, for j, k = 1, … , m − 1 and j < k. Therefore, the LSE of parameters in model (4) can be easily derived in terms of the observed genotypic group means as μ^0=ȳmm· and âj = ȳjm· − ȳmm· for j = 1, … , m − 1, and d^jk=ȳjk·ȳjm·ȳkm·+ȳmm· for j, k = 1, … , m − 1 and jk. Note that these LSE could be sensitive to phenotypic outliers especially for small genotypic groups. For example, a few individuals may have genotype “AjAm” (j ≠ m) for a rare allele “Aj.” If the genotypic group mean ȳjm· is much larger than ȳmm·, the LSE âj will be large. By choosing a common allele “Am” as the reference, we could improve the accuracy of LSE. Through incorporating model (4) into a GLM, we could also perform hypothesis tests on its model parameters via contrasts. In analysis of genetic variance components, however, it has been known that wj's are often correlated with vjk's as random variables over the random individuals even when the inheritance of paternal and maternal alleles are independent (Wang and Zeng, 2009). As the result, model (4) leads to a different partition of the expected genotypic variance V(E(G|g)) from the one defined in Fisher's ANOVA model (2).

To further dissect the confounding between the additive and dominance effects in model (4), we can make mean corrections on the indicator variables {z1i} and {z2j}, and introduce the following mean-corrected index variables (see Wang, 2014)

x1j(g)=z1j(g)E[z1j(g)]={1pj,the paternal allele is Ajpj,the paternal allele is not Ajx2j(g)=z2j(g)E[z2j(g)]={1pj,the maternal allele is Ajpj,the maternal allele is not Aj

Then, we define the following modified genotype coding variables

wj*(g)=x1j+x2j={2(1pj),if g=AjAj12pj,if g=AjAjc2pj,if g=AjcAjc,vjj*(g)=x1jx2j={(1pj)2,if g=AjAjpj(1pj),if g=AjAjcpj2,if g=AjcAjc

for j = 1, … , m, and

vjk*(g)=x1jx2k+x1kx2j={(1pj)(1pk)+pjpk, if g=AjAk2pk(1pj), if g=AjAj2pj(1pk), if g=AkAkpk(12pj), if g=AjAjkcpj(12pk), if g=AkAjkc2pipj, if g=AjkcAjkc

for jk, j, k = 1, …, m. Here, Ajkc denotes an allele which is different from both Aj and Ak. Note that the modified genotype coding variables wj*(g)=wj(g)2pj, vjj*(g)=vjj(g)pjwj(g)+pj2, and vjk*(g)=vjk(g)pjwk(g)pkwj(g)+2pjpk are well defined on unphased genotypes, although the mean-corrected index variables x1j, x2k cannot be defined on unphased heterozygous genotypes. By choosing Am as a reference allele, we can construct the following GMA model (Wang, 2014)

E(G|gi)=μ*+j=1m1αj*wj*(gi)+j=1m1k=jm1δjk*vjk*(gi),    (5)

for i = 1, …, N. Still, in model (5), we refer the parameters αj* as the average additive allelic effects and δjk* (jk) as the average allelic interactions, with respect to the reference allele Am. Both the GMA model (5) and the GLM (4) can provide a full parameterization of the expected genotypic values Gjk. Comparing to model (4), one major advantage of model (5) is that it can retain the same partition on the genetic variance components as the one from Fisher's ANOVA model (2). In fact, under the constraints (3) plus the symmetric property of δjk, model (2) can be re-written as

E(G|gi)=μ+j=1mαjwj*(gi)+j=1mk=jmδjkvjk*(gi),    (6)

for i = 1, …, N. Model (5) is a simplified version of model (6) by further replacing the redundant variables wm*, vjm* and vmm* by wm*=j=1m1wj*, vjm*=vjj*k=1m1vjk* for j = 1, … , m − 1, and vmm*=j=1m1k=jm1vjk*. We can see that both models (5) and (2) share exactly the same additive and dominance variance components. They become equivalent when we take μ* = μ, αj*=αjαm, and δjk*=δjkδjmδkm+δmm, for j, k = 1, … , m − 1 and jk. Note that model (5) does not contain redundant parameters. Therefore, it does not require constraints on its model parameters. Besides, the random variables that are responsible for the additive and dominance variance components are explicitly defined in model (5). In practice, similar to enforcing the constraints (3) on model (2), we can create the variables wj* and vjk* by replacing the allele frequencies pj's by their MLE p^j's. Then, by incorporating model (5) into GLM (1), we can treat these modified genotype coding variables as regular fixed covariates and fit the model using the ordinary LS approach. The hypothesis of H0:VA = 0 for existence of the additive variance component can also be tested via testing H0:αj*=0,j = 1,… ,m1 for the average additive allelic effects.

The GLM (4) and GMA model (5) can be transformed easily from one to the other. From the relationship between their genotype coding variables, we can show that

{μ*=γ+2j=1m1pjaj+j=1m1k=1m1pjpkdjkαj*=aj+k=1m1pkdjk,  j=1,,m1

and δjk*=djk for jk, j, k = 1, … , m − 1. Here, for notation simplicity, we define dkj = djk, for j < k. To fit model (5), instead of solving its normal equations directly, we can derive the LSE of its model parameters from the LSE of model (4) as the following.

{μ^*=j=1mp^j2y¯jj·+j=1m1k=j+1m2p^jp^ky¯jk·=y¯··*α^j*=y¯j·*y¯m·* , j=1,,m1

and δ^jk*=d^jk=ȳjk·ȳjm·ȳkm·+ȳmm· for jk, j, k = 1, … , m − 1. Note that the LSE of the average additive allelic effects are calculated from allele weighted means, which could be more robust to phenotypic outliers than the LSE of fixed additive allelic effects. It is also interesting to see that both the fixed additive allelic effects aj's and the fixed allelic interactions djk's may affect the average additive allelic effects αj*'s, though the fixed allelic interactions keep the same as the average allelic interactions. In general, the hypothesis test of H0 : aj = 0, j = 1, … , m − 1, for the fixed additive effects in a GLM (4) is not equivalent to the hypothesis test of H0:αj*=0,j=1,,m1, for the average additive effects in an equivalent GMA model (5). Therefore, the standard F-statistics for testing the fixed additive effects in model (4) could be inadequate for testing the existence of the additive variance component VA when significant fixed (or average) allelic interactions are present.

In ANOVA, we treat {x1j, j = 1, … , m − 1} and {x2k, k = 1, … , m − 1} as random variables over the sampled individuals. Model (5) provides an approximation to the expected genotypic values E(G|g) by a linear combination of the random variables x1j, j = 1, … , m − 1 and x2k, k = 1, … , m − 1, plus their cross-product terms for allelic interactions. To model the genotypic distributions, let gi*=(z1j(gi)z2k(gi),j,k=1,,m,jk) be a vector that describes the unphased genotypic categories for i = 1, … , N. As usual, we assume that the marker genotypes g1*,,gN* are i.i.d. from a multinomial distribution of Mult(1, {pjk, j, k = 1, … , m, jk}). Under this assumption, we can show that the vectors g1i = (z11(gi), … , z1m(gi)), i = 1, … , N, of paternal allele indicator variables are i.i.d. from Mult(1, {p1, … , pm}); and the vectors g2i = (z21(gi), … , z2m(gi)), i = 1, … , N, of maternal allele indicator variables are also i.i.d. from Mult(1, {p1, … , pm}). When the inheritance of paternal and maternal alleles are independent, the study population is under HWE and the mean corrected index variables {x1j(gi), j = 1, … , m} are independent of {x2j(gi), j = 1, … , m} for any randomly sampled individual i. In HWD, however, the mean corrected index variables {x1j, j = 1, … , m} could be correlated with {x2j, j = 1, … , m}. Let us define

{Djj=Cov(z1j,z2j)=E(x1jx2j)=E(vjj*)=pjjpj2Djk=E(x1jx2k+x1kx2j)/2=E(vjk*)/2=pjk/2pjpk.

for jk, j, k = 1, … , m. Then, {Djk, j, k = 1, … , m} measure the departures from HWE, which satisfy j=1mDjk=k=1mDjk=0. From the observed genotypes, we have MLE D^jj=njj/Np^j2 for j = 1, … , m, and D^jk=njk/(2N)p^jp^k for j, k = 1, … , m and jk.

It is more convenient to use GMA model (5) rather than GLM (4) for estimation of genetic variance components. In model (5), let A=j=1m1αj*wj*(g) and D=j=1m1k=jm1δjk*vjk*(g) represent the additive and dominance components, respectively. In general, we can partition the expected genotypic variance as V(E(G|g)) = VA + VD + 2Cov(A, D). In Wang (2014), we have derived formulas for estimating the variance components VA, VD and covariance component Cov(A, D) based on the parameters in model (5). By plugging in LSE of the parameters, we obtain estimators of the variance components VA, VD and the covariance component Cov(A, D) as shown in Appendix A in Supplementary Material. It should be pointed out that these estimators of the variance and covariance components are different from the traditional ANOVA estimators when the marker genotypes are in HWD. Unlike the ANOVA estimators of variance components which could be negative when data are unbalanced, our estimators of the variance components are guaranteed to be non-negative. Similar to the ANOVA estimators, when the model residuals are normally distributed, these estimators become MLE of the variance and covariance components and likely possess the asymptotic normality property (see Searle, 1995). Meanwhile, with variations from both the genotypic group means and the MLE estimates of allele frequencies, these estimators are only asymptotically unbiased, while the original ANOVA estimators are unbiased.

Under HWE, we would expect an orthogonal partition of the expected genotypic variance as V(E(G|g)) = VA + VD. By taking D^jk=0 for j, k = 1, … , m in Appendix A in Supplementary Material, we obtain the following estimators

V^A=2j=1mp^j(y¯j·*y¯··*)2,V^D=j=1mk=1mp^jp^k(y¯jk·y¯··*)22j=1mp^j(y¯j·*y¯··*)2

If we further incorporate model (5) into GLM (1) and ignore the adjusted covariates, the estimator of residual variance is given by V^ϵ=j=1mk=jmi=1njk(yjkiȳjk·)2/N. The estimator of total phenotypic variance is given by V^Y=j=1mk=jmi=1njk(yjkiȳ···)2/N. We can show that V^Y=V^A+V^D+V^ϵ. Note that when D^jk=0 for j, k = 1, … , m, the allele weighted means and the allele averaged means become the same; i.e., ȳj·*=ȳj·· for j = 1, … , m. Besides, the weighted overall mean ȳ··* is the same as the grand mean ŷ···. However, due to possible sampling variations from the sampled individuals, under HWE we could still have some D^jk0, j, k = 1, … , m, which may lead to a slight deviation from the orthogonal partition of the expected genotypic variance.

When D^jk=0 for j, k = 1, … , m, another nice feature of the GMA model (5) is that the LSE {α^j*,j=1,,m1} of its main effects will keep the same in a reduced GMA model when we ignore the allelic interactions. Let us represent the full GMA model (5) in a matrix form with the design matrix X=(1N,Xα*,Xδ*), where 1N is a N by 1 vector with all its elements being 1, Xα*=(w1*(gi),,wm1*(gi);i=1,,N)N×(m1) and Xδ*=(v11*(gi),,v1,m1*(gi),,vm1,m1*(gi);i=1,,N)N×m(m1)/2, which correspond to the main effects α*=(α1*,,αm1*) and allelic interactions δ*=(δ11*,,δ1,m1*,,δm1,m1*), respectively. We can show that XX=diag(N,Xα*Xα*,Xδ*Xδ*), which is a block diagonal matrix when D^jk=0, for j, k = 1, … , m (see proof in Appendix B in Supplementary Material). Therefore, the LSE μ^*=ȳ··· and α^*=(Xα*Xα*)1Xα*Y, which do not depend on Xδ*. In other words, ignoring the average allelic interactions in model (5) does not affect the LSE of the intercept and the average additive effects in this case. The GLM (4) does not have such a property. In a reduced GLM (4) without the fixed allelic interactions, the LSE of its additive effects are âj=α^j*=ȳj·*ȳm·* for j = 1, … , m − 1, while in a full GLM (4) the LSE become âj = ȳjm· − ȳmm·, for j = 1, … , m − 1.

We can also estimate the genetic variance components with an adjustment for other covariates. First, by incorporating the GMA model (5) into GLM (1), we can fit GLM (1) using the ordinary LS approach. Next, based on the fitted model, we can calculate the fitted additive and dominance components Âi=j=1m1α^j*wi(gi) and D^i=j=1m1k=jm1δ^jk*vjk(gi), for each individual i = 1, … , N. Then, we can estimate VA and VD as the sample variances of {Âi, i = 1, … , N} and {D^i,i=1,,N}, respectively. Meanwhile, the covariance component Cov(A, D) can be estimated as the sample covariance between {Âi, i = 1, … , N} and {D^i,i=1,,N}.

2.2. Two-Locus Models

We consider an extension of the previous one-locus models to two loci. Assume that marker 1 has alleles A11, …, A1m1 (m1 ≥ 2) with p11, … , p1m1 being the allele frequencies, and marker 2 has alleles A21, …, A2m2 (m2 ≥ 2) with p21, … , p2m2 being the allele frequencies. Let pjkrs = P(A1jA1k, A2rA2s) denote the joint genotype frequencies at the two marker loci in a study population. For a random sample of size N from the study population, let njkrs be the number of individuals who carry unphased genotypes A1jA1k (jk) at locus 1, and A2rA2s (rs) at locus 2, for j, k = 1, … , m1 and r, s = 1, … , m2 (N=j=1m1k=jm1r=1m2s=rm2njkrs). Then, the MLE of genotype frequencies p^jkrs=njkrs/N. Let yjkrs, i,i = 1, … , njkrs, be the observed phenotypic values of individuals carrying the joint genotypes (A1jA1k, A2rA2s). We define the observed genotypic group means ȳjkrs·=i=1njkrsyjkrs,i/njkrs, for j, k = 1, … , m1 and r, s = 1, …, m2. Without distinguishing the origin of parental alleles, we assume that nkjrs = njkrs for k > j, j, k = 1, … , m1, and njksr = njkrs for r > s, r, s = 1, … , m2. Let njk··=r=1m2s=rm2njkrs for j, k = 1, … , m1, and n··rs=j=1m1k=jm1njkrs for r, s = 1, … , m2. We have the MLE of allele frequencies p^1j=(2njj··+kjnjk··)/(2N) for j = 1, … , m1 at locus 1, and p^2r=(2n··rr+srn··rs)/(2N) for r = 1, … , m2 at locus 2. For notation simplicity, we define ȳkjrs· = ȳjkrs· for k > j (j, k = 1, … , m1) and ȳjksr· = ȳjkrs· for r > s (r, s = 1, … , m2). In addition, we define the weighted genotypic group means ȳj·rs*=ȳ·jrs*=k=1m1p^1kȳjkrs, ȳjkr·*=ȳjk·r*=s=1m2p^2sȳjkrs, ȳjk··*=ȳkj··*=r,s=1m2p^2rp^2sȳjkrs, ȳ··rs*=ȳ··sr*=j,k=1m1p^1jp^1kȳjkrs, ȳj···*=ȳ·j··*=k=1m1r,s=1m2p^1kp^2rp^2sȳjkrs, and ȳ··r·*=ȳ···r*=j,k=1m1s=1m2p^1jp^1kp^2sȳjkrs, for j, k = 1, … , m1 and r, s = 1, … , m2. The weighted overall mean ȳ····*=j,k=1m1r,s=1m2p^1jp^1kp^2rp^2sȳjkrs.

Let Gjkrs = E(G(g)|g = A1jA1k, A2rA2s) be the expected genotypic value of individuals with the joint genotypes (A1jA1k, A2rA2s), for j, k = 1, …, m1 and r, s = 1, …, m2. Without distinguishing the origin of parental alleles, we assume that {Gjkrs} satisfy the symmetric properties: Gjkrs = Gkjrs = Gjksr = Gkjsr, for j < k and r < s. In general, there are totally m1m2(m1 + 1)(m2 + 1)/4 possible distinctive expected genotypic values. By treating the paternal and maternal gametes as two independent risk factors, the Fisher's two-locus ANOVA model for the expected genotypic values Gjkrs can be written as (see Kempthorne, 1957)

Gjkrs=μ+α1j+α1k+δ1jk+α2r+α2s+δ2rs+(αα)jr               +(αα)js+(αα)kr+(αα)ks+(αδ)j,rs+(αδ)k,rs               +(δα)jk,r+(δα)jk,s+(δδ)jk,rs    (7)

for j, k = 1, …, m1 and r, s = 1, …, m2. The parameters α1j and δ1jk are referred as the average additive and dominance effects at locus 1, α2r and δ2rs the average additive and dominance effects at locus 2, (αα)jr the additive by additive interactions, (αδ)j, rs the additive by dominance interactions, (δα)jk, r the dominance by additive interactions, and (δδ)jk, rs the dominance by dominance interactions. Still, due to an over-parameterization of the model for the expected genotypic values, not all the model parameters are estimable. From the symmetric property of the expected genotypic values, we usually assume that δ1jk = δ1kj, δ2rs = δ2sr, (αδ)j, rs = (αδ)j, sr, (δα)jk, r = (δα)kj, r, (δδ)jk, rs = (δδ)kj, rs = (δδ)jk, sr. In addition, the following constraints need to be added on the model parameters (Gallais, 1974; Weir and Cockerham, 1977).

j=1m1p1jα1j=0, r=1m2p2rα2r=0, j=1m1p1jδ1jk=0, r=1m2p2rδ2rs=0,j=1m1p1j(αα)jr=0, r=1m2p2r(αα)jr=0, j=1m1p1j(αδ)j,rs=0, r=1m2p2r(αδ)j,rs=0,j=1m1p1j(δα)jk,r=0, r=1m2p2r(δα)jk,r=0, j=1m1p1j(δδ)jk,rs=0,r=1m2p2r(δδ)jk,rs=0    (8)

In other words, a weighted sum of the average genetic effects of a genetic component is zero over any index. It has been known that model (7) under constraints (8) can provide an orthogonal partition on the genetic variance components when the inheritance of four paternal and maternal alleles at the two loci are independent (Kempthorne, 1957; Weir and Cockerham, 1977). Still, as pointed out in Wang (2014), it is difficult to estimate the parameters in model (7) under the complicated constraints (8) when other adjusted covariates are involved. Besides, the random variables that constitute the random sources of the genetic variance components are not explicitly defined in model (7).

To model the expected genotypic values through a GLM, we can re-list all the sampled individuals by an index variable k = 1, … , N. Similar to the one-locus case, we introduce the indicator variables z1j(1) and z2k(1) for inheritance of the paternal and maternal alleles at locus 1, and z1r(2) and z2s(2) for inheritance of the paternal and maternal alleles at locus 2. Then, for phase-unknown genotypes, we define the genotype coding variables w1j, v1jk for j, k = 1, … , m1 (jk) at locus 1, and w2r, v2rs for r, s = 1, … , m2 (rs) at locus 2, in the same way as we did in the one-locus case. By including the within locus additive and dominance effects as well as the locus-by-locus interactions (i.e., epistases), a fully parameterized two-locus GLM model can be written as (Wang, 2011),

E(G|gi)=μ0+j=1m11a1jw1j+j=1m11k=jm11d1jkv1jk+r=1m21a2rw2r                       +r=1m21s=rm21d2rsv2rs+j=1m11r=1m21(aa)jrw1jw2r                       +j=1m11r=1m21s=rm21(ad)j,rsw1jv2rs                       +j=1m11k=jm11r=1m21(da)jk,rv1jkw2r                       +j=1m11k=jm11r=1m21s=rm21(dd)jk,rsv1jkv2rs    (9)

for i = 1, …, N. A nice feature of the above GLM is that we can easily establish the relationship between its model parameters and the expected genotypic values by starting from the lowest order parameter μ0 = Gm1m1m2m2. Suppose that there are no empty genotypes; i.e., njkrs > 0 for any j, k = 1, … , m1 and r, s = 1, … , m2. When we incorporate the above model (9) into a GLM (1) and ignore the adjusted covariates, the LSE of parameters in model (9) can be derived as shown in Appendix C in Supplementary Material. Similar to the one-locus GLM, the LSE of these fixed allelic effects are highly dependent upon the genotypic group means, which could be sensitive to phenotypic outliers in small genotypic groups. More importantly, model (9) cannot provide the same partition of the expected genotypic variance V(E(G|g)) as the one defined in the original Fisher's ANOVA model (7).

To construct a two-locus GMA model for analysis of genetic variance components, we can introduce the mean-corrected index variables x1j(1),x2k(1) for j, k = 1, … , m1 at locus 1, and x1r(2),x2s(2) for r, s = 1, … , m2 at locus 2. Then we define the modified genotype coding variables w1j*, v1jk* for j, k = 1, … , m1 at locus 1, and w2r*, v2rs* for r, s = 1, … , m2 at locus 2 in the same way as we did in the one-locus case. By including the within locus additive and dominance effects as well as the locus-by-locus interactions (i.e., epistases), we can build a fully parameterized two-locus GMA model as the following.

E(G|gi)=μ*+j=1m11α1j*w1j*+j=1m11k=jm11δ1jk*v1jk*+r=1m21α2r*w2r*                        +r=1m21s=rm21δ2rs*v2rs*+j=1m11r=1m21(αα)jr*w1j*w2r*                        +j=1m11r=1m21s=rm21(αδ)j,rs*w1j*v2rs*                        +j=1m11k=jm11r=1m21(δα)jk,r*v1jk*w2r*                        +j=1m11k=jm11r=1m21s=rm21(δδ)jk,rs*v1jk*v2rs*    (10)

for i = 1, …, N. Similar to the one-locus case, the original Fisher's ANOVA model (7) under constraints (8) and the symmetric properties of the dominance effects can be re-written as

E(G|gi)=μ+j=1m1α1jw1j*+j=1m1k=jm1δ1jkv1jk*+r=1m2α2rw2r*                     +r=1m2s=rm2δ2rsv2rs*+j=1m1r=1m2(αα)jrw1j*w2r*                     +j=1m1r=1m2s=rm2(αδ)j,rsw1j*v2rs*                      +j=1m1k=jm1r=1m2(δα)jk,rv1jk*w2r*                      +j=1m1k=jm1r=1m2s=rm2(δδ)jk,rsv1jk*v2rs*.    (11)

Model (10) is actually a simplified version of model (11) by further removing the redundant parameters. The two models (10) and (11) are equivalent when we take

           μ*=μ,  α1j*=α1jα1m1,  α2k*=α2kα2m2        δ1jk*=δ1jkδ1jm1δ1km1+δm1m1        δ2rs*=δ2rsδ2rm2δ2sm2+δm2m2     (αα)jr*=(αα)jr(αα)jm2(αα)m1r+(αα)m1m2  (αδ)j,rs*=[(αδ)j,rs(αδ)j,rm2(αδ)j,sm2+(αδ)j,m2m2]                [(αδ)m1,rs(αδ)m1,rm2(αδ)m1,sm2+(αδ)m1,m2m2]  (δα)jk,r*=[(δα)jk,r(δα)jm1,r(δα)km1,r+(δα)m1m1,r]                 [(δα)jk,m2(δα)jm1,m2(δα)km1,m2+(δα)m1m1,m2] (δδ)jk,rs*=[(δδ)jk,rs(δδ)jk,rm2(δδ)jk,sm2+(δδ)jk,m2m2]                 [(δδ)jm1,rs(δδ)jm1,rm2(δδ)jm1,sm2+(δδ)jm1,m2m2]                 [(δδ)km1,rs(δδ)km1,rm2(δδ)km1,sm2+(δδ)km1,m2m2]                 +[(δδ)m1m1,rs(δδ)m1m1,rm2(δδ)m1m1,sm2                 +(δδ)m1m1,m2m2]

for j, k = 1, … , m1 − 1 (jk) and r, s = 1, … , m2 − 1 (rs). Therefore, both model (10) and (11) share exactly the same genetic components as the ones defined in the original Fisher's two-locus ANOVA model (7). In model (10), let A1=j=1m11α1j*w1j*, D1=j=1m11k=jm11δ1jk*v1jk*, A2=r=1m21α2r*w2r*, D2=r=1m21s=rm21δ2rs*v2rs*, A1A2=j=1m11r=1m21(αα)jr*w1j*w2r*, A1D2=j=1m11r=1m21s=rm21(αδ)j,rs*w1j*v2rs*, D1A2=j=1m11k=jm11r=1m21(δα)jk,r*v1jk*w2r* and D1D2=j=1m11k=jm11r=1m21s=rm21(δδ)jk,rs*v1jk*v2rs*, which represent the additive and dominance at locus 1, additive and dominance at locus 2, additive by additive, additive by dominance, dominance by additive and dominance by dominance genetic components, respectively. Each genetic component consists of a set of average allelic effects (additive or interaction) contributed by alleles from the same locus or the same set of loci, and at each locus at most two alleles could be involved. We refer the number of alleles involved in each average allelic effect of a genetic component as the order of the genetic component, and the number of non-redundant average allelic effects involved in a genetic component as the degrees of freedom (df) of the genetic component. Note that genetic components of the same order may have different df. For example, the additive by additive component A1A2 has its order being 2 and df = (m1 − 1)(m2 − 1), while the dominance component D1 at locus 1 has its order being 2 and df=(m11)2.

Both the GLM (9) and GMA model (10) provide a re-parameterization of the expected genotypic values. From the relationship between their genotype coding variables, these two models can be transformed equivalently from one to the other. We can represent the parameters in GMA model (10) in terms of the parameters in its equivalent GLM (9) as shown in Appendix D in Supplementary Material. It is interesting to see that an average allelic effect can be affected by not only its corresponding fixed allelic effect but also other higher-order fixed allelic effects. As a result, the hypothesis test of an average allelic effect in the GMA model (10) is not equivalent to the hypothesis test of the corresponding fixed allelic effect in an equivalent GLM (9) when higher order fixed allelic effects are present. From Appendices C, D in Supplementary Material, we can also derive the LSE of parameters for the GMA model (10) as the following

           μ^*=y¯····*,  α^1j*=y¯j···*y¯m1···*,  α^2r*=y¯··r·*y¯··m2·*        δ^1jk*=y¯jk··*(y¯jm1··*+y¯m1k··*)+y¯m1m1··*         δ^2rs*=y¯··rs*(y¯··rm2*+y¯··m2s*)+y¯··m2m2*       (αα)^jr*=y¯j·r·*(y¯j·m2·*+y¯m1·r·*)+y¯m1·m2·*    (αδ)^j,rs*=y¯j·rs*(y¯m1·rs*+y¯j·rm2*+y¯j·m2s*)                      +(y¯j·m2m2*+y¯m1·rm2*+y¯m1·m2s*)y¯m1·m2m2*    (δα)^jk,r*=y¯jkr·*(y¯jkm2·*+y¯jm1r·*+y¯km1r·*)                      +(y¯jm1m2·*+y¯km1m2·*+y¯m1m1r·)y¯m1m1m2·*

and (δδ)^jk,rs*=(dd)^jk,rs, for j, k = 1, …, m1 − 1, jk; and r, s = 1, …, m2 − 1, rs. Similar to the one-locus GMA model, the LSE of these average allelic effects depend on the weighted genotypic group means. Except the LSE of the highest order effects for dominance by dominance interactions, the LSE of other lower order average allelic effects could be more robust to phenotypic outliers in small genotypic groups than the LSE of their corresponding fixed allelic effects.

When the two markers are unlinked and in HWE, the inheritance of four paternal and maternal alleles at the two loci are independent. Therefore, the four sets of indicator variables {z1j(1),j=1,,m1}, {z2k(1),k=1,,m1}, {z1r(2),r=1,,m2}, and {z2s(2),s=1,,m2} are independent with each other, although the variables within each set could still be correlated. In this case, all the genetic components are independent of each other, and the GMA model (10) can provide an orthogonal partition of the expected genotypic variance. In Wang (2014), we have derived formulas for computing the genetic variance components based on the model parameters in a general multi-locus GMA model. For the two-locus GMA model (10), by plugging in the LSE of its model parameters, we obtain estimators of the genetic variance components as shown in Appendix E in Supplementary Material. It should be pointed out that Weir and Cockerham (1977) also derived estimates of the genetic variance components based on the model parameters in Fisher's two-locus ANOVA model (7). But they did not construct the estimators of genetic variance components in terms of the weighted genotypic group means. Note that the orthogonal partition on the genetic variance components implies that the genetic components constitute independent random sources contributing to the expected genotypic variance. Therefore, we could estimate and test for each genetic component separately. Bonferroni criterion can also be applied to correct for the association testing of multiple genetic components.

When the inheritance of paternal and maternal alleles at the two loci are dependent, non-zero covariances among different genetic components may present. The dependency among inheritance of the paternal and maternal alleles at the two markers can be measured by Djkrs = pjkrs/(2 − 1{j = k})(2 − 1{r = s})−pjpkprps, for j, k = 1, … , m1 and r, s = 1, … , m2. Let D^jkrs=p^jkrs/(21{j=k})(21{r=s})p^jp^kp^rp^s. Similar to the one-locus case, when D^jkrs=0 for j, k = 1, … , m1 and r, s = 1, … , m2, we can show that the LSE of the average additive, dominance, additive by additive, additive by dominance, dominance by additive and dominance by dominance effects in the full model (10) can keep consistent when some components are excluded from the model.

In general, we can always incorporate the two-locus GMA model (10) into a regression model (1). By treating the modified genotype coding variables w1j*, w2r*, v1jk*, and v2rs* as regular fixed covariates, we can fit the regression model using the ordinary LS approach. Based on the fitted model, we can calculate the fitted genetic components for each individual i = 1, … , N. Then the genetic variance components and their possible covariances can be estimated as the sample variances and covariances from the fitted values of these genetic components. Similar to the one-locus case, these estimators of the variance and covariance components are different from the traditional ANOVA estimators when the two marker are linked or their genotypes are in HWD. As the estimators of variance components coming from the sample variances, they are guaranteed to be non-negative. When the model residuals are normally distributed, these estimators of the variance and covariance components become MLE and likely possess the asymptotic normality property. Besides, these variance and covariance estimators are likely asymptotically unbiased.

2.3. A Simulation Example

Consider two biallelic marker loci with allele frequencies p1 = 0.4 and q1 = 0.6 for alleles “1” and “0” at locus 1, and p2 = 0.2 and q2 = 0.8 for alleles “1” and “0” at locus 2. Assume that the two marker loci are in linkage and gametic equilibria. The expected genotypic values at the two marker loci are simulated based on a two-locus GLM model (9) with μ0 = 10 and a11 = a21 = d111 = d211 = (aa)11 = 1 and (ad)1, 11 = (da)11, 1 = (dd)11, 11 = 0. From Appendix D in Supplementary Material, we can show that this GLM is equivalent to a GMA model (10) with μ* = 11.72 and α11*=1.8, α21*=2, δ111*=1, δ211*=1, (αα)11*=1, (αδ)1,11*=0, (δα)11,1*=0, and (δδ)11,11*=0. Based on the true expected genotypic values and the genotypic distribution, we also know that the total expected genotypic variance, which is VG = 3.09, has an orthogonal partition which consists of eight variance components contributed by w1*,v1*,w2*,v2*,ww*,wv*,vw*,vv*. Besides, these eight components w1*,v1*,w2*,v2*,ww*,wv*,vw*,vv* contribute 50.74, 1.87, 41.53, 0.83, 5.03, 0.00, 0.00, 0.00% of the total expected genotypic variance VG. For each random sample of size n, we first generate genotypes of individuals independently based on the genotypic distribution. Then, based on the genotypes, we create dummy variables w1 = w11, v1 = v111, w2 = w21, v2 = v211, ww = w11*w21, wv = w11*v211, vw = v111*w21, vv = v111*v211 and mean-corrected index variables w1*=w11,v1*=v111,w2*=w21,v2*=v211,ww*=w11*w21,wv*=w11*v211,vw*=v111*w21,vv*=v111*v211. The phenotypic values is a sum of the genotypic values and the residuals with the residuals being independent of the genotypes. To generate phenotypic values, we assume that the residuals ϵi, i = 1, … , n, are i.i.d. and normally distributed with the residual variance being Vϵ = 17.51, which account for 85% of the total phenotypic variance (i.e., the broad sense heritability is about 15%).

First, we show that the GLM and GMA models provide different partitions of the expected genotypic variance. To minimize the sampling variation, we simulate a random sample with n = 100,000. We fit both GLM and GMA models to the random sample using SAS “proc glm” (SAS Institute, INC.). For the fitted GLM, we have LSE μ^0=9.99, â11 = 1.01, â21 = 1.05, d^111=0.97, d^211=0.95, (aa)^11=0.97, (ad)^1,11=0.02, (da)^11,1=0.03 and (dd)^11,11=0.07, which are close to the true values. From the fitted GLM, we calculate estimates of the variance components contributed by w1, v1, w2, v2, ww, wv, vw, vv as 0.4894, 0.1274, 0.3533, 0.0345, 0.4153, 0.00, 0.00, 0.00, respectively. The summation of these variance components is 1.4199, which is about 46% of the total expected genotypic variance estimate VG = 3.0896. In other words, in the GLM based partition of the expected genotypic variance, about 54% is contributed by the covariance components due to the collinearity among the variables w1, v1, w2, v2, ww, wv, vw, vv. For the fitted GMA, we have LSE μ^*=11.72, α^11*=1.78, α^21*=2.02, δ^111*=0.96, δ^211*=0.98, (αα)^11*=0.97, (αδ)^1,11*=0.05, (δα)^11,1*=0.02 and (δδ)^11,11*=0.07, which are also close to the true values. From the fitted GMA model, we compute estimates of the variance components contributed by w1*,v1*,w2*,v2*,ww*,wv*, vw*, vv* as 1.5321, 0.0534, 1.3079, 0.0244, 0.1462, 0.00, 0.00, 0.00, respectively. The summation of these variance components is 3.064, which is about 99% of the total expected genotypic variance estimate VG = 3.0896. This indicates that the GMA model leads to a partition of the expected genotypic variance, which is almost orthogonal with only about 1% coming from the covariance components. Note also that these estimates of the variance components contributed by w1*,v1*,w2*,v2*,ww*,wv*, vw*, vv* account for approximately 49.59, 1.73, 42.33, 0.79, 4.73, 0.00, 0.00, 0.00% of the expected genotypic variance estimate VG = 3.0896, which are close to the true proportions 50.74, 1.87, 41.53, 0.83, 5.03, 0.00, 0.00, 0.00% of the variables w1*,v1*,w2*,v2*,ww*,wv*, vw*, vv* contributed to the expected genotypic variance.

We also look at the difference between the partition of the expected genotypic variance and the Type III sums of squares for this random sample. Both the GLM and GMA models have the same regression sum of squares SSR = 307788.89 and mean square error MSE = 17.52. In the fitted GLM, the Type III sums of squares for the eight variables w1, v1, w2, v2, ww, wv, vw, vv are 13415.25, 3475.46, 8472.60, 840.54, 4149.25, 0.24, 1.04, and 0.67, respectively. The summation of these Type III sums of squares is 30355.05, which is less than 10% of the total SSR. In the fitted GMA, the Type III sums of squares for the eight variables w1*,v1*,w2*,v2*,ww*,wv*,vw*,vv* are 153184.54, 5340.61, 130751.17, 2438.75, 14620.38, 2.92, 0.44, and 0.67, respectively. The summation of these Type III sums of squares is 306339.50, which is about 99.5% of the total SSR. This indicates that these Type III sums of squares also provide an orthogonal partition of the SSR. Or, in other words, the hypothesis tests of the eight genetic components are approximately orthogonal.

Next, we examine the performance in model selection between using the GLM and GMA models. We consider varied sample sizes of n = 500, 1000, 2000, and 5000. Under each simulation scenario, we run 1000 simulation. For each simulation sample, we first apply one commonly used method—the forward stepwise selection for model selection on {w1, v1, w2, v2, ww, wv, vw, vv} and {w1*,v1*,w2*,v2*,ww*,wv*,vw*,vv*}, separately. We run “proc glmselect” in SAS with the criterion p=0.05 for both entry and stay in the model. For the GMA model selection on index variables {w1*,v1*,w2*,v2*,ww*,wv*,vw*,vv*}, as these variables are independent in the underlying true model, we can rank them in the order of w1*,w2*,ww*,v1*,v2*,vw*,wv*, and vv* according to their variance contributions from the largest to the smallest. Intuitively, we expect that the selected models would include the higher ranked variables more often than the lower ranked ones. We mainly focus on the top five variables w1*,w2*,ww*,v1*,v2* and classify the selected models into 10 categories: I = all the five variables w1*,w2*,ww*,v1*,v2* are selected; II = w1*,w2*,ww*,v1* are selected but not v2*; III = w1*,w2*,ww*,v2* are selected but not v1*; IV= w1*,w2*,ww* are selected but not v1*,v2*; V= w1*,w2* are selected but not ww*,v1*,v2*; VI = w1*,ww* are selected but not w2*; VII = w2*,ww* are selected but not w1*; VIII = ww* is selected but w1*,w2* are missed; IX=w1* is selected but miss w2*,ww*, or w2* is selected but miss w1*,ww*; X=w1*,w2*,ww* are missed. For the GLM model selection on the dummy variables {w1, v1, w2, v2, ww, wv, vw, vv}, we also define the similar 10 categories I-VIII based on the variables w1, v1, w2, v2, ww, wv, vw, vv. Under each simulation scenario, the counts of selected GLM and GMA model types for stepwise selection are listed in Table 1.

TABLE 1
www.frontiersin.org

Table 1. Counts of selected GLM and GMA model types for “Stepwise Selection”.

For the stepwise selection, the GMA model shows a clear advantage over the GLM on choosing the Type I (the true model), Type I+II, Type I+II+III, Type I+II+III+IV, or Type I-V models. Meanwhile, the GLM tends to miss one of the main fixed effects (Type VI, VII, or VIII) when the sample size ≤ 2000, even though a Type VI, VII, or VIII GLM could imply an equivalent GMA model of Type IV. Overall, as the sample size increases, the model selection improves using either GLM or GMA models. But the selected GLM have a bigger variety, while the selected GMA models are more predictable.

It has been known that the conventional stepwise model selection has problems such as the stepwise F-statistics may no longer follow the F-distributions and the aggravated collinearity among the selected variables (Harrell, 2001). Several regularized regression methods have been developed to deal with the model selection problem especially for high-dimensional data. Here we also apply one of the regularized regression methods - the adaptive LASSO, using “proc glmselect” in SAS. We adopt the Bayesian information criterion (BIC) for model selections. The standardization of the variables is also made prior to the model selection. Note that the standardization of the variables {w1, v1, w2, v2, ww, wv, vw, vv} or {w1*,v1*,w2*,v2*,ww*,wv*,vw*,vv*} can only affect the scales of the regression coefficients but does not change the correlation structure among each set of variables, while making mean-corrections on the indicator variables {z1j(1),z2k(1)} and {z1r(2),z2s(2)} can lead to different correlation structures between the dummy variables {w1, v1, w2, v2, ww, wv, vw, vv} and the index variables {w1*,v1*,w2*,v2*,ww*,wv*,vw*,vv*}. The counts of selected GLM and GMA model types for adaptive LASSO are listed in Table 2.

TABLE 2
www.frontiersin.org

Table 2. Counts of selected GLM and GMA model types for “Adaptive LASSO”.

For the adaptive LASSO, it appears that the GMA model performs slightly better on choosing Type I (the true model), Type I+II, Type I+II+III, Type I+II+III+IV, or Type I-V models when sample size=500. However, this advantage diminishes for larger sample sizes. Still, the selected GLM appear to have a bigger variety, while the selected GMA models are more predictable. On the other hand, it seems that the selected GMA models could miss all the three top variables w1*,w2*,ww* more likely than the GLM models for w1, w2, ww. A brief comparison between Tables 1, 2 also reveals that the stepwise selection has a better performance than the adaptive LASSO under our simulation setting. But this may not be a fair comparison as the selected variables in our simulation setting is very limited. Besides, except BIC, many other criteria are available for model selections in “proc glmselect.” Further exploration on those criteria is needed.

3. Discussion

In this study, we applied the GMA models on analysis of variance components for genetic markers with unphased genotypes. We pointed out that the traditional Fisher's ANOVA model does not explicitly specify the random variables that contribute to the various genetic variance components. The constraints on their model parameters also complicate the model fitting. Meanwhile, the classical dummy-variable based GLM does not provide the same partition on the genetic variance components as the original Fisher's ANOVA model. The hypothesis tests of the fixed allelic effects in a GLM based on the partial (Type III) reduction in sums of squares could also be inadequate for testing the existence of variance components when allelic interactions are present. Alternatively, the GMA model can retain the same partition on the genetic variance components as the traditional Fisher's ANOVA model. Similar to the classical GLM, the GMA model does not require constraints on its model parameters and can be fitted using the ordinary least square approach. As the result, the GMA model allows us to estimate and test for the genetic variance components more conveniently than the classical GLM.

The classical GLM is appealing in genetic association studies due to its simplicity in interpretation of the model parameters, which often represent certain comparisons of the expected genotypic values in different genotype groups. However, the GLM-based approach faces challenges in dealing with allelic interactions because the fixed lower order allelic effects are often confounded with the higher order allelic interactions in a GLM even when the paternal and maternal alleles are independently inherited. In order to test for a particular fixed allelic interaction based on the classical GLM, we need to include all its lower order effects in the model to make this allelic interaction interpretable. Besides, ignoring certain higher order interactions in the model could also affect the definition of this particular allelic interaction due to their potential confounding, as pointed out in Zeng et al. (2005). On the other hand, analysis of genetic variance components using ANOVA type models such as GMA provides an alternative way on assessing the allelic effects and interactions. A nice feature of the GMA model is that the genetic components are independent of each other when multiple genetic markers are unlinked and in equilibrium populations. Therefore, at least in equilibrium populations, each genetic component can be treated as a random source of variation and tested individually regardless of its lower or higher order genetic components. A statistically significant higher-order genetic variance component implies a variation contributed by varied allele types from a set of loci, which could be more perceivable than a significant fixed allelic interaction for claiming allelic interactions.

As shown in Wang (2014), the extension of GMA to multiple loci is straightforward. The GMA model could also be applied to other phenotypic outcomes. For example, under the generalized linear mixed model framework, we can consider the genetic components contributing to a g-transformed (g - a link function) expected outcomes. For survival outcomes, we can examine the genetic components contributing to the hazard functions as well. However, it should be pointed out that the estimation of genetic variance components for genetic markers relies on the genotypic distribution of the markers in a study population. When a sample does not come from the simple random sampling, an adjustment for the sampling strategy is needed in order to make appropriate statistical inference in the general population. Currently, the genome-wide association studies (GWAS) often adopt the case-control design. When the case and control sampling rates are known, we could assess the variance components of genetic markers based on their odds ratio estimates. A thorough exploration on applying the GMA type models to a case-control based GWAS could be a research topic in the future.

Author Contributions

TW planned the study, conducted the derivation, and wrote the manuscript.

Conflict of Interest Statement

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The reviewer ZY and handling editor declared their shared affiliation, and the handling editor states that the process nevertheless met the standards of a fair and objective review.

Acknowledgments

The author would like to thank his colleague Dr. Kwang Woo Ahn in Division of Biostatistics at Medical College of Wisconsin for his advice on implementation of the adaptive LASSO method.

Supplementary Material

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fgene.2016.00123

References

Fisher, R. A. (1918). The correlation between relatives on the supposition of mendelian inheritance. Trans. R. Soc. Edinburgh 52, 399–433.

Google Scholar

Fisher, R. A. (1925). Statistical Methods for Research Workers. Edinburgh; London: Oliver & Boyd.

Google Scholar

Gallais, A. (1974). Covariances between arbitrary relatives with linkage and epistasis in the case of linkage disequilibrium. Biometrics 30, 429–446.

PubMed Abstract | Google Scholar

Harrell, F. (2001). Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis. New York, NY: Springer-Verlag.

Google Scholar

Henderson, C. R. (1953). Estimation of variance and covariance components. Biometrics 9, 226–252.

Google Scholar

Kempthorne, O. (1957). An Introduction to Genetic Statistics. New York, NY: Wiley.

Google Scholar

Searle, S. R. (1971). Linear Models. New York, NY: John Wiley & Sons, Inc.

Google Scholar

Searle, S. R. (1995). An overview of variance component estimation. Metrika 42, 215–230.

Google Scholar

Searle, S. R., Casella, G., and McCulloch, C. E. (1992). Variance Components. New York, NY: John Wiley & Sons, Inc.

Google Scholar

Wang, T. (2011). On coding genotypes for genetic markers with multiple alleles in genetic association study of quantitative traits. BMC Genet. 12:82. doi: 10.1186/1471-2156-12-82

PubMed Abstract | CrossRef Full Text | Google Scholar

Wang, T. (2014). A revised fisher model on analysis of quantitative trait loci with multiple alleles. Front. Genet. 5:328. doi: 10.3389/fgene.2014.00328

PubMed Abstract | CrossRef Full Text | Google Scholar

Wang, T., and Zeng, Z. B. (2009). Contribution of genetic effects to genetic variance components with epistasis and linkage disequilibrium. BMC Genet. 10:52. doi: 10.1186/1471-2156-10-52

PubMed Abstract | CrossRef Full Text | Google Scholar

Weir, B. S., and Cockerham, C. C. (1977). “Two-locus theory in quantitative genetics,” in Proceedings of the International Conference on Quantitative Genetics, eds E. Pollak, O. Kempthorne, and T. B. Bailey (Ames: Iowa State University Press), 247–269.

Google Scholar

Zeng, Z.-B., Wang, T., and Zou, W. (2005). Modeling quantitative trait loci and interpretation of models. Genetics 169, 1711–1725. doi: 10.1534/genetics.104.035857

PubMed Abstract | CrossRef Full Text | Google Scholar

Keywords: Fisher's ANOVA model, analysis of variance, general linear model, general multi-allelic model, genetic variance components, orthogonal partition, allelic interactions, least square estimates

Citation: Wang T (2016) Analysis of Variance Components for Genetic Markers with Unphased Genotypes. Front. Genet. 7:123. doi: 10.3389/fgene.2016.00123

Received: 25 March 2016; Accepted: 22 June 2016;
Published: 13 July 2016.

Edited by:

Steven J. Schrodi, Marshfield Clinic Research Foundation, USA

Reviewed by:

Zhan Ye, Marshfield Clinic, USA
Charles M. Rowland, Quest Diagnostics, USA

Copyright © 2016 Wang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Tao Wang, taowang@mcw.edu

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.