Edited by: Miguel Arenas, Institute of Molecular Pathology and Immunology of the University of Porto (IPATIMUP), Portugal
Reviewed by: Gregory Bruce Ewing, École Polytechnique Fédérale de Lausanne, Switzerland; Stefano Mona, EPHE (Ecole Pratique des Hautes Etudes), France
*Correspondence: Sadoune Ait Kaci Azzou, Département de Mathématiques, Équipe de Modélisation Stochastique Appliquée (EMOSTA), Université du Québec à Montréal, Case postale 8888, succursale Centre-Ville, Montréal, QC H3C 3P8, Canada
This article was submitted to Evolutionary and Population Genetics, a section of the journal Frontiers in Genetics
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
The effective population size over time (demographic history) can be retraced from a sample of contemporary DNA sequences. In this paper, we propose a novel methodology based on importance sampling (IS) for exploring such demographic histories. Our starting point is the
The demographic history of a population leaves its signature in the genome, which means that DNA sequences contain information about the demographic history of the population from which they are sampled. Therefore, it is possible to use genetic data to infer demographic parameters, an issue with important implications in many fields such as public health, epidemiology and conservation biology (Minin et al.,
The first methods for estimating the demographic history from gene sequences were parametric and used coalescent theory. Such methods require a simple demographic model in order to describe the changes in the population size over time in terms of one or more parameters. They are based on importance sampling, e.g., (Slatkin and Hudson,
Usually, in practice, it is not known in advance which demographic model fits the sampled gene sequences. Further, population histories are often more complex than those described by simple parametric models. This has motivated the development of non-parametric and semi-parametric methods for inferring the demographic history from sequence data or from an estimated genealogy (e.g., Fu,
Our method is nonparametric and is closely related to the family of
Thus, Pybus et al. (
To improve the
Strimmer and Pybus (
In particular, Drummond et al. (
Finally, Heled and Drummond (
In order to estimate the effective population size, we propose a new method in a likelihood-based perspective. Unlike some skyline methods that use a single estimated phylogeny of the sequences, or others that use MCMC approaches, we resort to an efficient importance sampling scheme and our estimate comes to an weighted average over a large number of simulated genealogies, each with a different set of coalescence times. The methodology is described in detail in Section The Skywis Method.
In this section, we present the basic ideas behind the standard coalescent, as well as its extension to the case of fluctuating population size. An introduction to coalescent theory can be found in Nordborg (
In short, the ancestry of a sample of sequences is modeled back in time, starting from the current sample and until the most recent common ancestor (MRCA) of the sample is found. At each step in the genealogical tree, one of the following events can occur: (1) two sequences coalesce if they share a common ancestor; (2) one sequence mutates. In the coalescent framework, time is measured in units of
In the classical coalescent process, and in the presence of
The classical coalescent framework can be extended to include simple deviations from the idealized Wright-Fisher model, like recombination, fluctuating population size, population structure, and selection. In our paper, we focus on a single extension of the coalescent, namely variable population size.
In the case of non-constant population size, the number of descendants of a sequence in one generation does not follow the Poisson distribution with intensity one (Hein et al.,
Let
Let
and let Λ(
where ν(
The waiting time until the next event depends only on the time of the previous event by the Markov property. The survival function of the time
where
We note that when replacing Λ(
It is precisely from this equation that Pybus et al. (
Parameter estimation in population genetic models require optimization of the likelihood of the data given the parameters,
where θ is the collection of parameters (such as population size and migration rates) for the population process. Typically, the objective of the analysis is to estimate these parameters by averaging the likelihood over all possible genealogies. A naive Monte Carlo method for the integral in Equation (6) is given by
where
Importance Sampling (IS) allows us to improve the efficiency of the Monte Carlo integration. The main idea of the IS approach is to reduce the inefficiency of the approximation (Equation 7) by concentrating the simulation on the trees that are more likely with the observed data. Instead of choosing histories from the distribution
The Monte Carlo approximation of Equation (8) gives
where
Importance sampling (IS) was first used in this context by Griffiths and Tavaré (
In this section, we describe our estimation method of the effective population size, when
simulate
compute
compute the weights
estimate
For example, with a variable population size that is expanding from the past to the present, as we progress toward the MRCA one can expect the population size to be smaller, or coalescence times to be shorter, than in the case of a constant population size. This fact, of shorter coalescence times, should be reflected more faithfully by the most probable genealogies. Since such genealogies receive the largest weights, one can see that through the weighting system the estimator is adapting itself to the information contained in the data.
In what follows we describe our method in full detail, namely:
how to simulate genealogies;
how to set the weights;
how to estimate the effective population size.
In order to generate genealogies we use the proposal distribution
Let:
In the Stephens and Donnelly (
where
Stephens and Donnelly (
where
In the proposed reconstruction, when
In our approach, the genealogies are simulated backwards in time by the following algorithm based on Equation (12):
initialize
simulate the time to the next event,
randomly choose a sequence from
for each type β∈
compute the quantities
Then, choose:
a coalescence event with probability
a mutation event (to β) with probability
depending on the type of event chosen in step 5, we continue as follows:
if there is a coalescence event, choose another sequence of type α randomly, and let
if there is a mutation event, mutate the sequence α into a sequence β, without changing
let
After implementing the above algorithm, the coalescence times that are at the core of our method can be deduced. In the genealogy
After generating genealogies using the Stephens and Donnelly (
where
with
and
When building genealogies backwards in time, as we move backwards in time, fewer coalescence events occur. As a result, coalescence times close to the present are very short and become larger gradually going back in time. These short coalescence times create an undesirable variability in the estimation of the effective population size. Therefore, we propose to cumulate small coalescence times in order to improve the estimation of the effective population size. These cumulated time intervals are called epochs. To define epochs that get larger as we go backwards in time, we followed (Durbin and Li,
Finally, we note that the idea of cumulating small coalescence times in order to smooth the graph of the estimator of the effective population size was first proposed by Strimmer and Pybus (
Once the genealogies have been simulated using the method described in Section 3.1.1, we cumulate the coalescence times as follows:
we fix the total number of epochs,
for each simulated genealogy
we use formula (Equation 18) proposed by Durbin and Li (
where
For each genealogy, formula (Equation 18) gives the boundaries of the epochs, measured from the present to the past where
The
because
The expectation of the accumulated waiting time in order to pass from
where
where
where
where
The distribution of importance weights of genealogies described by the Equation (15) is an approximation to the posterior distribution
In our case, we are interested in the estimation of
Therefore, to estimate the integral Equation (26) by a weighted average of estimates from
For illustration, in Figure
[ |
The algorithm described in Section 3.1 can be generalized to the case of serially sampled sequences i.e., sequences sampled at different moments in time. Such samples are also called heterochronous. Figure
In the presence of serially sampled sequences, we have to adapt the method of Stephens and Donnelly (
Let
In other words, the probabilities
In order to present our results, we introduce these additional notations:
Our proposal distribution is an adapted version of the Stephens and Donnelly (
Case 1:
Case 2:
Normally (i.e., in homochronous sampling), the waiting time
where
In the case of heterochronous sequences, the algorithm for simulating the genealogies backward in time is the following:
initialize
simulate the time to the next event,
be the observed value;
compute
if
let
let
let
otherwise, go to step 5;
let
compute the quantities
Then, choose:
a coalescence event with probability
a mutation event (to β) with probability
depending on the result in step 6:
if there is a coalescence event, choose another sequence of type α randomly, and let
if there is a mutation event, mutate the sequence α into a sequence β, without changing
let
After the definition of how to build a genealogy in the case of serially sampled sequences, and the proposal distribution
Case 1:
Case 2:
where:
the probability that there are no events in the interval
As in the case of homochronous sequences, after computing the probabilities
For heterochronous sequences, the method for producing a
for each simulated genealogy
we fix the number of epochs at
in order to define the epochs, the time cutting points in a genealogy
where
For each genealogy and for each time interval (
In practice, we performed minor smoothing at times
To test the ability of our method to capture the demographic signal contained in the DNA sequences, we simulated several demographics scenarios. Further, we compared the results of the
The DNA sequences were simulated using the
The
From the DNA sequences generated by
Based on the estimated tree produced by PHYLIP, we used the APE package (Paradis et al.,
The
Below, we present our results according to the demographic models we considered.
In this case, we consider 50 simulated DNA sequences with parameters:
number of nucleotides: 10,000;
constant effective population size: 2000 generations;
no recombination and no population structure;
mutation rate equals to 2·10−7: therefore (θ = 8);
JC69 (Jukes and Cantor,
The estimate of the effective population size (
In Figure
The
In this section, we present results where 25 DNA sequences of length 10,000 nucleotides and mutation rate μ = 5·10−4 were simulated under the JC69 mutation model. We assume that the effective population size follows the piecewise constant model function:
where
Figure
In Figure
The
In this section, we suppose that the effective population growth is exponential assuming an instantaneous growth rate that is proportional to the current population size according to the equation
Using the
Number of nucleotides: 1000;
no recombination, and no population structure;
mutation rate: 5·10−7 (θ = 1);
JC69 finite sites model;
β = 1 (in generations).
The
The result given in Figure
In Figure
In Figure
In order to test the methodology proposed in Section 3.2, we use the same parameters as in Section 4.3, but by assume that the 50 sequences were collected at different moments in time such as:
The result given in Figure
The
We present a framework of the new method and test by simulation its ability to capture the demographic signal contained in the DNA sequences under several demographic scenarios. Moreover, we consider both homochronous and heterochronous data using a simple substitution model, JC69 (Jukes and Cantor,
For illustration we present the results given by the generalized skyline plot that uses a single genealogy, and those obtained by the
As a future development, we expect the method to be improved by considering an iterative procedure, in which the present approach would be the first estimation step. As a new approach, the
The Ph.D. studies of SA were supported in part by scholarships awarded by the ISM (Institut des Sciences Mathématiques) and stipends out of NSERC (Natural Sciences and Engineering Research Council) research grants.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
We acknowledge the stimulating discussions within the ÉMoStA group.
1The reason we changed the way we define the epochs is that the number of sequences rises at the instants of the serial sampling, so the method used in Section Simulation of Genealogies is not appropriate.
2The simulation of the genealogies was performed using MATLAB programming language (MATLAB,