Multi-locus Analysis of Genomic Time Series Data from Experimental Evolution

Download PDF České info

A growing number of experimental biologists are generating “evolve-and-resequence” (E&R) data in which the genomes of an experimental population are repeatedly sequenced over time. The resulting time series data provide important new insights into the dynamics of evolution. This type of analysis has only recently been made possible by next-generation sequencing, and new statistical procedures are required to analyze this novel data source. We present such a procedure here, and apply it to both simulated and real E&R data.

Published in the journal: . PLoS Genet 11(4): e32767. doi:10.1371/journal.pgen.1005069
Category: Research Article
doi: https://doi.org/10.1371/journal.pgen.1005069

Summary

Introduction

A common study design in population genetics consists of collecting genomic variation data from living organisms to make inferences about unobserved evolutionary and biological phenomena. The many areas where this design has been applied include demographic inference (see [1] for a recent review), recombination rate estimation [2–6], and detection of natural selection [7–13]. Recently, there has been much interest in utilizing time series genetic data—e.g., from ancient DNA [14–21], experimental evolution of a population under controlled laboratory environments [22–26], or direct measurements in fast evolving populations [27]—to enhance our ability to probe into evolution. In particular, understanding the genetic basis of adaptation to changes in the environment can be significantly facilitated by such temporal data. Specifically, the dynamics of allele frequencies in an evolving population potentially convey added information about how the genome functions [28], information which is inaccessible to methods which operate only on a static snapshot of that genome.

An experimental methodology which serially interrogates the genomes of an controlled population over time could potentially yield new insights. In fact, this methodology can now be realized thanks to the advent of next-generation sequencing. By sequencing successive generations of model organisms raised in a controlled environment, genetic time series data can be generated which describe evolution at nucleotide resolution [24, 25, 28, 29]. This so-called evolve-and-resequence (henceforth, E&R) methodology is fundamentally different than the observational approach described above, and new inference procedures are needed to analyze this type of data.

In this paper, we present such a procedure and study its ability to perform a number of testing and estimation tasks relevant to population genetics. Our method is based on an approximation to the multi-locus Wright-Fisher process, and is well-suited to the small population, discrete generation, and random mating setting in which many E&R experiments are conducted. Furthermore, because it is based on a canonical population genetic model of genome evolution, our method can directly estimate population genetic quantities such as fitness, dominance, recombination rate, and effective population size. It can also be used to design future experiments with sufficient power to reliably infer these quantities.

We first use simulated data to demonstrate the utility of our method. Then, we apply our method to analyze genome-wide data from a real E&R experiment of D. melanogaster, designed to study the adaptation to a novel laboratory environment over tens of generations.

Related work

There is a small but growing literature on the analysis of evolve-and-resequence data. Feder et al. [30] present a statistical test for detecting selection at a single biallelic locus in time series data. (Although it is not a major focus, their method can also be used to estimate the selection parameter.) Similar to our method, they model the sample paths of the Wright-Fisher process as Gaussian perturbations around a deterministic trajectory in order to obtain a computable test statistic. However, their aim is slightly different from ours in that they analyze yeast and bacteria data sets where the population size is both large and must be estimated from data. Here we focus on population sizes which are smaller and more typical of experiments performed on higher organisms, for example mice or Drosophila. We generally assume that the effective population size is known but also test our ability to estimate it from data. Also, because of the increased amount of drift present in the small population regime, we necessarily restrict our attention to selection coefficients which are somewhat larger than those considered by Feder et al. Finally, although Feder et al. do study the performance of their method when time series data are corrupted by noise due to finite sampling (as in e.g. a next-generation sequencing experiment), they do not model this effect. Here we properly account for the effect of sampling by integrating over the latent space of population-level frequencies when computing the likelihood.

Another related work is Baldwin-Brown et al. [31], which presents a thorough study of the effects of sequencing effort, replicate count, strength of selection, and other parameters on the power to detect and localize a single selected locus segregating in a 1 Mb region. Results are obtained by simulating data under different experimental conditions and comparing the resulting distributions of allele trajectories under selection and neutrality using a modified form of t-test. Because it is not model-based, this method is incapable of performing parameter estimation. As a result of their study, Baldwin-Brown et al. present a number of design recommendations to experimenters seeking to attain a given level of power to detect selection. In a related work, Kofler and Schlötterer [32] carried out forward simulations of whole genomes to provide guidelines for designing E&R experiments to maximize the power to detect selected variants.

Illingworth et al. [33] derive a probabilistic model for time series data generated from large, asexually reproducing populations. The population size is sufficiently large (on the order of ∼ 10⁸) that population allele frequencies evolve quasi-deterministically. The deterministic trajectories are governed by a system of differential equations describing the effect of a selected (“driver”) mutation on nearby linked neutral (“passenger”) mutations. Randomness arises due to the finite sampling of alleles by sequencing. The main difference between the setting of Illingworth et al.’s and our own concerns genetic drift. While drift may be ignored when studying a large population of microorganisms, we show that it confounds our ability to detect and estimate selection in populations of order ∼ 10³. Thus, for E&R studies on (smaller) populations of macroscopic organisms, methods which assume that allele frequencies evolve deterministically may not perform as well as those which explicitly take drift into account.

Topa et al. [34] present a Bayesian model for single-locus time series data obtained by next-generation sequencing. In each time period, the allele count is modeled as a draw from a binomial distribution with number of trials equal to the depth of sequencer coverage, and success probability equaling the population-level allele frequency. The posterior allele frequency distribution is used to test for selection by comparing a neutral model to one in which unobserved allele frequencies to depend on time. In the non-neutral case, a Gaussian process is used to allow for directional selection acting on the posterior allele frequency distributions.

Finally, Lynch et al. [35] derive a likelihood-based method for estimating population allele frequency at a single locus in pooled sequencing data. The method allows for the possibility of sequencing errors as well as subsampling the population prior to sequencing. Using theoretical results as well as simulations, the authors give guidelines on the (subsampled) population size and coverage depth needed to reliably detect a difference in allele frequency between two populations. Unlike the other methods surveyed here, the approach of Lynch et al. is not designed to analyze time series data. Hence the data requirements needed to reliably detect allele frequency changes using their method—for example, sequencing coverage depth of at least 100 reads—are potentially greater than for methods are informed by a population-genetic model of genome evolution over time.

Novelty of our method

Our method differs from the above-mentioned approaches in several regards. To the best of our knowledge, ours is the first method capable of analyzing time series data from multiple linked sites jointly. We find that this is advantageous when studying selection in E&R data. Furthermore, it enables us to analyze features of these data which cannot be studied using single-locus models, such as local levels of linkage disequilibrium and the effect of a recombination hotspot. Additionally, because our model is based on a principled approximation to the Wright-Fisher process, it can numerically estimate the selection coefficient, dominance parameter, recombination rates, and other population genetic quantities of interest. In this way it is distinct from the aforementioned simulation-based methods [31, 32], methods which only focus on testing for selection [30, 31, 34], or methods based on general statistical procedures which are not specific to population genetics [34, 35].

Software and data availability

Source code implementing the method described in this paper is included in S1 Code. The experimental data analyzed in Analysis of a real E&R experiment data are from Franssen et al. [36] and are available on the Dryad digital repository http://dx.doi.org/10.5061/dryad.403b2.

Results

As described above, the primary methodological advance of this paper is to derive a tractable approximation to the discrete, multi-locus Wright-Fisher model with selection. This approximation enables us to perform statistical inference on time-series data generated in E&R experiments. Before studying how our approximation performs on both simulated and real data, we give a brief overview of its motivation and derivation.

A brief overview of the method

We consider the following model of an E&R experiment. A sexually reproducing population of N diploid individuals is evolved in discrete, non-overlapping generations. Pooled DNA sequencing [37, 38] is performed T times at generations t₁ < t₂ < ⋯ < t_T. At each segregating site in the resulting data set, we assume that there are two alleles, denoted A₀ and A₁. (As will be seen below, up to a change in the sign of the selection coefficient associated with each site, the model is agnostic to which allele is called A₀ or A₁.) Let L and R denote the number of loci and the number of experimental replicates, respectively. The array D ∊ [0, 1]^T×L×R counts relative frequency with which the A₁ allele was observed for each combination of generation, locus and replicate.

Given D and a vector of underlying population-genetic parameters θ, let ℙ(D∣θ) denote the model likelihood. In an idealized E&R experiment, generations are discrete and non-overlapping, mating is random, and the population size is fixed, so the likelihood is well approximated by the classical Wright-Fisher model of genome evolution [39]:

where ℙ_θ(G_i∣G_i−1) is the transition function of the discrete, many-locus Wright-Fisher Markov chain from genomic configuration G_i−1 to G_i given parameters θ,

Zdroje

1. Veeramah KR, Hammer MF (2014) The impact of whole-genome sequencing on the reconstruction of human population history. Nature Reviews Genetics 15 : 149–162. doi: 10.1038/nrg3625 24492235

2. McVean GAT, Myers SR, Hunt S, Deloukas P, Bentley DR, et al. (2004) The fine-scale structure of recombination rate variation in the human genome. Science 304 : 581–584. doi: 10.1126/science.1092500 15105499

3. Myers S, Bottolo L, Freeman C, McVean G, Donnelly P (2005) A fine-scale map of recombination rates and hotspots across the human genome. Science 310 : 321–324. doi: 10.1126/science.1117196 16224025

4. Auton A, Fledel-Alon A, Pfeifer S, Venn O, Ségurel L, et al. (2012) A fine-scale chimpanzee genetic map from population sequencing. Science 336 : 193–198. doi: 10.1126/science.1216872 22422862

5. Chan AH, Jenkins PA, Song YS (2012) Genome-wide fine-scale recombination rate variation in Drosophila melanogaster. PLoS Genetics 8: e1003090. doi: 10.1371/journal.pgen.1003090 23284288

6. Auton A, Li YR, Kidd J, Oliveira K, Nadel J, et al. (2013) Genetic recombination is targeted towards gene promoter regions in dogs. PLoS Genetics 9: e1003984. doi: 10.1371/journal.pgen.1003984 24348265

7. Nielsen R, Bustamante C, Clark AG, Glanowski S, Sackton TB, et al. (2005) A scan for positively selected genes in the genomes of humans and chimpanzees. PLoS Biology 3: e170. doi: 10.1371/journal.pbio.0030170 15869325

8. Bustamante CD, Fledel-Alon A, Williamson S, Nielsen R, Hubisz MT, et al. (2005) Natural selection on protein-coding genes in the human genome. Nature 437 : 1153–1157. doi: 10.1038/nature04240 16237444

9. Sabeti PC, Schaffner SF, Fry B, Lohmueller J, Varilly P, et al. (2006) Positive natural selection in the human lineage. Science 312 : 1614–1620. doi: 10.1126/science.1124309 16778047

10. Nielsen R, Hellmann I, Hubisz M, Bustamante C, Clark AG (2007) Recent and ongoing selection in the human genome. Nature Reviews Genetics 8 : 857–868. doi: 10.1038/nrg2187 17943193

11. Sella G, Petrov DA, Przeworski M, Andolfatto P (2009) Pervasive natural selection in the Drosophila genome? PLoS Genetics 5: e1000495. doi: 10.1371/journal.pgen.1000495 19503600

12. Hernandez RD, Kelley JL, Elyashiv E, Melton SC, Auton A, et al. (2011) Classic selective sweeps were rare in recent human evolution. Science 331 : 920–924. doi: 10.1126/science.1198878 21330547

13. Langley CH, Stevens K, Cardeno C, Lee YCG, Schrider DR, et al. (2012) Genomic variation in natural populations of Drosophila melanogaster. Genetics 192 : 533–598. doi: 10.1534/genetics.112.142018 22673804

14. Hummel S, Schmidt D, Kremeyer B, Herrmann B, Oppermann M (2005) Detection of the CCR5-Delta32 HIV resistance gene in bronze age skeletons. Genes and Immunity 6 : 371–374. doi: 10.1038/sj.gene.6364172 15815693

15. Green RE, Krause J, Briggs AW, Maricic T, Stenzel U, et al. (2010) A draft sequence of the Neandertal genome. Science 328 : 710–722. doi: 10.1126/science.1188021 20448178

16. Reich D, Green RE, Kircher M, Krause J, Patterson N, et al. (2010) Genetic history of an archaic hominin group from Denisova Cave in Siberia. Nature 468 : 1053–1060. doi: 10.1038/nature09710 21179161

17. Ludwig A, Pruvost M, Reissmann M, Benecke N, Brockmann GA, et al. (2009) Coat color variation at the beginning of horse domestication. Science 324 : 485. doi: 10.1126/science.1172750 19390039

18. Meyer M, Kircher M, Gansauge MT, Li H, Racimo F, et al. (2012) A high-coverage genome sequence from an archaic Denisovan individual. Science 338 : 222–226. doi: 10.1126/science.1224344 22936568

19. Orlando L, Ginolhac A, Zhang G, Froese D, Albrechtsen A, et al. (2013) Recalibrating equus evolution using the genome sequence of an early middle pleistocene horse. Nature 499 : 74–78. doi: 10.1038/nature12323 23803765

20. Sankararaman S, Mallick S, Dannemann M, Prüfer K, Kelso J, et al. (2014) The genomic landscape of Neanderthal ancestry in present-day humans. Nature 507 : 354–357. doi: 10.1038/nature12961 24476815

21. Steinrücken M, Bhaskar A, Song YS (2014) A novel spectral method for inferring general diploid selection from time series genetic data. Annals of Applied Statistics 8 : 2203–2222. doi: 10.1214/14-AOAS764 25598858

22. Wiser MJ, Ribeck N, Lenski RE (2013) Long-term dynamics of adaptation in asexual populations. Science 342 : 1364–1367. doi: 10.1126/science.1243357 24231808

23. Lang GI, Rice DP, Hickman MJ, Sodergren E, Weinstock GM, et al. (2013) Pervasive genetic hitchhiking and clonal interference in forty evolving yeast populations. Nature 500 : 571–574. doi: 10.1038/nature12344 23873039

24. Burke MK, Dunham JP, Shahrestani P, Thornton KR, Rose MR, et al. (2010) Genome-wide analysis of a long-term evolution experiment with Drosophila. Nature 467 : 587–590. doi: 10.1038/nature09352 20844486

25. Orozco ter Wengel P, Kapun M, Nolte V, Kofler R, Flatt T, et al. (2012) Adaptation of Drosophila to a novel laboratory environment reveals temporally heterogeneous trajectories of selected alleles. Molecular Ecology 21 : 4931–4941. doi: 10.1111/j.1365-294X.2012.05673.x

26. Tenaillon O, Rodríguez-Verdugo A, Gaut RL, McDonald P, Bennett AF, et al. (2012) The molecular diversity of adaptive convergence. Science 335 : 457–461. doi: 10.1126/science.1212986 22282810

27. Shankarappa R, Margolick JB, Gange SJ, Rodrigo AG, Upchurch D, et al. (1999) Consistent viral evolutionary changes associated with the progression of human immunodeficiency virus type 1 infection. Journal of Virology 73 : 10489–10502. 10559367

28. Burke MK (2012) How does adaptation sweep through the genome? Insights from long-term selection experiments. Proceedings of the Royal Society B: Biological Sciences 279 : 5029–5038. doi: 10.1098/rspb.2012.0799 22833271

29. Parts L, Cubillos FA, Warringer J, Jain K, Salinas F, et al. (2011) Revealing the genetic structure of a trait by sequencing a population under selection. Genome Research 21 : 1131–1138. doi: 10.1101/gr.116731.110 21422276

30. Feder AF, Kryazhimskiy S, Plotkin JB (2014) Identifying signatures of selection in genetic time series. Genetics 196 : 509–522. doi: 10.1534/genetics.113.158220 24318534

31. Baldwin-Brown JG, Long AD, Thornton KR (2014) The power to detect quantitative trait loci using resequenced, experimentally evolved populations of diploid, sexual organisms. Molecular Biology and Evolution 31 : 1040–1055. doi: 10.1093/molbev/msu048 24441104

32. Kofler R, Schlötterer C (2014) A guide for the design of evolve and resequencing studies. Molecular Biology and Evolution 31 : 474–483. doi: 10.1093/molbev/mst221 24214537

33. Illingworth CJR, Parts L, Schiffels S, Liti G, Mustonen V (2012) Quantifying selection acting on a complex trait using allele frequency time series data. Molecular Biology and Evolution 29 : 1187–1197. doi: 10.1093/molbev/msr289 22114362

34. Topa H, Jónás Á, Kofler R, Kosiol C, Honkela A (2014) Gaussian process test for highthroughput sequencing time series: application to experimental evolution. arXiv q-bio.PE: 1403 : 4086.

35. Lynch M, Bost D, Wilson S, Maruki T, Harrison S (2014) Population-genetic inference from pooled-sequencing data. Genome Biology and Evolution 6 : 1210–1218. doi: 10.1093/gbe/evu085 24787620

36. Franssen SU, Nolte V, Tobler R, Schlötterer C (2015) Patterns of linkage disequilibrium and long range hitchhiking in evolving experimental Drosophila melanogaster populations. Molecular Biology and Evolution, 32 : 495–509. doi: 10.1093/molbev/msu320

37. Futschik A, Schlötterer C (2010) The next generation of molecular markers from massively parallel sequencing of pooled DNA samples. Genetics 186 : 207–218. doi: 10.1534/genetics.110.114397 20457880

38. Schlötterer C, Tobler R, Kofler R, Nolte V (2014) Sequencing pools of individuals—mining genome-wide polymorphism data without big funding. Nature Reviews Genetics 15 : 749–763. doi: 10.1038/nrg3803 25246196

39. Ewens WJ (1979) Mathematical Population Genetics. Springer Verlag.

40. Hazel JR (1995) Thermal adaptation in biological membranes: is homeoviscous adaptation the explanation? Annual Review of Physiology 57 : 19–42. doi: 10.1146/annurev.ph.57.030195.000315 7778864

41. Comeron JM, Ratnappan R, Bailin S (2012) The many landscapes of recombination in Drosophila melanogaster. PLoS Genetics 8: e1002905. doi: 10.1371/journal.pgen.1002905 23071443

42. Singh ND, Stone EA, Aquadro CF, Clark AG (2013) Fine-scale heterogeneity in crossover rate in the garnet-scalloped region of the Drosophila melanogaster X chromosome. Genetics 194 : 375–387. doi: 10.1534/genetics.112.146746 23410829

43. Cutler DJ, Jensen JD (2010) To pool, or not to pool? Genetics 186 : 41–43. doi: 10.1534/genetics.110.121012 20855575

44. Gautier M, Foucaud J, Gharbi K, Cézard T, Galan M, et al. (2013) Estimation of population allele frequencies from next-generation sequencing data: pool-versus individual-based genotyping. Molecular Ecology 22 : 3766–3779. doi: 10.1111/mec.12360 23730833

45. Lynch M, Bost D, Wilson S, Maruki T, Harrison S (2014) Population-genetic inference from pooled-sequencing data. Genome Biology and Evolution 6 : 1210–1218. doi: 10.1093/gbe/evu085 24787620

46. Kirkpatrick M, Johnson T, Barton N (2002) General models of multilocus evolution. Genetics 161 : 1727. 12196414

47. Barton NH, Otto SP (2005) Evolution of recombination due to random drift. Genetics 169 : 2353–2370. doi: 10.1534/genetics.104.032821 15687279

48. Stephan W, Song YS, Langley CH (2006) The hitchhiking effect on linkage disequilibrium between linked neutral loci. Genetics 172 : 2647–2663. doi: 10.1534/genetics.105.050179 16452153

49. Hudson RR (2002) Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics 18 : 337–338. doi: 10.1093/bioinformatics/18.2.337 11847089

50. Li H, Stephan W (2006) Inferring the demographic history and rate of adaptive substitution in Drosophila. PLoS Genetics 2: e166. doi: 10.1371/journal.pgen.0020166 17040129

51. Peng B, Kimmel M (2005) simuPOP: a forward-time population genetics simulation environment. Bioinformatics 21 : 3686–3687. doi: 10.1093/bioinformatics/bti584 16020469