Machine learning swallowing population genetics = understanding patterns in population genomics

Dan Schrider and Andy Kern have a new review preprint out, Machine Learning for Population Genetics: A New Paradigm. On Twitter there has already been a little snark to the effect of “oh, you mean regression?” That’s fair enough, and the preprint would probably benefit from a lower key title, though that’s really the sort of titles journals seem to love.

I would recommend this preprint to two large groups of my readers. There are those with strong computational skills who are curious about biology. It makes it clear why population genomics benefits from machine learning methods. Second, those who are interested or trained in genetics with less of a computational and pop gen background.

Yes, all models are wrong. But some give insight, and some are just not salvageable. In population genomics some of the model-building is obviously starting to yield really fragile results.

Carving nature at its joints more realistically

If you are working on phylogenetic questions on a coarse evolutionary scale (that is, “macroevolutionary,” though I know some evolutionary geneticists will shoot me the evil eye for using that word) generating a tree of relationships is quite informative and relatively straightforward, since it has a comprehensible mapping onto to what really occurred in nature. When your samples are different enough that the biological species concept works well and gene flow doesn’t occur between node, then a tree is a tree (one reason Y and mtDNA results are so easy to communicate to the general public in personal genomics).

Everything becomes more problematic when you are working on a finer phylogenetic scale (or in taxa where inter-species gene flow is common, as is often the case with plants). And I’m using problematic here in the way that denotes a genuine substantive analytic issue, as opposed to connoting something that one has moral or ethical objections to.

It is intuitively clear that there is often genetic population structure within species, but how to summarize and represent that variant is not a straightforward task.

In 2000 the paper Inference of Population Structure Using Multilocus Genotype Data in Genetics introduced the sort of model-based clustering most famously implemented with Structure. The paper illustrates limitations with the neighbor-joining tree methods which were in vogue at the time, and contrasts them with a method which defines a finite set of populations and assigns proportions of each putative group to various individuals.

The model-based methods were implemented in numerous packages over the 2000s, and today they’re pretty standard parts of the phylogenetic and population genetic toolkits. The reason for their popularity is obvious: they are quite often clear and unambiguous in their results. This may be one reason that they emerged to complement more visualization methods like PCA and MDS with fewer a priori assumptions.

But of course, crisp clarity is not always reality. Sometimes nature is fuzzy and messy. The model-based methods take inputs and will produce crisp results, even if those results are not biologically realistic. They can’t be utilized in a robotic manner without attention to the assumptions and limitations (see A tutorial on how (not) to over-interpret STRUCTURE/ADMIXTURE bar plots).

This is why it is exciting to see a new preprint which addresses many of these issues, Inferring Continuous and Discrete Population Genetic Structure Across Space*:

A classic problem in population genetics is the characterization of discrete population structure in the presence of continuous patterns of genetic differentiation. Especially when sampling is discontinuous, the use of clustering or assignment methods may incorrectly ascribe differentiation due to continuous processes (e.g., geographic isolation by distance) to discrete processes, such as geographic, ecological, or reproductive barriers between populations. This reflects a shortcoming of current methods for inferring and visualizing population structure when applied to genetic data deriving from geographically distributed populations. Here, we present a statistical framework for the simultaneous inference of continuous and discrete patterns of population structure….

The whole preprint should be read for anyone interested in phylogenomic inference, as there is extensive discussion and attention to many problems and missteps that occur when researchers attempt to analyze variation and relationships across a species’ range. Basically, the sort of thing that might be mentioned in peer review feedback, but isn’t likely to be included in any final write-ups.

As noted in the abstract the major issue being addressed here is the problem that many clustering methods do not include within their model the reality that genetic variation within a species may be present due to continuous gene flow defined by isolation by distance dynamics. This goes back to the old “clines vs. clusters” debates. Many of the model-based methods assume pulse admixtures between population clusters which are random mating. This is not a terrible assumption when you consider perhaps what occurred in the New World when Europeans came in contact with the native populations and introduced Africans. But it is not so realistic when it comes to the North European plain, which seems to have become genetically differentiated only within the last ~5,000 years, and likely seen extensive gene flow.

The figure below shows the results from the conStruct method (left), and the more traditional fastStructure (right):

There are limitations to the spatial model they use (e.g., ring species), but that’s true of any model. The key is that it’s a good first step to account for continuous gene flow, and not shoehorning all variation into pulse admixtures.

Though in beta, the R package is already available on github (easy enough to download and install). I’ll probably have more comment when I test drive it myself….

* I am friendly with the authors of this paper, so I am also aware of their long-held concerns about the limitations and/or abuses of some phylogenetic methods. These concerns are broadly shared within the field.

After agriculture, before bronze

 

The above plot shows genetic distance/variation between highland and lowland populations in Papa New Guinea (PNG). It is from a paper in Science that I have been anticipating for a few months (I talked to the first author at SMBE), A Neolithic expansion, but strong genetic structure, in the independent history of New Guinea.

What does “strong genetic structure” mean? Basically Fst is showing the proportion of genetic variation which is partitioned between groups. Intuitively it is easy to understand, in that if ~1% of the genetic variation is partitioned between groups in one case, and ~10% in another, then it is reasonable to suppose that the genetic distance between groups in the second case is larger than in the first case. On a continental scale Fst between populations is often on the order of ~0.10. That is the value for example when you pool the variation amongst Northern Europeans and Chinese, and assess how much of it can be apportioned in a manner which differentiates populations (so it’s about ~10% of the variation).

This is why ancient DNA results which reported that Mesolithic hunter-gatherers and Neolithic farmers in Central Europe who coexisted in rough proximity for thousands of years exhibited differences on the order of ~0.10 elicited surprise. These are values we are now expecting from continental-scale comparisons. Perhaps an appropriate analogy might be the coexistence of Pygmy groups and Bantu agriculturalists? Though there is some gene flow, the two populations exist in symbiosis and exhibit local ecological segregation.

In PNG continental scale Fst values are also seen among indigenous people. The differences between the peoples who live in the highlands and lowlands of PNG are equivalent to those between huge regions of Eurasia. This is not entirely surprising because there has been non-trivial gene flow into lowland populations from Austronesian groups, such as the Lapita culture. Many lowland groups even speak Austronesian languages today.

Using standard ADMIXTURE analysis the paper shows that many lowland groups have significant East Asian ancestry (red), while none of the highland groups do (some individuals with East Asian admixture seem to be due to very recent gene flow). But even within the highlands the genetic differences are striking. The  Fst values between Finns and Southern European groups such as Spaniards are very high in a European context (due to Finnish Siberian ancestry as well as drift through a bottleneck), but most comparisons within the highland groups in PNG still exceeds this.

The paper also argues that genetic differences between Papuans and the natives of Australia pre-date the rising sea levels at the beginning of the Holocene, when Sahul divided between its various constituents. This is not entirely surprising considering that the ecology of the highlands during the Pleistocene would have been considerably different from Australia to the south, resulting in sharp differences in the hunter-gatherer lifestyles. Additionally, there does not seem to have been a genetic cline. Papuans are symmetrically related to all Australian groups they had samples from.

Using coalescence-based genomic methods they inferred that separation between highlands and some lowland groups occurred ~10-20,000 years ago. That is, after the Last Glacial Maximum. For the highlands, the differences seem to date to within the last 10,000 years. The Holocene. Additionally, they see population increases in the highlands, correlating with the shift to agriculture (cultivation of taro).

None of the above is entirely surprising, though I would take the date inferences with a grain of salt. The key is to observe that large genetic differences, as well as cultural differences, accrued in the highlands of PNG during the Holocene. In the paper they have a social and cultural explanation for what’s going on:

  Fst values in PNG fall between those of hunter-gatherers and present-day populations of west Eurasia, suggesting that a transition to cultivation alone does not necessarily lead to genetic homogenization.

A key difference might be that PNG had no Bronze Age, which in west Eurasia was driven by an expansion of herders and led to massive population replacement, admixture, and cultural and linguistic change (7, 8), or Iron Age such as that linked to the expansion of Bantu-speaking
farmers in Africa (24). Such cultural events have resulted in rapid Y-chromosome lineage expansions due to increased male reproductive variance (25), but we consistently find no evidence for this in PNG (fig. S13). Thus, in PNG, wemay be seeing the genetic, linguistic, and cultural diversity that sedentary human societies can achieve in the absence of massive technology-driven expansions.

Peter Turchin in books like Ultrasociety has aruged that one of the theses in Steven Pinker’s The Better Angels of Our Nature is incorrect: that violence has not decreased monotonically, but peaked in less complex agricultural societies. PNG is clearly a case of this, as endemic warfare was a feature of highland societies when they encountered Europeans. Lawrence Keeley’s War Before Civilization: The Myth of the Peaceful Savage gives so much attention to highland PNG because it is a contemporary illustration of a Neolithic society which until recently had not developed state-level institutions.

What papers like these are showing is that cultural and anthropological dynamics strongly shape the nature of genetic variation among humans. Simple models which assume as a null hypothesis that gene flow occurs through diffusion processes across a landscape where only geographic obstacles are relevant simply do not capture enough of the dynamic. Human cultures strongly shape the nature of interactions, and therefore the genetic variation we see around us.

Quantitative genomics, adaptation, and cognitive phenotypes

The human brain utilizes about ~20% of the calories you take in per day. It’s a large and metabolically expensive organ. Because of this fact there are lots of evolutionary models which focus on the brain. In Catching Fire: How Cooking Made Us Human Richard Wrangham suggests that our need for calories to feed our brain is one reason we started to use fire to pre-digest our food. In The Mating Mind Geoffrey Miller seems to suggest that all the things our big complex brain does allows for a signaling of mutational load. And in Grooming, Gossip, and the Evolution of Language Robin Dunbar suggests that it’s social complexity which is driving our encephalization.

These are all theories. Interesting hypotheses and models. But how do we test them? A new preprint on bioRxiv is useful because it shows how cutting-edge methods from evolutionary genomics can be used to explore questions relating to cognitive neuroscience and pyschopathology, Polygenic selection underlies evolution of human brain structure and behavioral traits:

…Leveraging publicly available data of unprecedented sample size, we studied twenty-five traits (i.e., ten neuropsychiatric disorders, three personality traits, total intracranial volume, seven subcortical brain structure volume traits, and four complex traits without neuropsychiatric associations) for evidence of several different signatures of selection over a range of evolutionary time scales. Consistent with the largely polygenic architecture of neuropsychiatric traits, we found no enrichment of trait-associated single-nucleotide polymorphisms (SNPs) in regions of the genome that underwent classical selective sweeps (i.e., events which would have driven selected alleles to near fixation). However, we discovered that SNPs associated with some, but not all, behaviors and brain structure volumes are enriched in genomic regions under selection since divergence from Neanderthals ~600,000 years ago, and show further evidence for signatures of ancient and recent polygenic adaptation. Individual subcortical brain structure volumes demonstrate genome-wide evidence in support of a mosaic theory of brain evolution while total intracranial volume and height appear to share evolutionary constraints consistent with concerted evolution…our results suggest that alleles associated with neuropsychiatric, behavioral, and brain volume phenotypes have experienced both ancient and recent polygenic adaptation in human evolution, acting through neurodevelopmental and immune-mediated pathways.

The preprint takes a kitchen-sink approach, throwing a lot of methods of selection at the phenotype of interest. Also, there is always the issue of cryptic population structure generating false positive associations, but they try to address it in the preprint. I am somewhat confused by this passage though:

Paleobiological evidence indicates that the size of the human skull has expanded massively over the last 200,000 years, likely mirroring increases in brain size.

From what I know human cranial sizes leveled off in growth ~200,000 years ago, peaked ~30,000 years ago, and have declined ever since then. That being said, they find signatures of selection around genes associated with ‘intracranial volume.’

There are loads of results using different methods in the paper, but I was curious note that schizophrenia had hits for ancient and recent adaptation. A friend who is a psychologist pointed out to me that when you look within families “unaffected” siblings of schizophrenics often exhibit deviation from the norm in various ways too; so even if they are not impacted by the disease, they are somewhere along a spectrum of ‘wild type’ to schizophrenic. In any case in this paper they found recent selection for alleles ‘protective’ of schizophrenia.

There are lots of theories one could spin out of that singular result. But I’ll just leave you with the fact that when you have a quantitative trait with lots of heritable variation it seems unlikely it’s been subject to a long period of unidirecitional selection. Various forms of balancing selection seem to be at work here, and we’re only in the early stages of understanding what’s going on. Genuine comprehension will require:

– attention to population genetic theory
– large genomic data sets from a wide array of populations
– novel methods developed by population genomicists
– and funcitonal insights which neuroscientists can bring to the table

Why you should learn some population genetics

From reader surveys I know a substantial portion of the people who will see this post are financially well off (of those who aren’t, a large number are students). Therefore, you can invest in some books.

Often people ask me questions related to population genetics in the comments (sometimes I get emails). That is all well and good. But it is always better to be able to fish than have to ask for fish. Additionally, learning some population and quantitative genetics allows you to develop some tacit schemas through which you can process information coming at you, and through with you can develop some general intuition.

If you have a modest level of mathematical fluency and and the disposable income, here are three indispensable books which are like the keys to the kingdom:

* Elements of Evolutionary Genetics
* Principles of Population Genetics
* Introduction to Quantitative Genetics.

If you don’t have the cash to spare, there are online notes which are pretty good:

* Graham Coop’s Population Genetics notes
* Joe Felsenstein’s Theoretical Evolutionary Genetics

There are others online resources, but they are not as comprehensive. John Gillespie’s Population Genetics: A Concise Guide is good as very gentle introductions go, but if you are going to spend money, I think just plumping down for a more comprehensive textbook (which will have more genomics in it) is better over the long run.

The goal of getting these books isn’t to make you a population geneticist, but, if you are interested in evolutionary questions it gives you a powerful toolkit. Really nothing in evolutionary process makes sense except in the light of population genetics.

Dost thou know the equilibrium at panmixia?

If you read a blog about Biblical criticism from a Christian perspective it would probably be best if you were familiar with the Bible. You don’t have to have read much scholarly commentary, rather, just the New Testament. Barring that, at least the synoptic gospels!

At this point, with over 400 individuals responding to the reader survey, it is strange to consider that more people believe they have a handle on what Fst is than the Hardy-Weinberg Equilibrium. First, Fst is a more subtle concept than people often think it is. And second, because the HWE is so easy, important, and foundational to population genetics. I mean p^2 + 2pq + q^2 = 1. Could it be simpler???

So a quick ask. If you are one of the people who doesn’t understand HWE or why it is important, please get yourself a copy of John Gillespie’s Population Genetics: A Concise Guide. I understand that not everyone has the time, interest, or money for Principles of Population Genetics, or any of the more “hardcore” texts. But Population Genetics: A Concise Guide will surely suffice to follow anything on this blog.

Or, barring that, please review the online resources which you have available. Two examples:

Graham Coop’s Notes on Population Genetics or Joe Felsenstein’s unpublished textbook Theoretical Evolutionary Genetics.

Why the rate of evolution may only depend on mutation

Sometimes people think evolution is about dinosaurs.

It is true that natural history plays an important role in inspiring and directing our understanding of evolutionary process. Charles Darwin was a natural historian, and evolutionary biologists often have strong affinities with the natural world and its history. Though many people exhibit a fascination with the flora and fauna around us during childhood, often the greatest biologists retain this wonderment well into adulthood (if you read W. D. Hamilton’s collections of papers, Narrow Roads of Gene Land, which have autobiographical sketches, this is very evidently true of him).

But another aspect of evolutionary biology, which began in the early 20th century, is the emergence of formal mathematical systems of analysis. So you have fields such as phylogenetics, which have gone from intuitive and aesthetic trees of life, to inferences made using the most new-fangled Bayesian techniques. And, as told in The Origins of Theoretical Population Genetics, in the 1920s and 1930s a few mathematically oriented biologists constructed much of the formal scaffold upon which the Neo-Darwinian Synthesis was constructed.

The product of evolution

At the highest level of analysis evolutionary process can be described beautifully. Evolution is beautiful, in that its end product generates the diversity of life around us. But a formal mathematical framework is often needed to clearly and precisely model evolution, and so allow us to make predictions. R. A. Fisher’s aim when he wrote The Genetical Theory Natural Selection was to create for evolutionary biology something equivalent to the laws of thermodynamics. I don’t really think he succeeded in that, though there are plenty of debates around something like Fisher’s fundamental theorem of natural selection.

But the revolution of thought that Fisher, Sewall Wright, and J. B. S. Haldane unleashed has had real yields. As geneticists they helped us reconceptualize evolutionary process as more than simply heritable morphological change, but an analysis of the units of heritability themselves, genetic variation. That is, evolution can be imagined as the study of the forces which shape changes in allele frequencies over time. This reduces a big domain down to a much simpler one.

Genetic variation is concrete currency with which one can track evolutionary process. Initially this was done via inferred correlations between marker traits and particular genes in breeding experiments. Ergo, the origins of the “the fly room”.

But with the discovery of DNA as the physical substrate of genetic inheritance in the 1950s the scene was set for the revolution in molecular biology, which also touched evolutionary studies with the explosion of more powerful assays. Lewontin & Hubby’s 1966 paper triggered a order of magnitude increase in our understanding of molecular evolution through both theory and results.

The theoretical side occurred in the form of the development of the neutral theory of molecular evolution, which also gave birth to the nearly neutral theory. Both of these theories hold that most of the variation with and between species on polymorphisms are due to random processes. In particular, genetic drift. As a null hypothesis neutrality was very dominant for the past generation, though in recent years some researchers are suggesting that selection has been undervalued as a parameter for various reasons.

Setting the live scientific debate, which continue to this day, one of the predictions of neutral theory is that the rate of evolution will depend only on the rate of mutation. More precisely, the rate of substitution of new mutations (where the allele goes from a single copy to fixation of ~100%) is proportional to the rate of mutation of new alleles. Population size doesn’t matter.

The algebra behind this is straightforward.

First, remember that the frequency of the a new mutation within a population is \frac{1}{2N}, where N is the population size (the 2 is because we’re assuming diploid organisms with two gene copies). This is also the probability of fixation of a new mutation in a neutral scenario; it’s probability is just proportional to its initial frequency (it’s a random walk process between 0 and 1.0 proportions). The rate of mutations is defined by \mu, the number of expected mutations at a given site per generation (this is a pretty small value, for humans it’s on the order of 10^{-8}). Again, there are 2N gene copies, so you have 2N\mu to count the number of new mutations.

The probability of fixation of a new mutations multiplied by the number of new mutations is:

    \[ \( \frac{1}{2N} \) \times 2N\mu = \mu \]

So there you have it. The rate of fixation of these new mutations is just a function of the rate of mutation.

Simple formalisms like this have a lot more gnarly math that extend them and from which they derive. But they’re often pretty useful to gain a general intuition of evolutionary processes. If you are genuinely curious, I would recommend Elements of Evolutionary Genetics. It’s not quite a core dump, but it is a way you can borrow the brains of two of the best evolutionary geneticists of their generation.

Also, you will be able to answer the questions on my survey better the next time!

Why overdominance probably isn’t responsible for much polymorphism

Hybrid vigor is a concept that many people have heard of, because it is very useful in agricultural genetics, and makes some intuitive sense. Unfortunately it often gets deployed in a variety of contexts, and its applicability is often overestimated. For example, many people seem to think (from personal communication) that it may somehow be responsible for the genetic variation around us.

This is just not so. As you may know each human carries tens of millions of genetic variants within their genome. Populations have various levels of polymorphism at particular positions in the genome. How’d they get there? In the early days of population genetics there were two broad schools, the “balance” and “classical.” The former made the case for the importance of balancing selection in maintaining variation. The latter suggested that the variation we see around us is simply a transient between fixation of a favored mutation from a low a frequency or extinction of a disfavored variant (perhaps environmental conditions changed and a high frequency variant is now disfavored). Arguably the rise of neutral theory and empirical results from molecular evolution supported the classical model more than the balance framework (at least this was Richard Lewontin’s argument, and I follow his logic here).

But even in relation to alleles which are maintained at polymorphism through balancing selection, overdominance isn’t going to be the major player.

Sickle cell disease is a classic consequence of overdominance; the heterozygote is more fit than the wild type or the recessive disease which is caused by homozygotes of the mutation. Obviously polymorphism is maintained despite the decreased fitness of the mutant homozygote because the heterozygote is so much more fit than the wild type. The final proportion of the alleles segregating in the population will be conditional on the fitness drag of the homozygote in the mutant type, because as per HWE it will be present in the population ~q2.

The problem is that this is clearly not going to scale across loci. That is, even if the fitness drag is more minimal than is the case with the sickle cell locus, one can imagine a cummulative situation. The segregation load is just going to be too high. Overdominance is probably a transient strategy which fades away as populations evolve more efficient ways to adapt that doesn’t have such a fitness load.

So how does balancing selection still lead to variation without heteroygote advantage? W. D. Hamilton argued that much of it was due to negative frequency dependent selection. Co-evolution with pathogens is the best case of this. As strategies get common pathogens adapt, so rare strategies encoded by rare alleles gain in fitness. As these alleles increase in frequency their fitness decreases due to pathogen resistance. Their frequency declines, and eventually the pathogens lose the ability to resist it, and its frequency increases again.

Why only one migrant per generation keeps divergence at bay

The best thing about population genetics is that because it’s a way of thinking and modeling the world it can be quite versatile. If Thinking Like An Economist is a way to analyze the world rationally, thinking like a population geneticist allows you to have the big picture on the past, present, and future, of life.

I have some personal knowledge of this as a transformative experience. My own background was in biochemistry before I became interested in population genetics as an outgrowth of my lifelong fascination with evolutionary biology. It’s not exactly useless knowing all the steps of the Krebs cycle, but it lacks in generality. In his autobiography I recall Isaac Asimov stating that one of the main benefits of his background as a biochemist was that he could rattle off the names on medicine bottles with fluency. Unless you are an active researcher in biochemistry your specialized research is quite abstruse. Population genetics tends to be more applicable to general phenomena.

In a post below I made a comment about how one migrant per generation or so is sufficient to prevent divergence between two populations. This is an old heuristic which goes back to Sewall Wright, and is encapsulated in the formalism to the left. Basically the divergence, as measured by Fst, is proportional to the inverse of 4 time the proportion of migrants times the total population + 1. The mN is equivalent to the number of migrants per generation (proportion times the total population). As the mN become very large, the Fst converges to zero.

The intuition is pretty simple. Image you have two populations which separate at a specific time. For example, sea level rise, so now you have a mainland and island population. Since before sea level rise the two populations were one random mating population their initial allele frequencies are the same at t = 0. But once they are separated random drift should begin to subject them to divergence, so that more and more of their genes exhibit differences in allele frequencies (ergo, Fst, the between population proportion of genetic variation, increases from 0).

Now add to this the parameter of migration. Why is one migrant per generation sufficient to keep divergence low? The two extreme scenarios are like so:

  1. Large populations change allele frequency very slowly due to drift, so only a small proportion of migration is needed to prevent them from diverging
  2. Small populations change allele frequency very fast due to drift, so a larger proportion of migration is needed to prevent them from drifting

Within a large population one migrant is a small proportion, but drift is occurring very slowly. Within a small population drift is occurring fast, but one migrant is a relatively large proportion of a small population.

Obviously this is a stylized fact with many details which need elaborating. Some conservation geneticists believe that the focus on one migrant is wrongheaded, and the number should be set closer to 10 migrants.

But it still gets at a major intuition: gene flow is extremely powerful and effective at reducing differences between groups. This is why most geneticists are skeptical of sympatric speciation. Though the focus above is on drift, the same intuition applies to selective divergence. Gene flow between populations work at cross-purposes with selection which drives two groups toward different equilibrium frequencies.

This is why it was surprising when results showed that Mesolithic hunter-gatherers and farmers in Europe were extremely genetically distinct in close proximity for on the order of 1,000 years. That being said, strong genetic differentiation persists between Pygmy peoples and their agriculturalist neighbors, despite a long history of living nearby each other (Pygmies do not have their own indigenous languages, but speak the tongue of their farmer neighbors). In the context of animals physical separation is often necessary for divergence, but for humans cultural differences can enforce surprisingly strong taboos. Culture is as strong a phenomenon as mountains or rivers….