Synergistic epistasis as a solution for human existence

Epistasis is one of those terms in biology which has multiple meanings, to the point that even biologists can get turned around (see this 2008 review, Epistasis — the essential role of gene interactions in the structure and evolution of genetic systems, for a little background). Most generically epistasis is the interaction of genes in terms of producing an outcome. But historically its meaning is derived from the fact that early geneticists noticed that crosses between individuals segregating for a Mendelian characteristic (e.g., smooth vs. curly peas) produced results conditional on the genotype of a secondary locus.

Molecular biologists tend to focus on a classical, and often mechanistic view, whereby epistasis can be conceptualized as biophysical interactions across loci. But population geneticists utilize a statistical or evolutionary definition, where epistasis describes the extend of deviation from additivity and linearity, with the “phenotype” often being fitness. This goes back to early debates between R. A. Fisher and Sewall Wright. Fisher believed that in the long run epistasis was not particularly important. Wright eventually put epistasis at the heart of his enigmatic shifting balance theory, though according to Will Provine in Sewall Wright and Evolutionary Biology even he had a difficult time understanding the model he was proposing (e.g., Wright couldn’t remember what the different axes on his charts actually meant all the time).

These different definitions can cause problems for students. A few years ago I was a teaching assistant for a genetics course, and the professor, a molecular biologist asked a question about epistasis. The only answer on the key was predicated on a classical/mechanistic understanding. But some of the students were obviously giving the definition from an evolutionary perspective! (e.g., they were bringing up non-additivity and fitness) Luckily I noticed this early on and the professor approved the alternative answer, so that graders would not mark those using a non-molecular answer down.

My interested in epistasis was fed to a great extent in the middle 2000s by my reading of Epistasis and the Evolutionary Process. Unfortunately not too many people read this book. I believe this is so because when I just went to look at the Amazon page it told me that “Customers who viewed this item also viewed” Robert Drews’ The End of the Bronze Age. As it happened I read this book at about the same time as Epistasis and the Evolutionary Process…and to my knowledge I’m the only person who has a very deep interest in statistical epistasis and Mycenaean Greece (if there is someone else out there, do tell).

In any case, when I was first focused on this topic genomics was in its infancy. Papers with 50,000 SNPs in humans were all the rage, and the HapMap paper had literally just been published. A lot has changed.

So I was interested to see this come out in Science, Negative selection in humans and fruit flies involves synergistic epistasis (preprint version). Since the authors are looking at humans and Drosophila and because it’s 2017 I assumed that genomic methods would loom large, and they do.

And as always on the first read through some of the terminology got confusing (various types of statistical epistasis keep getting renamed every few years it seems to me, and it’s hard to keep track of everything). So I went to Google. And because it’s 2017 a citation of the paper and further elucidation popped up in Google Books in Crumbling Genome: The Impact of Deleterious Mutations on Humans. Weirdly, or not, the book has not been published yet. Since the author is the second to last author on the above paper it makes sense that it would be cited in any case.

So what’s happening in this paper? Basically they are looking to reduced variance of really bad mutations because a particular type of epistasis amplifies their deleterious impact (fitness is almost always really hard to measure, so you want to look at proxy variables).

Because de novo mutations are rare, they estimate about 7 are in functional regions of the genome (I think this may be high actually), and that the distribution should be Poisson. This distribution just tells you that the mean number of mutations and the variance of the the number of mutations should be the same (e.g., mean should be 5 and variance should 5).

Epistasis refers (usually) to interactions across loci. That is, different genes at different locations in the genome. Synergistic epistasis means that the total cumulative fitness after each successive mutation drops faster than the sum of the negative impact of each mutation. In other words, the negative impact is greater than the sum of its parts. In contrast, antagonistic epistasis produces a situation where new mutations on the tail of the distributions cause a lower decrement in fitness than you’d expect through the sum of its parts (diminishing returns on mutational load when it comes to fitness decrements).

These two dynamics have an effect the linkage disequilibrium (LD) statistic. This measures the association of two different alleles at two different loci. When populations are recently admixed (e.g., Brazilians) you have a lot of LD because racial ancestry results in lots of distinctive alleles being associated with each other across genomic segments in haplotypes. It takes many generations for recombination to break apart these associations so that allelic state at one locus can’t be used to predict the odds of the state at what was an associated locus. What synergistic epistasis does is disassociate deleterious mutations. In contrast, antagonistic epistasis results in increased association of deleterious mutations.

Why? Because of selection. If a greater number of mutations means huge fitness hits, then there will strong selection against individuals who randomly segregate out with higher mutational loads. This means that the variance of the mutational load is going to lower than the value of the mean.

How do they figure out mutational load? They focus on the distribution of LoF mutations. These are extremely deleterious mutations which are the most likely to be a major problem for function and therefore a huge fitness hit. What they found was that the distribution of LoF mutations exhibited a variance which was 90-95% of a null Poisson distribution. In other words, there was stronger selection against high mutation counts, as one would predict due to synergistic epistasis.

They conclude:

Thus, the average human should carry at least seven de novo deleterious mutations. If natural selection acts on each mutation independently, the resulting mutation load and loss in average fitness are inconsistent with the existence of the human population (1 − e−7 > 0.99). To resolve this paradox, it is sufficient to assume that the fitness landscape is flat only outside the zone where all the genotypes actually present are contained, so that selection within the population proceeds as if epistasis were absent (20, 25). However, our findings suggest that synergistic epistasis affects even the part of the fitness landscape that corresponds to genotypes that are actually present in the population.

Overall this is fascinating, because evolutionary genetic questions which were still theoretical a little over ten years ago are now being explored with genomic methods. This is part of why I say genomics did not fundamentally revolutionize how we understand evolution. There were plenty of models and theories. Now we are testing them extremely robustly and thoroughly.

Addendum: Reading this paper reinforces to me how difficult it is to keep up with the literature, and how important it is to know the literature in a very narrow area to get the most out of a paper. Really the citations are essential reading for someone like me who just “drops” into a topic after a long time away….

Citation: ScienceNegative selection in humans and fruit flies involves synergistic epistasis.

Why the rate of evolution may only depend on mutation

Sometimes people think evolution is about dinosaurs.

It is true that natural history plays an important role in inspiring and directing our understanding of evolutionary process. Charles Darwin was a natural historian, and evolutionary biologists often have strong affinities with the natural world and its history. Though many people exhibit a fascination with the flora and fauna around us during childhood, often the greatest biologists retain this wonderment well into adulthood (if you read W. D. Hamilton’s collections of papers, Narrow Roads of Gene Land, which have autobiographical sketches, this is very evidently true of him).

But another aspect of evolutionary biology, which began in the early 20th century, is the emergence of formal mathematical systems of analysis. So you have fields such as phylogenetics, which have gone from intuitive and aesthetic trees of life, to inferences made using the most new-fangled Bayesian techniques. And, as told in The Origins of Theoretical Population Genetics, in the 1920s and 1930s a few mathematically oriented biologists constructed much of the formal scaffold upon which the Neo-Darwinian Synthesis was constructed.

The product of evolution

At the highest level of analysis evolutionary process can be described beautifully. Evolution is beautiful, in that its end product generates the diversity of life around us. But a formal mathematical framework is often needed to clearly and precisely model evolution, and so allow us to make predictions. R. A. Fisher’s aim when he wrote The Genetical Theory Natural Selection was to create for evolutionary biology something equivalent to the laws of thermodynamics. I don’t really think he succeeded in that, though there are plenty of debates around something like Fisher’s fundamental theorem of natural selection.

But the revolution of thought that Fisher, Sewall Wright, and J. B. S. Haldane unleashed has had real yields. As geneticists they helped us reconceptualize evolutionary process as more than simply heritable morphological change, but an analysis of the units of heritability themselves, genetic variation. That is, evolution can be imagined as the study of the forces which shape changes in allele frequencies over time. This reduces a big domain down to a much simpler one.

Genetic variation is concrete currency with which one can track evolutionary process. Initially this was done via inferred correlations between marker traits and particular genes in breeding experiments. Ergo, the origins of the “the fly room”.

But with the discovery of DNA as the physical substrate of genetic inheritance in the 1950s the scene was set for the revolution in molecular biology, which also touched evolutionary studies with the explosion of more powerful assays. Lewontin & Hubby’s 1966 paper triggered a order of magnitude increase in our understanding of molecular evolution through both theory and results.

The theoretical side occurred in the form of the development of the neutral theory of molecular evolution, which also gave birth to the nearly neutral theory. Both of these theories hold that most of the variation with and between species on polymorphisms are due to random processes. In particular, genetic drift. As a null hypothesis neutrality was very dominant for the past generation, though in recent years some researchers are suggesting that selection has been undervalued as a parameter for various reasons.

Setting the live scientific debate, which continue to this day, one of the predictions of neutral theory is that the rate of evolution will depend only on the rate of mutation. More precisely, the rate of substitution of new mutations (where the allele goes from a single copy to fixation of ~100%) is proportional to the rate of mutation of new alleles. Population size doesn’t matter.

The algebra behind this is straightforward.

First, remember that the frequency of the a new mutation within a population is \frac{1}{2N}, where N is the population size (the 2 is because we’re assuming diploid organisms with two gene copies). This is also the probability of fixation of a new mutation in a neutral scenario; it’s probability is just proportional to its initial frequency (it’s a random walk process between 0 and 1.0 proportions). The rate of mutations is defined by \mu, the number of expected mutations at a given site per generation (this is a pretty small value, for humans it’s on the order of 10^{-8}). Again, there are 2N gene copies, so you have 2N\mu to count the number of new mutations.

The probability of fixation of a new mutations multiplied by the number of new mutations is:

    \[ \( \frac{1}{2N} \) \times 2N\mu = \mu \]

So there you have it. The rate of fixation of these new mutations is just a function of the rate of mutation.

Simple formalisms like this have a lot more gnarly math that extend them and from which they derive. But they’re often pretty useful to gain a general intuition of evolutionary processes. If you are genuinely curious, I would recommend Elements of Evolutionary Genetics. It’s not quite a core dump, but it is a way you can borrow the brains of two of the best evolutionary geneticists of their generation.

Also, you will be able to answer the questions on my survey better the next time!

Fisherianism in the genomic era

There are many things about R. A. Fisher that one could say. Professionally he was one of the founders of evolutionary genetics and statistics, and arguably the second greatest evolutionary biologist after Charles Darwin. With his work in the first few decades of the 20th century he reconciled the quantitative evolutionary framework of the school of biometry with mechanistic genetics, and formalized evolutionary theory in The Genetical Theory of Natural Selection.

He was also an asshole. This is clear in the major biography of him, R.A. Fisher: The Life of a Scientist. It was written by his daughter.  But The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century also seems to indicate he was a dick. And W. D. Hamilton’s Narrow Roads of Gene Land portrays Fisher has rather cold and distant, despite the fact that Hamilton idolized him.

Notwithstanding his unpleasant personality, R. A. Fisher seems to have been a veritable mentat in his early years. Much of his thinking crystallized in the first few decades of the 20th century, when genetics was a new science and mathematical methods were being brought to bear on a host of topics. It would be decades until DNA was understood to be the substrate of heredity. Instead of deriving from molecular first principles which were simply not known in that day, Fisher and his colleagues constructed a theoretical formal edifice which drew upon patterns of inheritance that were evident in lineages of organisms that they could observe around them (Fisher had a mouse colony which he utilized now and then to vent his anger by crushing mice with his bare hands). Upon that observational scaffold they placed a sturdy superstructure of mathematical formality. That edifice has been surprisingly robust down to the present day.

One of Fisher’s frameworks which still gives insight is the geometric model of the distribution of fitness of mutations. If an organism is near its optimum of fitness, than large jumps in any direction will reduce its fitness. In contrast, small jumps have some probability of getting closer to the optimum of fitness. In plainer language, mutations of large effect are bad, and mutations of small effect are not as bad.

A new paper in PNAS loops back to this framework, Determining the factors driving selective effects of new nonsynonymous mutations:

Our study addresses two fundamental questions regarding the effect of random mutations on fitness: First, do fitness effects differ between species when controlling for demographic effects? Second, what are the responsible biological factors? We show that amino acid-changing mutations in humans are, on average, more deleterious than mutations in Drosophila. We demonstrate that the only theoretical model that is fully consistent with our results is Fisher’s geometrical model. This result indicates that species complexity, as well as distance of the population to the fitness optimum, modulated by long-term population size, are the key drivers of the fitness effects of new amino acid mutations. Other factors, like protein stability and mutational robustness, do not play a dominant role.

In the title of the paper itself is something that would have been alien to Fisher’s understanding when he formulated his geometric model: the term “nonsynonymous” to refer to mutations which change the amino acid corresponding to the triplet codon. The paper is understandably larded with terminology from the post-DNA and post-genomic era, and yet comes to the conclusion that a nearly blind statistical geneticist from about a century ago correctly adduced the nature of mutation’s affects on fitness in organisms.

The authors focused on two primary species which different histories, but well characterized in the evolutionary genomic literature: humans and Drosophila. The models they tested are as follows:


Basically they checked the empirical distribution of the site frequency spectra (SFS) of the nonsynonymous variants against expected outcomes based on particular details of demographics, which were inferred from synonymous variation. Drosophila have effective population sizes orders of magnitude larger than humans, so if that is not taken into account, then the results will be off. There are also a bunch of simulations in the paper to check for robustness of their results, and they also caveat the conclusion with admissions that other models besides the Fisherian one may play some role in their focal species, and more in other taxa. A lot of this strikes me as accruing through the review process, and I don’t have the time to replicate all the details to confirm their results, though I hope some of the reviewers did so (again, I suspect that the reviewers were demanding some of these checks, so they definitely should have in my opinion).

In the Fisherian model more complex organisms are more fine-tuned due topleiotropy and other such dynamics. So new mutations are more likely to deviate away from the optimum. This is the major finding that they confirmed. What does “complex” mean? The Drosophila genome is less than 10% of the human genome’s size, but the migratory locust has twice as large a genome as humans, while wheat has a sequence more than five times as large. But organism to organism, it does seem that Drosophila has less complexity than humans. And they checked with other organisms besides their two focal ones…though the genomes there are not as complete presumably.

As I indicated above, the authors believe they’ve checked for factors such as background selection, which may confound selection coefficients on specific mutations. The paper is interesting as much for the fact that it illustrates how powerful analytic techniques developed in a pre-DNA era were. Some of the models above are mechanistic, and require a certain understanding of the nature of molecular processes. And yet they don’t seem as predictive as a more abstract framework!

Citation: Christian D. Huber, Bernard Y. Kim, Clare D. Marsden, and Kirk E. Lohmueller, Determining the factors driving selective effects of new nonsynonymous mutations PNAS 2017 ; published ahead of print April 11, 2017, doi:10.1073/pnas.1619508114