The logic of human destiny was inevitable 1 million years ago

Robert Wright’s best book, Nonzero: The Logic of Human Destiny, was published nearly 20 years ago. At the time I was moderately skeptical of his thesis. It was too teleological for my tastes. And, it does pander to a bias in human psychology whereby we look to find meaning in the universe.

But this is 2017, and I have somewhat different views.

In the year 2000 I broadly accepted the thesis outlined a few years later in The Dawn of Human Culture. That our species, our humanity, evolved and emerged in rapid sequence, likely due to biological changes of a radical kind, ~50,000 years ago. This is the thesis of the “great leap forward” of behavioral modernity.

Today I have come closer to models proposed by Michael Tomasello in The Cultural Origins of Human Cognition and Terrence Deacon in The Symbolic Species: The Co-evolution of Language and the Brain. Rather than a punctuated event, an instance in geological time, humanity as we understand it was a gradual process, driven by general dynamics and evolutionary feedback loops.

The conceit at the heart of Robert J. Sawyer’s often overly preachy Neanderthal Parallax series, that if our own lineage went extinct but theirs did not they would have created a technological civilization, is I think in the main correct. It may not be entirely coincidental that the hyper-drive cultural flexibility of African modern humans evolved in African modern humans first. There may have been sufficient biological differences to enable this to be likely. But I believe that if African modern humans were removed from the picture Neanderthals would have “caught up” and been positioned to begin the trajectory we find ourselves in during the current Holocene inter-glacial.

Luke Jostins’ figure showing across board encephalization

The data indicate that all human lineages were subject to increased encephalization. That process trailed off ~200,000 years ago, but it illustrates the general evolutionary pressures, ratchets, or evolutionary “logic”, that applied to all of them. Overall there were some general trends in the hominin lineage that began to characterized us about a million years ago. We pushed into new territory. Our rate of cultural change seems to gradually increased across our whole range.

One of the major holy grails I see now and then in human evolutionary genetics is to find “the gene that made us human.” The scramble is definitely on now that more and more whole genome sequences from ancient hominins are coming online. But I don’t think there will be such gene ever found. There isn’t “a gene,” but a broad set of genes which were gradually selected upon in the process of making us human.

In the lingo, it wasn’t just a hard sweep from a de novo mutation. It was as much, or even more, soft sweeps from standing variation.

Aryan marauders from the steppe came to India, yes they did!

Its seems every post on Indian genetics elicits dissents from loquacious commenters who are woolly on the details of the science, but convinced in their opinions (yes, they operate through uncertainty and obfuscation in their rhetoric, but you know where the axe is lodged). This post is an attempt to answer some questions so I don’t have to address this in the near future, as ancient DNA papers will finally start to come out soon, I hope (at least earlier than Winds of Winter).

In 2001’s The Eurasian Heartland: A continental perspective on Y-chromosome diversity Wells et al. wrote:

The current distribution of the M17 haplotype is likely to represent traces of an ancient population migration originating in southern Russia/Ukraine, where M17 is found at high frequency (>50%). It is possible that the domestication of the horse in this region around 3,000 B.C. may have driven the migration (27). The distribution and age of M17 in Europe (17) and Central/Southern Asia is consistent with the inferred movements of these people, who left a clear pattern of archaeological remains known as the Kurgan culture, and are thought to have spoken an early Indo-European language (27, 28, 29). The decrease in frequency eastward across Siberia to the Altai-Sayan mountains (represented by the Tuvinian population) and Mongolia, and southward into India, overlaps exactly with the inferred migrations of the Indo-Iranians during the period 3,000 to 1,000 B.C. (27). It is worth noting that the Indo-European-speaking Sourashtrans, a population from Tamil Nadu in southern India, have a much higher frequency of M17 than their Dravidian-speaking neighbors, the Yadhavas and Kallars (39% vs. 13% and 4%, respectively), adding to the evidence that M17 is a diagnostic Indo-Iranian marker. The exceptionally high frequencies of this marker in the Kyrgyz, Tajik/Khojant, and Ishkashim populations are likely to be due to drift, as these populations are less diverse, and are characterized by relatively small numbers of individuals living in isolated mountain valleys.

In a 2002 interview with the India site Rediff, the first author was more explicit:

Some people say Aryans are the original inhabitants of India. What is your view on this theory?

The Aryans came from outside India. We actually have genetic evidence for that. Very clear genetic evidence from a marker that arose on the southern steppes of Russia and the Ukraine around 5,000 to 10,000 years ago. And it subsequently spread to the east and south through Central Asia reaching India. It is on the higher frequency in the Indo-European speakers, the people who claim they are descendants of the Aryans, the Hindi speakers, the Bengalis, the other groups. Then it is at a lower frequency in the Dravidians. But there is clear evidence that there was a heavy migration from the steppes down towards India.

But some people claim that the Aryans were the original inhabitants of India. What do you have to say about this?

I don’t agree with them. The Aryans came later, after the Dravidians.

Over the past few years I’ve gotten to know the above first author Spencer Wells as a personal friend, and I think he would be OK with me relaying that to some extent he was under strong pressure to downplay these conclusions. Not only were, and are, these views not popular in India, but the idea of mass migration was in bad odor in much of the academy during this period. Additionally, there was later work which was less clear, and perhaps supported an Indian origin for R1a1a. Spencer himself told me that it was not impossible for R1a to have originated in India, but a branch eventually back-migrated to southern Asia.

But even researchers from the group at Stanford where he had done his postdoc did not support this model by the middle 2000s, Polarity and Temporality of High-Resolution Y-Chromosome Distributions in India Identify Both Indigenous and Exogenous Expansions and Reveal Minor Genetic Influence of Central Asian Pastoralists. In 2009 a paper out of an Indian group was even stronger in its conclusion for a South Asian origin of R1a1a, The Indian origin of paternal haplogroup R1a1* substantiates the autochthonous origin of Brahmins and the caste system.

By 2009 one might have admitted that perhaps Spencer was wrong. I was certainly open to that possibility. There was very persuasive evidence that the mtDNA lineages of South Asia had little to do with Europe or the Middle East.

Yet a closer look at the above papers reveals two major systematic problems.

First, ancient DNA has made it clear that there has been major population turnover during the Holocene, but this was not the null hypothesis in the 2000s. Looking at extant distributions of lineages can give one a distorted view of the past. Frankly, the 2009 Indian paper was egregious in this way because they included Turkic groups in their Central Asian data set. Even in 2009 there was a whole lot of evidence that Central Asian Turkic groups were likely very different from Indo-European Turanian populations which would have been the putative ancestors of Indo-Aryans. Honestly the authors either consciously loaded the die to reduce the evidence for gene flow from Central Asia, or they were ignorant (the nature of the samples is much clearer in the supplements than the  primary text for what it’s worth).

Second, Y chromosomal marker sets in the 2000s were constrained to fast mutating microsatellite regions or less than 100 variant SNPs on the Y. Because it is so repetitive the Y chromosome is hard to sequence, and it really took the technologies of the last ten years to get it done. Both the above papers estimate the coalescence of extant R1a1a lineages to be 10-15,000 years before the present. In particular, they suggest that European and South Asian lineages date back to this period, pushing back any possible connection between the groups, and making it possible that European R1a1a descended from a South Asian founder group which was expanding after the retreat of the ice sheets. The conclusions were not unreasonable based on the methods they had.  But now we have better methods.*

Whole genome sequencing of the Y, as well as ancient DNA, seems to falsify the above dates. Though microsatellites are good for very coarse grain phyolgenetic inferences, one has to be very careful about them when looking at more fine grain population relationships (they are still useful in forensics to cheaply differentiate between individuals, since they accumulate variation very quickly). They mutate fast, and their clock may be erratic.

Additionally, diversity estimates were based on a subset of SNP that were clearly not robust. R1a1a is not diverse anywhere, though basal lineages seem to be present in ancient DNA on the Pontic steppe in some cases.

To show how lacking in diversity R1a1a is, here are the results of a 2016 paper which performed whole genome sequencing on the Y. Instead of relying on the order of 10 to 100 SNPs, this paper discover over 65,000 Y variants worldwide. Notice how little difference there is between different South Asian groups below, indicative of a massive population expansion relatively recently in time which didn’t even have time to exhibit regional population variation. They note that “The most striking are expansions within R1a-Z93 [the South Asian clade], ~4.0–4.5 kya. This time predates by a few centuries the collapse of the Indus Valley Civilization, associated by some with the historical migration of Indo-European speakers from the western steppes into the Indian sub-continent.

Read More

Oxford Nanopore finally giving hope to biologist’s dreams

I don’t talk too much about genomic technology because it changes so fast. Being up-to-date on the latest machines and tools often requires regular deep-dives right now, though I believe at some point technological improvements will plateau as the data returned will be cheap and high quality enough that there won’t be much to gain on the margin.

Of course we’ve already come a long way. Fifteen years ago a “whole human genome” cost on the order of billions of dollars. Today a high quality whole human genome will run you on the order of $1,000. This is fundamentally a technology driven change, with big metal machines automatically generating reads and powerful computers to process them. One couldn’t imagine such a scenario 30 years ago because the technology wasn’t there.

I’ve stated before that I don’t think genomics fundamentally alters what we know and understand about evolution. At least so far. But it is a huge change in the domain of medicine. Cleary the human genomicists, especially Francis Collins, overhyped the yield of the technology in relation to healthcare in the 2000s. But with cheap and ubiquitous sequencing we may see the end of Mendelian diseases in our lifetime (through screening and possibly at some point CRISPR therapy).

This has been driven by technological innovation in the private sector around a few firms. The famous chart showing the massive decline in the cost of genomic sequencing over the past 15 years is due in large part to the successes of Illumina. But, Illumina has also had a quasi-monopoly on the field over the past five years (or more), and that shows with the leveling off of the decline in cost. Until the past year….

What gives? Many people believe that Illumina is moving again in part because a genuine challenger is emerging, or at least the flicker of a challenge, in the form of Oxford Nanopore. Oxford Nanopore has been around since 2005, but it really came into the public eye around 2010 or so. But like many tech companies it overpromised in the early years. I remember skeptically listening to a friend in the fall of 2011 talk about how quickly Nanopore was going to change the game…. I didn’t put too much stock into these sorts of presentations to hopeful researchers because I remember Pacific Biosciences making the same sort of pitch to amazed biologists in 2008. Pac Bio is still around, but has turned out to be a bit player, rather than a challenger to Illumina.

But I have to admit that Nanopore has really started to step up its game of late. Probably one of the major things it has accomplished is that it’s made us reimagine what sequencing technology should look like. Rather than refrigerators of various sizes, Oxford Nanopore allows us to imagine sequencing technology which exhibits a form factor more analogous to a USB thumb drive. The first time I saw a Nanopore machine in the flesh I knew intellectually what I was going to see…but because of my deep intuitions I still overlooked the two Nanopore machines laying on the workbench in front of me.

Despite their amazing form factor, these early Nanopore machines had limited application. They didn’t generate much data, and so were utilized by researchers who worked with smaller genomes. Scientists who worked with bacteria seem to have been using them a lot, for example. Additionally the machines were error prone and people were working out their kinks in real time in laboratories (one tech told me early on they were so small that he swore they were affected by ambient vibrations so he found ways to dampen that source of error).

A new preprint suggests we may be turning the corner though, Nanopore sequencing and assembly of a human genome with ultra-long reads:

Nanopore sequencing is a promising technique for genome sequencing due to its portability, ability to sequence long reads from single molecules, and to simultaneously assay DNA methylation. However until recently nanopore sequencing has been mainly applied to small genomes, due to the limited output attainable. We present nanopore sequencing and assembly of the GM12878 Utah/Ceph human reference genome generated using the Oxford Nanopore MinION and R9.4 version chemistry. We generated 91.2 Gb of sequence data (~30x theoretical coverage) from 39 flowcells. De novo assembly yielded a highly complete and contiguous assembly (NG50 ~3Mb). We observed considerable variability in homopolymeric tract resolution between different basecallers. The data permitted sensitive detection of both large structural variants and epigenetic modifications. Further we developed a new approach exploiting the long-read capability of this system and found that adding an additional 5x-coverage of “ultra-long” reads (read N50 of 99.7kb) more than doubled the assembly contiguity. Modelling the repeat structure of the human genome predicts extraordinarily contiguous assemblies may be possible using nanopore reads alone. Portable de novo sequencing of human genomes may be important for rapid point-of-care diagnosis of rare genetic diseases and cancer, and monitoring of cancer progression. The complete dataset including raw signal is available as an Amazon Web Services Open Dataset at:

30x just means that you’re getting bases sampled typically 30 times, so that you have a very accurate and precise read on its state. 30x has become the default standard in medical genomics. If Nanopore can do 30x on human genomes at reasonable cost it won’t be a niche player much longer.

The read length is important because last I checked the human genome still had large holes in it. The typical Illumina machine produces average read lengths in the low hundreds of base pairs. If you have large repetitive regions of the human genome (and you do have these), you’re never going to span them with such short yardsticks. Additionally, these short reads have to be tiled together when you assemble a genome from raw results, and this is a computationally really intensive task. It’s good when you have a reference genome you can align to as a scaffold. But researchers who don’t work on humans or model organisms may not have a good reference genome, or in many cases a reference genome at all.

Pac Bio occupies a space where it provide really long reads for a high price point. Most of the time this isn’t necessary, but imagine you work on a disease which is caused by large repetitive regions. You are likely willing to pay the price that is asked. And because Pac Bio generates very long reads it makes de novo assembly much easier, as your algorithm has to tile together far fewer contiguous sequences, and long sequences are less likely to have lots of repetitive matches in the genome.

But Pac Bio machines are expensive and huge. In the abstract above it alludes to “Portable de novo sequencing of human genomes.” This is a huge deal. The dream, as whispered by some genomicists I have known, is that at a point in the future biologists would carry portable sequencers which would produce very long reads that so that they could de novo assemble sequences on the spot. A concrete example might be a health inspector checking on the sorts of microbes found on the counter of a restaurant, or a field ecologist who might be sample various fungi to discover cryptic species.

Obviously this is still a dream. The preprint above makes it clear that to do what they did required a lot of novel techniques and development of new tools. This isn’t beta technology, it’s early alpha. But because it’s 2017 the outlines of the dream are coming into public view.

Citation: Nanopore sequencing and assembly of a human genome with ultra-long reads
Miten Jain, Sergey Koren, Josh Quick, Arthur C Rand, Thomas A Sasani, John R Tyson, Andrew D Beggs, Alexander T Dilthey, Ian T Fiddes, Sunir Malla, Hannah Marriott, Karen H Miga, Tom Nieto, Justin O’Grady, Hugh E Olsen, Brent S Pedersen, Arang Rhie, Hollian Richardson, Aaron Quinlan, Terrance P Snutch, Louise Tee, Benedict Paten, Adam M. Phillippy, Jared T Simpson, Nicholas James Loman, Matthew Loose
bioRxiv 128835; doi:

Mouse fidelity comes down to the genes

While birds tend to be at least nominally monogamous, this is not the case with mammals. This strikes some people as strange because humans seem to be monogamous, at least socially, and often we take ourselves to be typically mammalian. But of course we’re not. Like many primates we’re visual creatures, rather than relying in smell and hearing. Obviously we’re also bipedal, which is not typical for mammals. And, our sociality scales up to massive agglomerations of individuals.

How monogamous we are is up for debate. Desmond Morris, who is well known to many from his roles in television documentaries, has been a major promoter of the idea that humans are monogamous, with a focus on pair-bonds. In contrast, other researchers have highlighted our polygamous tendencies. In The Mating Mind Geoffrey Miller argues for polygamy, and suggests that pair-bonds in a pre-modern environment were often temporary, rather than lifetime (Miller is now writing a book on polyamory).

The fact that in many societies high status males seem to engage in polygamy, despite monogamy being more common, is one phenomenon which confounds attempts to quickly generalize about the disposition of our species. What is preferred may not always be what is practiced, and the external social adherence to norms may be quite violated in private.

Adducing behavior is simpler in many other organisms, because their range of behavior is more delimited. When it comes to studying mating patterns in mammals voles have long been of interest as a model. There are vole species which are monogamous, and others which are not. Comparing the diverged lineages could presumably give insight as to the evolutionary genetic pathways relevant to the differences.

But North American deer mice, Peromyscus, may turn to be an even better bet: there are two lineages which exhibit different mating patterns which are phylogenetically close enough to the point where they can interbreed. That is crucial, because it allows one to generate crosses and see how the characteristics distribute themselves across subsequent generations. Basically, it allows for genetic analysis.

And that’s what a new paper in Nature does, The genetic basis of parental care evolution in monogamous mice. In figure 3 you can see the distribution of behaviors in parental generations, F1 hybrids, and the F2, which is a cross of F1 individuals. The widespread distribution of F2 individuals is likely indicative of a polygenic architecture of the traits. Additionally, they found that some traits are correlated with each other in the F2 generation (probably due to pleiotropy, the same gene having multiple effects), while others were independent.

With the F2 generation they ran a genetic analysis which looked for associations between traits and regions of the genome. They found 12 quantitative trait loci (QTLs), basically zones of the genome associated with variation on one or more of the six traits. From this analysis they immediately realized there was sexual dimorphism in terms of the genetic architecture; the same locus might have a different effect in the opposite sex. This is evolutionarily interesting.

Because the QTLs are rather large in terms of physical genomic units the authors looked to see which were plausible candidates in terms of function. One of their hits was vasopressin, which should be familiar to many from vole work, as well as some human studies. Though the QTL work as well as their pup-switching experiment (which I did not describe) is persuasive, the fact that a gene you’d expect shows up as a candidate really makes it an open and shut case.

The extent of the variation explained by any given QTL seems modest. In the extended figures you can see it’s mostly in the 1 to 5 percent range. In Carl Zimmer’s excellent write up he ends:

But Dr. Bendesky cautioned that the vasopressin gene would probably turn out to be just one of many that influence oldfield mice. Though it is strongly linked to parental behavior, the vasopressin gene accounts for 6.7 percent of the variation in nest building among males, and only 2.9 percent among females.

The genetic landscape of human parenting will turn out to be even more rugged, Dr. Bendesky predicted.

“You cannot do a 23andMe test and find out if your partner is going to be a good father,” he said.

Sort of. The genetic architecture above is polygenic…but not incredibly diffuse. The proportion of variation explained by the largest effect allele is more than for height, and far more than for education. If human research follows up on this, I wouldn’t be surprised if you could develop a polygenic risk score.

But I don’t have a good intuition on how much variation in humans there really is for these sorts of traits that are heritable. I assume some. But I don’t know how much. And how much of the variance in behavior might be explained by human QTLs? Humans don’t lick or build nests, or retrieve pups. Also, as one knows from Genetics and Analysis of Quantitative Traits sexually dimorphic traits take a long time to evolve. These are two deer mice species. Within humans there may not have been enough time for this sort of heritable complexity of behavior to evolve.

There are a lot of philosophical issues here about translating to a human context.

Nevertheless, this research shows that ingenious animal models can powerfully elucidate the biological basis of behavior.

Citation: The genetic basis of parental care evolution in monogamous mice. Nature (2017) doi:10.1038/nature22074

Genetic variation in human populations and individuals

I’m old enough to remember when we didn’t have a good sense of how many genes humans had. I vaguely recall numbers around 100,000 at first, which in hindsight seems rather like a round and large number. A guess. Then it went to 40,000 in the early 2000s and then further until it converged to some number just below 20,000.

But perhaps more fascinating is that we have a much better catalog of the variation across the whole human genome now. Often friends ask me questions of the form: “so DTC genomic company X has about 800,000 SNPs, is that enough to do much?” To answer such a question you need some basic numbers in your head, as well as what you want to “do.”

First, the human genome has about 3 billion base pairs (3 Gb). That’s a lot. But most of the genome famously doesn’t code for proteins. The exome, the proportion of the genome where bases directly translate into a protein accounts for 1% of the whole genome. That’s 30 million bases (30 Mb). But this small region of the genome is very important, as the vast majority of major disease mutations are found in the exome.

When it comes to a standard 800K SNP chip, which samples 800,000 positions across the 3 Gb genome, it is likely that the designers enriched the marker set for functional positions relevant to diseases. Not all marker positions are created equal. Though even outside of those functional positions there are often nearby SNPs that can “tag” them, so you can infer one from the state of the other.

But are 800,000 positions enough to make good ancestry inference? (to give one example) Yes. 800,000 is actually a substantial proportion of the polymorphism in any given genome. There have been some papers which improved on the numbers in 2015’s A global reference for human genetic variation, but it’s still a good comprehensive review to get an order-of-magnitude sense. The table below gives you a sense of individual variation:

Median autosomal variant sites per genome

When it comes to single nucleotide polymorphisms (SNPs), what SNP chips are getting at, an 800K array should get a substantial proportion of your genome-wide variation. More than enough for ancestry inference or forensics. The singleton column shows mutations specific to the individual.  When focusing on new mutations specific to an individual that might cause disease, singleton large deletions and nonsynonymous SNPs is really where I’d look.

But what about whole populations? The plot to the left shows the count of variants as a function of alternative allele frequency. When we say “SNP”, you really mean variants which exhibit polymorphism at a particular cut-off frequency for the minor allele (often 1%). It is clear that as the minor allele frequency increases in relation to the human reference genome the number of variants decreases.

From the paper:

The majority of variants in the data set are rare: ~64 million autosomal variants have a frequency <0.5%, ~12 million have a frequency between 0.5% and 5%, and only ~8 million have a frequency >5% (Extended Data Fig. 3a). Nevertheless, the majority of variants observed in a single genome are common: just 40,000 to 200,000 of the variants in a typical genome (1–4%) have a frequency <0.5% (Fig. 1c and Extended Data Fig. 3b). As such, we estimate that improved rare variant discovery by deep sequencing our entire sample would at least double the total number of variants in our sample but increase the number of variants in a typical genome by only ~20,000 to 60,000.

An 800K SNP chip will be biased toward the 8 million or so variants with a frequency of 5%. This number gives you a sense of the limited scope of variation in the human genome. 0.27% of the genome captures a lot of the polymorphism.

Citation: 1000 Genomes Project Consortium. “A global reference for human genetic variation.” Nature 526.7571 (2015): 68-74.

Why humans have so many pulse admixtures

The Blank Slate is one of my favorite books (though I’d say The Language Instinct is unjustly overshadowed by it). There is obviously a substantial biological basis in human behavior which is mediated by genetics. When The Blank Slate came out in the early 2000s one could envisage a situation in 2017 when empirically informed realism dominated the intellectual landscape. But that was not to be. In many ways, for example in sex differences, we’ve gone backward, while there is still undue overemphasis in our society on the environmental impact parents have on children (as opposed to society more broadly).

But genes do not determine everything, obviously. Several years after reading The Blank Slate I read Not by Genes Alone: How Culture Transformed Human Evolution. In this work Peter Richerson and Robert Boyd outline their decades long project of modeling cultural variation and evolution formally in a manner reminiscent of biological evolution. Richerson and Boyd’s program does not start from a “blank slate” assumption. Rather, it is focused on broad macro-social dynamics where cultural variation “swamps” out biological variation.

Recall that in classic population genetic theory a major problem with group level selection is that gene flow between adjacent groups quickly removes between group variation. One migrant between two groups per generation is enough for them not to diverge genetically. For group selection to occur the selective effect has to be very strong or the between group difference has to be very high. Rather than talking about genetics though, where the debate is still live, and the majority consensus is still that biological group selection is not that common (depending on how you define it), let’s talk about human culture.

Here the group level differences are extreme and the boundaries can be sharp. Historically it seems likely that most groups which were adjacent to each other looked rather similar because of gene flow and similar selective pressures. Even though in medieval Spain there was a generality, probably true, that Muslims were swarthier than Christians*, there was a palpable danger in battle of identifying friend from foe because the two groups overlapped too much in appearance.

This brings up how one might delineate differences culturally. In battle opposing armies wear distinct uniforms and colors so that the distinction can be made. But obviously one change uniform surreptitiously (perhaps taking the garb from the enemy dead). This is why physical adornment such as tattoos are useful, as they are “hard to fake.” Perhaps the most clear illustration of this dynamic is the Biblical story for the origin of the term shibboleth. Even slight differences in accent are clear to all, and, often difficult to mimic once in adulthood.

Biological evolution mediated through genes is relatively slow and constrained compared to cultural evolution. Whole regions of central and northern Europe shifted from adherence to Roman Catholicism to forms of Protestantism on the order of 10 years. Of course religion is an aspect of culture where change can happen very rapidly, but even language shifts can occur in only a few generations (e.g., the decline of regional German and Italian dialects in the face of standard forms of the language).

Cultural evolution as a formally modeled neofunctionalism is credibly outlined in works such as Peter Turchin’s Ultrasociety: How 10,000 Years of War Made Humans the Greatest Cooperators on Earth. That’s not what I want to focus on here. Rather, I contend that the reality of massive pulse admixtures evident in the human genome over the past 10,000 years, at minimum, is a function of the fact that human cultural evolutionary processes result in winner-take-all genetic consequences.

A concrete example of what I’m talking about would compare the peoples of the Italian peninsula and the Iberian peninsula around 1500. The two populations are not that different genetically, and up to that point shared many cultural traits (and continue to do so). But, a combination of geography and history resulted in Iberian demographic expansion in the several hundred years after 1500, whereby today there are probably many more descendants of Iberians than Italians. This is not a function of any deep genetic difference between the two groups. There aren’t deep genetic differences in fact. Rather, the social and demographic forces which propelled Iberia to imperial status redounded upon the demographic production of Iberians in the future. In addition, the New World underwent a massive pulse admixture between Iberians, and native Amerindians, as well as Africans, usually brought over as slaves, due the cultural and political history of the period.

The pulse admixture question is rather interesting academically. To some extent current methods are biased toward detection of pulse admixtures, and even fit continuous gene flow as pulse admixtures. A quick rapid exchange of gene flow and then recombination breaking apart associations of markers which are ancestrally informative haplotypes is something you can test for. But I think we can agree that the gene flow triggered by the Columbian Exchange was a pulse admixture, and there’s too much concurrent evidence from uniparental lineage turnover in the ancient DNA to dismiss the non-historically corroborated signatures of pulses as simply artifacts.

Nevertheless continuous gene flow does occur. That is, normal exchange of individuals between neighboring demes as a slow simmer over time. But the idea that we are a clinal ring species or something like that isn’t right in my opinion. Part of the story are strong geographical barriers. But another major part is that cultural revolutions and advantages introduce huge short-term demographic advantages to particular groups, and the shake out of inter-group competition can be dramatic.

Therefore, I make a prediction: the more cultural evolutionary dynamics a species is subject to, the more pulse admixture you’ll be able to detect. For example, pulse admixture should be more important in social insects than their solitary relatives.

* Not only was some of the ancestry of Muslims North African, Muslim rule was longest in the southern and southeastern regions, where people were not as fair as in the north.

Sex bias in migration from the steppe (revisited)

Last fall I blogged a preprint which eventually came out as a paper in PNAS, Ancient X chromosomes reveal contrasting sex bias in Neolithic and Bronze Age Eurasian migrations. The upshot is that the authors found that there was far less steppe ancestry on the X chromosomes of Bronze Age Central Europeans than across the whole genome. The natural inference here is that you had migrations of males into territory where they had to find local wives.

But the story does not end there. Iosif Lazaridis and David Reich have put out a short not on biorxiv, Failure to Replicate a Genetic Signal for Sex Bias in the Steppe Migration into Central Europe. It’s short, so I suggest you read the note yourself, but the major issue seems to be that on X chromosomes ADMIXTURE in supervised mode seems to behave really strangely. Lazaridis and Reich find that there seems to be a downward bias of steppe ancestry. Ergo, the finding was an artifact.

Goldberg et al. almost immediately responded, Reply To Lazaridis And Reich: Robust Model-Based Inference Of Male-Biased Admixture During Bronze Age Migration From The Pontic-Caspian Steppe. Their response seems to be that yes, ADMIXTURE does behave strangely, but the overall finding is still robust.

With these uncertainties I do wonder if it’s hard at this point to evaluate the alternative models. But, we do have archaeology and mtDNA. What do those say? On that basis, from what little I know, I am inclined to suspect a strong male bias of migration.

Citation: Reply To Lazaridis And Reich: Robust Model-Based Inference Of Male-Biased Admixture During Bronze Age Migration From The Pontic-Caspian Steppe, Amy Goldberg, Torsten Gunther, Noah A Rosenberg, Mattias Jakobsson
bioRxiv 122218; doi:

Citation: Failure to Replicate a Genetic Signal for Sex Bias in the Steppe Migration into Central Europe, Iosif Lazaridis, David Reich, bioRxiv 114124; doi:

How Tibetans can function at high altitudes

About seven years ago I wrote two posts about how Tibetans manage to function at very high altitudes. And it’s not just physiological functioning, that is, fitness straightforwardly understood. High altitudes can cause a sharp reduction in reproductive fitness because women can not carry pregnancies to term. In other words, high altitude is a very strong selection pressure. You adapt, or you die off.

For me there have been two things of note since those original papers came out. First, one of those loci seem to have been introgressed from a Denisovan genetic background. I want to be careful here, because the initial admixture event may not have been into the Tibetans proper, but earlier hunter-gatherers who descend from Out of Africa groups, who were assimilated into the Tibetans as they expanded 5-10,000 years ago. Second, it turns out that dogs have been targeted for selection on EPAS1 as well (the “Denisovan” introgression) for altitude adaptation as well.

This shows that in mammals at least there’s a few genes which show up again and again. The fact that EPAS1 and EGLN1 were hits on relatively small sample sizes also reinforces their powerful effect. When the EPAS1 results initially came out they were highlighted as the strongest and fastest instance of natural selection in human evolutionary history. One can quibble about the details about whether this was literally true, but that it was a powerful selective event no one could deny.

A new paper in PNAS, Genetic signatures of high-altitude adaptation in Tibetans, revisits the earlier results with a much larger sample size (the research group is in China) comparing Han Chinese and Tibetans. They confirm the earlier results, but, they also find other loci which seem likely targets of selection in Tibetans. Below is the list:

SNP A1 A2 Frequency of A1 P value FST Nearest gene
Tibetan EAS (Han)
rs1801133 A G 0.238 0.333 6.30E-09 0.021 MTHFR
rs71673426 C T 0.102 0.013 1.50E-08 0.1 RAP1A
rs78720557 A T 0.498 0.201 4.70E-08 0.191 NEK7
rs78561501 A G 0.599 0.135 6.10E-15 0.414 EGLN1
rs116611511 G A 0.447 0.003 3.60E-19 0.57 EPAS1
rs2584462 G A 0.211 0.549 3.90E-09 0.203 ADH7
rs4498258 T A 0.586 0.287 1.70E-08 0.171 FGF10
rs9275281 G A 0.095 0.365 1.10E-10 0.162 HLA-DQB1
rs139129572 GA G 0.316 0.449 5.80E-09 0.036 HCAR2
P value indicates the P value from the MLMA-LOCO analysis. FST is the FST value between Tibetans and EASs. Nearest gene indicates the nearest annotated gene to the top differentiated SNP at each locus except EGLN1, which is known to be associated with high-altitude adaptation; rs139129572 is an insertion SNP with two alleles: GA and G. A1, allele 1; A2, allele 2.

Many of these genes are familiar. Observe the allele frequency differences between the Tibetans and other East Asians (mostly Han). The sample sizes are on the order of thousands, and the SNP-chip had nearly 300,000 markers. What they found was that the between population Fst of Han to Tibetan was ~0.01. So only 1% of the SNP variance in their data was partitioned between the two groups. These alleles are huge outliers.

The authors used some sophisticated statistical methods to correct for exigencies of population structure, drift, admixture, etc., to converge upon these hits, but even through inspection the deviation on these alleles is clear. And as they note in the paper it isn’t clear all of these genes are selected simply for hypoxia adaptation. MTFHR, which is quite often a signal of selection, may have something to due to folate production (higher altitudes have more UV). ADH7 is part of a set of genes which always seem to be under selection, and HLA is never a surprise.

Rather than get caught up in the details it is important to note here that expansion into novel habitats results in lots of changes in populations, so that two groups can diverge quite fast on functional characteristics.  The PCA makes it clear that Tibetans and Hans have very little West Eurasian admixture, and the Fst based analysis puts their divergence on the order of 5,000 years before the present. The authors admit honestly that this is probably a lower bound value, but I also think it is quite likely that Tibetans, and probably Han too, are compound populations, and a simple bifurcation model from a common ancestral population is probably shaving away too many realistic edges. In plainer language, there has been gene flow between Han and Tibetans probably <5,000 years ago, and Tibetans themselves probably assimilated more deeply diverged populations in the highlands as they expanded as agriculturalists. An estimate of a single divergence fits a complex history to too simple of a model quite often.

The take home: understanding population history is probably important to get a better sense of the dynamics of adaptation.

Citation: Jian Yang, Zi-Bing Jin, Jie Chen, Xiu-Feng Huang, Xiao-Man Li, Yuan-Bo Liang, Jian-Yang Mao, Xin Chen, Zhili Zheng, Andrew Bakshi, Dong-Dong Zheng, Mei-Qin Zheng, Naomi R. Wray, Peter M. Visscher, Fan Lu, and Jia Qu, Genetic signatures of high-altitude adaptation in Tibetans, PNAS 2017 ; published ahead of print April 3, 2017, doi:10.1073/pnas.1617042114

The future shall, and should, be sequenced

Last fall I talked about a preprint, Human demographic history impacts genetic risk prediction across diverse populations. It’s now published in AJHG, with the same informative title, Human Demographic History Impacts Genetic Risk Prediction across Diverse Populations. Even though talked about this before, I thought it would be useful to highlight again.

To recap, GWAS is a pretty big deal, but only in the last 15 years or so. With genome-wide data researchers began to explore associations between diseases and population genetic variation. In some cases they discovered strong associations between characteristics and genetic variants, but in many casese it turned out that though a trait is highly heritable (e.g., schizophrenia) the causal variants are either not common or do not explain much of the variation in the poplation (or both).

But as the second decade of GWAS proceeds the sample sizes are getting larger, and researchers are moving from SNP-chips, with their various biases, to high quality whole-genome sequences. One of the major sorts of low hanging fruit in the minds of many people are rare variants. Basically SNP-chips are geared toward finding common variations within large populations, since they have a finite number of markers they are going to interrogate. Sequencing though is a comprehensive catalog of the genome in a relative sense. If you have high coverage (so you sample the site many times) you can easily discover rare mutations within an individual genome that makes them distinctive from almost the rest of the human race (these may be de novo mutations, or, they could be mutations private to their extended pedigree).

But context matters. Martin et al. find that confirmed GWAS hits in Europeans tend to exhibit decreased portability as a function of genetic distance. This isn’t entirely surprising, especially if rarer variants are part of the explanation. Rare variants usually emerged later in history, after the differentiation between geographic races.

A solution would be to have a diverse panel of populations in your studies. For many reasons this was not to be. Northwest Europeans are enormously enriched in current data sets. Martin et al. observe that recent this has diminished somewhat, from 95% European to less than 80%. But they observe that this is mostly due to the inclusion of “Asian” samples, as opposed to African and Native Americans, who remain as undererpresented as they did several years ago.

The African and Native American samples present somewhat different problems. The Native American groups are quite drifted due to bottlenecks. Likely they have their own variants due to the combined affects of mutation and selection through 15 to 20,000 years of isolation from other human populations. In contrast, the African groups have lots of diversity with a high time depth due to their ancestral histories, which are less subject to bottleneck effects. The prediction ability into Africans of current GWAS looks to be rather pathetic. This is reasonable because their diversity is poorly captured in Eurocentric study designs, and, they are more genetically diverged from Europeans than Asians are.

Ultimatley I think, and hope, this portability question will be of short term utility. As sequencing gets cheap, and studies become more numerous, we’ll fill in the gaps of understudied populations. Finally, ethics is above my paygrade, but I do hope those who demand a strenuous bar on consent keep in mind that that will result in slower growth of these study populations. Academics want to do a good job, but they also want to stay on the good side of IRB.

Citation: Martin, Alicia R., et al. “Human demographic history impacts genetic risk prediction across diverse populations.” bioRxiv (2016): 070797.

Adaptation is ancient: the story of Duffy

Anyone with a passing familiar with human population genetics will know of the Duffy system, and the fact that there is a huge difference between Sub-Saharan Africans and other populations on this locus. Specifically, the classical Duffy allele exhibits a nearly disjoint distribution from Africa to non-Africa. It was naturally one of the illustrations in The Genetics of Human Populations, a classic textbook from the 1960s.

Today we know a lot more about human variation. On most alleles we don’t see such sharp distinctions. Almost certainly the detection of these very differentiated alleles early on in human genetics was partly a function of selection bias. The methods, techniques, and samples, were underpowered and limited, so only the largest differences would be visible. Today we often use single base pair variations, single nucleotide polymorphisms, and the frequency differences are much more modest on average. Ergo, the reality that only a minority of genetic variation is partitioned across geographic races.

Why is Duffy different? Obviously it could be random. Assuming you have a polymorphism, you’ll get a range of frequencies across populations, and in some cases those frequencies which map onto different geographic zones just by chance. Imagine constant mutation, and high structured bottlenecks. You could get a sequence of derived mutations fixing in populations one after the other, just by chance.

This is probably not the case with Duffy. I’ll quote from Wikipedia:

The Duffy antigen is located on the surface of red blood cells, and is named after the patient in which it was discovered. The protein encoded by this gene is a glycosylated membrane protein and a non-specific receptor for several chemokines. The protein is also the receptor for the human malarial parasites Plasmodium vivax and Plasmodium knowlesi. Polymorphisms in this gene are the basis of the Duffy blood group system.

Malaria is one of the strongest selection pressures known to humanity. The balancing selection which results in sickle-cell disease is well known even among the general public. But the likely selection pressures due to the vivax variety are well commonly talked about, partly because they don’t as a side-effect induce a serious disease. Duffy may be canonical if you are a human population geneticist, but it is of less interest more generally.

But a recent paper in PLOS GENETICS shows just how dynamic the evolutionary genetic past of our species was, through the lens of the Duffy system, Population genetic analysis of the DARC locus (Duffy) reveals adaptation from standing variation associated with malaria resistance in humans. Here’s the author summary:

Infectious diseases have undoubtedly played an important role in ancient and modern human history. Yet, there are relatively few regions of the genome involved in resistance to pathogens that show a strong selection signal in current genome-wide searches for this kind of signal. We revisit the evolutionary history of a gene associated with resistance to the most common malaria-causing parasite, Plasmodium vivax, and show that it is one of regions of the human genome that has been under strongest selective pressure in our evolutionary history (selection coefficient: 4.3%). Our results are consistent with a complex evolutionary history of the locus involving selection on a mutation that was at a very low frequency in the ancestral African population (standing variation) and subsequent differentiation between European, Asian and African populations.

Why is it that regions of the genome subject to selection due to co-evolution with pathogens are hard to detect in relation to selection? My response would be that it’s because selection and adaptation are always happening in these regions, constantly erasing its footprints in these regions of the genome.

You may be familiar with the fact that the major histocompatibility complex (MHC) are some of the most diverse regions of the genome. That’s because negative frequency dependent selection makes it so that rare variants never go extinct, as the rarer they get the more favored they are.

Many classical and modern techniques of selection require less protean dynamics when it comes to the model which they attempt to detect. Basically, many of the standard selection detection methods are looking for a simple perturbation in the pattern of variation that’s expected. A strong powerful recent sweep on a single mutation is like the spherical cow of evolutionary genetics. It happens. And it’s easy to model and detect. But it may not be nearly as important as our ability to detect these “hard sweeps” may suggest to us.

In contrast, if selection targets a larger number of independent mutations, then you get a “soft sweep,” which is harder to detect, because it is no singular event. Complexity is the enemy of detection. As a thought experiment, if you selected for height within a population you may catch some large effect alleles that would leave strong signals, but most of the dynamic would leave a polygenic footprint, distributed across innumerable genes.

The Duffy locus is somewhat in the middle. The authors distinguish between selection on standing variation (the allele frequency is higher than a single new mutation within the population) and a soft sweep, where multiple variants against different haplotypes are subject to selection. Their models and results strongly support selection on standing variation for the FY*O variant, and perhaps selection for the FY*A variant.

These selection events were very old, and very strong. Selection coefficients on the order of 4% are hard to believe in a natural environment. Curiously the coalescence times for the haplotypes some of these alleles indicate that selection was contemporaneous with the emergence of modern humans out of Africa, about ~50,000 years ago. From their sequence data analysis the different alleles have been segregating for a long time in the collective human population, and powerful sweeps fixed FY*O in both the ancestors of the Bantu and Pygmies before they diverged from each other. In contrast the Khoisan samples suggest that FY*O introgressed into their population from newcomers, while variants of FY*A are ancestral.

The big picture here is that selection is ancient, that it is powerful, and it was a dynamic even before our species diversified into various lineages.

If you read the paper, and you should, it’s pretty clear that a lot of the adaptive story was suspected. It’s just with modern genomics and fancy ABC methods you can put point estimates and intervals on these hunches. But another issue, as they note in the piece, is that we have a better grasp of African population structure today than in the past, and this allows for better framing.

But it is here I have some caution to throw. At one point citing a 2012 paper the authors suggest “The KhoeSan peoples are a highly diverse set of southern African populations that diverged from all other populations approximately 100 kya.” I can tell you that some credible researchers who have access to whole genome sequences and have been looking at this question peg the divergence date closer to 200,000 years. Some of the issue here is that you need to decompose later gene flow, which will reduce the distance between populations. Easier said than done.

The genetic prehistory of the African continent is almost certainly much more complex than what is presented in the paper, largely due to lack of ancient DNA within Africa. Northern Eurasia turned out to be far more complex than had earlier been guessed…and it is likely that Northern Eurasia has had a simpler history because of its much shorter time of habitation.

If I had to guess I suspect that the ancestors of the Khoisan as we understand them were a separate and distinct group who diverged between ~100,000 and ~200,000 years ago from other extant African populations. But I suspect our clarity is very low in relation the sort of structure which eventually resulted in the shake-out of only a few large groups of Sub-Saharan Africans aside from the Khoisan.

Citation: Population genetic analysis of the DARC locus (Duffy) reveals adaptation from standing variation associated with malaria resistance in humans.