Why do percentage estimates of “ancestry” vary so much?

When looking at the results in , , and my “East Asian” percentage is:

– 19%
– 13%
– 6%

What’s going on here? In science we often make a distinction between precision and accuracy. Precision is how much your results vary when you re-run an experiment or measurement. Basically, can you reproduce your result? Accuracy refers to how close your measurement is to the true value. A measurement can be quite precise, but consistently off. Similarly, a measurement may be imprecise, but it bounces around the true value…so it is reasonably accurate if you get enough measurements just cancel out the errors (which are random).

The values above are precise. That is, if you got re-tested on a different chip, the results aren’t going to be much different. The tests are using as input variation on 100,000 to 1 million markers, so a small proportion will give different calls than in the earlier test. But that’s not going to change the end result in most instances, even though these methods often have a stochastic element.

But what about accuracy? I am not sure that old chestnuts about accuracy apply in this case, because the percentages that these services provide are summaries and distillations of the underlying variation. The model of precision and accuracy that I learned would be more applicable to the DNA SNP array which returns calls on the variants; that is, how close are the calls of the variant to the true value (last I checked these are arrays are around 99.5% accurate in terms of matching the true state).

What you see when these services pop out a percentage for a given ancestry is the outcome of a series of conscious choices that designers of these tests made keeping in mind what they wanted to get out of these tests. At a high level here’s what’s going on:

  1. You have a model of human population history and dynamics with various parameters
  2. You have data that that varies that you put into that model
  3. You have results which come back with values which are the best fit of that data to the model you specificed

Basically you are asking the computational framework a question, and it is returning its best answer to the question posed. To ask whether the answer is accurate or not is almost not even wrong. The frameworks vary because they are constructed by humans with difference preferences and goals.

Almost, but not totally wrong. You can for example simulate populations whose histories you know, and then test the models on the data you generated. Since you already know the “truth” about the simulated data’s population structure and history, you can see how well your framework can infer what you already know from the patterns of variation in the generated data.

Going back to my results, why do my East Asian percentages vary so much? The short answer is that one of the major variables in the model alluded to above is the nature of the reference population set and the labels you give them.

Looking at Bengalis, the ethnic group I’m from, it is clear that in comparison to other South Asian populations they are East Asian shifted. That is, it seems clear I do have some East Asian ancestry. But how much?

The “simple” answer is to model my ancestry is a mix of two populations, an Indian one and an East Asian one, and then see what the values are for my ancestry across the two components. But here is where semantics becomes important: what is Indian and East Asian? Remember, these are just labels we give to groups of people who share genetic affinities. The labels aren’t “real”, the reality is in the raw read of the sequence. But humans are not capable of really getting anything from millions of raw SNPs assigned to individuals. We have to summarize and re-digest the data.

The simplest explanation for what’s going on here is that the different companies have different populations put into the boxes which are “Indian/South Asian” and “East Asian.” If you are using fundamentally different measuring sticks, then there are going to be problems with doing apples to apples comparisons.

My personal experience is that 23andMe tends to give very high percentages of South Asian ancestry for all South Asians. Because “South Asian” is a very diverse category when tests come back that someone is 95-99% South Asian…it’s not really telling you much. In contrast, some of the other services may be using a small subset of South Asians, who they define as “more typical”, and so giving lower percentages to people from Pakistan and Bengal, who have admixture from neighboring regions to the west and east respectively.*

Something similar can occur with East Asian ancestry. If the “donor” ancestral groups are South Asian and East Asian for me, then the proportions of each is going to vary by how close the donor groups selected by the company is to the true ancestral group. If, for example,  chose a more Northeastern Asian population than , then my East Asian population would vary between the two services because I know my East Asian ancestry is more Southeast Asian.

The moral of the story is that the values you obtain are conditional on the choices you make, and those choices emerge from the process of reducing and distilling the raw genetic variation into a manner which is human interpretable. If the companies decided to use the same model, the would come out with the same results.

* I helped develop an earlier version of MyOrigins, and so can attest to this firsthand.

Ancient Europeans: isolated, always on the edge of extinction

A few years ago I suggested to the paleoanthropologist Chris Stringer that the first modern humans who arrived in Europe did not contribute appreciable ancestry to modern populations in the continent (appreciable as in 1% or more of the genome).* It seems I may have been right according to results from a 2016 paper, The genetic history of Ice Age Europe. The very oldest European ancient genome samples “failed to contribute appreciably to the current European gene pool.”

Why did I make this claim? Two reasons:

1) 40,000 years is a long time, and there was already substantial evidence of major population turnovers across northern Eurasia by this point. You go far enough into the future and it’s not likely that a local population leaves any descendants. So just work that logic backward.

2) There was already evidence of low population sizes and high isolation levels between groups in Pleistocene and Mesolithic/Neolithic Europe. This would again argue in favor of a high likelihood of local extinctions give enough time.

This does not only apply to just modern humans, descendants of southern, likely African, populations. Neanderthals themselves show evidence of high homogeneity, and expansions through bottlenecks over the ~600,000 years of their flourishing.

The reason that these dynamics characterized modern humans and earlier hominins in northern Eurasia is what ecologists would term an abiotic factor: the Ice Age. Obviously humans could make a go of it on the margins of the tundra (the Neanderthals seem less adept at penetrating the very coldest of terrain in comparison to their modern human successors; they likely frequented the wooded fringes, see ). We have the evidence of several million years of continuous habitation by our lineage. But many of the ancient genomes from these areas, whether they be Denisovan, Neanderthal, or Mesolithic European hunter-gatherer, show indications of being characterized by very low effective population sizes. Things only change with the arrival of farming and agro-pastoralism.

For two obvious reasons we happen to have many ancient European genomes. First, many of the researchers are located in Europe, and the continent has a well developed archaeological profession which can provide well preserved samples with provenance and dates. And second, Europe is cool enough that degradation rates are going to be lower than if the climate was warmer. But if Europe, as part of northern Eurasia, is subject to peculiar exceptional demographic dynamics we need to be cautious about generalizing in terms of the inferences we make about human population genetic history. Remember that ancient Middle Eastern farmers already show evidence of having notably larger effective population sizes than European hunter-gatherers.

Two new preprints confirm the long term population dynamics typical of European hunter-gatherers, Assessing the relationship of ancient and modern populations and Genomics of Mesolithic Scandinavia reveal colonization routes and high-latitude adaptation. The first preprint is rather methods heavy, and seems more of a pathfinder toward new ways to extract more analytic juice from ancient DNA results. Those who have worked with population genomic data are probably not surprised at the emphasis on collecting numbers of individuals as opposed to single genome quality. That is, for the questions population geneticists are interested in “two samples sequenced to 0.5x coverage provide better resolution than a single sample sequenced to 2x coverage.”

I encourage readers (and “peer reviewers”) to dig into the appendix of Assessing the relationship of ancient and modern populations. I won’t pretend I have (yet). Rather, I want to highlight an interesting empirical finding when the method was applied to extant ancient genomic samples: “we found that no ancient samples represent direct ancestors of modern Europeans.”

This is not surprising. The ‘hunter-gatherer’ resurgence of the Middle Neolithic notwithstanding, Northern Europe was subject to two major population replacements, while Southern Europe was subject to one, but of a substantial nature. Recall that the Bell Beaker paper found that “spread of the Beaker Complex to Britain was mediated by migration from the continent that replaced >90% of Britain’s Neolithic gene pool within a few hundred years.” This means that less than 10% of modern Britons’ ancestry are a combination of hunter-gatherers and Neolithic farmers.

And yet if you look at various forms of model-based admixture analyses it seems as if modern Europeans have substantial dollops of hunter-gatherer ancestry (and hunter-gatherer U5 mtDNA and Y chromosomal lineage I1 and I2, associated with Pleistocene Europeans, is found at ~10% frequency in modern Europe in the aggregate; though I suspect this is a floor). What gives? Let’s look at the second preprint, which is more focused on new empirical results from ancient Scandinavian genomes, Genomics of Mesolithic Scandinavia reveal colonization routes and high-latitude adaptation. From early on in the preprint:

Based on SF12’s high-coverage and high-quality genome, we estimate the number of single nucleotide polymorphisms (SNPs) hitherto unknown (that are not recorded in dbSNP (v142)) to be c. 10,600. This is almost twice the number of unique variants (c. 6,000) per Finnish individual (Supplementary Information 3) and close to the median per European individual in the 1000 Genomes Project (23) (c. 11,400, Supplementary Information 3). At least 17% of these SNPs that are not found in modern-day individuals, were in fact common among the Mesolithic Scandinavians (seen in the low coverage data conditional on the observation in SF12), suggesting that a substantial fraction of human variation has been lost in the past 9,000 years (Supplementary Information 3). In other words, the SHGs (as well as WHGs and EHGs) have no direct descendants, or a population that show direct continuity with the Mesolithic populations (Supplementary Information 6) (13–17). Thus, many genetic variants found in Mesolithic individuals have not been carried over to modern-day groups.

The gist of the paper in terms of archaeology and demographic history is that Scandinavian hunter-gatherers were a compound population. One component of their ancestry is what we term “Western hunter-gatherers” (WHG), who descended from the late  Pleistocene Villabruna cluster (see paper mentioned earlier). Samples from Belgium, Switzerland, and Spain all belong to this cluster. The second element are “Eastern hunter-gatherers” (EHG). These samples derive from the Karelia region, to the east of modern Finland, bound by the White Sea to the north. EHG populations exhibit affinities to both WHG as well as Siberian populations who contributed ancestry to Amerindians, the “Ancestral North Eurasians” (ANE). There is a question at this point whether EHG are the product of a pulse admixture between an ANE and WHG population, or whether there was a long existent ANE-WHG east-west cline which the EHG were situated upon. That is neither here nor there (the Tartu group has a paper addressing this leaning toward isolation-by-distance from what I recall).

Explicitly testing models to the genetic data the authors conclude that there was a migration of EHG populations with a specific archaeological culture around the north fringe of Scandinavia, down the Norwegian coast. Conversely, a WHG population presumably migrated up from the south and somewhat to the east (from the Norwegian perspective).

And yet the distinctiveness of the very high quality genome as inferred from unique SNPs they have suggests to them that very little of the ancestry of modern Scandinavians (and Finns to be sure) derives from these ancient populations. Very little does not mean all. There is a lot of functional analysis in the paper and supplements which I will not discuss in this post, and one aspect is that it seems some adaptive alleles for high latitudes might persist down to the present in Nordic populations as a gift from these ancient forebears. This is no surprise, not all regions of the genome are created equal (a more extreme case is the Denisovan derived high altitude adaptation haplotype in modern Tibetans).

Nevertheless, there was a great disruption. First, the arrival of farmers whose ultimate origins were Anatolia ~6,000 years ago to the southern third of Scandinavia introduced a new element which came in force (agriculture spread over the south in a few centuries). A bit over a thousand years later the Corded Ware people, who were likely Indo-European speakers, arrived. These Indo-European speakers brought with them a substantial proportion of ancestry related to the hunter-gatherers because they descended in major fraction from the EHG (and later accrued more European hunter-gatherer ancestry from both the early farmers and likely some residual hunter-gatherer populations who switched to agro-pastoralism**).

For several years I’ve had discussions with researchers whose daily bread & butter are the ancient DNA data sets of Europe. I’ve gotten some impressions implicitly, and also from things they’ve said directly. It strikes me that the Bantu expansion may not be a bad analogy in regards to the expansion of farming in Europe (and later agro-pastoralism). Though the expanding farmers initial mixed with hunter-gatherers on the frontier, once they got a head of steam they likely replaced small hunter-gatherer groups in totality, except in areas like Scandinavia and along the maritime fringe where ecological conditions were such hunter-gatherers were at advantage ( seems to describe a massive farmer vs. coastal forager war on the North Sea).

But this is not the end of the story for Norden. At SMBE I saw some ancient genome analysis from Finland on a poster. Combined with ancient genomic analysis from the Baltic, along with deeper analysis of modern Finnish mtDNA, it seems likely that the expansion of Finno-Samic languages occurred on the order of ~2,000 years ago. After the initial expansion of Corded Ware agro-pastoralists.

The Sami in particular seem to have followed the same path along the northern fringe of Scandinavia that the EHG blazed. Though they herd reindeer, they were also Europe’s last indigenous hunter-gatherers. Genetically they exhibit the same minority eastern affinities in their ancestry that the Finns do, though to a greater extent. But their mtDNA harbors some distinctive lineages, which might be evidence of absorption of ancient Scandinavian substate.

I’ll leave it to someone else to explain how and why the Finns and Sami came to occupy the areas where they currently dominate (note that historically Sami were present much further south in Norway and Sweden than they are today). But note that in Latvia and Lithuania the N1c Y chromosomal lineage is very common, despite no language shift, indicating that there was a great deal of reciprocal mixing on the Baltic.

Overall the story is of both population and cultural turnover. This should not surprise when one considers that northern Eurasia is on the frontier of the human range. And perhaps it should temper the inferences we make about other areas of the world.

* You may notice that this threshold is lower than the Neanderthal admixture proportions in the non-African genome. Why is this old admixture still detectable while modern human lineages go extinct? Because it seems to have occurred with non-African humans had a very small effective population, and was mixed thoroughly. Because of the even genomic distribution this ancestry has not been lost in any of the daughter populations.

** Haplogroup I1, which descends from European late Pleistocene populations, exhibits a star phylogeny of similar time depth as R1b and R1a.

The great Bantu expansion was massive

Lots of stuff at SMBE of interest to me. I went to the Evolution meeting last year, and it was a little thin on genetics for me. And I go to ASHG pretty much every year, but there’s a lot of medical stuff that is not to my taste. SMBE was really pretty much my style.

In any case one of the more interesting talks was given by  (soon of the Crick Institute). He had several novel African genomes to talk about, in particular from Malawi hunter-gatherers (I believe dated to 3,000 years before the present), and one from a pre-Bantu pastoralist.

At one point Skoglund presented a plot showing what looked like an isolation by distance dynamic between the ancient Ethiopian Mota genome and a modern day Khoisan sample, with the Malawi population about $\frac{2}{3}$ of the way toward the Khoisan from the Ethiopian sample. Some of my friends from a non-human genetics background were at the talk and were getting quite excited at this point, because there is a general feeling that the Reich lab emphasizes the stylized pulse admixture model a bit too much. Rather than expansion of proto-Ethiopian-like populations and proto-Khoisan-like populations they interpreted this as evidence of a continuum or cline across East Africa. I’m not sure if this is the right interpretation of the plot presented, but it’s a reasonable one.

Malawi is considerably to the north of modern Khoisan populations. This is not surprising. From what I have read Khoisan archaeological remains seem to be found as far north as Zimbabwe, while others have long suggested a presence as far afield as Kenya. Perhaps more curiously: the Malawi hunter-gatherers exhibit not evidence of having contributed genes to modern Bantu residents of Malawi.

Surprising, but not really. If you look at a PCA plot of Bantu genetic variation it really starts showing evidence of local substrate (Khoisan) in South Africa. From Cameroon to Mozambique it looks like the Bantu simply overwhelmed local populations, they are clustered so tight. Though it is true that African populations harbor a lot of diversity, that diversity is not necessarily partition between the populations. The Bantu expansion is why.

Of more interest from the perspective of non-African history is the Tanzanian pastoralist. This individual is about 38% West Eurasian, and that ancestry has the strongest affinities with Levantine Neolithic farmers. Specifically, the PPN, which dates to between 8500-5500 BCE. More precisely, this individual was exclusively “western farmer” in the Lazaridis et al. formulation. Additionally, Skoglund also told me that the Cushitic (and presumably Semitic) peoples to the north and east had some “eastern farmer.” I immediately thought back to Hogdson et al. Early Back-to-Africa Migration into the Horn of Africa, which suggested multiple layers. Finally, 2012 Pagani et al. suggested that admixture in the Ethiopian plateau occurred on the order of ~3,000 years ago.

Bringing all of this together it suggests to me two things

  1. The migration back from Eurasia occurred multiple times, with an early wave arriving well before the Copper/Bronze Age east-west and west-east gene flow in the Near East (also, there was backflow to West Africa, but that’s a different post….).
  2. The migration was patchy; the Mota sample dates to 4,500 years ago, and lacks any Eurasian ancestry, despite the likelihood that the first Eurasian backflow was already occurring.

Skoglund will soon have the preprint out.