Ancestry inference won’t tell you things you don’t care about (but could)

The figure above is from Noah Rosenberg’s relatively famous paper, Clines, Clusters, and the Effect of Study Design on the Inference of Human Population Structure. The context of the publication is that it was one of the first prominent attempts to use genome-wide data on a various of human populations (specifically, from the HGDP data set) and attempt model-based clustering. There are many details of the model, but the one that will jump out at you here is that the parameter defines the number of putative ancestral populations you are hypothesizing. Individuals then shake out as proportions of each element, K. Remember, this is a model in a computer, and you select the parameters and the data. The output is not “wrong,” it’s just the output based how you set up the program and the data you input yourself.

These sorts of computational frameworks are innocent, and may give strange results if you want to engage in mischief. For example, let’s say that you put in 200 individuals, of whom 95 are Chinese, 95 are Swedish, and 10 are Nigerian. From a variety of disciplines we know to a good approximation that non-Africans form a monophyletic clade in relation to Africans (to a first approximation). In plain English, all non-Africans descend from a group of people who diverged from Africans more than 50,000 years ago. That means if you imagine two populations, the first division should be between Africans and non-Africans, to reflect this historical demography. But if you skew the sample size, as the program looks for the maximal amount of variation in the data set it may decide that dividing between Chinese and Swedes as the two ancestral populations is the most likely model given the data.

This is not wrong as such. As the number of Africans in the data converges on zero, obviously the dividing line is between Swedes and Chinese. If you overload particular populations within the data, you may marginalize the variation you’re trying to explore, and the history you’re trying to uncover.

I’ve written all of this before. But I’m writing this in context of the earlier post, Ancestry Inference Is Precise And Accurate(Ish). In that post I showed that consumers drive genomics firms to provide results where the grain of resolution and inference varies a lot as a function of space. That is, there is a demand that Northern Europe be divided very finely, while vast swaths of non-European continents are combined into one broad cluster.

Less than 5% Ancient North Eurasian

Another aspect though is time. These model-based admixture frameworks can implicitly traverse time as one ascends up and down the number of K‘s. It is always important to explain to people that the number of K‘s may not correspond to real populations which all existed at the same time. Rather, they’re just explanatory instruments which illustrate phylogenetic distance between individuals. In a well-balanced data set for humans K = 2 usually separates Africans from non-Africans, and K = 3 then separates West Eurasians from other populations. Going across K‘s it is easy to imagine that is traversing successive bifurcations.

A racially mixed man, 15% ANE, 30% CHG, 25% WHG, 30% EEF

But today we know that’s more complicated than that. Three years ago Pickrell et al. published Toward a new history and geography of human genes informed by ancient DNA, where they report the result that more powerful methods and data imply most human populations are relatively recent admixtures between extremely diverged lineages. What this means is that the origin of groups like Europeans and South Asians is very much like the origin of the mixed populations of the New World. Since then this insight has become only more powerful, as ancient DNA has shed light as massive population turnovers over the last 5,000 to 10,000 years.

These are to some extent revolutionary ideas, not well known even among the science press (which is too busy doing real journalism, i.e. the art of insinuation rather than illumination). As I indicated earlier direct-to-consumer genomics use national identities in their cluster labels because these are comprehensible to people. Similarly, they can’t very well tell Northern Europeans that they are an outcome of a successive series of admixtures between diverged lineages from the late Pleistocene down to the Bronze Age. Though Northern Europeans, like South Asians, Middle Easterners, Amerindians, and likely Sub-Saharan Africans and East Asians, are complex mixes between disparate branches of humanity, today we view them as indivisible units of understanding, to make sense of the patters we see around us.

Personal genomics firms therefore give results which allow for historically comprehensible results. As a trivial example, the genomic data makes it rather clear that Ashkenazi Jews emerged in the last few thousand years via a process of admixture between antique Near Eastern Jews, and the peoples of Western Europe. After the initial admixture this group became an endogamous population, so that most Ashkenazi Jews share many common ancestors in the recent past with other Ashkenazi Jews. This is ideal for the clustering programs above, as Ashkenazi Jews almost always fit onto a particular K with ease. Assuming there are enough Ashkenazi Jews in your data set you will always be able to find the “Jewish cluster” as you increase the value.

But the selection of a K which satisfies this comprehensibility criterion is a matter of convenience, not necessity. Most people are vaguely aware that Jews emerged as a people at a particular point in history. In the case of Ashkenazi Jews they emerged rather late in history. At certain K‘s Ashkenazi Jews exhibit mixed ancestral profiles, placing them between Europeans and Middle Eastern peoples. What this reflects is the earlier history of the ancestors of Ashkenazi Jews. But for most personal genomics companies this earlier history is not something that they want to address, because it doesn’t fit into the narrative that their particular consumers want to hear. People want to know if they are part-Jewish, not that they are part antique Middle Eastern and Southwest European.

Perplexment of course is not just for non-scientists. When Joe Pickrell’s TreeMix paper came out five years ago there was a strange signal of gene flow between Northern Europeans and Native Americans. There was no obvious explanation at the time…but now we know what was going on.

It turns out that Northern Europeans and Native Americans share common ancestry from Pleistocene Siberians. The relationship between Europeans and Native Americans has long been hinted at in results from other methods, but it took ancient DNA for us to conceptualize a model which would explain the patterns we were seeing.

An American with recent Amerindian (and probably African) ancestry

But in the context of the United States shared ancestry between Europeans and Native Americans is not particularly illuminating. Rather, what people want to know is if they exhibit signs of recent gene flow between these groups, in particular, many white Americans are curious if they have Native American heritage. They do not want to hear an explanation which involves the fusion of an East Asian population with Siberians that occurred 15,000 to 20,000 years ago, and then the emergence of Northern Europeans thorough successive amalgamations between Pleistocene, Neolithic, and Bronze Age, Eurasians.

In some of the inference methods Northern Europeans, often those with Finnic ancestry or relationship to Finnic groups, may exhibit signs of ancestry from the “Native American” cluster. But this is almost always a function of circumpolar gene flow, as well as the aforementioned Pleistocene admixtures. One way to avoid this would be to simply not report proportions which are below 0.5%. That way, people with higher “Native American” fractions would receive the results, and the proportions would be high enough that it was almost certainly indicative of recent admixture, which is what people care about.

Why am I telling you this? Because many journalists who report on direct-to-consumer genomics don’t understand the science well enough to grasp what’s being sold to the consumer (frankly, most biologists don’t know this field well either, even if they might use a barplot here and there).

And, the reality is that consumers have very specific parameters of what they want in terms of geographic and temporal information. They don’t want to be told true but trivial facts (e.g., they are Northern European). But neither they do want to know things which are so novel and at far remove from their interpretative frameworks that they simply can’t digest them (e.g., that Northern Europeans are a recent population construction which threads together very distinct strands with divergent deep time histories). In the parlance of cognitive anthropology consumers want their infotainment the way they want their religion, minimally counterintuitive. Consume some surprise. But not too much.

  1. Love the photos and associated captions!

    In terms of what consumers want, I think your narrative is certainly an accurate explanation of what consumer genomics company executives think that their customers want.

    I’m not convinced, however, that consumers couldn’t find value in a deeper and more accurate model of their genetic ancestry that requires them to invest more effort for more knowledge. A significant subset of the market is pretty high brow and quite open to new scientifically rooted ideas. Certainly, the notion of a report on Neanderthal ancestry isn’t something that consumer genomics customers have found off putting.

    The presentation of the common backstory to provide context for the rest is tricky – it has to find the middle ground between understandable and unequivocal and being accurate – comparable to the information signs in national parks and zoos, and it has to find a way to break it into manageable chunks. But, I think a lot of people who invested the time to rebuild their worldview based upon that while seeing how they personally fit into that framework would find it to be very rewarding.

    You can’t buy the consumer loyalty that comes from that kind of experience with money. A consumer genomics company that invested in that kind of “user interface” could be the iPhone of consumer genomics – able to charge a premium for its very comparable services to competitors that could make it the dominant profit maker in the field.

  2. I think that one of the main reasons why consumers wouldn’t value the older stories of genetics is that they are much more universal than recent admixture.

    When people do traditional genealogy they are generally looking for some aspect of their family history which is unique or at least rare. Perhaps they hope to find a mildly famous ancestor. Or maybe they had an ethically ambiguous grandparent. What hooks them into the process is the idea that there is something potentially unique and special about their own lineage.

    Neolithic-era ancestry from the “four races” in various proportions in contrast, is essentially universal to those of European descent. Your own particular admixture proportion from these groups does not paint you or your ancestors as being “special” in any sense.

    Using my own example, as I said in the other thread, I heard from anaunt in adulthood that my great-great grandfather, who passed himself off as Portuguese when he moved to the U.S., was actually from an English Sephardic family. DNA testing showed the proportion of “Ashkenazic” DNA one would expect from being 1/16th Sephardic ancestry. But if we step back and look at things from the perspective of the early Neolithic races, what does my genetic breakdown mean? A slightly elevated level of the more “southern” ancestry groups compared to other people of largely Northern European descent, and nothing more. It’s interesting, but there’s no personal narrative that can be drawn from it – certainly nothing that sets me out as particularly distinctive, which means it lacks the emotional appeal most consumers are looking for.

  3. My comment on your previous post maybe speaks to the “shared ancestry between Europeans and Native Americans”. I wrote that my 23andMe results showed an almost imperceptible less than one tenth of a percent of “Native American” ancestry. 23and Me clusters(?) East Asian and Native American together.

    Anyhow, speaking to what consumers want to see and what a supposedly scientific analysis shows, 23and Me put Native American on my timeline graphic, if you are familiar with that feature. For myself, I am not new to this area, but just as a consumer and have read a little, including Dr.Wells. I also am fortunate to know where my recent ancestors hailed from. I am not upset about what any of the results are.

    Did 23andMe chose Native American rather than East Asian( which show a higher % at 0.2) on my timeline because they thought I would like that more? I don’t have a college degree, but I’ve gathered enough to know that East Asians and Native Americans are similar genetically. I understand the connection, but ancient DNA that has stayed on the EurAsian continent is not “Native American”. How could it be when they did not make the trip to “Turtle Island” or columbus’s America?

    This kind of product being sold to the consumer could cause confusion as many white Americans yearn for a NativeAmerican connection. Your recommendation of not reporting proportions which are below 0.5% makes sense.

    What I would like to know is where is the difference between Southern Euros, or Mediterranean and Northern Euros? They couldn’t all have been farmers.

  4. “A racially mixed man, 15% ANE, 30% CHG, 30% WHG, 30% EEF”

    This adds up to more than 100%. Which one is 5% off?

    [fixed -razib]

  5. It looks like the company Living DNA does provide information about a user’s genotype by time period. Has anyone here used this service?

    “View your ancestry through history

    We put your ancestry into context showing your breakdown today (going back up to 10 generations), and also the spread of your ancestors at different points in history, showing how we are all connected.”

  6. I seem to be unusual in that, once I had solved one ‘hidden’ mystery in my genealogy, I ceased to find it interesting. My paternal grandfather’s father was a brick maker; my paternal grandmother’s father was a plumber. Whoopee. I am pretty confident I can trace my male line back to some Norman French guy who invaded England, but so what? The Normans very successfully made themselves disappear as an identifiable ethnic group within a few hundred years of the invasion, and in any case it says zero about me.

    I find the period from the end of the LGM to the Bronze Age much more interesting; and then the Bronze Age collapse around the Mediterranean – the American anthropologist Eric Cline is a very interesting speaker on that subject, and it carries some warning messages for modern global trade – break a vital link, and the whole system goes down (with the exception of Egypt, who successfully held out the second invasion of the Sea People, but were weakened as a consequence).

    And I am getting very impatient wanting to learn more about the origins of modern East Asians. Come on, Q. Fu et al. – summon your courage and publish it.

  7. Karl, the example of your great-great-grandfather reinforces to me the importance of family ties and connection between generations. Your DNA testing results are a pleasing confirmation of family oral tradition. If you had not had that conversation with your aunt, the DNA results alone would not have been especially enlightening.

    I agree with you, that many consumers are looking for evidence of a famous ancestor or other aspect that connotes rarity or being unique in some way. I have spent a great deal of time on Ancestry dot com, combing through public records, trying to learn more about my heritage. I hit a dead end at the point that my ancestors emigrated to the United States from shtetls in Zhytomyr and Warsaw in the late 19th century. Most of my (American Christian) acquaintances in the genealogical research community can trace their lineage for hundreds of years. I will try genomic testing next. To me, it is sufficient to know what portion of my DNA is Ashkenazi or Sephardic. My expectations about specificity are low, now that I realize how minimal the historic record was for most Jews, and perhaps for eastern Europeans in general.

    I am not complaining about oppression or minority status! Instead, I am rather amused by the wealth of information that retail genomic testing consumers expect. Most of us, including Ashkenazi Jews, become scientifically indistinguishable prior to 5000 years ago.

  8. Not used it, but 23andMe now do the same thing with their revised ‘user experience’. The time periods make sense in my case. I haven’t checked my daughter’s yet, but will do today. The big surprise in my wife’s results came in the late 17th/early18th Century, Qing Dynasty, so far enough back to have been lost from the collective family memory.

    They lost a couple of things I liked when they changed the site, like the global PCA plots. I would have wanted to keep those, but evidently most of their customers were not interested in that feature. No surprise, really, given their customer base. I downloaded screenshots of the maps that showed me and my daughter, but my wife submitted her sample too late, and by the time she got her results, that feature had gone.

    If there is some software I can access to use to recreate those maps, I’d be grateful to be told about it.

    Don’t know what Joe Pickrell is doing at DNA Land – his analysis makes no sense for me. I’m pretty darned sure I’m not 4.4% Sardinian. But it’s free, so I’m not complaining. But the newly added trait prediction feature is pretty unimpressive – not much use telling me I have a high probability of having brown eyes when I have just told them twice that my eyes are green.

  9. Correction: late 18th/early19th Century, in my wife’s case. Just failed on non-verbal IQ.

    It was not a total surprise, because it had already shown up in my daughter’s results. But it’s still a surprising result.

  10. Meant to say – that journalistic piece was, in my view, a very nasty and premeditated hit piece, and it made me vicariously very angry on your behalf; perhaps irrationally so, but I have been reading all of the various mutations of GNXP since 2002, and the inferences of that piece were very definitely not fair on you, from my reader’s perspective. I don’t want to prolong it, but if I had a way to kick back at that journalist, I surely would.

  11. Sandgroper, I feel similarly. I was more than a little surprised to read the linked piece characterizing razib’s work as scientific racism. It was short on facts and denied freedom of association in general. Describing razib as an alt-right enabler is ridiculous, further compounded by the fact that taxonomic descriptions of the alt-right are more elusive than consumer genomics ancestry clusters.

    (The New York Times can choose whomever they want as columnists for the newspaper. Extending an offer and then retracting it is tacky though.)

  12. I too find’s findings about me strange… they show me as having more eastern European and Eurasian ancestry than I actually have. They say I am 1% Tubalar! My genealogy reveals me to be almost completely colonial American/northwest European, which 23andme’s results align much better with.

  13. It’s not for want of capability in China, which is what is making me so suspicious.

    I don’t know this, I’m just suspicious of it. But there are fair number of Chinese anthropologists/paleoanthropologists who have some weird, apparently ideologically/politically driven hyper-nationalistic theories about the origins of the ‘Han race’, and how this ‘unique’ branch of modern humans evolved in situ in China. I just have a suspicion that modern geneticists like Q. Fu and others are concerned about the reaction they would get if they publish research that shows that these ‘racial purity’ theories are a load of nonsense.

    There is at least one geneticist in Shanghai (name escapes me, but it’s a guy, not Qiaomei Fu ) who just says straight out that Chinese are descended from a group of anatomically modern humans who migrated out of Africa, just like everyone else, and that the genomic data very clearly demonstrate that, but whenever people like him make some public statements about it, there is an immediate aggressive reaction from these apparently ideologically driven anthropologists.

    To muddy the water, there have been some finds of ancient human remains in China that appear to be ambiguous (whether anatomically modern human or with some archaic features), and with dating that would confound a “single migration out of Africa” event. But reports about those remains need to be treated with caution, including both the interpretation of the cranial morphology and the reliability of the dating) given what seems to me to be a clear ideological agenda of some of the Chinese anthropologists.

    *shrug* I dunno. But it’s frustrating.

  14. Could just be due to a lack of finds/remains. China has always been humid, going back 50,000 years and more, so not such a good preservational environment.

    I need to stop serial posting/thinking aloud.

  15. I wrote that my 23andMe results showed an almost imperceptible less than one tenth of a percent of “Native American” ancestry. 23and Me clusters(?) East Asian and Native American together.

    i would generally avoid reading too much into such low %s. i have a japanese friend who is 0.1% ashkenazi.

  16. it uses PoBI data set, and has great ppl involved. but, it is really british isles focused.

    my belief is these inferences are useful for ppl from latin america. not so much for others.

  17. At the risk of manifesting the “intractable nightmare” of your other recent Ancestry blog (“Ancestry inference is precise and accurate(ish)“) as the genetic lay-person who takes these tests too literally, would you be interested in taking a stand on whether it is me, or the FTDNA Family Finder algorithm, that just doesn’t get my intra-family results? My parent is nearly 50% of an ethnicity (Scandinavian) vs. me 0%? I suspect that your blog comment addresses this matter fully to more erudite consumers than me. If I recall correctly, you consulted in the development of the FF admixture tool.

    “In Northern Europe there is very little genetic variation from the British Isles all the way the Baltic.”

    However, is a 50% admixture variance a bit extreme even if the populations are inter-continental or even if allowing for uneven inheritance patterns? To carry my specific example a little further, Scandinavian results that I have seen on Gedmatch calculators such as Eurogenes K36 show much higher levels of whatever N. European population is included than my father or me. On the Eurogenes K13 calculator, we have the same top Single Mode population. Is not Gedmatch a similarly referenced time-frame as FTDNA?

    We buy these tests blindly, in my opinion, based on commercials that promise to tell you “how Irish you are” if it’s St. Patrick’s Day. If these companies would feature model ancestry reports of proto-typical populations before purchase, e.g. showing up-front the broad categories such as “British / Irish” or “French / German,” that are typical results (apologies for Euro-centric examples), they might mitigate much of the forum Q&A and far-flung testing.

    I respect the FTDNA product / experience overall, especially the MtDNA and Y DNA research, but admit that my parent’s AuDNA variances with mine really challenged my comfort-level with the state of the art (or science). FTDNA has a stated mission on its website of providing ancestry services to match family members. Yet, it either offers the ethnicity function as “fluff” to sell more services or believes that it proffers some utility in making family connections. Unless my intra-family results are an anomaly, it seems that the FTDNA admixture algorithm cuts a broad-swath whereas 23andMe tries hard to find any possible minute admixture, misleading naive customers like me who have only partially documented family trees. Add to that the alleged different time periods measured by commercial DNA tests … hoping that you continue these blogs.

  18. We buy these tests blindly, in my opinion, based on commercials that promise to tell you “how Irish you are” if it’s St. Patrick’s Day. If these companies would feature model ancestry reports of proto-typical populations before purchase, e.g. showing up-front the broad categories such as “British / Irish” or “French / German,” that are typical results (apologies for Euro-centric examples), they might mitigate much of the forum Q&A and far-flung testing.

    this correct.

    so i was a consultant for ftDNA, and we saw these weird results in particular with scandinavians. without giving away trade secrets i can tell u the mix of scandinavian if your reference set impacts the outcomes a lot.

    i think a bayesian system to take into account family relatedness might help with these problems.

