Ancestry inference is precise and accurate(ish)

For about three years I consulted for Family Tree DNA. It was a great experience, and I met a lot of cool people through that connection. But perhaps the most interesting aspect was the fact that I can understand the various pressures that direct-to-consumer genomics firms face from the demand side. The science is one thing, but when you are working on a consumer facing product, other variables come into play which are you not cognizant of when you are thinking of it from a point of pure analysis. I’m pretty sure that my insights working with Family Tree DNA can generalize to the other firms as well (23andMe, Ancestry, and Genographic*).

The science behind the ancestry inference elements of the product on offer is not particularly controversial or complex, but the customer aspect of how these results are received can become an intractable nightmare. The basic theory was outlined in the year 2000 in Pritchard et al.’s Inference of Population Structure Using Multilocus Genotype Data. You have lots of data thanks to better genomic technology (e.g., 300,000 SNPs). You have computers to analyze that data. And, you have scientific models of population history and dynamics which you can test that data against. The shape of the data will determine the parameters of the model, and it this those parameters that yield “your ancestry.”

In broad sketches the results make sense for most people. It’s in the finer details that the confusions emerge. To the left you see my son’s ancestry deconvolution. The color coding is such you can tell that his maternal and paternal chromosomes have very different ancestry profiles (mostly Northern European and South Asian, respectively).

But his “Northern European” chromosomes also are more richly colored, with alternative segments denoting ancestry from different parts of Northern Europe. So in terms of proportions I am told my son is about 15 percent French and German, and 10 percent Scandinavian and 10 percent British and Irish. This is reasonable. On the other side he’s nearly 50 percent “broadly South Asian.” The balance is accounted for by my East Asian ancestry, which is correct, as my South Asian ethnicity is from Bengal, where there is a fair amount of East Asian ancestry (my family’s origin is on the eastern edge of Bengal itself).

And it is here that the non-scientific concerns of consumer genomics comes into focus. The genetic differences and distance between various South Asian groups are far higher than those between various Northern European groups. Depending on the statistic measure you use intra-South Asian variation is about one order of magnitude greater than intra-Northern European differences. This is due to geographic partitioning, the caste system, and differential admixture in South Asians between extreme diverged ancestral elements (about half of South Asian ancestry is very similar to Europeans and Middle Easterners, and half of it is extremely different, so how far you are from the 50 percent mark determines a lot).

Broadly South Asian

In Northern Europe there is very little genetic variation from the British Isles all the way the Baltic. The reason for this is historical: massive population turnover in the region 4,500 years ago means that much of the genetic divergence between the groups dates to the Bronze Age. It is this the genetic divergence, the variation, that is the raw material for the inferences and proportions you see in ancestry calculators. There’s just not that much raw material for Northern Europeans.

Broadly South Asian

Remember, the methods require lots of variation in the data as a raw input. You’re making the inference machine work real hard to produce a reasonable robust result if you don’t have that much variation. In contrast to the situation with Northern Europeans, with South Asians the companies are leaving raw material on the table, and just combining diverse groups together.

What’s going on here? As you might have guessed this is an economically motivated decision. Most South Asians know their general heritage due to caste and regional origins (though many Bengalis exhibit some lacunae about their East Asian ancestry). In contrast, many Americans of Northern European ancestry with an interest in genealogy are extremely curious about explicit proportional breakdowns between Northern European nationalities. The direct-to-consumer genomic firms attempt to cater to this demand as best as they can.

As I have stated many times, racial background is to various extents both biological and social. When it comes to the difference between Lithuanians and Nigerians the biological differences due to evolutionary history are straightforward, and clear and distinct. You can generate a phylogenetic history and perform a functional analysis of the differences. Additionally, you also have to note that the social differences exist, but are not straightforward. Like Lithuanians Nigerians of Igbo background are generally Roman Catholic, while most other Nigerians are not. The linguistic differences between Nigerian languages are great enough that it is defensible to suggest that Hausa speakers of Afro-Asiatic dialects are closer to Lithuanians in their phylogenetic history than to the dialects of the Yoruba.

A Lithuanian American

Contrast this to the situation where you differentiate Lithuanians from French. To any European the differences here are incredibly huge. The history of France, what was Roman Gaul, goes back 2,000 years. After the collapse of the West Roman Empire by any measure the people who became French were at the center of European history. In contrast, Lithuanians were a marginal tribe, who did not enter Christian civilization until the late 14th century. In social-cultural terms, due to history, the differences between French and Lithuanians are extremely salient to people of French and Lithuanian ancestry. But genetically the differences are modest at best.

If a direct-to-consumer genetic testing company tells you that you are 90 percent Northern European and 10 percent West African, that is a robust result that has a clear historical genetic interpretation. The two element’s of one’s ancestry have been relatively distinct for on the order of 100,000 years, with the Northern European element really just a proxy for non-Africans (though it is easy to drill-down within Eurasia). In contrast, notice how 23andMe, with some of the best scientists in the business, tells people they are “French-German,” and not French or German. What the hell is a “French-German”? Someone from Alsace-Lorraine? A German descendent of Huguenots? Obviously not.

“French-German” is a cluster almost certainly because there are no clear and distinct genetic differences between French and Germans. Yes, there is a continuum of allele frequencies between these two groups, but having looked at a fair number of people of French and German background in Family Tree DNA’s database I can tell you that France and Germany have a lot of local structure even among people of indigenous ancestry. Germans from the Rhineland are quite often genetically closer to French from Normandy than they are to Germans from eastern Saxony. Some of this is due to gene flow between neighboring regions, but some of this is due to cultural fluidity as to who exactly is German. It is clear that some Germans from the eastern regions are Germanized Slavs. Some Germans from the north exhibit strong affinities to Scandinavians, while Germans from Bavaria and Austria are classically Central European (whatever that means). The average German is distinct from the average French person, but the genetic clustering of the two groups is not clear and distinct.

Remember earlier I explained that the science is predicated on aligning data and models. The cultural model of Northern Europeans is conditioned on diversity and difference which has been very salient for the past few thousand years since the rise and fall of Rome. But the evolutionary genetic history is one where there are far fewer differences. The data do not fit a model that makes much sense to the average consumer (e.g., “you descend from a mix of Bronze Age migrants from the west-central steppe of Eurasia and Mesolithic indigenous hunter-gatherers and Neolithic farmers”). What makes sense to the average American consumer are histories of nationalities, so direct-to-consumer genetic companies try to satisfy this need. Because the needs of the consumer and their cultural expectations are poorly served by the data (genetic variation) and models of population history, you have a lot of awkward kludges and strange results.

A Saxon

Imagine, for example, you want to estimate how “German” someone is.  What do you use for your reference population of Germans?  Looking at the data there are clearly three major clusters within Germany when you weight the numbers appropriate, with affinities to the northern French, Slavs, and Scandinavians, and various proportions in between. Your selection of your sample is going to mean that some Germans are going to be more Germans than other Germans. If you select an eastern German sample then western Germans whose ancestors have been speaking a Germanic language far longer than eastern Germans are going to come out as less German. Or, you could just pick all of these disparate groups…in which case, lots of Northern Europeans become “German.”

Consumers want genetic tests to reflect strong cultural memories which were forged in the fires of rapidly protean and distinction-making process of cultural evolution. But biological and cultural evolution exhibit different modes (the latter generates huge between group differences) and tempos (those differences emerge fast). The ancestry results many people get are the outcomes of compromises to thread the needle and square the circle.

All the above is half the story. Next I’ll explain why “deep history” has to be massaged to make recent history informative and comprehensible….

* Also, I have a little historical perspective because of my friendship with the person who arguably created this sector, .

15 thoughts on “Ancestry inference is precise and accurate(ish)

  1. Very interesting read Razib. I once checked admixture results of for an individual from India. It showed more detailed predictions as compared to the original AncestryDNA info.

  2. My 23andme results on ancestry are not too surprising. It finds Askenazic ancestry from my father’s side, which is where a Sephardic ancestor was located. It finds Scandinavian ancestry, which is mostly from my mother’s side, which again isn’t surprising, since that’s where I had a Swedish ancestor. I have 0.1% Native American via my mother (and ultimately my maternal grandfather) which was a bit surprising, but he had one grandmother whose family was in the U.S. since the 1700s, so it’s the plausible place to have that result turn up.

    The one exception is British/Irish and French/German. Judging by my family tree, I certainly have more ancestors from Germany than I do from Ireland/England. But at every confidence level, 23andme finds I have about twice as much British/Irish than French/German. More oddly, my maternal grandmother (who is still alive, and has been sequenced) is only found to be 31% French/German. This is odd because she is not only 100% German, but both of her parents were from a German-speaking town in Austria-Hungary. Their ancestors came from all over German-speaking Europe – from Bavaria to Wurzburg to Alsace – but there’s not a non-German among them judging by surnames.

    Regardless, I learned to take these results with a grain of salt. Which is why when I see on various forums people using 23andme data to “prove” that all of the U.S. (including the Midwest) has more “German” ancestry than “British” I bring up the anecdote of my grandmother.

  3. I like the 23andMe chromosome painting feature, and the overall results are close to what I know about my family.
    European 93.9%, Middle Eastern & North African 5.3% etc. make sense to southern Italian/Sicilian origins.
    The Sub-Saharan African 0.1% is not a stretch, although at such low percent, is it even present?
    East Asian & Native American 0.3% is where in my opinion they get confused.
    East Asian comes in at 0.2% (broadly East Asian 0.2%) and Native American at <0.1%. How do they arrive at any NA ancestry with such a low content? Less than one percent does not seem detectable in a consumer test.
    East Asian ancestry content is not impossible, who knows what ancestry was brought in the highly mixed Mediterranean, but on the 23amdMe "Your Ancestry Timeline" feature, they chose to list Native American and propose that I had an 100% NA ancestor(7th great or greater) rather than showing as an East Asian ancestor.
    All my grand parents arrived here in the US from Italy. There was no time for any Native American ancestors. It is almost impossible for that, and infidelity would have shown up with less Italian 77.1% and more North Western Euro 0.3% as Italians were not that common when the colonists may have mixed with Native Americans. I am aware EastAsians are similar to Native Americans.

  4. I liked the explaination that consumer eexpectations are that cultural and national differences will be revealed but that DNA isn’t aligned (i.e., doesn’t “map”) to that purpose. Nature and nurture can proceed in different directions, on different time scales.

  5. Seems I can’t do anything right today. I meant to respond to my own post that people use 23andme data to prove less German than British.

  6. While this is true for the rest of the world, the prominence of Europe makes this even more salient: tiny genetic differences can make for huge differences in cultural and behavioral profile, as the Hajnal line makes most clear. Ultimately (if mostly unwittingly) this why White North Americans are interested in their precise ancestry.

  7. tiny genetic differences can make for huge differences in cultural and behavioral profile, as the Hajnal line makes most clear.

    this is literally begging the question. there is no evidence for the genetic differences you take as a given. asserting that they exist won’t convince. (they may exist, but that’s no reason to simple assume that they do)

  8. Razib (dada bhai),

    Thank you for a clearly articulated account of the admixture constellation.

    Having tested across all of the companies and being a pioneer with FTDNA, my biggest concern is how can the MyOrigins algorithm place such disparate results from other companies’?

    My family is highly admixed across the four major anthropological groups. MyO has me at 29% Southern European, my brother at 5%, our mother at 29%, yet every other company has that component <5%!

    They also seem to under represent the Sub Saharan and Asian components compared to the other companies (including Gedmatch). I have not had a perfunctory response from them in all of the years that I have been asking.

    Are you able to provide any clarity?


  9. I thought FTDNA’s Asian clusters in myOrigins were great. The Central Asia component reliably extracted a lot of admixture from Indians representing post-Neolithic West/Central Asian ancestry.

    The South Asia cluster was not really linked to any historical time or place in its description though. It should have said something like “it represents pre-Bronze Age India” or something.

    I think ideally an ethnic composition feature should have a recent timeframe (comparison to modern populations) and an ancient one. The ancient one should feature, for Europeans, Neolithic farming populations from Anatolia and the Levant, and a Loschbour-like hunter gatherer population. The tricky aspect is to create a Steppe component that will accurately pull the right proportion for Europeans and other Asians alike. A neolithic Iranian cluster would help differentiate a lot in Asia as well.

    So something including:

    Neolithic Anatolia

    Neolithic Levant

    Western Hunter Gatherer

    Steppe + Eastern Hunter Gatherer (Yamnaya may be best for drawing out a sensible admixture proportion even if it may historically not be the right fit)

    Neolithic Iranian

    Either Caucasus Hunter Gatherers or something akin to Chalcolithic Iranian, like an ancient Armenian group to represent “indigenous”, non-Steppe Caucasian ancestry that should peak in Caucasian populations

    Native American

    East Eurasian

    Aboriginal Australian / Southeast Asian (and perhaps a separate Oceanian component if needed)

    If and when ancient DNA from India is sequenced, it can simply be added to fill the void and capture admixture that would otherwise split up into the pre-existing categories.

    I feel like this approach of having two phases to a calculator would achieve the goal of giving people a digestible breakdown of what they’re looking for (i.e, am I German or Scottish?) but also introduce them to the valuable knowledge being uncovered by the hard working scientists as to our origins and dispel many myths/legends (particularly racial ones).

    I hope whoever’s at FTDNA now is taking an approach like this one.

  10. Also an interesting idea is to have a feature which “averages” results from a family. For instance, even siblings will have variance in their results due to the nature of genetic recombination and the algorithms used to assess allele frequency distribution.

    So if someone wants a “clearer” result, they could get as many people tested from their family as they can and the tool would spit out a “calculated” score. You could have a visual depiction of a family tree, click any person in it, and it would calculate their “weighted” numbers based on the results of neighboring individuals in the tree. So all the siblings get the same result.

    It would be more like “Ancestry Composition” rather than “Ethnic Composition”. It makes sense for siblings to have varying ethnic composition because of genetic recombination, but “ancestry” composition shouldn’t change… they have the exact same ancestors.

  11. They also seem to under represent the Sub Saharan and Asian components compared to the other companies (including Gedmatch). I have not had a perfunctory response from them in all of the years that I have been asking.

    that is very strange. can i ask your specific background?

  12. At the risk of manifesting the “intractable nightmare” of your other recent Ancestry blog (“Ancestry inference is precise and accurate(ish)“) as the genetic lay-person who takes these tests too literally, would you be interested in taking a stand on whether it is me, or the FTDNA Family Finder algorithm, that just doesn’t get my intra-family results? My parent is nearly 50% of an ethnicity (Scandinavian) vs. me 0%? I suspect that your blog comment addresses this matter fully to more erudite consumers than me: If I recall correctly, you consulted in the development of the FF admixture tool.

    “In Northern Europe there is very little genetic variation from the British Isles all the way the Baltic.”
    However, is a 50% admixture variance a bit extreme even if the populations are inter-continental or even if allowing for uneven inheritance patterns? To carry my specific example a little further, Scandinavian results that I have seen on Gedmatch calculators such as Eurogenes K36 show much higher levels of whatever N. European population is included than my father or me. On the Eurogenes K13 calculator, we have the same top Single Mode population. Is not Gedmatch a similarly referenced time-frame as FTDNA?
    We buy these tests blindly, in my opinion, based on commercials that promise to tell you “how Irish you are” if it’s St. Patrick’s Day. If these companies would feature model ancestry reports of proto-typical populations before purchase, e.g. showing up-front the broad categories such as “British / Irish” or “French / German,” that are typical results (apologies for Euro-centric examples), they might mitigate much of the forum Q&A and far-flung testing.
    I respect the FTDNA product / experience overall, especially the MtDNA and Y DNA research, but admit that my parent’s AuDNA variances with mine really challenged my comfort-level with the state of the art (or science). FTDNA has a stated mission on its website of providing ancestry services to match family members. Yet, it either offers the ethnicity function as “fluff” to sell more services or believes that it proffers some utility in making family connections. Unless my intra-family results are an anomaly, it seems that the FTDNA admixture algorithm cuts a broad-swath whereas 23andMe tries hard to find any possible minute admixture, misleading naive customers like me who have only partially documented family trees. Add to that the alleged different time periods measured by commercial DNA tests … hoping that you continue these helpful blogs.

Comments are closed.