The above plot I generated using the 1000 Genomes data set. BEB = Bangladeshis from Dhaka, STU are Sri Lankan Tamils, ITU are Telegus, while PJL are Punjabis from Lahore, and GIH are Gujaratis (collected in Houston). These are big categories. The South Indian population sets exhibit some structure in terms of caste; there are a few Brahmins, as well as some Dalits. The Bengalis are strangely coherent for a South Asian population, shifted toward Cambodians. The Gujarati are differentiated between a large number of Patels, and other various groups. To my surprise the Punjabi samples are very diverse.

nihms137159f3To a great extent it recapitulates the results of the 2009 paper Reconstructing Indian Population History. What you see to the left is the “ANI-ASI cline.” Basically South Asians, from Pashtuns all the way to Paniyas fall along a spectrum of genetic distance from West Asian and European populations. A secondary element is that some groups, such as Bengalis and many Austro-Asiatic tribes, are shifted toward East Asians. An old hypothesis of the ethnogenesis of South Asian peoples is that they are a variegated mix of “Caucasoid” populations intrusive to the subcontinent, which was originally inhabited by an “Australoid” element. Malala_and_Freida_Pinto_meet_the_Youth_For_Change_panel_cropped_fridaThough these terms are somewhat archaic, the general point seems to get at something visually clear: some South Asians look nearly Mediterranean in appearance, while others are hard to distinguish from Australian Aboriginals (at least superficially). And of course, most of us are somewhere in the middle.

nihms137159f4The insight of the Reich group was to use Andaman Islanders as a proxy for a primal indigenous population, and infer that the admixture alluded to above consisted of a very West Eurasian-like population, the Ancestral North Indians (ANI), and an indigenous group closer to East Eurasians, though very diverged, the Ancestral South Indians (ASI). Ergo, the ANI-ASI cline. Using the most closely related population to infer the “ghost population,” they were able to infer admixture proportions even though no “pure” ASI group was available as a reference against which they could judge. Clever strategies like this are important, because the reference populations you use to adduce admixture events (or lack thereof) strongly impact the nature of your results. Using simple PCA or model-based clustering, as with ADMIXTURE, one would fix South Indian Dalits and tribal populations as the “purest” aboriginal people. ~100% “Australoid.” And other groups could be modeled as a “Caucasoid/Australoid” mix. But this model was not satisfactory because even low caste South Indian groups were more shifted toward West Eurasians than you’d expect.

Using a statistic called the F4 ratio the they estimated that ANI ranges from 65-75% in the Northwest Indian populations, down to 15-30% in the lower caste South Indian ones. A 2013 paper, Genetic Evidence for Recent Population Mixture in India, attempted to infer an admixture period (two to four thousand years before the present), as well as a possible secondary pulse in some Indo-European groups. This stands to reason today when you note that most Indian groups share the most unique drift trajectory with the ancient Caucasian hunter-gatherer found in Kotias, but a minority, mostly upper caste, are closer to Sintashta steppe culture.

I’m putting this post up because people are asking me about a paper profiled in ArsTechnica, The caste system has left its mark on Indians’ genomes. Actually the 2009 Reich lab paper already concluded this. So what’s the major finding of this paper that makes it unique? We’ll start with the abstract, Genomic reconstruction of the history of extant populations of India reveals five distinct ancestral components and a complex structure:

India, harboring more than one-sixth of the world population, has been underrepresented in genome-wide studies of variation. Our analysis reveals that there are four dominant ancestries in mainland populations of India, contrary to two ancestries inferred earlier. We also show that (i) there is a distinctive ancestry of the Andaman and Nicobar Islands populations that is likely ancestral also to Oceanic populations, and (ii) the extant mainland populations admixed widely irrespective of ancestry, which was rapidly replaced by endogamy, particularly among Indo-European–speaking upper castes, about 70 generations ago. This coincides with the historical period of formulation and adoption of some relevant sociocultural norms.

So the two major results which warrant this paper being published are that instead of two ancestral populations, they posit four, and, the admixture between some of these is considerably more recent than in the 2013 paper. I think the first conclusion is wrong, and the second is too strong.

The authors make much of the fact that they have new samples. And their SNP-chip has a high density. But I’m confused why they didn’t integrate the 1000 Genomes data. The paper was received in early July 2015, and I know there was 1000 Genomes data from all the above groups by then. They didn’t even bother to use the HapMap GIH sample, which was definitely there!

Screenshot from 2016-01-26 23:29:21The figure to the right shows the crux of their results. They used ADMIXTURE to break apart the ancestries of their Indian data set into four clusters. Through cross-validation they established that a K = 4 was optimal parameter fit for their data. Two of the populations are previously known: ANI and ASI. But they also find that there is an “Ancestral Austro-Asiatic,” and “Ancestral Tibeto-Burman,” cluster, AAA and ATB repectively. Because they did not use full labels, it can be hard to decipher, but they use this plot to assert that people of the Khatri caste are nearly 100% ANI, while Paniyas are nearly 100% ASI. Additionally, they found several groups which were nearly 100% AAA and 100% ATB.

Long-time readers will see the immediate problem: you can’t use ADMIXTURE like this! There is no guarantee that a group that is 100% actually is in a situation where corresponds to a genuine discrete ancestral population that existed in reality. That is, these sorts of models push a certain number of ancestral populations, and force individuals into being combinations of those. The model is constrained by the data you are putting into it to generate the results. For example, if I took Uygurs and Europeans, and did a K = 2, the Uygurs may form one cluster, and the Europeans another, at 100% levels. But we know from history and other methodologies that the Uygurs are a recently mixed group (within the last 2,000 years). Nevertheless if you tell the package to assume K = 2 with Uygurs and Northern Europeans, then it will place these into two distinct groups. And in fact, the result tells you something real and significant about the relatedness of the individuals in the data…but it doesn’t tell you necessarily anything about the real population history.

There’s a fair amount of evidence that Austro-Asiatic populations in India are not indigenous, nor are they pure. A major hole in this paper is the total lack of acknowledgement that Austro-Asiatic languages are much more common in Southeast Asia, and it seems likely that they were intrusive to India. If so, modern Austro-Asiatic peoples can be thought of us a compound of migrants with the local substrate.

The ATB element is found only in Austro-Asiatic tribes and Bengali Brahmins. That’s reasonable, because both populations exhibit a relationship to East Asian groups. While the Brahmins of South India absorbed a minor element of local Dravidian ancestry, those of Bengal absorbed Tibeto-Burman and Austro-Asiatic, which is found in higher concentrations among Bengalis proper.

To repeat, ADMIXTURE does not necessarily give you real population combinations!!! In fact, populations are to some extent a social construct, insofar as they’re just really collapsing the genetic variation which is the result of a particular demographic and pedigree history. The “ANI” group proffered here is an artifact. The Khatri are not a representative of a pure population which is similar to the ancestral ANI. The Paniya are not 100% ASI, they are just the most ASI. The Birhor are not 100% Ancestral Austro-Asiatic, they are just the most distinctively Austro-Asiatic. The Jamatia are not pure Ancestral Tibeto-Burman; most of these Northeastern tribes have some ANI/ASI admixture. They’re just the most Tibeto-Burman.

Instead of relying on ADMIXTURE so much, they should have also utilized D-stats and f-stats (not as sensitive to drift), as well as TreeMix. I think that would have quickly shown that some of these “pure” groups were mixed.

Second, there is the issue of time-since-admixture. They obtained lower values than the 2013 paper. Why? Because they use source populations (and probably the methodology) which are somewhat different from that earlier work. Honestly if some of these populations are compounds, then it doesn’t make sense to necessarily use them as idealized donors in an admixture event. The AAA tracts are most definitely artifacts in my opinion, since the tracts are the outcome of a previous admixture event.

Finally, the authors allude to a “Southern Route” out of Africa, and, imply that the Austro-Asiatic arrived with this. The best work today suggests that Austro-Asiatic peoples expanded with an agricultural wave ~4,000 years ago, with a locus of origin in the uplands of South China. Therefore, they are not primal. A simple inspection of the map of Austro-Asiatic languages forces one to ask the question of direction of migration.

I offer this critique in the spirit of post-publication review. Perhaps the authors will clarify, as I’m genuinely puzzled by the interpretations they offered.

