South Asian Genotype Project


It’s been a few years since I’ve done any serious “Genome Blogging.” Mostly I’ve been very busy and there isn’t much low-hanging fruit left as it is. But today I want to announce that I’ll be running the generically titled “South Asian Genotype Project.”

The way it works is simple: send me a 23andMe, Ancestry, or Family Tree DNA raw genotype file to contactgnxp -at- gmail.com (though 23andMe’s new chip has far less overlap with other platforms earlier, so probably best if you were typed before August 2017).

In the subject please put:

  1. “South Asian Genotype Project”
  2. The state/province your family is from
  3. Ethnolinguistic group
  4. If applicable, caste

In the body of the email you can put Y and mtDNA and any other information you want. Obviously your data is confidential and I won’t identify you by name, just ethnolinguistic group and such.

Since the last time I did this I have some scripts that make this a lot of easier, so hopefully I’ll be adding individuals to this spreadsheet every few days. I’ll give project members an ID and try to email them when the results are up.

The main motivator for this project on my part is that people still ask me questions about Sinhalese, Nasrani Christians, and other assorted groups which we don’t have answers to because current research projects haven’t focused on them.

Since Zack worked on the Harappa Ancestry Project we know a lot more about South Asian ancestry. Basically, there is an ANI-ASI cline, and some South Asians have exogenous ancestry off this cline. Indian Jews have Middle Eastern ancestry, while Bengalis have East Asian ancestry, and some groups in Pakistan have African ancestry. With that in mind I’ll be testing a smaller number of populations. The marker set is 240,000 SNPs by the way.

Below are some representative results. You can see that my results from three DTC services are basically the same. Also, some South Indian groups (see Pulliyar) show “Dai” ancestry, when I’m pretty sure it’s just that I didn’t sample as much on the extreme portion of the ASI-cline.

ID
Armenians
Belorussian
C_India
Dai
Nigerian
NWIndia
S_India
YemeniteJews
Balochi
34%
1%
0%
0%
0%
66%
0%
0%
Bangladesh_Razib (23andMe)
0%
0%
14%
14%
0%
15%
57%
0%
Bangladesh_Razib (Ancestry)
0%
0%
14%
14%
0%
15%
57%
0%
Bangladesh_Razib (ftDNA)
0%
0%
13%
14%
0%
15%
58%
0%
Chenchus
0%
0%
1%
1%
0%
0%
98%
0%
Dharkars
0%
0%
16%
2%
0%
38%
44%
0%
Dusadh
0%
0%
21%
1%
0%
2%
76%
0%
Iranians
65%
2%
1%
2%
0%
20%
0%
10%
Kallar
0%
0%
0%
0%
0%
0%
100%
0%
Kurumba
0%
0%
0%
0%
0%
4%
96%
0%
Meghawal
0%
0%
10%
0%
0%
26%
64%
0%
MumbaiJews
18%
0%
4%
0%
0%
39%
28%
11%
Naga
0%
0%
0%
90%
0%
0%
10%
0%
NorthKannadi
0%
0%
0%
2%
0%
0%
98%
0%
Pakistani
3%
7%
19%
6%
0%
38%
23%
4%
Pathan
12%
3%
1%
1%
0%
80%
3%
0%
TamilNadu_Iyer
0%
1%
2%
0%
0%
42%
54%
0%
TamilNadu_Nadar
0%
0%
0%
1%
0%
0%
99%
0%
UP_Kayastha
0%
0%
17%
2%
0%
42%
39%
0%
WestBengal_Kayastha
0%
2%
15%
6%
0%
14%
64%
0%
Pulliyar
0%
0%
0%
7%
0%
0%
93%
0%
DalitTN
0%
0%
0%
1%
0%
0%
99%
0%
Velama
0%
0%
9%
0%
0%
22%
68%
0%

36 thoughts on “South Asian Genotype Project

  1. Hi Razib,

    Given that ADMIXTURE components at low resolution in W Eurasia have proven to match the ancestral gene pools in that region pretty well (“Mediterranean” to Anatolian Neolithic, “West Asian” to CHG/Iranian Neolithic, “SW Asian” to Natufian, “NE European” to Yamnaya or mix of Yamnaya and WHG) what do you make of the components that repeatedly appear in South Asia? Intuitively, I would associate the “South Asian” component dominating the West Coast of peninsular India with a South Indian Neolithic associated with ashmounds and Zebu pastoralism that spread Dravidian languages, which grades over from “West Asian” in the Indus Valley associated with Iranian Neolithic, and then the “Oceanian” percentages in Southernmost Indian tribals with local HG substrates, and a mix of “Onge” and “East Asian” with the Austroasiatic contribution in the East. What do you think?

    Is there any component associated with Gangetic populations?

  2. years ago the model i was promoting based on admixture was like so:

    west asian farmer + steppe + native eastern eurasian.

    some of the components might be picking up well admixed groups like ONLY west asian farmer + native eastern eurasian. some of the south indian non-brahmins have very little steppe influence.

  3. I have sent my myheritage raw data (which is similar to ftdna format). But I was confused regarding the format of e-mail but I have added all of my info in subject anyway(I hope I did it right).

  4. Thanks for your answer. Several more questions.

    Don’t you find it odd that a single “South Asian” component seems to dominate the whole of the Subcontinent? Ordinarily you would expect multiple Neolithic events, which did occur in S Asia (South Indian Neolithic in the C Deccan, Gangetic Neolithic, plus Indus Neolithic which is the oldest) to produce their own autosomal signals segmented by ADMIXTURE even if the underlying genetic composition of each component is the same. The population movements/expansions associated with each component would have occurred several thousand years ago, which is more than enough time to produce an autosomal component through drift.

    What does the single South Asian component suggest to you? That the entire continent was populated by Dravidian lookalikes when the Indus Neolithic began, even though the Dravidian expansion in the South Indian Neolithic and Iron Age had not yet taken place, and in either case did not reach the Indus and Ganges, which therefore most likely did not speak Dravidian languages even after the Dravidian expansion? How was this homogeneity achieved?

  5. “Ordinarily you would expect multiple Neolithic events, which did occur in S Asia (South Indian Neolithic in the C Deccan, Gangetic Neolithic, plus Indus Neolithic which is the oldest) to produce their own autosomal signals segmented by ADMIXTURE even if the underlying genetic composition of each component is the same.”

    I think that ANI is probably a single well mixed combination of an indigenous N to S cline (some of which shows up as ASI and some of which shows up as ANI), an Iranian Neolithic sourced Indus Valley Neolithic and Indo-Aryan mix. ADMIXTURE would not distinguish between successive Neolithic events that all affected the same population except at very high K and even then only if there is a more pure reference population to correspond to them. The Ganges Neolithic is probably derivative of the Indus River Neolithic in terms of population.

    “That the entire continent was populated by Dravidian lookalikes when the Indus Neolithic began, even though the Dravidian expansion in the South Indian Neolithic and Iron Age had not yet taken place, and in either case did not reach the Indus and Ganges, which therefore most likely did not speak Dravidian languages even after the Dravidian expansion? How was this homogeneity achieved?”

    The Indus Neolithic almost surely involves migration into an area with indigenous Paleolithic South Asians and the genetic cline across Paleolithic South Asia almost surely involved a less extreme internal range of variation than the differences between Paleolithic South Asians and the Iranian Neolithic, Indo-Aryan and Austro-Asiatic immigrants, respectively. (It is also very clear to me based upon the anthropology showing very thin trade ties between IVC and the South Asian Neolithic, which came later and used different crops, that the IVC culture was not linguistically Dravidian and instead probably spoke a language related to that of Iranian farmers.)

    One of the puzzles of South Asia linguistically is that the time depth of the Dravidian language family is very young; younger even than the languages derived from Sanskrit, yet the time depth of private South Asian Y-DNA and mtDNA haplogroups (e.g. Y-DNA H) is very old (tens of thousands of years) and those private South Asian uniparental DNA haplogroups are more common in Dravidians than in linguistically Indo-European South Asians.

    My conjecture regarding what happened is that members of the pre-Dravidian language family (and possibly also extinct language isolates) once covered a range of South Asia that covered pretty much everyplace that the Indus River Valley Civilization did not extend. Then, the Indo-Aryan invasion’s initial success wiped out those languages everywhere except a small relict area probably about midway North to South on the eastern coast of the Deccan Peninsula. But, within a couple of generations, the last outpost of Dravidians surged out and recovered most, but not all of the territory taken by the Indo-Aryans in a manner not unlike the Reconquista of the Iberian Peninsula after it fell to the Moors. Their Dravidian dialect became the proto-Dravidian dialect of rebooted Dravidian and is the most recent common ancestor of all Dravidian languages today due to this bottleneck even though the language family is much, much older than its most recent common ancestor.

    Among the data favoring this conjecture is that most of linguistically Indo-European South Asia shows signs of two waves of Indo-European admixture, while most of linguistically Dravidian South Asia shows only one, which is older than the Indo-European wave to the North despite the fact than Indo-Aryan invasion was a north to south process.

    Now, how old the language family that experienced a Bronze Age bottle neck that grave rise to the modern Dravidian language family is, is another conundrum. I think that there is a good probability that this language family coincided with the emergence of the South Asian Neolithic ca. 2500 BCE and may have replaced the previous hunter-gather languages of South Asia, and that it may have arrived with the same people who brought the outside domesticated from the African Sahel that were crucial to that agricultural revolution and taught South Asians how to cultivate them. But, either way, all other Dravidian and/or pre-Dravidian language were wiped out in a bottleneck event that probably took place ca. 1500 CCE to 1100 BCE.

  6. What is the interpretation of C_India component, which is the highest in Pakistani and Dusadh (I forgot who these people are, except I know some gurkha-watchmen type who used to claim Dusadh)?

  7. Arent West Bengal Kayasthas (also Bengalis in general) technically closer to UP Kayasthas or UP-Bihar mid-castes,excluding east Asian parts?

  8. Arent West Bengal Kayasthas (also Bengalis in general) technically closer to UP Kayasthas or UP-Bihar mid-castes,excluding east Asian parts?

    i only have 1 sample. they look like bengalis in general.

    (i am 1/4 east bengal kayastha by ancestry, though conversion to islam happened on my maternal grandfather’s side centuries ago supposedly)

  9. What is the interpretation of C_India component, which is the highest in Pakistani and Dusadh (I forgot who these people are, except I know some gurkha-watchmen type who used to claim Dusadh)?

    they’re mostly guju patels. so i don’t know, do you have a good interp? just wanted something btwn south indian and punjabi/pashtun

  10. “”i only have 1 sample. they look like bengalis in general.

    (i am 1/4 east bengal kayastha by ancestry, though conversion to islam happened on my maternal grandfather’s side centuries ago supposedly)””

    Yeah, that make sense.Bengali muslims are local coverts from Hindu/Buddhist castes and WB_Kayastha’s cluster with Bangladeshis is logical.But Im curious to know, Who are the closest Indian groups for us Bangladeshis(and also West Bengalis),excluding our east Asian genes. UPites-Biharis? Oriyas? Telegus? or Gujjus? Based on geographical location it must be UPites-Biharis,Right?

  11. UPites-Biharis? Oriyas? Telegus? or Gujjus?

    more telugus than gujus i think. i guess i could check.

    hard to say re: UPites-Biharis cuz there aren’t the samples 🙁

  12. My paternal ancestor was supposedly a Bengali Kayastha many centuries ago too.

    More resolution on W Bengali non brahmins would be interesting.. One would think the founding population for Bangladeshis must be Eastern Bihari / West Bengali on some sort of cline..

    Ah just seen your reply

  13. “more telugus than gujus i think. i guess i could check.
    hard to say re: UPites-Biharis cuz there aren’t the samples ”

    In harappadna calculator’s population reference UP-Biharis and also UP_Kayasthas(Srivastava) are in my top 10 so i think they are our closest Indian groups, both Gujaratis and telegus are quiet far from us. Another case is our ASI is related to Austroasiatic/Santhals, same is for UP-Biharis as well but Telegu’s ASI is related to South Indian Tribals. A N-C.India component with UP-Bihar samples would be ideal.
    If the C.India is Gujrati Patel they are 50-50 ANI-ASI? the S.India is Kallar so 40-60 ANI-ASI?

  14. two friends—a Telugu Kamma and a Bombay-origin part-Bhatia, putatively part-Afghan (who shares an above-threshold segment on 23andMe with my Ashkenazi mother?)—will be sending theirs.

  15. The really interesting component here is C_India. You have suggested it could be from Gujarati Patels. Why not include some Patels so that we can see? Another suggestion I have is that it might be Austro-Asiatic ancestry that has bled into non Austro-Asiatics in Central India.

    I presume that you included Armenians and Belorussians as proxies for Iran Neolithic and for Ancient Steppe populations. But most of the South Asian populations have no component of either Armenian or Belorussian. This must be because of the Anatolia Neolithic ancestry in Armenians and Belorussians and its lack in South Asia. The only exception are the Balochi who have 34% of the Armenian component. This can be explained by the considerable Middle Eastern ancestry in this region due to Muhammad bin Qasim’s conquest in the eight century.

    Can you repeat your analysis with aDNA from Iran Neolithic and Yamnaya instead of Armenian and Belorussian?

  16. Thank you very much Razib for including my ADNA results. I would really appreciate if you could please try and include Living DNA as well. Thank you very much. I personally think my ADNA underestimates my Dai type ancestry and all tests have possibly inflated my South Indian ancestry in relation to my East Asian ancestry, thank you very much

  17. Sending in mine and my brother’s. Our dad is [Christian] Tamil Sri Lankan from Jaffna, and our mom is [Hindu-Muslim] Indo-Guyanese (Bihari?). They met — and we were born — here in Canada.

  18. Thanks for running these results. For the Bangladeshi samples, the Dai seems pretty consistent but some variation in NW/C/S Indian and Belorussian. Leaking into one another? Presumably Iran_Neo is being mostly represented by the NW and C Indian components?

    My father’s results (sylhet 3) are always relatively more S Indian heavy compared to other family members, but there’s a good 20pc difference with my mother (sylhet 4) which seems a bit excessive.

    Interesting to compare these with the West Bengali Kayastha sample you have.

    The 1% Nigerian is clearly noise, but 2% Yemeni jew in my maternal cousin (Sylhet 2) is interesting as he and my mother seem to often score minor SW Asian in other calculators and on nmonte too.

  19. @Reza
    Hi Reza I’m Zayd, I think your maternal side have some distant middle eastern ancestry. In my opnion NW/C/S Indian components should have both ANI/ASI components. S.Indian peaks in Kallars and Nadars and they are around 40% ANI. NW.Indian also should be something like 30-40% ASI, for example the Jatt sample didnt score any S.Indian, not even C.Indian. I could be wrong though.

  20. i will post PCA. so i think i may redo some of these clusters and add ancient ones. perhaps something besides ADMIXTURE. some of the early results look non-robust to me (fwiw i’ll rerun everyone when i figure out where i’m going going fwd).

  21. Dear Razib,
    Sent you my DNA data to the email address you provided. Please acknowledge at convenience. Would be interested to see what your analysis says. Thanks for starting this project.
    Regards,
    MB

  22. @Reza: I would say you could make a better case for your SW Asian ancestry WITH the 1% Nigerian ancestry as it would account for the SSA that’s common among Middle-Easterners. This is why Razib hypothesized that Pakistanis would score more Nigerian.

    As for me, I’m pretty proud of having the highest Dai so far. What are the representative populations for this group?

  23. Hi Reza, Zayd, Arlus, Jortita

    Bmoney here – Agreed Razib if you could split the ANI out – Punjabis and Gujaratis have strong Zagros Farmer ancestry but also significant Paniya ancestry >15% compared to populations outside South Asia or west of the Indus

  24. Hi all,

    I’d quite like to work out what some of the intra group variability is like rather than necessarily labelling it as recent ancestry from elsewhere etc.

    PCA would be great from that point of view.

    I wonder which other ethnicities have sent in their info, but mid north / central India would be great from an IMBY point if view, as well as what appears to be Bengalis from various parts of the region.

    Separating South Indian Tribals from Central Indian Tribals also seems smart in the absence of ancient dna. Works well with Lukasz’ calculator

Leave a Reply

Your email address will not be published. Required fields are marked *