Introducing the Harappa Ancestry Project

A few weeks ago I hinted at a South Asian equivalent to Dodecad & Eurogenes BGA. It is now public and in the data collection phase. You can read the whole thing here:

This is the feed:

If your ancestry is from these nations:

  • Afghanistan
  • Bangladesh
  • Bhutan
  • Burma
  • India
  • Iran
  • Maldives
  • Nepal
  • Pakistan
  • Sri Lanka
  • Tibet

Read on! If not, “for entertainment purposes only”….

I have been griping in public and in private about the “reference” populations used for South Asian genomics for years. Because of the Permit Raj the HGDP had to use Pakistani populations. Additionally, because of the HGDP’s mandate to focus on smaller groups which might harbor genetic uniqueness you have some very obscure tribes, but only one sample set from an Indo-Aryan speaking population. And even there, it was a minority, not the Punjabi speaking majority of Pakistan.

Some of this has changed in recent years. Papers such as Reconstructing Indian History and Genetic diversity in India and the inference of Eurasian population expansion have added more populations to the mix. The current phase of the HapMap has Gujaratis from Houston. But there is always a problem when you take a small population set to be representative of a broader group. There are ~1.3 billion South Asians. Using Gujaratis from Houston, who are likely to be of a narrow range of castes, is still problematic. Because of the long history of endogamy and likelihood of fine-grained caste and geographical structure good population coverage is of the essence for South Asians. Taking the Beijing HapMap sample as representative of Han Chinese is not optimal, but this sort of thing would be far less optimal in South Asia.

So when Dienekes began the Dodecad Ancestry Project I was very curious. I had had ADMIXTURE for a while, but it prompted me to start playing around with it myself. My plan was to wait to see how Dienekes fared. In particular, what didn’t pan out in terms of fruitful use of labor. Mine is finite, like everyone’s. My medium term plan was to start up a South Asian equivalent to Dodecad at some point in the first half of 2011.

Then Zack approached me. I know Zack from the internet since 2003 through the blogs. His primary interest in blogging was about Pakistani culture and liberal politics (he’s Pakistani American and a liberal). But he also has a doctorate in electrical engineering, so he has some technical skillz. It turns out that because of Zack’s own peculiar genetic background (he’s 1/4 Egyptian) he kept asking me questions. Eventually it became clear that he was interested about starting something similar to Dodecad…and I told him my own future plans, and encouraged him to take up the torch immediately. I knew Zack had the technical chops, and also could probably devote more time and energy at the time than I could.

I immediately gave him my 23andMe sample. Since I had Dienekes already run my genome we kind of knew what to expect. And it looks like Zack has the software running well. He included a Nepali sample, and it turns out that in an MDS clustering I fell 71% into the dominant Nepali cluster. This is kind of what I expected.

In any case, the details:

Please do not send samples from close relatives. I define close relatives as 2nd cousins or closer. If you have data from yourself and your parents, it might be better to send the samples from your parents (assuming they are not related to each other) and not send your own sample.

If you are unsure if you are eligible to participate, please send me an email ( to inquire about it before sending off your raw data.

What to send?
Please send your All DNA raw data text file (zipped is better) downloaded from 23andme to along with ancestral background information about you and all four of your grandparents. Background information would include where they were born, mother tongue, caste/community to which they belonged, etc. Please provide as much ancestry information as possible and try to be specific. Do especially include information about any ancestry from outside South Asia.

Data Privacy
The raw genetic data and ancestry information that you send me will not be shared with anyone.

Your data will be used only for ancestry analysis. No analysis of physical or health/medical traits will be performed.

The individual ancestry analysis published on this blog will be done using an ID of the form HRPnnnn known to only you and me.

What do you get?
All results of ancestry analysis (individual and group) will be posted on this blog under the Harappa Ancestry Project category. This will include admixture analysis as well as clustering into population groups etc.

I suggest you read about Dienekes’ analysis on South Asians for an idea about what to expect.

You can access all blog posts related to this project from the Harappa Ancestry Project link on the navigation menu on every page of my website. You can also subscribe to the project feed.

If you’re South Asian, Iranian, Burmese, or Tibetan, and have a 23andMe genotyping done, you know what to do. If you know someone from these groups who have had that done, please forward this one.

Posted in Uncategorized

Comments are closed.