Genomic ancestry tests are not cons, part 2: the problem of ethnicity

The results to the left are from 23andMe for someone whose paternal grandparents were immigrants from southern Germany. Their mother had a father who was of English American background (his father was a Yankee American with an English surname and his mother was an immigrant from England), and grandparents who were German (Rhinelander) and French Canadian respectively on their maternal side.

Looking at the results from 23andMe one has to wonder, why is this individual only a bit under 25% French & German, when genealogical records show places of birth that indicates they should be 75% French & German (more precisely, 62.5% German and 12.5% French). Though their ancestry is 25% English, only 13% of their ancestry is listed as such.

First, notice that nearly half of their ancestry is “Broadly Northwestern European.” Last I  checked  23andMe uses phased haplotypes to detect segments of ancestry. This is a very powerful method and is often quite good at zeroing in on people of European ancestry. But with Americans of predominant, but mixed, Northern European background rather than giving back precise proportions often you obtain results of the form of “Broadly…” because presumably, recombination has generated novel haplotypes in white Americans.

But this isn’t the whole story. Why, for example, are many of the Finnish people I know on 23andMe assigned as >90% Finnish, while a Danish friend is 40% Scandinavian?

The issue here is that to be “Finnish” and “Scandinavian” are not equivalent units in terms of population genetics. Finns are a relatively homogeneous ethnic group who seem to have undergone a recent population bottleneck. In contrast, Scandinavia encompasses several different, albeit related, ethnicities which are geographically widely distributed.

Ethnic identities are socially and historically constructed. Additionally, they are often clear and distinct. This is not always the case for population genetic classifications. On a continental scale, racial classification is trivial, and feasible with only a modest number of genetic markers. Why? Because the demographic and evolutionary history of Melanesians and West Africans, to give two concrete examples, are distinct over tens of thousands of years. Population genetic analyses which attempt to identify or differentiate these groups have a lot of raw material to work with.

Read More

South Asian Genotype Project


It’s been a few years since I’ve done any serious “Genome Blogging.” Mostly I’ve been very busy and there isn’t much low-hanging fruit left as it is. But today I want to announce that I’ll be running the generically titled “South Asian Genotype Project.”

The way it works is simple: send me a 23andMe, Ancestry, or Family Tree DNA raw genotype file to contactgnxp -at- gmail.com (though 23andMe’s new chip has far less overlap with other platforms earlier, so probably best if you were typed before August 2017).

In the subject please put:

  1. “South Asian Genotype Project”
  2. The state/province your family is from
  3. Ethnolinguistic group
  4. If applicable, caste

In the body of the email you can put Y and mtDNA and any other information you want. Obviously your data is confidential and I won’t identify you by name, just ethnolinguistic group and such.

Since the last time I did this I have some scripts that make this a lot of easier, so hopefully I’ll be adding individuals to this spreadsheet every few days. I’ll give project members an ID and try to email them when the results are up.

The main motivator for this project on my part is that people still ask me questions about Sinhalese, Nasrani Christians, and other assorted groups which we don’t have answers to because current research projects haven’t focused on them.

Since Zack worked on the Harappa Ancestry Project we know a lot more about South Asian ancestry. Basically, there is an ANI-ASI cline, and some South Asians have exogenous ancestry off this cline. Indian Jews have Middle Eastern ancestry, while Bengalis have East Asian ancestry, and some groups in Pakistan have African ancestry. With that in mind I’ll be testing a smaller number of populations. The marker set is 240,000 SNPs by the way.

Below are some representative results. You can see that my results from three DTC services are basically the same. Also, some South Indian groups (see Pulliyar) show “Dai” ancestry, when I’m pretty sure it’s just that I didn’t sample as much on the extreme portion of the ASI-cline.

Read More

Razib Khan’s raw genotype data on 23andMe, Family Tree DNA, Geno 2.0 and Ancestry

It has been a while since I posted an update on my genotype. Since then I’ve been tested on most of the major platforms. I don’t see any harm in releasing this to the public or researchers who want to look at it (though I don’t know why anyone would).

You can download all the files here.

Having my genotypes public is pretty useful for me. If I inquire about someone’s genetics oftentimes people get weirdly defense and ask “what about you?” I Just invite them to look at my raw data and analyze it for themselves! I’m not a hypocrite about this.

Over the years I’ve had researchers inquire about my ethnicity when they stumble upon my genotype on platforms such as openSNP. So in full disclosure, most of my ancestry is pretty standard eastern Bengali. I’m more East Asian shifted than most Bangladeshi samples in the 1000 Genomes project, but then my family is from Comilla, in the far east of eastern Bengal (anyone who cares, my Y is of course R1a1a-Z93 and my mtDNA U2b).

As before I’ll put the genotype under a Creative Commons license:Creative Commons License

Bank your exome with Helix for free ($0.00) [update, SALE ENDED!]

Update: Sale over!

I wasn’t going to do this again, but I’ve decided to promote Helix’s special discount. It ends at 2:59 AM EDT November 10th. Eight hours from when I push this post.

Obviously, there is a conflict of interest as I work for one of Helix’s partners. What does that mean?

  • Helix does an exome+ sequence and stores your data.
  • Then, you buy applications which use that data.
  • The company I work for is one of the application providers.
  • “Exome” means that Helix does a very accurate medical grade sequence of all your genes. The “+” points to the fact that they include a substantial number of positions which are not within genes (in the “junk DNA”). That totals up to 30,000,000+ markers (the exome is 1% of your whole genome). This is not trivial. Current direct-to-consumer genomics companies are looking at 500,000 to 1,000,000 markers with SNP arrays.
  • Helix keeps this data. Within a few months, you can buy the data at cost (it won’t be cheap!). But the model is that you buy a la cart apps, which will be affordable (our products are affordable).

I’m laying this all out very plainly because many people are asking me about these details right now as the sale winds down, and this includes people who are pretty savvy about personal genomics. Here is why I think you should get the kits now:

  1. It gets my company more customers. That’s the self-interested part, and less important for the target audience.
  2. For you, it gets you an exome that you can buy later without any upfront cost. For the next eight hours, Helix is basically waiving the kit costs by dropping the price $100.

Our Neanderthal product is now $9.99. Our Metabolism product is $19.99. These products are great, as they give you functional information in a very user-friendly manner. But a lot of my readers can analyze their own data, so what’s the incentive then? Again, the incentive is that you get an exome for free, and can later buy it if you want, or, perhaps even a savvy personal genomics consumer will find an app they’ll want to purchase. Normally the kit is $80, so buying it now means you’ll never have to pay this cost. If you are the type of person who has qualms about a private company keeping your data, this may not be for you.

Of course, there are other app developers in the Helix store, so just buy whatever you want. This is a way to get your exome sequenced for free nowI will tell you that the Insitome apps are among the cheapest.

Finally, a lot of people are buying “family-pack” quantities. I got four kits for example for my immediate family. Unfortunately, there are some issues with the Helix site and the extra purchases. You can buy more than one easily at Amazon right now. Our Neanderthal product is not in low stock. The Metabolism product has only a few left, though I don’t know what that means.

Note: The discount is client-side, so you may need to switch browsers if you are going to the Helix site to buy (or turn off ad-block). From what I can see Amazon does not have these issues.

The issue is with the model, not precision!

The Wirecutter has a thorough review of direct-to-consumer ancestry testing services. Since I now work at a human personal genomics company I’m not going to comment on the merits of any given service. But, I do want to clarify something in regards to the precision of these tests. Before the author quotes Jonathan Marks, he says:

For Jonathan Marks, anthropology professor at University of North Carolina at Charlotte, the big unknown for users is the margin for error with these estimates….

The issue I have with this quote is that the margin of error on these tests is really not that high. Margin of error itself is a precise concept. If you sample 1,000 individuals you’ll have a lower margin of error than if you sample 100 individuals. That’s common sense.

But for direction-to-consumer genomic tests you are sampling 100,000 to 1 million markers on SNP arrays (the exact number used for ancestry inference is often lower than the total number on the array). For ancestry testing you are really interested in the 10 million or so (order of magnitude) markers which vary between population, and a random sampling of 100,000 to 1 million is going to be pretty representative (consider that election year polling usually surveys a few thousand people to represent an electorate of tens of millions).

If you run a package like Admixture you can repeat the calculation for a given individual multiple times. In most cases there is very little variation between replicates in relation to the percentage breakdowns, even though you do a random seed to initialize the process as it begins to stochastically explore the parameter space (the variance is going to be higher if you try to resolve clusters which are extremely phylogenetically close of course).

As I have stated before, the reason these different companies offer varied results is that they start out with different models. When I learned the basic theory around phylogenetics in graduate school the philosophy was definitely Bayesian; vary the model parameters and the model and see what happens. But you can’t really vary the model all the time between customers, can you? It starts to become a nightmare in relation to customer service.

There are certain population clusters that customers are interested in. To provide a service to the public a company has to develop a model that answers those questions which are in demand. If you are designing a model for purely scientific purposes then you’d want to highlight the maximal amount of phylogenetic history. That isn’t always the same though as the history that customers want to know about it. This means that direct-to-consumer ethnicity tests in terms of the specification of their models deviate from pure scientific questions, and result in a log of judgment calls based on company evaluations of their client base.

Addendum: There is a lot of talk about the reference population sets. The main issue is representativeness, not sample size. You don’t really need more than 10-100 individuals from a given population in most cases. But you want to sample the real population diversity that is out there.

10 million DTC dense marker genotypes by end of 2017?


Today I got an email from 23andMe that they’d hit the 2 million customer mark. Since they reached their goal of 1 million kits purchased the company seems to have taken its foot off the pedal of customer base growth to focus on other things (in particular, how to get phenotypic data from those who have been genotyped). In contrast Ancestry has been growing at a faster rate of late. After talking to Spencer Wells (who was there at the beginning of the birth of this sector) we estimated that the direct-to-consumer genotyping kit business is now north of 5 million individuals served. Probably closer to 6 or 7 million, depending on the numbers you assume for the various companies (I’m counting autosomal only).

This pretty awesome. Each of these firm’s genotype in the range of 100,000 to 1 million variant markers, or single nucleotide base pairs. 20 years ago this would have been an incredible achievement, but today we’re all excited about long-read sequencing from Oxford Nanopore. SNP-chips are almost ho-hum.

But though sequencing is the cutting edge, the final frontier and terminal technology of reading your DNA code, genotyping in humans will be around for a while because of cost. At ASHG last year a medical geneticist was claiming price points in bulk for high density SNP-chips are in the range of the low tens of dollars per unit. A good high coverage genome sequence is still many times more expensive (perhaps an order of magnitude ore more depending on who you believe). It also can impose more data processing costs than a SNP-chip in my experience.

Here’s a slide from Spencer:

I suspect genotyping will go S-shaped before 2025 after explosive growth in genotyping. Some people will opt-out. A minority of the population, but a substantial proportion. At the other extreme of the preference distribution you will have those who will start getting sequenced. Researchers will begin talk about genotyping platforms like they talk about microarrays (yes, I know at places like the Broad they already talk about genotyping like that, but we can’t all be like the Broad!).

Here’s an article from 2007 on 23andMe in Wired. They’re excited about paying $1,000 genotyping services…the cost now of the cheapest high quality (30x) whole genome sequences. Though 23andMe has a higher price point for its medical services, many of the companies are pushing their genotyping+ancestry below $100, a value it had stabilized at for a few years. Family Tree DNA has a father’s day sale for $69 right now. Ancestry looks to be $79. The Israel company MyHeritage is also pushing a $69 sale price (the CSO there is advertising that he’s hiring human geneticists, just so you know). It seems very likely that a $50 price point is within site in the next few years as SNP-chip costs become trivial and all the expenses are on the data storage/processing and visualization costs. I think psychologically for many people paying $50 is not cheap, but it is definitely not expensive. $100 feels expensive.

Ultimately I do wonder if I was a bit too optimistic that 50% of the US population will be sequenced at 30x by 2025. But the dynamic is quite likely to change rapidly because of a technological shift as the sector goes through a productivity uptick. We’re talking about exponential growth, which humans have weak intuition about….

Addendum: Go into the archives of Genomes Unzipped and reach the older posts. Those guys knew where we were heading…and we’re pretty much there.

Direct-to-consumer genomics, it’s back on!

The past three and a half years, and arguably longer, there has been something of a dark night passing over direct to consumer (DTC) personal genomics. The regulatory issues have been unclear to unfavorable. If you have read this blog you know 23andMe‘s saga with the Food and Drug Administration.

It looks like 2017 DTC is finally turning a regulatory corner, with some clarity and freedom to operate, FDA Opens Genetic Floodgates with 23andMe Decision:

Today, the U.S. Food and Drug Administration told gene-testing company 23andMe that it will be allowed to directly tell consumers whether their DNA puts them at higher risk for 10 different diseases, including late-onset Alzheimer’s disease and Parkinson’s.

The decision to allow these direct-to-consumer tests is a big vindication for 23andMe, which in 2013 was forced to cease marketing such results after the FDA said they could be inaccurate and risky to consumers, and that they required regulatory approval.

I still agree with my assessment in 2013, this won’t mean anything in the long run. DTC is here to stay, and if the decentralization of medical testing and services don’t happen in the USA, they’ll happen elsewhere, and at some point medical tourism will get cheap enough that any restrictions in this nation won’t be of relevance. But, this particular decision alters the timeline in the grand scheme of things, and matters a great deal for specific players.

It’s on!