Size (sample) matters more than coverage

Credit
Credit

We live in an age where it’s almost anachronistic to talk about “-omics.” When a technology becomes seamless in our day to day life it becomes unworthy of notice. That being said we’re still in the phase of genomics where a lot of the details of “best practices” are being hashed out (the proliferation of “pipelines” for relatively pedestrian tasks makes that clear). Recently I stumbled upon two papers which I thought would be useful to give a little more coverage to, Population genomics based on low coverage sequencing: how low should we go? and Assessing the Effect of Sequencing Depth and Sample Size in Population Genetics Inferences. At issue here is coverage versus sample size. By coverage I mean the expected number of reads that will hit a nucleotide. If you have 100× you’ll expect to get 100 hits on a base, and if you have 1× you’re only getting one hit. Because of variation lots of positions are going to be above or below your expected coverage. Why this matters on the most prosaic level is that there is going to error in the results you get back from sequencing, and if you have many hits on the same position you can distinguish true from false polymorphism. For many projects people today seem to prefer on the order of 30×.

Citation: Fumagalli M (2013) Assessing the Effect of Sequencing Depth and Sample Size in Population Genetics Inferences. PLoS ONE 8(11): e79667. doi:10.1371/journal.pone.0079667
Citation: Fumagalli M (2013) Assessing the Effect of Sequencing Depth and Sample Size in Population Genetics Inferences. PLoS ONE 8(11): e79667. doi:10.1371/journal.pone.0079667

As the second paper is open access I’ll refer to its results, which are broadly in agreement with the first. When attempting to estimate simulated (so the author knows the “true” values) population genetic statistics or population substructure increasing sample size at even 1-2x coverage gave much more bang for the buck than ratcheting up coverage. The methodology employed a trade off between sample size and coverage, so that (sample size)x(coverage) remained invariant. It actually wasn’t totally surprising for me in relation to population structure, since noisy and error prone data can still be quite useful assuming there isn’t a systematic bias (i.e., the error is random, so you’re left to thousands of useful markers after employing stringent quality control). But it did surprise how much of an effect there was in standard population genetic statistics of diversity. And the problems in that domain only increase when you have a rapidly growing population so that there is an excess of rare variants (like humans), rather than a constant population size.*

Finally, obviously this is a conclusion geared toward biologists focusing on population-scale dynamics, whether it be molecular ecologists or population geneticists. But as sequencing becomes more ubiquitous, and money remains finite, these sorts of balancing acts between coverage versus sample size will come more to the fore.

* Also, the author observes that instead of employing a hard cut off of some sort in variant calling, but utilizing a probabilistic model such as in ANGSD, you can get a lot more juice out of low coverage.

 

 

Posted in Uncategorized

Comments are closed.