When conquered pre-Greece took captive her rude Hellene conqueror

When I was a child in the 1980s I was captivated by Michael Wood’s documentary In Search of the Trojan War (he also wrote a book with the same name). I had read a fair amount of Greek mythology, prose translations of the Iliad, as well as ancient history. The contrast between the Classical Greeks and the strangeness of their mythology was always something that on the surface of my mind. The reality that Bronze Age Greeks were very different from Classical Greeks resolved this issue to some extent, as the mythos no doubt drew from the alien world of the former.

Though Classical Greeks were very different from us (e.g., slavery), to some extent Western civilization began with them, and they are very familiar to us for this reason. Rebecca Goldstein’s Plato at the Googleplex was predicated on the thesis that the ancient Greek philosopher had something to tell us, and that if he was alive today he would be a prominent public speaker.

I’m going to dodge the issue of Julian Jaynes’ bicameral mind, and just assert that people of the Bronze Age were fundamentally different from us in a way Plato was not. And that difference is preserved in aspects of Greek mythology. Though it is fashionable, and correct, to assert that Homer’s world was not that of Mycenaeans, but the barbarian period of the Greek Dark Age, it is not entirely true. Homer clearly preserved traditions where citadels such as Mycenae and Pylos were preeminent. Details such as the boar’s tusk helmets are also present in the Iliad. His corpus of oral history clearly preserved some ancient folkways which had fallen out of favor.

But aesthetic details or geopolitics are not what struck me about Greek mythology, but events such as the sacrifice of Iphigenia. Like Abraham’s near sacrifice of his son, this plot element seems to moderns cruel, barbaric, and unthinking. And though the Classical Greeks did not have our conception of human rights, they had turned against human sacrifice (and the Romans suppressed the practice when they conquered the Celts) on the whole. But it seems to have occurred in earlier periods.

The rupture between the world of the Classical Greeks and the strange edifices of Mycenaean Greece were such that scholars were shocked that the Linear B tablets of the Bronze Age were written in Greek when they were finally deciphered. In fact many of the names and deities on these tablets would be familiar to us today; the name Alexander and the goddess Athena are both attested to in Mycenaean tablets.

Preceding the Mycenaeans, who  emerge in the period between 1400-1600 BCE, are the Minoans, who seem to have developed organically in the Aegean in the 3rd millennium. This culture had relations with Egypt and the Near East, their own system of writing, and deeply influenced the motifs of the successor Mycenaean Greek civilization. The aesthetic similarities between Mycenaeans and Minoans is one reason that many were surprised that the former were Greek, because the Minoan language was likely not.

Mycenaean civilization seems to have been a highly militarized and stratified society. There is a reason that this is sometimes referred to as the “age of citadels.” Allusions to the Greeks, or Achaeans, in the diplomatic missives of the Egyptians and Hittites suggests that the lords of the Hellenes were reaver kings. In 1177 B.C. Eric Cline repeats the contention that a fair portion of the “sea peoples” who ravaged Egypt in the late Bronze Age were actually Greeks.

So when did these Greeks arrive on the shores of Hellas? In The Coming of the Greeks Robert Drews argued that the Greeks were part of a broader movement of mobile charioteers who toppled antique polities and turned them into their own. The Hittites and Mitanni were two examples of Indo-European ruling elites who took over a much more advanced civilizational superstructure. While the Hittites and other Indo-Europeans, such as the Luwians and Armenians, slowly absorbed the non-Indo-European substrate of Anatolia, the Indo-Aryan Mitanni elite were linguistically absorbed by their non-Indo-European Hurrian subjects. Indo-Aryan elements persisted only their names, their gods, and tellingly, in a treatise on training horses for charioteers.

Drews’ thesis is that the Greek language percolated down from the warlords of the citadels and their retinues over the Bronze Age, with the relics who did not speak Greek persisting into the Classical period as the Pelasgians. Set against this is the thesis of Colin Renfrew that Greek was one of the first Indo-European languages, as Indo-European languages began in Anatolia.

The most recent genetic data suggest to me that both theses are likely to be wrong. The data are presented in two preprints The Population Genomics Of Archaeological Transition In West Iberia and The Genomic History Of Southeastern Europe. The two papers cover lots of different topics. But I want to focus on one aspect: gene flow from steppe populations into Southern Europe.

We know that in the centuries after 2900 BCE there was a massive eruption of individuals from the steppe fringe of Eastern Europe, and Northern Europe from Ireland to to Poland was genetically transformed. Though there was some assimilation of indigenous elements, it looks to be that the majority element in Northern Europe were descended from migrants.

For various reasons this was always less plausible for Southern Europe. The first reason is that Southern Europeans shared a lot of genetic similarities to Sardinians, who resembled Neolithic farmers. Admixture models generally suggested that in the peninsulas of Southern Europe the steppe-like ancestry was the minority component, not the majority, as was the case in Northern Europe.

These data confirm it. The Bronze Age in Portugal saw a shift toward steppe-inflected populations, but it was not a large shift. There seems to have been later gene flow too. But by and large the Iberian populations exhibit some continuity with late Neolithic populations.  This is not the case in Northern Europe.

In The Genomic History Of Southeastern Europe the authors note that steppe-like ancestry could be found sporadically during early periods, but that there was a notable increase in the Bronze Age, and later individuals in the Bronze Age had a higher fraction. Nevertheless, by and large it looks as if the steppe-like gene flow in the southerly Balkans (focusing on Bulgarian samples) was modest in comparison to the northern regions of Europe. Unfortunately I do not see any Greece Bronze Age samples, but it seems likely that steppe-like influence came into these groups after they arrived in Bulgaria, which is more northerly.

Down to the present day a non-Indo-European language, Basque, is spoken in Spain. Paleo-Sardinian survived down to the Common Era, and it too was not Indo-European. Similarly, non-Indo-European Pelasgian communities continued down to the period of city-states in Greece.

These long periods of coexistence point to the demographic equality (or even superiority) of the non-Indo-European populations. The dry climate of the Mediterranean peninsulas are not as suitable for cattle based agro-pastoralism. This may have limited the spread and dominance of Indo-Europeans. Additionally, the Mediterranean peninsulas were likely touched by Indo-European migrations relatively late. Much of the early zeal for expansion may have already dissipated by them. The high frequency of likely Indo-European R1b lineages among the Basques is curious, and may point to the spreading of male patronization networks, and their assimilation into non-Indo-European substrates where necessary. R1b is also found in Sardinia, and in high frequencies in much of Italy.

The interaction and synthesis between native and newcomer was likely intensive in the Mediterranean. For example, of the gods of the Greek pantheon only Zeus is indubitably of Indo-European origin. Some, such as Artemis, have clear Near Eastern antecedents. But other Greek gods may come down from the pre-Greek inhabitants of what became Greece.

Ultimately these copious interactions and transformations should not be a great surprise. The sunny lands of the Mediterranean attracted Northern European tribes during Classical antiquity. The Cimbri invasion of Italy, Galatians in Thrace and Anatolia, the folk wandering of Vandals and Goths into Iberia, are all instances of population movements southward. These likely moved the needle ever so slightly toward convergence between Northern and Southern Europe in terms of genetic content.

In relation to the more general spread of Indo-Europeans, I believe there are a few areas like Northern Europe, where replacement was preponderant (e.g., the Tarim basin). But I also believe there were many more which presented a Southern European model of synthesis and accommodation.

Synergistic epistasis as a solution for human existence

Epistasis is one of those terms in biology which has multiple meanings, to the point that even biologists can get turned around (see this 2008 review, Epistasis — the essential role of gene interactions in the structure and evolution of genetic systems, for a little background). Most generically epistasis is the interaction of genes in terms of producing an outcome. But historically its meaning is derived from the fact that early geneticists noticed that crosses between individuals segregating for a Mendelian characteristic (e.g., smooth vs. curly peas) produced results conditional on the genotype of a secondary locus.

Molecular biologists tend to focus on a classical, and often mechanistic view, whereby epistasis can be conceptualized as biophysical interactions across loci. But population geneticists utilize a statistical or evolutionary definition, where epistasis describes the extend of deviation from additivity and linearity, with the “phenotype” often being fitness. This goes back to early debates between R. A. Fisher and Sewall Wright. Fisher believed that in the long run epistasis was not particularly important. Wright eventually put epistasis at the heart of his enigmatic shifting balance theory, though according to Will Provine in Sewall Wright and Evolutionary Biology even he had a difficult time understanding the model he was proposing (e.g., Wright couldn’t remember what the different axes on his charts actually meant all the time).

These different definitions can cause problems for students. A few years ago I was a teaching assistant for a genetics course, and the professor, a molecular biologist asked a question about epistasis. The only answer on the key was predicated on a classical/mechanistic understanding. But some of the students were obviously giving the definition from an evolutionary perspective! (e.g., they were bringing up non-additivity and fitness) Luckily I noticed this early on and the professor approved the alternative answer, so that graders would not mark those using a non-molecular answer down.

My interested in epistasis was fed to a great extent in the middle 2000s by my reading of Epistasis and the Evolutionary Process. Unfortunately not too many people read this book. I believe this is so because when I just went to look at the Amazon page it told me that “Customers who viewed this item also viewed” Robert Drews’ The End of the Bronze Age. As it happened I read this book at about the same time as Epistasis and the Evolutionary Process…and to my knowledge I’m the only person who has a very deep interest in statistical epistasis and Mycenaean Greece (if there is someone else out there, do tell).

In any case, when I was first focused on this topic genomics was in its infancy. Papers with 50,000 SNPs in humans were all the rage, and the HapMap paper had literally just been published. A lot has changed.

So I was interested to see this come out in Science, Negative selection in humans and fruit flies involves synergistic epistasis (preprint version). Since the authors are looking at humans and Drosophila and because it’s 2017 I assumed that genomic methods would loom large, and they do.

And as always on the first read through some of the terminology got confusing (various types of statistical epistasis keep getting renamed every few years it seems to me, and it’s hard to keep track of everything). So I went to Google. And because it’s 2017 a citation of the paper and further elucidation popped up in Google Books in Crumbling Genome: The Impact of Deleterious Mutations on Humans. Weirdly, or not, the book has not been published yet. Since the author is the second to last author on the above paper it makes sense that it would be cited in any case.

So what’s happening in this paper? Basically they are looking to reduced variance of really bad mutations because a particular type of epistasis amplifies their deleterious impact (fitness is almost always really hard to measure, so you want to look at proxy variables).

Because de novo mutations are rare, they estimate about 7 are in functional regions of the genome (I think this may be high actually), and that the distribution should be Poisson. This distribution just tells you that the mean number of mutations and the variance of the the number of mutations should be the same (e.g., mean should be 5 and variance should 5).

Epistasis refers (usually) to interactions across loci. That is, different genes at different locations in the genome. Synergistic epistasis means that the total cumulative fitness after each successive mutation drops faster than the sum of the negative impact of each mutation. In other words, the negative impact is greater than the sum of its parts. In contrast, antagonistic epistasis produces a situation where new mutations on the tail of the distributions cause a lower decrement in fitness than you’d expect through the sum of its parts (diminishing returns on mutational load when it comes to fitness decrements).

These two dynamics have an effect the linkage disequilibrium (LD) statistic. This measures the association of two different alleles at two different loci. When populations are recently admixed (e.g., Brazilians) you have a lot of LD because racial ancestry results in lots of distinctive alleles being associated with each other across genomic segments in haplotypes. It takes many generations for recombination to break apart these associations so that allelic state at one locus can’t be used to predict the odds of the state at what was an associated locus. What synergistic epistasis does is disassociate deleterious mutations. In contrast, antagonistic epistasis results in increased association of deleterious mutations.

Why? Because of selection. If a greater number of mutations means huge fitness hits, then there will strong selection against individuals who randomly segregate out with higher mutational loads. This means that the variance of the mutational load is going to lower than the value of the mean.

How do they figure out mutational load? They focus on the distribution of LoF mutations. These are extremely deleterious mutations which are the most likely to be a major problem for function and therefore a huge fitness hit. What they found was that the distribution of LoF mutations exhibited a variance which was 90-95% of a null Poisson distribution. In other words, there was stronger selection against high mutation counts, as one would predict due to synergistic epistasis.

They conclude:

Thus, the average human should carry at least seven de novo deleterious mutations. If natural selection acts on each mutation independently, the resulting mutation load and loss in average fitness are inconsistent with the existence of the human population (1 − e−7 > 0.99). To resolve this paradox, it is sufficient to assume that the fitness landscape is flat only outside the zone where all the genotypes actually present are contained, so that selection within the population proceeds as if epistasis were absent (20, 25). However, our findings suggest that synergistic epistasis affects even the part of the fitness landscape that corresponds to genotypes that are actually present in the population.

Overall this is fascinating, because evolutionary genetic questions which were still theoretical a little over ten years ago are now being explored with genomic methods. This is part of why I say genomics did not fundamentally revolutionize how we understand evolution. There were plenty of models and theories. Now we are testing them extremely robustly and thoroughly.

Addendum: Reading this paper reinforces to me how difficult it is to keep up with the literature, and how important it is to know the literature in a very narrow area to get the most out of a paper. Really the citations are essential reading for someone like me who just “drops” into a topic after a long time away….

Citation: ScienceNegative selection in humans and fruit flies involves synergistic epistasis.

Africa’s great demographic transformation

Stonehenge has been a preoccupation for moderns since the Victorian period. It was built over 5,000 years ago, and its usage in some fashion continued down to about 2,500 years ago. For a long while it had been associated with the Celts, but more recently there has been some suspicion that its roots must be pre-Celtic.

And that is almost certainly true. The original site of Stonehenge had a wooden structure. But during the arrival of the Bell Beaker culture it was extensively rebuilt, and eventually stone monoliths were erected in the fashion we are used to seeing today.

Bernard Cornwell’s novel Stonehenge deals with this period. There is no major focus on physical conflict between the native populations, and the Bell Beaker groups. Rather, the plot centers around the cultural tumult and innovation that was triggered by the arrival of the newcomers.

In Stonehenge the Bell Beakers occupied more marginal, out of the way, territory. The novel presumed that ultimately there would be cultural fusion between the two groups, as there was a lot of interaction inter-personally among the characters of the two groups. We now know that the reality was likely one of near total replacement. From the abstract to be presented on shortly on the Bell Beakers:

British individuals associated with Beakers are genetically indistinguishable from continental individuals associated with the same material culture and genetically nearly completely discontinuous with the previously resident population.

This is not entirely surprising. Ancient Ireland seems to have been characterized by discontinuity with the arrival of Bell Beakers genetically.

Ancient DNA is not magic. But it can literally put some flesh on the bones of cultural shifts that archaeologists have seen in the material culture. One key element here is that the predominant ancestry across the British Isles today derives from migrations that date to the early Bronze Age.* I do not know if this has any relevance as to the arrival of the Celtic languages to the Britain and Ireland, but I suspect it does.

This was percolating in my mind because there’s a new paper which attempts to explore in more detail the Bantu expansions which occurred between 1000 BCE and 500 CE. It’s pretty incredible that from Gabon to Capetown Africans speak one language family, with similarities at least as close as that of the Romance language family.

But then is it incredible? Indo-European languages span the North Sea to the Bay of Bengal. The Bantu expansion in some ways serves as a template for the argument in First Farmers, as an agricultural revolution triggered a demographic expansion which did not stop until they reached the their geographic limits.

The paper in Science, which is open access, Dispersals and genetic adaptation of Bantu-speaking populations in Africa and North America, focuses on two issues. First, the demographic history and phylogenomics of the Bantu populations. Second, using population genomic methods it explores the dynamics of natural selection in these peoples. They utilize and extensive SNP data set, with more than 500,000 markers in their core analyses.

In general I think there are lots of interesting results in this paper. But the one angle I was unsatisfied by was their purported increase in coverage. As you can see it’s highly localized to a few countries. This is probably common sense since much of Africa is not accessible due to political issues (e.g., sampling in the Democratic Republic of Congo is treacherous right now). But one always has to be careful of the limitations of the data when making inferences. Though they have samples from the southwest (Angola, Namibia), the the African Great Lakes region around Uganda, and in South Africa, huge zones between are missing. And, they are highly over sampled in and around Gabon.

With all that said, I think with a variety of methods they probably have confirmed a major aspect of Bantu migration. I’ll quote:

Two hypotheses have been proposed concerning the dispersal of Bantu-speaking populations across sub-Saharan Africa (2–4). According to the “early-split” hypothesis, the western and eastern branches split early, within the Bantu heartland, into separate migration routes. By contrast, the “late-split” model suggests an initial spread southward from the Bantu homeland into the equatorial rainforest (i.e., Gabon/Angola), followed by expansions toward the rest of the subcontinent. We tested these hypotheses by determining whether eBSPs and seBSPs were genetically closer to wBSPs from the southern part, relative to wBSPs from the northern part, of western central Africa….

…Although additional sampling of African populations may further refine these patterns, our results, together with previous genetic data supporting the late-split model (2, 3), indicate that BSPs [Bantu-speaking peoples] first moved southward through the rainforest before migrating toward eastern and southern Africa, where they admixed with local populations. This model is further supported by linguistics (15) and archaeoclimate data (16), suggesting that a climatic crisis ~2500 years ago fragmented the rainforest into patches and facilitated the early movements of BSPs farther southward from their original homeland.

That being said, their sample limitations produce interesting assertions. E.g., “The GLOBETROTTER method estimated that eBSPs resulted from two consecutive admixture events (P < 0.05) occurring 1000 to 1500 years ago and 150 to 400 years ago between a wBSP (~75% contribution) and an Afroasiatic-speaking population from Ethiopia (~10% contribution).” GLOBETROTTER is powerful, but too often people use it in a manner where they assume that the inferences it generates from the data it has are the truth, as opposed to the closest GLOBETROTTER can get to the truth with the tools its given.

In this case I would contend that because there aren’t any Nilotic samples it leaves a major hole in their power to be able to accurately infer what really happened. The presence of pastoralist Nilotic people in close proximity to Bantu agriculturalists has been one of the major dynamics which define the East African landscape. The admixture into eastern Bantu agriculturalists therefore is almost certainly from Nilotic peoples, though there has been Afro-Asiatic (Cushitic) influence as far south as Tanzania, evident in enigmatic peoples such as the Sandawe.

The point here is that just because the GLOBETROTTER method inferred gene flow from a population in the sample set, it does not mean that the gene flow was necessarily from that population. The sampling of the region is sparse, so obviously this is only a first approximation. To some extent I assume the authors assume the readers will connect the dots, but often this sort of thing gets lost in translation, and then it gets into the media….

Though it is difficult to make in the admixture plot above, there are subtle differences in the eastern Bantu groups. The Luyha, who are from Kenya, do not show evidence of the blue component which is clearly Eurasian, while the Bakiga from Rwanda do. But even in the Bakiga the ratio of the violet element that seems to be associated with an indigenous African component which is distinct from that of the Bantu and the blue Eurasian is far higher than in the Afro-Asiatic populations in their data set (this does not mean they don’t have Eurasian ancestry, since admixture plots aren’t perfect proxies).

Because of the nature of the sampling and the utilization of admixture to frame their results I do feel that we don’t get a good sense of the variation among the Bantu across their full range. Granted, the between population genetic distance is actually quite low across this zone, on the order of 0.01, because of the recent shared ancestry. Africans may have much greater total diversity than Eurasians in their genomes, but their between population distance is actually not much different or even lower than Eurasians because of the recent demographic expansions. But did the Bantu expand into empty lands? The Khoisan, Pygmy and Nilotic (I’m sure that’s what it is) contribution to the Bantus across their range is clear, but that’s because we have close enough reference populations to model this contribution. What about areas like Tanzania? Or Mozambique? Were they empty? I suspect the issue here is that we don’t have any non-Bantu indigenous groups as they’ve all been absorbed.

But it is in the selection component that they offer a possible way to ascertain non-Bantu ancestry from ghost populations in the future. They found lots and lots of selection around immune genes. This is not surprising. There were local diseases which they had to adapt to. Therefore, “the HLA region in wBSPs showed a strong excess of ancestry from rainforest hunter-gatherers, at 38%, 6.74 SD higher than the genome-wide average of 16%…..”

In places like Mozambique it would be curious if the regions known to be under selection or enriched for indigenous ancestry in other areas where there are still indigenous populations exhibited a higher Fst against other groups. That is, the Mozambique ghost populations should leave an inordinate impact on regions of the genome associated with immunological function.

Which brings me back to Stonehenge. We do have ancient genomes. But not that many. Especially further back. Apparently the names of rivers and mountains often have very deep histories. For example, the river Humber has a name which may date back to pre-Celtic times (consider the Mississippi river, which has an American Indian origin). These serve as shadows of cultures long gone and replaced. The Bantu expansion is close enough to the margins of history that we don’t have so much time interposed between it and concrete records. We can skein out its outlines with more rigor and surety. And the patterns we see among the Bantus can give us a sense of how past demographic-cultural expansions may have occurred.

* The papers coming out of the PoBI project suggest that a significant minority of the ancestry in eastern England is Anglo-Saxon. But only there.

Addendum: I can’t find the data to download and test some things myself.

So what’s point of demographic models which leave you scratching your head

There’s a new paper on Tibetan adaptation to high altitudes, Evolutionary history of Tibetans inferred from whole-genome sequencing. The focus of the paper is on the fact that more genes than have previously been analyzed seem to be the targets of natural selection. And I buy most of their analyses (not sure about the estimate of Denisovan ancestry being 0.4%…these sorts of things can be tricky).

But they fancy it up with a ∂a∂i model of population history, as well as using MSMC to account for gene flow. I don’t understand why they didn’t use something simpler like TreeMix, which can also handle more complex models. I guess because they wanted to focus on only a few populations?

Years ago I asked the developer of MSMC, Stephan Schiffels, if assuming an admixed population is not admixed might cause weird inferences. Why yes, it would. For example, admixed populations might show higher effective population since they’re pooling the histories of two separate populations. As for ∂a∂i, the model above leaves me literally scratching my head.

…predicted that the initial divergence between Han and Tibetan was much earlier, at 54kya (bootstrap 95% C.I 44 kya to 58 kya). However, for the first 45ky, the two populations maintained substantial gene flow (6.8×10-4 and 9.0×10-4 per generation per chromosome). After 9.4 kya (bootstrap 95% C.I 8.6 kya to 11.2 kya), the gene flow rate dramatically dropped (1.3×10-11 and 4×10-7 per generation per chromosome), which is consistent with the estimate from MSMC.

Mystifying. The separation between Chinese and Tibetans is pretty much immediately after modern humans arrive in East Asia. Then there’s a lot of reciprocal gene flow…which ends during the Holocene.

We’re being told here that there are two populations which persisted in some form for ~45,000 years. Is this believable? That these two populations maintained some sort of continuity, and, remained in close proximity to engage in gene flow. And then ~10,000 years ago the ancestors of the Tibetans separated from the ancestors of the modern Han Chinese.

The latter scenario I can imagine. It’s this ~45,000 year dance I’m confused by. If there is substantial gene flow between the two groups why did they keep enough distinctive drift to be separate populations?

With what we know about ancient DNA from Europe if we posited such a model for that continent we’d be way off. There’s been too many population turnovers. Is East Asia different? I’m moderately skeptical of that. I think perhaps researchers should be very aware of the limitations of ∂a∂i when it comes to fine-grained population genomic analyses.

Note: This is a cool paper, and this small section is not entirely relevant. Which is why I’m confused about it since it seems the weakest part of the analysis in terms of originality, and the least believable.

Beyond “Out of Africa” and multiregionalism: a new synthesis?

For several decades before the present era there have been debates between proponents of the recent African origin of modern humans, and the multiregionalist model. Though molecular methods in a genetic framework have come of the fore of late these were originally paleontological theories, with Chris Stringer and Milford Wolpoff being the two most prominent public exponents of the respective paradigms.

Oftentimes the debate got quite heated. If you read books from the 1990s, when multiregionalism in particular was on the defensive, there were arguments that the recent out of Africa model was more inspirational in regards to our common humanity. As a riposte the multiregionalists asserted that those suggesting recent African origins with total replacement was saying that our species came into being through genocide.

Though some had long warned against this, the dominant perception outside of population genetics was that results such the “mitochondrial Eve” had given strong support to the recent African origin of modern humans, to the exclusion of other ancestry. 2002’s Dawn of Human Culture took it for granted that the recent African origin of modern humans to the total exclusion of other hominin lineages was established fact.

In 2008 I went to a talk where Svante Paabo presented some recent Neanderthal ancient mtDNA work. It was rather ho-hum, as Paabo showed that the Neanderthal lineages were highly diverged from modern ones, and did not leave any descendants. Though of course most modern human lineages did not leave any descendants from that period, Paabo took this evidence supporting the proposition that Neanderthals did not contribute to the modern human gene pool.

When his lab reported autosomal Neanderthal admixture in 2010, it was after initial skepticism and shock internally. I know Milford Wolpoff felt vindicated, while Chris Stringer began to emphasize that the recent African origin of modern humanity also was defined by regional assimilation of other lineages. The data have ultimately converged to a position somewhere between the extreme models of total replacement or balanced and symmetrical gene flow.

This is not surprising. Extreme positions are often rhetorically useful and popular when there’s no data. But reality does not usually conform to our prejudices, so ultimately one has to come down at some point.

The data for non-Africans is rather unequivocal. The vast majority of (>90%) of the ancestry of non-Africans seems to go back to a small number of common ancestors ~60,000 years ago. Perhaps in the range of ~1,000 individuals. These individuals seem to be a node within a phylogenetic tree where all the other branches are occupied by African populations. Between this period and ~15,000 years ago these non-Africans underwent a massive range expansion, until modern humans were present on all continents except Antarctica. Additionally, after the Holocene some of these non-African groups also experienced huge population growth due to intensive agricultural practice.

To give a sense of what I’m getting at, the bottleneck and common ancestry of non-Africans goes back ~60,000 years, but the shared ancestry of Khoisan peoples and non-Khoisan peoples goes back ~150,000-200,000 years. A major lacunae of the current discussion is that often the dynamics which characterize non-Africans are assumed to be applicable to Africans. But they are not.

A 2014 paper illustrates one major difference by inferring effective population from whole genomes: African populations have not gone through the major bottleneck which is imprinted on the genomes of all non-African populations. The Khoisan peoples, the most famous of which are the Bushmen of the Kalahari, have the largest long term effective populations of any human group. The Yoruba people of Nigeria have a history where they were subject to some population decline, but not to the same extent as non-Africans.

What do we take away from this?

One thing is that we have to consider that the assimilationist model which seems to be necessary for non-Africans, also applies to Africans. For years some geneticists have been arguing that some proportion of African ancestry as well is derived from lineages outside of the main line leading up to anatomically modern humans. Without the smoking gun of ancient genomes this will probably remain a speculative hypothesis. I hope that Lee Berger’s recent assertion that they’ve now dated Homo naledi to ~250,000 years before the present may offer up the possibility that ancient DNA will help resolve the question of African archaic admixture (i.e., if naledi is related to the “ghost population”?).

The second dynamic is that the bottleneck-then-range-expansion which is so important in defining the recent prehistory of non-Africans is not as relevant to Africans during the Pleistocene. The very deep split dates being inferred from whole genome analysis of African populations makes me wonder if multiregional evolution is actually much more important within Africa in the development of modern humans in the last few hundred thousand years. Basically, the deep split dates may highlight that there was recurrent gene flow over hundreds of thousands of years between different closely related hominin populations in Africa.

Ultimately, it doesn’t seem entirely surprising that the “Out of Africa” model does not quite apply within Africa.

Addendum: Over the past ~5,000 years we have seen the massive expansion of agricultural populations within the continent. The “deep structure” therefore may have been erased to a great extent, with Pygmies, Khoisan, and Hadza, being the tip of the iceberg in terms of the genetic variation which had characterized the Africa during the Pleistocene.

“Out of Africa” bottleneck is what really matters for mutations

At least in relation to mutational load, if you read a new preprint in biorxiv, The demographic history and mutational load of African hunter-gatherers and farmers:

The distribution of deleterious genetic variation across human populations is a key issue in evolutionary biology and medical genetics. However, the impact of different modes of subsistence on recent changes in population size, patterns of gene flow, and deleterious mutational load remains to be fully characterized. We addressed this question, by generating 300 high-coverage exome sequences from various populations of rainforest hunter-gatherers and neighboring farmers from the western and eastern parts of the central African equatorial rainforest. We show here, by model-based demographic inference, that the effective population size of African populations remained fairly constant until recent millennia, during which the populations of rainforest hunter-gatherers have experienced a ~75% collapse and those of farmers a mild expansion, accompanied by a marked increase in gene flow between them. Despite these contrasting demographic patterns, African populations display limited differences in the estimated distribution of fitness effects of new nonsynonymous mutations, consistent with purifying selection against deleterious alleles of similar efficiency in the different populations. This situation contrasts with that we detect in Europeans, which are subject to weaker purifying selection than African populations. Furthermore, the per-individual mutation load of rainforest hunter-gatherers was found to be similar to that of farmers, under both additive and recessive modes of inheritance. Together, our results indicate that differences in the subsistence patterns and demographic regimes of African populations have not resulted in large differences in mutational burden, and highlight the role of gene flow in reshaping the distribution of deleterious genetic variation across human populations.

There’s two major moving parts in this preprint. First, they using phylogenomic methods to explicitly model population history. Second, they integrated their demographic results in generation and interpreting the distribution of mutations within the exomes of these populations. That is, they combined phylogenomics to gain insight into population genomics, as the latter focuses more on the parameters which define variation with a population.

The data they worked with was from the exome. The regions of the genome which translate into genes. That’s ~30 million bases. They get really good precision due to high coverage, hitting site about 70 times. Their sample was about 300 Africans and 100 Europeans, and they got ~500,000 polymorphisms or variants for their trouble.

The populations were labeled by subsistence and provenance. The Europeans were Belgians. For the Africans they had two groups of hunter-gatherer Pymgies, and two groups of Bantu agriculturalists, sampled from western and eastern locations as you see on the map above.

The admixture plots, which separate out individuals into K numbers of populations break out in a way that makes sense. First, Europeans separate, and the eastern agriculturalist populations have a little bit of evidence of European-like ancestry. This is almost certainly Middle Eastern farmer, which has been found in many East African populations, and those populations which have mixed with them. Then the hunter-gathers separate from the agriculturalists. This is in line with expectation and earlier research; the hunter-gatherers of Africa seem very different from the agriculturalists, and are actually more closely related to each other than the agriculturalists in their neighboring regions.

The exception to this pattern is caused by recent gene flow, which is clearly evident above. Due to population size differences it looks like there is more agricultural ancestry in the Pygmies than vice versa. I wish that they had sampled Mbuti Pygmies. I’m told that this group has the least agricultural admixture.

But then they decided to get fancy and explicitly model demographic histories with fastsimcoal2. What does this do? From the website for the software:

While preserving all the simulation flexibility of simcoal2, fastsimcoal is now implemented under a faster continous-time sequential Markovian coalescent approximation, allowing it to efficiently generate genetic diversity for different types of markers along large genomic regions, for both present or ancient samples. It includes a parameter sampler allowing its integration into Bayesian or likelihood parameter estimation procedure.

fastsimcoal can handle very complex evolutionary scenarios including an arbitrary migration matrix between samples, historical events allowing for population resize, population fusion and fission, admixture events, changes in migration matrix, or changes in population growth rates. The time of sampling can be specified independently for each sample, allowing for serial sampling in the same or in different populations.

The models you see that were tested are pretty simple, and they all seem plausible I suppose. Their simulations suggested that the three above scenarios, with alternative branching patterns and various gene flows, were all of equal likelihood. That is, given the models and the data that they had (4-fold synonymous sites which are likely to be neutral) you can’t distinguish which is right.

In all the models hunter-gatherers diverged relatively recently and so did the agriculturalists. Europeans, who are stand-ins for all non-Africans in this scenario, diverged pretty early from the Africans. But how the Africans relate to each other and Europeans is not totally clear. Why? Because ancient population structure. It is becoming rather obvious now that ~100,000 years ago, and earlier, there were many different modern human lineages which had already diversified. The Khoisan seem to have diverged from other human lineages closer to 200,000 thousand than 100,000 years ago. What this means is that for most of the history of anatomically modern humans population structure  existed between distinct lineages. And some of that persists down to today within Africa.

I’ll bullet point some of their inferences from these models (verbatim quotes below):

  1. Our results suggest that the ancestors of the contemporary RHG, AGR and EUR populations diverged between 85 and 140 thousand years ago (kya), from an ancestral population that underwent demographic expansion between 173 and 191 kya
  2. After the initial population splits, the Ne of AGR and RHG (NaAGR and NaRHG) remained within a range extending from 0.55 to 2.2 times the ancestral African Ne (NHUM), whereas EUR (NaEUR) experienced a decrease in Ne by a factor of three to seven.
  3. The ancestors of the wRHG and eRHG populations diverged 18 to 20 kya (TRHG), and underwent a decreased in Ne by a factor of 3.8 to 5.7 for the wRHG (NwRHG) and 7.1 to 11 for the eRHG (NeRHG), regardless of the branching model considered.
  4. The ancestors of the AGR (NaAGR) split into western and eastern populations 6.7 to 11 kya (TAGR), and underwent a mild expansion, by a factor of 2.3 to 3.1 for the wAGR (NwAGR) and 1.2 to 2.2 for the eAGR (NeAGR).
  5. The EUR population experienced a 7.1- to 8.3-fold expansion (NEUR) 12 to 22 kya (TEUR).

No results are perfect. But some of these dates do not make sense. There’s a lot of circumstantial evidence that the ancestors of European populations began to expand over the last 10,000 years. The dates above suggest there was a Pleistocene expansion. Basically you can divide that value by half, and then you get a reasonable range.

Second, both the agriculturalists sampled here are Bantu speaking, and there’s a good amount of cultural and genetic data for recent shared ancestry of the Bantu over the last 3,000 years. I understand that admixture with a very diverged lineage (e.g., eastern Bantu agriculturalist samples mixing with Nilotic populations, which is how they got some non-African ancestry, as well as local Pygmy groups) can inflate these divergence dates. If that’s the case, they should note that in the text.

We don’t have much historical or archaeological clarity from what I know about divergences between Pygmy groups. This particular group has studied the topic and published on it before, so I’m inclined to trust them more than anyone else. But, the above dates for groups we do know make me a bit more skeptical of a simple divergence around the Last Glacial Maximum.

Then there are the earliest divergences. And 85 to 140,000 year interval is huge for when non-Africans split off from Africans. If closer to 140 than 85, then that means that non-African divergence from Africans preserves ancient African diversity. That is, non-Africans descend from an African group that no longer exists (or has not been sampled in this study at least!). I’ve poked around this question, and when you take into account recent gene flow, it is hard to find the specific African group that non-Africans descend from, though there is some consensus that they branched off from the non-Khoisan Africans later than from the Khoisan.

But there is also a lot of archaeological and some ancient genetic DNA now that indicates that the vast majority of non-African ancestry began to expand rapidly around 50-60,000 years ago. This is tens of thousands of years after the lowest value given above. Therefore, again we have to make recourse to a long period of separation before the expansion. This is not implausible on the face of it, but we could do something else: just assume there’s an artifact with their methods and the inferred date of divergence is too old. That would solve many of the issues.

I really don’t know if the above quibbles have any ramification for the site frequency spectrum of deleterious mutations. My own hunch is that no, it doesn’t impact the qualitative results at all.

Figure 3 clearly shows that Europeans are enriched for weak and moderately deleterious mutations (the last category produces weird results, and I wish they’d talked about this more, but they observe that strong deleterious mutations have issues getting detected). Ne is just the effective population size and s is the selection coefficient (bigger number, stronger selection).

Why are the middle two values enriched? Presumably it’s the non-African bottleneck. This is where another non-African population would have been a nice check to make sure that it was the “Out of Africa” bottleneck…but it’s probably asking a bit much to sequence more individuals to 70x coverage.

The lack of difference between the African populations is an indication that recent demography is not shaping the distribution much. Additionally, they note that gene flow between the African groups probably increased diversity in some ways, so that as long as a group is connected with other populations it will probably be rescued (note that none of these in their data were particular inbred as judging by runs of homozygosity).

Finally, they found that the number of homozygote mutations that were deleterious is higher in their model results for Europeans than the African groups. This is not surprising, and what one expects. But, they found that this is a function likely of continuous gene flow between the African groups. Without gene flow homozygosity would have been much higher. This gets back to the fact that gene flow is a powerful homogenizing tool, and the lack of gene flow has to be pretty extreme for divergence to occur.

Which brings us back to the “Out of Africa” event. The next ten years are going to see a lot of investigation of African phyologenomics and population genomics. Basically, the relationships, and selection pressures. It is totally implausible that Bantu groups in Kenya and Tanzania did not absorb local non-Nilotic populations. We’ll figure that out. Additionally, selection pressures are probably different between different groups. We’ll know more about that. But, ancient DNA will probably give us some understanding of why non-Africans went through such a massive demographic sieve. We know in broad sketches. But most people want to fill in the details.

Citation: The demographic history and mutational load of African hunter-gatherers and farmers, Marie Lopez, Athanasios Kousathanas, Helene Quach, Christine Harmant, Patrick Mouguiama-Daouda, Jean-Marie Hombert, Alain Froment, George H Perry, Luis B Barreiro, Paul Verdu, Etienne Patin, Lluis Quintana-Murci, doi: https://doi.org/10.1101/131219

The logic of human destiny was inevitable 1 million years ago

Robert Wright’s best book, Nonzero: The Logic of Human Destiny, was published nearly 20 years ago. At the time I was moderately skeptical of his thesis. It was too teleological for my tastes. And, it does pander to a bias in human psychology whereby we look to find meaning in the universe.

But this is 2017, and I have somewhat different views.

In the year 2000 I broadly accepted the thesis outlined a few years later in The Dawn of Human Culture. That our species, our humanity, evolved and emerged in rapid sequence, likely due to biological changes of a radical kind, ~50,000 years ago. This is the thesis of the “great leap forward” of behavioral modernity.

Today I have come closer to models proposed by Michael Tomasello in The Cultural Origins of Human Cognition and Terrence Deacon in The Symbolic Species: The Co-evolution of Language and the Brain. Rather than a punctuated event, an instance in geological time, humanity as we understand it was a gradual process, driven by general dynamics and evolutionary feedback loops.

The conceit at the heart of Robert J. Sawyer’s often overly preachy Neanderthal Parallax series, that if our own lineage went extinct but theirs did not they would have created a technological civilization, is I think in the main correct. It may not be entirely coincidental that the hyper-drive cultural flexibility of African modern humans evolved in African modern humans first. There may have been sufficient biological differences to enable this to be likely. But I believe that if African modern humans were removed from the picture Neanderthals would have “caught up” and been positioned to begin the trajectory we find ourselves in during the current Holocene inter-glacial.

Luke Jostins’ figure showing across board encephalization

The data indicate that all human lineages were subject to increased encephalization. That process trailed off ~200,000 years ago, but it illustrates the general evolutionary pressures, ratchets, or evolutionary “logic”, that applied to all of them. Overall there were some general trends in the hominin lineage that began to characterized us about a million years ago. We pushed into new territory. Our rate of cultural change seems to gradually increased across our whole range.

One of the major holy grails I see now and then in human evolutionary genetics is to find “the gene that made us human.” The scramble is definitely on now that more and more whole genome sequences from ancient hominins are coming online. But I don’t think there will be such gene ever found. There isn’t “a gene,” but a broad set of genes which were gradually selected upon in the process of making us human.

In the lingo, it wasn’t just a hard sweep from a de novo mutation. It was as much, or even more, soft sweeps from standing variation.

Aryan marauders from the steppe came to India, yes they did!

Its seems every post on Indian genetics elicits dissents from loquacious commenters who are woolly on the details of the science, but convinced in their opinions (yes, they operate through uncertainty and obfuscation in their rhetoric, but you know where the axe is lodged). This post is an attempt to answer some questions so I don’t have to address this in the near future, as ancient DNA papers will finally start to come out soon, I hope (at least earlier than Winds of Winter).

In 2001’s The Eurasian Heartland: A continental perspective on Y-chromosome diversity Wells et al. wrote:

The current distribution of the M17 haplotype is likely to represent traces of an ancient population migration originating in southern Russia/Ukraine, where M17 is found at high frequency (>50%). It is possible that the domestication of the horse in this region around 3,000 B.C. may have driven the migration (27). The distribution and age of M17 in Europe (17) and Central/Southern Asia is consistent with the inferred movements of these people, who left a clear pattern of archaeological remains known as the Kurgan culture, and are thought to have spoken an early Indo-European language (27, 28, 29). The decrease in frequency eastward across Siberia to the Altai-Sayan mountains (represented by the Tuvinian population) and Mongolia, and southward into India, overlaps exactly with the inferred migrations of the Indo-Iranians during the period 3,000 to 1,000 B.C. (27). It is worth noting that the Indo-European-speaking Sourashtrans, a population from Tamil Nadu in southern India, have a much higher frequency of M17 than their Dravidian-speaking neighbors, the Yadhavas and Kallars (39% vs. 13% and 4%, respectively), adding to the evidence that M17 is a diagnostic Indo-Iranian marker. The exceptionally high frequencies of this marker in the Kyrgyz, Tajik/Khojant, and Ishkashim populations are likely to be due to drift, as these populations are less diverse, and are characterized by relatively small numbers of individuals living in isolated mountain valleys.

In a 2002 interview with the India site Rediff, the first author was more explicit:

Some people say Aryans are the original inhabitants of India. What is your view on this theory?

The Aryans came from outside India. We actually have genetic evidence for that. Very clear genetic evidence from a marker that arose on the southern steppes of Russia and the Ukraine around 5,000 to 10,000 years ago. And it subsequently spread to the east and south through Central Asia reaching India. It is on the higher frequency in the Indo-European speakers, the people who claim they are descendants of the Aryans, the Hindi speakers, the Bengalis, the other groups. Then it is at a lower frequency in the Dravidians. But there is clear evidence that there was a heavy migration from the steppes down towards India.

But some people claim that the Aryans were the original inhabitants of India. What do you have to say about this?

I don’t agree with them. The Aryans came later, after the Dravidians.

Over the past few years I’ve gotten to know the above first author Spencer Wells as a personal friend, and I think he would be OK with me relaying that to some extent he was under strong pressure to downplay these conclusions. Not only were, and are, these views not popular in India, but the idea of mass migration was in bad odor in much of the academy during this period. Additionally, there was later work which was less clear, and perhaps supported an Indian origin for R1a1a. Spencer himself told me that it was not impossible for R1a to have originated in India, but a branch eventually back-migrated to southern Asia.

But even researchers from the group at Stanford where he had done his postdoc did not support this model by the middle 2000s, Polarity and Temporality of High-Resolution Y-Chromosome Distributions in India Identify Both Indigenous and Exogenous Expansions and Reveal Minor Genetic Influence of Central Asian Pastoralists. In 2009 a paper out of an Indian group was even stronger in its conclusion for a South Asian origin of R1a1a, The Indian origin of paternal haplogroup R1a1* substantiates the autochthonous origin of Brahmins and the caste system.

By 2009 one might have admitted that perhaps Spencer was wrong. I was certainly open to that possibility. There was very persuasive evidence that the mtDNA lineages of South Asia had little to do with Europe or the Middle East.

Yet a closer look at the above papers reveals two major systematic problems.

First, ancient DNA has made it clear that there has been major population turnover during the Holocene, but this was not the null hypothesis in the 2000s. Looking at extant distributions of lineages can give one a distorted view of the past. Frankly, the 2009 Indian paper was egregious in this way because they included Turkic groups in their Central Asian data set. Even in 2009 there was a whole lot of evidence that Central Asian Turkic groups were likely very different from Indo-European Turanian populations which would have been the putative ancestors of Indo-Aryans. Honestly the authors either consciously loaded the die to reduce the evidence for gene flow from Central Asia, or they were ignorant (the nature of the samples is much clearer in the supplements than the  primary text for what it’s worth).

Second, Y chromosomal marker sets in the 2000s were constrained to fast mutating microsatellite regions or less than 100 variant SNPs on the Y. Because it is so repetitive the Y chromosome is hard to sequence, and it really took the technologies of the last ten years to get it done. Both the above papers estimate the coalescence of extant R1a1a lineages to be 10-15,000 years before the present. In particular, they suggest that European and South Asian lineages date back to this period, pushing back any possible connection between the groups, and making it possible that European R1a1a descended from a South Asian founder group which was expanding after the retreat of the ice sheets. The conclusions were not unreasonable based on the methods they had.  But now we have better methods.*

Whole genome sequencing of the Y, as well as ancient DNA, seems to falsify the above dates. Though microsatellites are good for very coarse grain phyolgenetic inferences, one has to be very careful about them when looking at more fine grain population relationships (they are still useful in forensics to cheaply differentiate between individuals, since they accumulate variation very quickly). They mutate fast, and their clock may be erratic.

Additionally, diversity estimates were based on a subset of SNP that were clearly not robust. R1a1a is not diverse anywhere, though basal lineages seem to be present in ancient DNA on the Pontic steppe in some cases.

To show how lacking in diversity R1a1a is, here are the results of a 2016 paper which performed whole genome sequencing on the Y. Instead of relying on the order of 10 to 100 SNPs, this paper discover over 65,000 Y variants worldwide. Notice how little difference there is between different South Asian groups below, indicative of a massive population expansion relatively recently in time which didn’t even have time to exhibit regional population variation. They note that “The most striking are expansions within R1a-Z93 [the South Asian clade], ~4.0–4.5 kya. This time predates by a few centuries the collapse of the Indus Valley Civilization, associated by some with the historical migration of Indo-European speakers from the western steppes into the Indian sub-continent.

Read More

Oxford Nanopore finally giving hope to biologist’s dreams

I don’t talk too much about genomic technology because it changes so fast. Being up-to-date on the latest machines and tools often requires regular deep-dives right now, though I believe at some point technological improvements will plateau as the data returned will be cheap and high quality enough that there won’t be much to gain on the margin.

Of course we’ve already come a long way. Fifteen years ago a “whole human genome” cost on the order of billions of dollars. Today a high quality whole human genome will run you on the order of $1,000. This is fundamentally a technology driven change, with big metal machines automatically generating reads and powerful computers to process them. One couldn’t imagine such a scenario 30 years ago because the technology wasn’t there.

I’ve stated before that I don’t think genomics fundamentally alters what we know and understand about evolution. At least so far. But it is a huge change in the domain of medicine. Cleary the human genomicists, especially Francis Collins, overhyped the yield of the technology in relation to healthcare in the 2000s. But with cheap and ubiquitous sequencing we may see the end of Mendelian diseases in our lifetime (through screening and possibly at some point CRISPR therapy).

This has been driven by technological innovation in the private sector around a few firms. The famous chart showing the massive decline in the cost of genomic sequencing over the past 15 years is due in large part to the successes of Illumina. But, Illumina has also had a quasi-monopoly on the field over the past five years (or more), and that shows with the leveling off of the decline in cost. Until the past year….

What gives? Many people believe that Illumina is moving again in part because a genuine challenger is emerging, or at least the flicker of a challenge, in the form of Oxford Nanopore. Oxford Nanopore has been around since 2005, but it really came into the public eye around 2010 or so. But like many tech companies it overpromised in the early years. I remember skeptically listening to a friend in the fall of 2011 talk about how quickly Nanopore was going to change the game…. I didn’t put too much stock into these sorts of presentations to hopeful researchers because I remember Pacific Biosciences making the same sort of pitch to amazed biologists in 2008. Pac Bio is still around, but has turned out to be a bit player, rather than a challenger to Illumina.

But I have to admit that Nanopore has really started to step up its game of late. Probably one of the major things it has accomplished is that it’s made us reimagine what sequencing technology should look like. Rather than refrigerators of various sizes, Oxford Nanopore allows us to imagine sequencing technology which exhibits a form factor more analogous to a USB thumb drive. The first time I saw a Nanopore machine in the flesh I knew intellectually what I was going to see…but because of my deep intuitions I still overlooked the two Nanopore machines laying on the workbench in front of me.

Despite their amazing form factor, these early Nanopore machines had limited application. They didn’t generate much data, and so were utilized by researchers who worked with smaller genomes. Scientists who worked with bacteria seem to have been using them a lot, for example. Additionally the machines were error prone and people were working out their kinks in real time in laboratories (one tech told me early on they were so small that he swore they were affected by ambient vibrations so he found ways to dampen that source of error).

A new preprint suggests we may be turning the corner though, Nanopore sequencing and assembly of a human genome with ultra-long reads:

Nanopore sequencing is a promising technique for genome sequencing due to its portability, ability to sequence long reads from single molecules, and to simultaneously assay DNA methylation. However until recently nanopore sequencing has been mainly applied to small genomes, due to the limited output attainable. We present nanopore sequencing and assembly of the GM12878 Utah/Ceph human reference genome generated using the Oxford Nanopore MinION and R9.4 version chemistry. We generated 91.2 Gb of sequence data (~30x theoretical coverage) from 39 flowcells. De novo assembly yielded a highly complete and contiguous assembly (NG50 ~3Mb). We observed considerable variability in homopolymeric tract resolution between different basecallers. The data permitted sensitive detection of both large structural variants and epigenetic modifications. Further we developed a new approach exploiting the long-read capability of this system and found that adding an additional 5x-coverage of “ultra-long” reads (read N50 of 99.7kb) more than doubled the assembly contiguity. Modelling the repeat structure of the human genome predicts extraordinarily contiguous assemblies may be possible using nanopore reads alone. Portable de novo sequencing of human genomes may be important for rapid point-of-care diagnosis of rare genetic diseases and cancer, and monitoring of cancer progression. The complete dataset including raw signal is available as an Amazon Web Services Open Dataset at: https://github.com/nanopore-wgs-consortium/NA12878.

30x just means that you’re getting bases sampled typically 30 times, so that you have a very accurate and precise read on its state. 30x has become the default standard in medical genomics. If Nanopore can do 30x on human genomes at reasonable cost it won’t be a niche player much longer.

The read length is important because last I checked the human genome still had large holes in it. The typical Illumina machine produces average read lengths in the low hundreds of base pairs. If you have large repetitive regions of the human genome (and you do have these), you’re never going to span them with such short yardsticks. Additionally, these short reads have to be tiled together when you assemble a genome from raw results, and this is a computationally really intensive task. It’s good when you have a reference genome you can align to as a scaffold. But researchers who don’t work on humans or model organisms may not have a good reference genome, or in many cases a reference genome at all.

Pac Bio occupies a space where it provide really long reads for a high price point. Most of the time this isn’t necessary, but imagine you work on a disease which is caused by large repetitive regions. You are likely willing to pay the price that is asked. And because Pac Bio generates very long reads it makes de novo assembly much easier, as your algorithm has to tile together far fewer contiguous sequences, and long sequences are less likely to have lots of repetitive matches in the genome.

But Pac Bio machines are expensive and huge. In the abstract above it alludes to “Portable de novo sequencing of human genomes.” This is a huge deal. The dream, as whispered by some genomicists I have known, is that at a point in the future biologists would carry portable sequencers which would produce very long reads that so that they could de novo assemble sequences on the spot. A concrete example might be a health inspector checking on the sorts of microbes found on the counter of a restaurant, or a field ecologist who might be sample various fungi to discover cryptic species.

Obviously this is still a dream. The preprint above makes it clear that to do what they did required a lot of novel techniques and development of new tools. This isn’t beta technology, it’s early alpha. But because it’s 2017 the outlines of the dream are coming into public view.

Citation: Nanopore sequencing and assembly of a human genome with ultra-long reads
Miten Jain, Sergey Koren, Josh Quick, Arthur C Rand, Thomas A Sasani, John R Tyson, Andrew D Beggs, Alexander T Dilthey, Ian T Fiddes, Sunir Malla, Hannah Marriott, Karen H Miga, Tom Nieto, Justin O’Grady, Hugh E Olsen, Brent S Pedersen, Arang Rhie, Hollian Richardson, Aaron Quinlan, Terrance P Snutch, Louise Tee, Benedict Paten, Adam M. Phillippy, Jared T Simpson, Nicholas James Loman, Matthew Loose
bioRxiv 128835; doi: https://doi.org/10.1101/128835