The architecture of skin color variation in Africa

Baby of hunter-gatherers in Southern Africa

Very interesting abstract at the ASHG meeting of a plenary presentation,Novel loci associated with skin pigmentation identified in African populations. This is clearly the work that one of the comments on this weblog alluded to last summer during SMBE. There I was talking about the likely introduction of the derived SLC24A5 variant to the Khoisan peoples and its positive selection in peoples in southern Africa.

Below is the abstract in full. Those who follow the literature on this see the usual suspects in relation to genes, but also new ones:

Despite the wide range of variation in skin pigmentation in Africans, little is known about its genetic basis. To investigate this question we performed a GWAS on pigmentation in 1,593 Africans from populations in Ethiopia, Tanzania, and Botswana. We identify significantly associated loci in or near SLC24A5MFSD12TMEM138…OCA2 and HERC2. Allele frequencies at these loci in global populations are strongly correlated with UV exposure. At SLC24A5 we find that a non-synonymous mutation associated with depigmentation in non-Africans was introduced into East Africa by gene flow, and subsequently rose to high frequency. At MFSD12, we identify novel variants that are strongly correlated with dark pigmentation in populations with Nilo-Saharan ancestry. Functional assays reveal that MFSD12 codes for a lysosomal protein that influences pigmentation in cultured melanocytes, zebrafish and mice. CRISPR knockouts of murine Mfsd12 display reduced pheomelanin pigmentation similar to the grizzled mouse mutant (gr/gr). Exome sequencing of gr/gr mice identified a 9 bp in-frame deletion in exon two of Mfsd12. Thus, using human GWAS data we were able to map a classic mouse pigmentation mutant. At TMEM138…we identify mutations in melanocyte-specific regulatory regions associated with expression of UV response genes. Variants associated with light pigmentation at this locus show evidence of a selective sweep in Eurasians. At OCA2 and HERC2 we identify novel variants associated with pigmentation and at OCA2, the oculocutaneous albinism II gene, we find evidence for balancing selection maintaining alleles associated with both light and dark skin pigmentation. We observe at all loci that variants associated with dark pigmentation in African populations are identical by descent in southern Asian and Australo-Melanesian populations and did not arise due to convergent evolution. Further, the alleles associated with skin pigmentation at all loci but SLC24A5 are ancient, predating the origin of modern humans. The ancestral alleles at the majority of predicted causal SNPs are associated with light skin, raising the possibility that the ancestors of modern humans could have had relatively light skin color, as is observed in the San population today. This study sheds new light on the evolutionary history of pigmentation in humans.

Much of this is not surprising. Looking at patterns of variation around pigmentation loci researchers suggested years ago that Melanesians and Africans exhibited evidence of similarity and functional constraint. That is, the dark skin alleles date back to Africa and did not deviate from their state due to selection pressures. In contrast, light skin alleles in places like eastern and western Eurasia are quite different.

Nyakim Gatwech

This abstract also confirms something I said in a comment on the same thread, that Nilotic peoples are the ones likely to have been subject to selection for dark skin in the last 10,000 years. You see above that variants on MFSD12 are correlated with dark complexion. In particular, in Nilo-Saharan groups. The model Nyakim Gatwech is of South Sudanese nationality and has a social media account famous for spotlighting her dark skin. In comparison to the Gatwech and the San Bushman child above are so different in color that I think it would be clear these two individuals come from very distinct populations.

The fascinating element of this abstract is the finding that most of the alleles which are correlated with lighter skin are very ancient and that they are the ancestral alleles more often than the derived! We’ll have to wait until the paper comes out. My assumption is that after the presentation Science will put it on their website. But until then here are some comments:

  • There is obviously a bias in the studies of pigmentation toward those which highlight European variability.
  • The theory of balancing selection makes sense to me because ancient DNA is showing OCA2 “blue eye” alleles which are not ancestral in places outside of Western Europe. And in East Asia there their own variants.
  • Lots of variance in pigmentation not accounted for in mixed populations (again, lots of the early genomic studies focused on populations which were highly diverged and had nearly fixed differences). Presumably, African research will pick a lot of this up.
  • This also should make us skeptical of the idea that Western Europeans were necessarily very dark skinned, as now we know that human pigmentation architecture is complex enough that sampling modern populations expand our understanding a great deal.
  • Finally, it’s long been assumed that at some stage early on humans were light skinned on most of their body because we had fur. When we lost our fur is when we would need to have developed dark skin. This abstract is not clear at how far long ago light and dark alleles coalesce to common ancestors.

Genetic variation and disease in Africa

Very readable review, Gene Discovery for Complex Traits: Lessons from Africa. It’s open access, so I recommend it. The summary:

The genetics of African populations reveals an otherwise “missing layer” of human variation that arose between 100,000 and 5 million years ago. Both the vast number of these ancient variants and the selective pressures they survived yield insights into genes responsible for complex traits in all populations.

The main issue I might have is I’m not sure that focusing on 5 million year time spans is particularly useful. Rather, looking at the last major bottleneck for modern humans before the “Out of Africa” event would be key, since that’s when a lot of the common variation would disappear, and very rare variants probably don’t have deep time depth in any case. With all that being said, the qualitative analysis is on point.

One of the major issues in the “SNP-chip” era has been that ascertainment of variation has been skewed toward Europeans. Though more recent techniques have tried to fix this…this review points out that if you by necessity constrain the SNPs of interest to those that vary outside of Africa (most of the world’s population), you are taking may alleles private to Africa off the table. This is relevant because the “Out of Africa” bottleneck ~50,000 years ago means that African populations harbor a lot more genetic variation than non-African populations do.

The move to high-quality whole genome sequencing obviates these concerns. As a matter of course African variation will be “picked up” since the marker set is not constrained ahead of time.

Importantly the authors focus on South Africa and the Xhosa population. This group has about ~20% Khoisan genetic ancestry, which is very diverse, and, very distinct, from that of the remaining ~80% of its ancestry. With its large African immigrant population and highly diverse native groups, some of them quite admixed, South Africa could actually provide some hard-to-substitute value in biomedical genetics.

Khoisan may not have diverged ~300,000 years ago

A few years ago I contributed to an op-ed which defended the utility of the race concept in biology in USA Today (which by the way prompted a quite patronizing email from a famous doyen of population genetics who wished to correct my ignorance; here’s a clue: “Out of Africa again & again”).

In my initial draft, I had stated that the Khoisan diverged from other human populations ~200,000 years ago. The fact-checker came back and said that this didn’t seem to be a supportable claim. The reason I gave the ~200,000 figure is that I’d button-holed people who looked at these genomes, and they were coming to the conclusion that the divergence between Khoisan and non-Khoisan was further back than we’d presupposed. And that was the number given to me.

Ultimately I compromised and allowed them to change the divergence value to 150,000 years before the present.

Today we’re in a different landscape. The above figure is from the Science paper, Southern African ancient genomes estimate modern human divergence to 350,000 to 260,000 years ago, which was earlier a biorxiv preprint (which I mentioned last spring). In concert with the North African find, the media is running with the idea that the origin of modern humans goes back very far indeed. This piece in ScienceNews is actually pretty good in my opinion at staying under control, though not all write-ups have been so measured.

So in a span of two years we’ve gone from me pushing and compromising on a value of ~150,000 years, to researchers suggesting that the Khoisan/non-Khoisan divergence is about two-fold older than that!

Well, I’m here to tell you that a prominent geneticist who is very conversant with these issues is simply incredulous about the likelihood of this particular value. I brought up this preprint to them over lunch and they just didn’t buy it. That is, they are skeptical that the amount of admixture would have skewed the earlier inferences to the magnitude that they seem to have in these results.

The authors in the paper used G-PhoCS and their own ingenious method to come to these inferences of split dates. The problem with these methods is that the inferences generated aren’t nearly as straightforward as an admixture estimate (which can be checked by something as simple as a PCA). I don’t want to get into the details, but I remember seeing models in the 2000s which inferred that East Asians and Europeans diverged ~25,000 years ago, or that there was no Neanderthal admixture in Europeans (to a high degree of confidence). Models can come out with a lot of values.

More importantly, look at the dates of divergence of non-Africans (Sardinians here) from their closest African relatives.

  • 115,000 years before the present (Dinka-Sardinian) for G-PhoCS
  • 76,000 years before the present for their TT-method

In light of the likelihood that the closest population to non-Africans may have been an East African population represented by Ethiopia Mota individual (along with modern Hadza), we can probably drop that estimate down a bit. But G-PhoCS in particular just gives too old an estimate. There are ways it makes sense (lots of old structure within Africa) of course. I’m just speaking in terms of possibilities.

The diversification of extant modern populations seems to have occurred around ~50,000-60,000 years before the present. This aligns with the archaeology, and the ancient genomes which we have on hand.

Of course the methods in this paper might be right. And the fossil from North Africa does add some plausibility to that. But really the whole field is somewhat unsettled now, and we should be cautious of reporting of definitive truths in the media.

Population structure in Neanderthals leads to genetic homogeneity

The above tweet is in response to a article which reports on the finding past month in PNAS, Early history of Neanderthals and Denisovans. It’s open access, you should read it. I don’t think I’ve reviewed it because I haven’t dug through the supplements. To be frank this is a paper where you pretty much have to read the supplements because they’re introducing a somewhat different model here than is the norm.

I talked to Alan Rogers at SMBE about this paper. Broadly, I think there might be something to it, and it’s because of what David says above. It is simply hard to imagine that Neanderthals could be extremely successful with such low genetic diversity as we see, and spread so thin. Now, the Quanta Magazine tries to emphasize that the effective population is not the true census population, but I wish it would have explained it more clearly. Basically, the size that is relevant for breeding is obviously not going to the same as a head count. And, because effective populations are highly sensitive to bottlenecks you can get really small numbers even when the extant population at any given time may be large.

The PNAS paper makes some novel inferences, and I’ll set that to the side until I read the supplements. But I don’t think it’s crazy that population structure within Neanderthals could be leading to lower total genetic diversity.

Massive genomic sample sizes = detecting evolution in real time

The recent PLOS BIOLOGY paper, Identifying genetic variants that affect viability in large cohorts, seems to have triggered a feeding frenzy in the media. For example, Big Think has put up Researchers Find Evidence That Human Evolution Is Still Actively Happening.

I wasn’t paying close attention because of course human evolution is still happening actively. From a genetic perspective, evolution is just change in allele frequencies. Populations aren’t infinite, so even if there wasn’t any selection stochastic forces would shift allele frequencies. But of course selection is probably happening. For adaptation by natural selection to occur you need heritable variation on a trait where there are fitness differences as a function of variation within the population. It seems implausible that these conditions don’t still apply. There’s plenty of fitness variation in the population, and it’s unlikely to be random as a function of heritable variation.

But the devil is in the details. And last year Field et al. used the modern genomic tools available to detect selection occurring over the past 2,000 years. It is not credible that it would have magically stopped a few centuries ago.

So why is this new paper such a big deal? (note that it’s in PLOS BIOLOGY, not PLOS GENETICS) Because the method they use is ingenious and simple. Basically, they’re looking at changes in allele frequencies as a function of age in huge populations. It’s a little more complicated than that, they used a logistic regression to control for some of the other variables. But they found some biologically plausible hits with their data set of 50,000-150,000. And, they replicated their hits from a European sample to a non-European one.

This does bring me back to a discussion I observed a while back. An evolutionary geneticist who works with Drosophila mentioned offhand that in his field there really wasn’t that much of a need for more data. They could spend all their time to doing analysis. A prominent human geneticist whose work focused on biomedicine piped up that that wasn’t true at all for their field. There are some differences in the scientific questions, but there are also differences in terms of what you can do with humans as a model organism.

In the paper they look forward to the day of increasing sample sizes an order of magnitude beyond where it is now. At some point in the near future, large fractions of entire nations will be sequenced at medical grade level (30x coverage).

Anyway, you should read Identifying genetic variants that affect viability in large cohorts. It’s pretty straightforward.

After agriculture, before bronze


The above plot shows genetic distance/variation between highland and lowland populations in Papa New Guinea (PNG). It is from a paper in Science that I have been anticipating for a few months (I talked to the first author at SMBE), A Neolithic expansion, but strong genetic structure, in the independent history of New Guinea.

What does “strong genetic structure” mean? Basically Fst is showing the proportion of genetic variation which is partitioned between groups. Intuitively it is easy to understand, in that if ~1% of the genetic variation is partitioned between groups in one case, and ~10% in another, then it is reasonable to suppose that the genetic distance between groups in the second case is larger than in the first case. On a continental scale Fst between populations is often on the order of ~0.10. That is the value for example when you pool the variation amongst Northern Europeans and Chinese, and assess how much of it can be apportioned in a manner which differentiates populations (so it’s about ~10% of the variation).

This is why ancient DNA results which reported that Mesolithic hunter-gatherers and Neolithic farmers in Central Europe who coexisted in rough proximity for thousands of years exhibited differences on the order of ~0.10 elicited surprise. These are values we are now expecting from continental-scale comparisons. Perhaps an appropriate analogy might be the coexistence of Pygmy groups and Bantu agriculturalists? Though there is some gene flow, the two populations exist in symbiosis and exhibit local ecological segregation.

In PNG continental scale Fst values are also seen among indigenous people. The differences between the peoples who live in the highlands and lowlands of PNG are equivalent to those between huge regions of Eurasia. This is not entirely surprising because there has been non-trivial gene flow into lowland populations from Austronesian groups, such as the Lapita culture. Many lowland groups even speak Austronesian languages today.

Using standard ADMIXTURE analysis the paper shows that many lowland groups have significant East Asian ancestry (red), while none of the highland groups do (some individuals with East Asian admixture seem to be due to very recent gene flow). But even within the highlands the genetic differences are striking. The  Fst values between Finns and Southern European groups such as Spaniards are very high in a European context (due to Finnish Siberian ancestry as well as drift through a bottleneck), but most comparisons within the highland groups in PNG still exceeds this.

The paper also argues that genetic differences between Papuans and the natives of Australia pre-date the rising sea levels at the beginning of the Holocene, when Sahul divided between its various constituents. This is not entirely surprising considering that the ecology of the highlands during the Pleistocene would have been considerably different from Australia to the south, resulting in sharp differences in the hunter-gatherer lifestyles. Additionally, there does not seem to have been a genetic cline. Papuans are symmetrically related to all Australian groups they had samples from.

Using coalescence-based genomic methods they inferred that separation between highlands and some lowland groups occurred ~10-20,000 years ago. That is, after the Last Glacial Maximum. For the highlands, the differences seem to date to within the last 10,000 years. The Holocene. Additionally, they see population increases in the highlands, correlating with the shift to agriculture (cultivation of taro).

None of the above is entirely surprising, though I would take the date inferences with a grain of salt. The key is to observe that large genetic differences, as well as cultural differences, accrued in the highlands of PNG during the Holocene. In the paper they have a social and cultural explanation for what’s going on:

  Fst values in PNG fall between those of hunter-gatherers and present-day populations of west Eurasia, suggesting that a transition to cultivation alone does not necessarily lead to genetic homogenization.

A key difference might be that PNG had no Bronze Age, which in west Eurasia was driven by an expansion of herders and led to massive population replacement, admixture, and cultural and linguistic change (7, 8), or Iron Age such as that linked to the expansion of Bantu-speaking
farmers in Africa (24). Such cultural events have resulted in rapid Y-chromosome lineage expansions due to increased male reproductive variance (25), but we consistently find no evidence for this in PNG (fig. S13). Thus, in PNG, wemay be seeing the genetic, linguistic, and cultural diversity that sedentary human societies can achieve in the absence of massive technology-driven expansions.

Peter Turchin in books like Ultrasociety has aruged that one of the theses in Steven Pinker’s The Better Angels of Our Nature is incorrect: that violence has not decreased monotonically, but peaked in less complex agricultural societies. PNG is clearly a case of this, as endemic warfare was a feature of highland societies when they encountered Europeans. Lawrence Keeley’s War Before Civilization: The Myth of the Peaceful Savage gives so much attention to highland PNG because it is a contemporary illustration of a Neolithic society which until recently had not developed state-level institutions.

What papers like these are showing is that cultural and anthropological dynamics strongly shape the nature of genetic variation among humans. Simple models which assume as a null hypothesis that gene flow occurs through diffusion processes across a landscape where only geographic obstacles are relevant simply do not capture enough of the dynamic. Human cultures strongly shape the nature of interactions, and therefore the genetic variation we see around us.

Inbreeding causing issues in Osama bin Laden’s family

I didn’t figure I would have to say much about 9/11 really that others could not say (aside from perhaps you should read Marc Sageman’s Understanding Terror Networks if you want an ethnography of the Salafi jihadist movement which lead to al-Qaeda). But The Daily Best has a profile of one of Osama bin Laden’s sons:

Moreover, by this time, bin Laden already had two wives. But Najwa, the first of them, encouraged him to pursue Khairia, believing that having someone with her training permanently on hand would help her son Saad and his brothers and sisters, some of whom also suffered from developmental disorders.

Osama bin Laden had two dozen some children (approximately). But it was strange to me to see mention of several children with developmental disorders. Inbreeding is a major burden for Arab Muslim societies. And sure enough, Osama bin Laden’s first wife was his first cousin. She gave birth to around 10 children. Her father was Osama bin Laden’s mother’s brother. With the possibility of several generations of cousin marriage their relatedness may have been closer than normal half-siblings.

Note: Osama bin Laden’s father was from Yemen and his mother from Syria. So he was most certainly not inbred.

Why do percentage estimates of “ancestry” vary so much?

When looking at the results in Ancestry DNA, 23andMe, and Family Tree DNA my “East Asian” percentage is:

– 19%
– 13%
– 6%

What’s going on here? In science we often make a distinction between precision and accuracy. Precision is how much your results vary when you re-run an experiment or measurement. Basically, can you reproduce your result? Accuracy refers to how close your measurement is to the true value. A measurement can be quite precise, but consistently off. Similarly, a measurement may be imprecise, but it bounces around the true value…so it is reasonably accurate if you get enough measurements just cancel out the errors (which are random).

The values above are precise. That is, if you got re-tested on a different chip, the results aren’t going to be much different. The tests are using as input variation on 100,000 to 1 million markers, so a small proportion will give different calls than in the earlier test. But that’s not going to change the end result in most instances, even though these methods often have a stochastic element.

But what about accuracy? I am not sure that old chestnuts about accuracy apply in this case, because the percentages that these services provide are summaries and distillations of the underlying variation. The model of precision and accuracy that I learned would be more applicable to the DNA SNP array which returns calls on the variants; that is, how close are the calls of the variant to the true value (last I checked these are arrays are around 99.5% accurate in terms of matching the true state).

What you see when these services pop out a percentage for a given ancestry is the outcome of a series of conscious choices that designers of these tests made keeping in mind what they wanted to get out of these tests. At a high level here’s what’s going on:

  1. You have a model of human population history and dynamics with various parameters
  2. You have data that that varies that you put into that model
  3. You have results which come back with values which are the best fit of that data to the model you specificed

Basically you are asking the computational framework a question, and it is returning its best answer to the question posed. To ask whether the answer is accurate or not is almost not even wrong. The frameworks vary because they are constructed by humans with difference preferences and goals.

Almost, but not totally wrong. You can for example simulate populations whose histories you know, and then test the models on the data you generated. Since you already know the “truth” about the simulated data’s population structure and history, you can see how well your framework can infer what you already know from the patterns of variation in the generated data.

Going back to my results, why do my East Asian percentages vary so much? The short answer is that one of the major variables in the model alluded to above is the nature of the reference population set and the labels you give them.

Looking at Bengalis, the ethnic group I’m from, it is clear that in comparison to other South Asian populations they are East Asian shifted. That is, it seems clear I do have some East Asian ancestry. But how much?

The “simple” answer is to model my ancestry is a mix of two populations, an Indian one and an East Asian one, and then see what the values are for my ancestry across the two components. But here is where semantics becomes important: what is Indian and East Asian? Remember, these are just labels we give to groups of people who share genetic affinities. The labels aren’t “real”, the reality is in the raw read of the sequence. But humans are not capable of really getting anything from millions of raw SNPs assigned to individuals. We have to summarize and re-digest the data.

The simplest explanation for what’s going on here is that the different companies have different populations put into the boxes which are “Indian/South Asian” and “East Asian.” If you are using fundamentally different measuring sticks, then there are going to be problems with doing apples to apples comparisons.

My personal experience is that 23andMe tends to give very high percentages of South Asian ancestry for all South Asians. Because “South Asian” is a very diverse category when tests come back that someone is 95-99% South Asian…it’s not really telling you much. In contrast, some of the other services may be using a small subset of South Asians, who they define as “more typical”, and so giving lower percentages to people from Pakistan and Bengal, who have admixture from neighboring regions to the west and east respectively.*

Something similar can occur with East Asian ancestry. If the “donor” ancestral groups are South Asian and East Asian for me, then the proportions of each is going to vary by how close the donor groups selected by the company is to the true ancestral group. If, for example, Family Tree DNA chose a more Northeastern Asian population than Ancestry DNA, then my East Asian population would vary between the two services because I know my East Asian ancestry is more Southeast Asian.

The moral of the story is that the values you obtain are conditional on the choices you make, and those choices emerge from the process of reducing and distilling the raw genetic variation into a manner which is human interpretable. If the companies decided to use the same model, the would come out with the same results.

* I helped develop an earlier version of MyOrigins, and so can attest to this firsthand.

Ancient Europeans: isolated, always on the edge of extinction

A few years ago I suggested to the paleoanthropologist Chris Stringer that the first modern humans who arrived in Europe did not contribute appreciable ancestry to modern populations in the continent (appreciable as in 1% or more of the genome).* It seems I may have been right according to results from a 2016 paper, The genetic history of Ice Age Europe. The very oldest European ancient genome samples “failed to contribute appreciably to the current European gene pool.”

Why did I make this claim? Two reasons:

1) 40,000 years is a long time, and there was already substantial evidence of major population turnovers across northern Eurasia by this point. You go far enough into the future and it’s not likely that a local population leaves any descendants. So just work that logic backward.

2) There was already evidence of low population sizes and high isolation levels between groups in Pleistocene and Mesolithic/Neolithic Europe. This would again argue in favor of a high likelihood of local extinctions give enough time.

This does not only apply to just modern humans, descendants of southern, likely African, populations. Neanderthals themselves show evidence of high homogeneity, and expansions through bottlenecks over the ~600,000 years of their flourishing.

The reason that these dynamics characterized modern humans and earlier hominins in northern Eurasia is what ecologists would term an abiotic factor: the Ice Age. Obviously humans could make a go of it on the margins of the tundra (the Neanderthals seem less adept at penetrating the very coldest of terrain in comparison to their modern human successors; they likely frequented the wooded fringes, see The Humans Who Went Extinct). We have the evidence of several million years of continuous habitation by our lineage. But many of the ancient genomes from these areas, whether they be Denisovan, Neanderthal, or Mesolithic European hunter-gatherer, show indications of being characterized by very low effective population sizes. Things only change with the arrival of farming and agro-pastoralism.

For two obvious reasons we happen to have many ancient European genomes. First, many of the researchers are located in Europe, and the continent has a well developed archaeological profession which can provide well preserved samples with provenance and dates. And second, Europe is cool enough that degradation rates are going to be lower than if the climate was warmer. But if Europe, as part of northern Eurasia, is subject to peculiar exceptional demographic dynamics we need to be cautious about generalizing in terms of the inferences we make about human population genetic history. Remember that ancient Middle Eastern farmers already show evidence of having notably larger effective population sizes than European hunter-gatherers.

Two new preprints confirm the long term population dynamics typical of European hunter-gatherers, Assessing the relationship of ancient and modern populations and Genomics of Mesolithic Scandinavia reveal colonization routes and high-latitude adaptation. The first preprint is rather methods heavy, and seems more of a pathfinder toward new ways to extract more analytic juice from ancient DNA results. Those who have worked with population genomic data are probably not surprised at the emphasis on collecting numbers of individuals as opposed to single genome quality. That is, for the questions population geneticists are interested in “two samples sequenced to 0.5x coverage provide better resolution than a single sample sequenced to 2x coverage.”

I encourage readers (and “peer reviewers”) to dig into the appendix of Assessing the relationship of ancient and modern populations. I won’t pretend I have (yet). Rather, I want to highlight an interesting empirical finding when the method was applied to extant ancient genomic samples: “we found that no ancient samples represent direct ancestors of modern Europeans.”

This is not surprising. The ‘hunter-gatherer’ resurgence of the Middle Neolithic notwithstanding, Northern Europe was subject to two major population replacements, while Southern Europe was subject to one, but of a substantial nature. Recall that the Bell Beaker paper found that “spread of the Beaker Complex to Britain was mediated by migration from the continent that replaced >90% of Britain’s Neolithic gene pool within a few hundred years.” This means that less than 10% of modern Britons’ ancestry are a combination of hunter-gatherers and Neolithic farmers.

And yet if you look at various forms of model-based admixture analyses it seems as if modern Europeans have substantial dollops of hunter-gatherer ancestry (and hunter-gatherer U5 mtDNA and Y chromosomal lineage I1 and I2, associated with Pleistocene Europeans, is found at ~10% frequency in modern Europe in the aggregate; though I suspect this is a floor). What gives? Let’s look at the second preprint, which is more focused on new empirical results from ancient Scandinavian genomes, Genomics of Mesolithic Scandinavia reveal colonization routes and high-latitude adaptation. From early on in the preprint:

Based on SF12’s high-coverage and high-quality genome, we estimate the number of single nucleotide polymorphisms (SNPs) hitherto unknown (that are not recorded in dbSNP (v142)) to be c. 10,600. This is almost twice the number of unique variants (c. 6,000) per Finnish individual (Supplementary Information 3) and close to the median per European individual in the 1000 Genomes Project (23) (c. 11,400, Supplementary Information 3). At least 17% of these SNPs that are not found in modern-day individuals, were in fact common among the Mesolithic Scandinavians (seen in the low coverage data conditional on the observation in SF12), suggesting that a substantial fraction of human variation has been lost in the past 9,000 years (Supplementary Information 3). In other words, the SHGs (as well as WHGs and EHGs) have no direct descendants, or a population that show direct continuity with the Mesolithic populations (Supplementary Information 6) (13–17). Thus, many genetic variants found in Mesolithic individuals have not been carried over to modern-day groups.

The gist of the paper in terms of archaeology and demographic history is that Scandinavian hunter-gatherers were a compound population. One component of their ancestry is what we term “Western hunter-gatherers” (WHG), who descended from the late  Pleistocene Villabruna cluster (see paper mentioned earlier). Samples from Belgium, Switzerland, and Spain all belong to this cluster. The second element are “Eastern hunter-gatherers” (EHG). These samples derive from the Karelia region, to the east of modern Finland, bound by the White Sea to the north. EHG populations exhibit affinities to both WHG as well as Siberian populations who contributed ancestry to Amerindians, the “Ancestral North Eurasians” (ANE). There is a question at this point whether EHG are the product of a pulse admixture between an ANE and WHG population, or whether there was a long existent ANE-WHG east-west cline which the EHG were situated upon. That is neither here nor there (the Tartu group has a paper addressing this leaning toward isolation-by-distance from what I recall).

Explicitly testing models to the genetic data the authors conclude that there was a migration of EHG populations with a specific archaeological culture around the north fringe of Scandinavia, down the Norwegian coast. Conversely, a WHG population presumably migrated up from the south and somewhat to the east (from the Norwegian perspective).

And yet the distinctiveness of the very high quality genome as inferred from unique SNPs they have suggests to them that very little of the ancestry of modern Scandinavians (and Finns to be sure) derives from these ancient populations. Very little does not mean all. There is a lot of functional analysis in the paper and supplements which I will not discuss in this post, and one aspect is that it seems some adaptive alleles for high latitudes might persist down to the present in Nordic populations as a gift from these ancient forebears. This is no surprise, not all regions of the genome are created equal (a more extreme case is the Denisovan derived high altitude adaptation haplotype in modern Tibetans).

Nevertheless, there was a great disruption. First, the arrival of farmers whose ultimate origins were Anatolia ~6,000 years ago to the southern third of Scandinavia introduced a new element which came in force (agriculture spread over the south in a few centuries). A bit over a thousand years later the Corded Ware people, who were likely Indo-European speakers, arrived. These Indo-European speakers brought with them a substantial proportion of ancestry related to the hunter-gatherers because they descended in major fraction from the EHG (and later accrued more European hunter-gatherer ancestry from both the early farmers and likely some residual hunter-gatherer populations who switched to agro-pastoralism**).

For several years I’ve had discussions with researchers whose daily bread & butter are the ancient DNA data sets of Europe. I’ve gotten some impressions implicitly, and also from things they’ve said directly. It strikes me that the Bantu expansion may not be a bad analogy in regards to the expansion of farming in Europe (and later agro-pastoralism). Though the expanding farmers initial mixed with hunter-gatherers on the frontier, once they got a head of steam they likely replaced small hunter-gatherer groups in totality, except in areas like Scandinavia and along the maritime fringe where ecological conditions were such hunter-gatherers were at advantage (War Before Civilization seems to describe a massive farmer vs. coastal forager war on the North Sea).

But this is not the end of the story for Norden. At SMBE I saw some ancient genome analysis from Finland on a poster. Combined with ancient genomic analysis from the Baltic, along with deeper analysis of modern Finnish mtDNA, it seems likely that the expansion of Finno-Samic languages occurred on the order of ~2,000 years ago. After the initial expansion of Corded Ware agro-pastoralists.

The Sami in particular seem to have followed the same path along the northern fringe of Scandinavia that the EHG blazed. Though they herd reindeer, they were also Europe’s last indigenous hunter-gatherers. Genetically they exhibit the same minority eastern affinities in their ancestry that the Finns do, though to a greater extent. But their mtDNA harbors some distinctive lineages, which might be evidence of absorption of ancient Scandinavian substate.

I’ll leave it to someone else to explain how and why the Finns and Sami came to occupy the areas where they currently dominate (note that historically Sami were present much further south in Norway and Sweden than they are today). But note that in Latvia and Lithuania the N1c Y chromosomal lineage is very common, despite no language shift, indicating that there was a great deal of reciprocal mixing on the Baltic.

Overall the story is of both population and cultural turnover. This should not surprise when one considers that northern Eurasia is on the frontier of the human range. And perhaps it should temper the inferences we make about other areas of the world.

* You may notice that this threshold is lower than the Neanderthal admixture proportions in the non-African genome. Why is this old admixture still detectable while modern human lineages go extinct? Because it seems to have occurred with non-African humans had a very small effective population, and was mixed thoroughly. Because of the even genomic distribution this ancestry has not been lost in any of the daughter populations.

** Haplogroup I1, which descends from European late Pleistocene populations, exhibits a star phylogeny of similar time depth as R1b and R1a.