Release the UK Biobank! (the prediction of height edition)

There’s so much science coming out of the UK Biobank it’s not even funny. It’s like getting the palantír or something.

Anyway, a preprint, submitted for your approval. A vision of things to come? Accurate Genomic Prediction Of Human Height:

We construct genomic predictors for heritable and extremely complex human quantitative traits (height, heel bone density, and educational attainment) using modern methods in high dimensional statistics (i.e., machine learning). Replication tests show that these predictors capture, respectively, ~40, 20, and 9 percent of total variance for the three traits. For example, predicted heights correlate ~0.65 with actual height; actual heights of most individuals in validation samples are within a few cm of the prediction. The variance captured for height is comparable to the estimated SNP heritability from GCTA (GREML) analysis, and seems to be close to its asymptotic value (i.e., as sample size goes to infinity), suggesting that we have captured most of the heritability for the SNPs used. Thus, our results resolve the common SNP portion of the “missing heritability” problem – i.e., the gap between prediction R-squared and SNP heritability. The ~20k activated SNPs in our height predictor reveal the genetic architecture of human height, at least for common SNPs. Our primary dataset is the UK Biobank cohort, comprised of almost 500k individual genotypes with multiple phenotypes. We also use other datasets and SNPs found in earlier GWAS for out-of-sample validation of our results.

A scatter-plot is worth a thousand derivations.

You know what better than 500,000 samples? One billion samples! A nerd can dream….

After agriculture, before bronze


The above plot shows genetic distance/variation between highland and lowland populations in Papa New Guinea (PNG). It is from a paper in Science that I have been anticipating for a few months (I talked to the first author at SMBE), A Neolithic expansion, but strong genetic structure, in the independent history of New Guinea.

What does “strong genetic structure” mean? Basically Fst is showing the proportion of genetic variation which is partitioned between groups. Intuitively it is easy to understand, in that if ~1% of the genetic variation is partitioned between groups in one case, and ~10% in another, then it is reasonable to suppose that the genetic distance between groups in the second case is larger than in the first case. On a continental scale Fst between populations is often on the order of ~0.10. That is the value for example when you pool the variation amongst Northern Europeans and Chinese, and assess how much of it can be apportioned in a manner which differentiates populations (so it’s about ~10% of the variation).

This is why ancient DNA results which reported that Mesolithic hunter-gatherers and Neolithic farmers in Central Europe who coexisted in rough proximity for thousands of years exhibited differences on the order of ~0.10 elicited surprise. These are values we are now expecting from continental-scale comparisons. Perhaps an appropriate analogy might be the coexistence of Pygmy groups and Bantu agriculturalists? Though there is some gene flow, the two populations exist in symbiosis and exhibit local ecological segregation.

In PNG continental scale Fst values are also seen among indigenous people. The differences between the peoples who live in the highlands and lowlands of PNG are equivalent to those between huge regions of Eurasia. This is not entirely surprising because there has been non-trivial gene flow into lowland populations from Austronesian groups, such as the Lapita culture. Many lowland groups even speak Austronesian languages today.

Using standard ADMIXTURE analysis the paper shows that many lowland groups have significant East Asian ancestry (red), while none of the highland groups do (some individuals with East Asian admixture seem to be due to very recent gene flow). But even within the highlands the genetic differences are striking. The  Fst values between Finns and Southern European groups such as Spaniards are very high in a European context (due to Finnish Siberian ancestry as well as drift through a bottleneck), but most comparisons within the highland groups in PNG still exceeds this.

The paper also argues that genetic differences between Papuans and the natives of Australia pre-date the rising sea levels at the beginning of the Holocene, when Sahul divided between its various constituents. This is not entirely surprising considering that the ecology of the highlands during the Pleistocene would have been considerably different from Australia to the south, resulting in sharp differences in the hunter-gatherer lifestyles. Additionally, there does not seem to have been a genetic cline. Papuans are symmetrically related to all Australian groups they had samples from.

Using coalescence-based genomic methods they inferred that separation between highlands and some lowland groups occurred ~10-20,000 years ago. That is, after the Last Glacial Maximum. For the highlands, the differences seem to date to within the last 10,000 years. The Holocene. Additionally, they see population increases in the highlands, correlating with the shift to agriculture (cultivation of taro).

None of the above is entirely surprising, though I would take the date inferences with a grain of salt. The key is to observe that large genetic differences, as well as cultural differences, accrued in the highlands of PNG during the Holocene. In the paper they have a social and cultural explanation for what’s going on:

  Fst values in PNG fall between those of hunter-gatherers and present-day populations of west Eurasia, suggesting that a transition to cultivation alone does not necessarily lead to genetic homogenization.

A key difference might be that PNG had no Bronze Age, which in west Eurasia was driven by an expansion of herders and led to massive population replacement, admixture, and cultural and linguistic change (7, 8), or Iron Age such as that linked to the expansion of Bantu-speaking
farmers in Africa (24). Such cultural events have resulted in rapid Y-chromosome lineage expansions due to increased male reproductive variance (25), but we consistently find no evidence for this in PNG (fig. S13). Thus, in PNG, wemay be seeing the genetic, linguistic, and cultural diversity that sedentary human societies can achieve in the absence of massive technology-driven expansions.

Peter Turchin in books like Ultrasociety has aruged that one of the theses in Steven Pinker’s The Better Angels of Our Nature is incorrect: that violence has not decreased monotonically, but peaked in less complex agricultural societies. PNG is clearly a case of this, as endemic warfare was a feature of highland societies when they encountered Europeans. Lawrence Keeley’s War Before Civilization: The Myth of the Peaceful Savage gives so much attention to highland PNG because it is a contemporary illustration of a Neolithic society which until recently had not developed state-level institutions.

What papers like these are showing is that cultural and anthropological dynamics strongly shape the nature of genetic variation among humans. Simple models which assume as a null hypothesis that gene flow occurs through diffusion processes across a landscape where only geographic obstacles are relevant simply do not capture enough of the dynamic. Human cultures strongly shape the nature of interactions, and therefore the genetic variation we see around us.

SLC24A5 is very important, but we don’t know why

The golden of pigmentation genetics started in 2005 with SLC24A5, a putative cation exchanger, affects pigmentation in zebrafish and humans. Prior to that pigmentation genetics was really to a great extent coat color genetics, done in mice and other organisms which have a lot of pelage variation.

Of course there was work on humans, mostly related to melanocortin 1. But more interesting were classical pedigree studies which indicated that the number of loci controlling variation in pigmentation was not that high. This, it was a mildly polygenic trait insofar as some large effect quantitative trait loci could be discerned in the inheritance patterns.

From The Genetics of Human Populations, written in the 1960s, but still useful today because of its comprehensive survey of the classical period:

Depending on what study samples you use variance on a locus of SLC24A5 explains less than 10% or more than 30% of the total variance. But it is probably the biggest effect locus on the whole in human populations when you pool them altogether (obviously it explains little variance in Africans or eastern non-Africans since it is homozygous ancestral by and large in both groups).

One aspect of the derived SNP in this locus is that it seems to be under strong selection. In a European 1000 Genomes sample there are 1003 SNPs of the derived variant, and 3 of the ancestral. Curiously this allele was absent in Western European Mesolithic European hunter-gatherers, though it was present in hunter-gatherers on the northern and eastern fringes of the continent. It was also present in Caucasian hunter-gatherers and farmers from the Middle East who migrated to Europe. It seems very likely that these sorts of high frequencies are due to selection in Europe.

The variant is also present in appreciably frequencies in many South Asian populations, and there seems to have been in situ selection there too, as well as the Near East. In Ethiopia it also seems to be under selection.

It could be something due to radiation…but the Near East and South Asia are quite high intensity in that regard. As are the highlands of Ethiopia. About seven years ago I suggested that rather that UV radiation as such the depigmentation that has occurred across the Holocene might be due to agriculture and changes in diet.

But a new result from southern Africa presented at the SMBE meeting this year suggests that this can not be a comprehensive answer. Meng Lin in Brenna Henn’s lab uses a broad panel of KhoeSan populations to find that the derived allele on SLC24A5 reaches ~40% frequency. Probably a high fraction of West Eurasian admixture in these groups is around ~10% being generous. Where did this allele come from? The results from Joe Pickrell a few years back are sufficient to explain: there was a movement of pastoralists with distant West Eurasian ancestry who brought cattle to southern Africa, and so resulted in the ethnogenesis of groups such as the Nama people (there is also Y chromosomal work by Henn on this).

Sad human with two derived alleles of SNP of interest

Lin reports that the haplotype around SLC24A5 is the same one as in Western Eurasia. Iain Mathieson (who is now at Penn if anyone is looking for something to do in grad school or a post-doc) has told me that the haplotype in the Motala Mesolithic hunter-gatherers and in the hunter-gatherers from the Caucasus are the same. It seems that this haplotype was widespread early in the Holocene. Curiously, the Motala hunter-gatherers also carry the East Asian haplotype around their derived EDAR variant.

I don’t know what to make of this. My intuition is that if a haplotype like this is so widespread nearly ~10,000 years ago recombination would have broken it apart into smaller pieces so that haplotype structure would be easier to discern. As it is that doesn’t seem to be the case.

And we also don’t know what’s going on withSLC24A5. Obviously it impacts skin color. It has been shown to do so in admixed populations. But it is hard to believe that that is the sole target of natural selection here.

The Bronze age demographic transformation of Britain

In Norman Davies’ the excellent The Isles: A History, he mentions offhand that unlike the Irish the British to a great extent have forgotten their own mythology. This is one reason that J. R. R. Tolkien created Middle Earth, they gave the Anglo-Saxons the same sort of mythos that the Irish and Norse had.

But to some extent I think we can update our assessments. Science is bringing myth to life. The legendary “Bell Beaker paper” is now available in preprint form, The Beaker Phenomenon And The Genomic Transformation Of Northwest Europe. The methods are not too abstruse if you have read earlier works on this vein (i.e., no Nick Patterson authored methodological supplement that I saw). And the results are straightforward.

And what are those results?

First, the Bell Beaker phenomenon was both cultural and demographic. Cultural in that it began in the Iberian peninsula, and was transmitted to Central Europe, without much gene flow from what they can see. Demographic in that its push west into what is today the Low Countries and France and the British Isles was accompanied by massive gene flow.

In their British samples they conclude that 90% of the ancestry of early Bronze Age populations derive from migrants from Central Europe with some steppe-like ancestry. In over words, in a few hundred years there was a 90% turnover of ancestry. The preponderance of the male European R1b lineage also dates to this period. It went from ~0% to ~75-90% in Britain over a few hundred years.

If most of the genetic-demographic character of modern Britain was established during the Bronze Age*, then there has been significant selection since the Bronze Age. The figure to the left shows ancient (Neolithic/Bronze age) frequencies of selected SNPs, with modern frequencies in the British in dashed read. The top-left SNP is for HERC2-OCA2, the region related to brown vs. blue eye color, and also associated with some more general depigmentation. The top-right SNP is in SLC45A2, the second largest effect skin color locus in Europeans. The bottom SNP is for a mutation on LCT, which allows for the digestion of milk sugar as adults.

The vast majority of the allele frequency change in Britons for digestion of milk sugar post-dates the demographic turnover. In other words, the modern allele frequency is a function of post-Bronze Age selection. This is not surprising, as it supports the result in Eight thousand years of natural selection.

1000 Genomes derived SLC45A2 SNP frequency

At least as interesting are the pigmentation loci. The fact that the derived frequency in HERC2-OCA2 is lower in both British and Central European Beaker people samples indicates that the lower proportion is not an artifact of sampling. Britons have gotten more blue-eyed over the last 4,000 years. Second, SLC45A2 is at shocking low proportions for modern European populations.

HGDP derived SLC45A2 SNP frequency

In the 1000 Genomes the 4% ancestral allele frequency is almost certainly a function of the Siberian (non-European) ancestry. In modern Iberians the ancestral frequency is 18% (and it is even higher in Sardinians last I checked), but in Tuscans it is ~2%. Though not diagnostic of Europeans in the way the derived SNP at SLC24A2 is, SLC452 derived variants are much more constrained to Europe. Individuals who are homozygote ancestral for SNPs atSLC45A2 rare in modern Northern Europeans (pretty much nonexistent actually). But even as late as the Bronze Age they would have been present at low but appreciable frequencies.

This particular result convinces me that the method in Field et al. which detected lots of recent (last 2,000 years) selection on pigmentation in British populations is not just a statistical artifact. Though these papers are solving much of European prehistory, they are also going to be essential windows into the trajectory of natural selection in human populations over the last 5,000 years.

* In the context of this paper the Anglo-Saxon migrations tackled by the PoBI paper are minor affairs because the two populations were already genetically rather close. Additionally, the PoBI paper found that the German migrations were significant demographic events, but most of the ancestry across Britain does date to the previous period.