Visualizing intra-European phylogenetic distances

Neighbor-joining tree of genetic distances between populations


In L. L. Cavalli-Sforza’s he used between population group genetic distances, as measured in FST values, to generate a series of visualizations, which then allowed him to infer historical processes. Basically the way it works is that you look at genetic variation, and see how much of it can be allocated to between groups. If none of it can be allocated to between groups, then in a population genetic sense it doesn’t make much sense to speak of distinctive groups, they’re basically one breeding population. The higher the FST statistic is, the more of the variation is partitioned between the groups.

Roughly this is used to correlate with genetic distance as well as evolutionary divergence. The longer two populations have been separated, the more and more genetic differences they’ll accumulate, inflating the FST value. There are a lot of subtleties that I’m eliding here (see Estimating and interpreting FST: the impact of rare variants for a survey of the recent literature on the topic and pathways forward), but for a long time, FST was the go-to statistic for making phylogenetic inferences on a within-species scale.

Today we have other techniques, Structure, Treemix, fineStructure, and various local ancestry packages.

But FST is still useful to give one a Gestalt sense of population genetic differences. Cavalli-Sforza admits in that European populations had very low pairwise FST, but because of the importance of Europe for sociocultural reasons a detailed analysis of the region was still provided in the text. Additionally, they had lots of European samples (non-European Caucasoids were thrown into one category for macro-group comparisons because there wasn’t that many samples).

Using results from the 2015 paper Massive migration from the steppe was a source for Indo-European languages in Europe, I visualized pairwise genetic distances for European populations, ancient and modern (Han Chinese as an outgroup), on a tree. What the results illustrate is that

  1. Ancient populations were very distinct in Europe from modern ones.
  2. Many modern groups are clustered close together.

The bulk of the population genetic structure in modern Europe seems to have been established in the period between 3000 BCE and 2000 BCE. This is not that much time for a lot of distinctiveness to develop, especially on the geographically open North European plain. I suspect with more and more Mesolithic and early to middle Neolithic DNA we’ll see that some of the modern population structure is a ghost of ancient substrate absorption.

Many of the ethno-national categories that are very significant in recent history, and impact the cultural memories of modern people and their genealogies, have very shallow roots. This does not mean they are not “real” (I don’t know what that’s supposed to mean at all), just that many of the identities which seem so salient to us today may be relatively recent in terms of their significance to large groups of humans….

2 thoughts on “Visualizing intra-European phylogenetic distances

  1. Two surprises:

    1. More distance between Romance + Greek v. others than expected.

    2. I would have expected Sardinian to cluster with LBK without Yamnaya/Corded Ware between them.

  2. , I think the placing of Early Neolithic together with Yamnaya may be an artefact of using a method that treats each population as a dimension, then calculating the tree based on absolute euclidean distance. That then places Yamnaya and Early Neolithic relatively close together because they have a similar “position” on the “French”, etc. dimension, both being quite distant from modern European populations. That is – A:

    If the neighbour joining algorithm is instead trying to fit the Fst distances provided, rather than try to calculate a Euclidean distance from them as dimensions, then you get a neighbour joining tree which has a more logical fit. B:

    For instance B, places Han closest to its nearest neighbour in the distance matrix, which has lowest Fst, the Russians, not the WHG. A places Han near WHG, even though that’s the most distant pair in the set, because both WHG and Han have relatively high distance present day Europeans (who make up most “dimensions” in that way of calculating the tree). Sardinian clusters with early Europeans as you’d expect, etc.

    Another example using a matrix of purely present day West Eurasian Fst scores. Treating each column of the Fst matrix as a dimension on which to calculate Euclidean distance generates A:, while treating the matrix as distances generates B: Note how A embeds Sardinians and Basques on the South Central Asian tree (due to sharing relatively measures on European “dimensions”), while B places them close to their closest relatives by Fst distance, Spanish.

    (I’ve used distances which David at Eurogenes has calculated using recent datasets from Reich lab. I think there may be some funny stuff going on at margins with the table from the 2015 paper. Newer sets are also available in Lazaridis’s 2016 paper).

Leave a Reply