What is ancestry?

Joe Pickrell
The Gencove Blog
Published in
6 min readJan 17, 2018

--

This is a modified version of a blog post I initially wrote two years ago. With the increasing interest in genetic ancestry testing over these two years, the question “What is ancestry?” is worth keeping in mind.

Anyone who has used commercial genetic testing products like those offered by 23andMe, AncestryDNA, or Gencove is familiar with the idea of “genetic ancestry”. After mailing in a saliva kit, we return a report that tells you the percentage of your DNA that is most similar to different populations around the world.

At a superficial level, it seems like getting this estimate should be straightforward — look at someone’s genome, apply some fancy statistics, and out pop numbers like “21.5% British and Irish” or “40% Great Britain, 6% Ireland, or “79% Northern and Central European” (these are numbers from my own 23andMe, AncestryDNA, and Gencove reports, respectively. An astute reader might be wondering: wait, shouldn’t those numbers be the same? Hold that thought).

Genetic ancestry results for one individual from three different companies: 23andMe, AncestryDNA, and Gencove

Once you state the problem of “ancestry inference” in more precise terms, however, you very quickly find yourself in the realm of sociology and psychology rather than statistics and genetics.

To understand why this is, it’s important to start at the beginning: what exactly is the goal of “ancestry inference” anyway?

What is “ancestry”?

A useful question for anyone working on algorithms for learning about ancestry from genetic data is: “How would you describe your ancestry?”. Try to answer the question yourself. Ask your friends. Bug some strangers on the Internet.

If the people you talk to are anything like the people I’ve talked to, the answers will generally break down into two broad categories:

  1. Many people use geographic labels to describe their ancestry, often based on current political borders. E.g. “French” or “Chinese
  2. Many people use ethnic labels to describe their ancestry. E.g. “Jewish” or “Caucasian”[1].

Let’s take it for granted that the “correct” definition of “ancestry” is something that aligns with these intuitive responses. This suggests that people expect a genetic “ancestry test” to predict the geographic and/or ethnic labels of their ancestors.

Unfortunately, if you sit down and try to write an algorithm to do this, you will immediately come across two daunting problems.

Problem #1: What time depth are we talking about?

Obviously we all have ancestors that lived at different times. You had maybe 8 ancestors living 100 years ago, but many thousands that lived 500 years ago. So whose geographic and/or ethnic labels should we try to guess — those of your ancestors living 100 years ago, or those living 500 years ago? (Or 1,000 years ago? Or…?).

A reasonable first guess is that when people talk about their ancestry they’re generally talking about recent ancestors, such that the “correct” answer to this question is something like 100 years ago. But this isn’t totally satisfying: in the United States there are many people whose ancestors immigrated to the country hundreds of years ago but who think of their ancestry as (for example) “British” or “Chinese” rather than “Michigander” or “Californian”.

So it’s not totally clear what time depth people generally think of when they think about their ancestry. Indeed, it seems plausible that the “correct” time depth to report in an ancestry test depends on a user’s…ancestry. This should be a hint that ancestry is a more complicated concept than it first appears.

Problem #2: Ancestry identifiers are influenced by social and political factors

This becomes even clearer when you notice the fundamental problem that some of these labels that we think of as “ancestry” are strongly influenced (and indeed sometimes determined by) social and political factors. Obviously no genetic markers change when someone converts to Judaism, or when the territory where someone lives is annexed by a neighboring country. But these events often have dramatic influences on how the descendants of these individuals think of their ancestry, via cultural transmission of things like languages and traditions.

Indeed, construction of a shared ancestral identity was (and remains) a method for consolidation of political power over diverse cultures (see e.g. Franco in Spain). This is largely invisible to genetics, except after hundreds or thousands of years (if shared identities influence subsequent marriage and/or migration patterns).

A solution

To get around all of these problems, what you would ideally like to have is a detailed list of your ancestors at different time depths, each labeled with their geographic location and any ethnic self-identifiers. You could then say, for example, that 100 years ago 25% of your ancestors lived in Illinois and identified as Jewish, while 500 years ago 5% of your ancestors lived in present-day Andalucia and identified as Muslim [2].

Unfortunately obtaining much of this information from genetic data is currently impractical, so we’re going to have to compromise with some dramatic approximations [3]. Specifically, the approach taken by all of the commercial companies is to try to estimate the general geographic regions where your ancestors lived (and in a select small number of cases their ethnic identifiers) some indeterminate time in that past, probably something like a few hundred years ago.

Does this all sound a bit vague? It should because it is. There’s plenty of wiggle room in the definition of “general geographic regions” and “some indeterminate time in the past” to allow for very different interpretations [4].

But the key is this: if we replace the currently-impossible goal of perfectly understanding the geography and ethnicity of your ancestors with the realistic goal of getting a general understanding about some of them, we can now make some progress. This might seem a bit disappointing, in that we’ve abandoned the exactness and objectivity that seem promised by a “genetic test”. But there are two reasons to be optimistic:

  1. In many cases an approximate understanding can already be quite meaningful. Millions of people around the world have purchased these tests. Some have uncovered aspects of their family history that were kept secret (indeed I’m one of them). Some have discovered hospital mixups that led to puzzling mismatches between their cultural and genetic ancestries. Still others have confronted the genetic legacy of slavery in their own genomes. This type of information can be extremely powerful.
  2. The more people that participate, the better we get. As genetic datasets get larger and larger, new statistical ways of studying ancestry become possible. At Gencove we’ve been able to update our algorithms multiple times over recent months to provide more detailed analyses; the simple reason is that, like most machine learning algorithms, the more training data the better the performance, as we identify the rare variants/combinations of variants that are most predictive of your ancestors’ locations.

If you want to help work on the next generation of ancestry inference algorithms, get in touch!

References:

[1] Though the Caucasus is a geographic region, the word “Caucasian” is used in the United States as an ethnic identifier approximately synonymous with “white”.

[2] You might also be interested in whether you actually inherited any genetic material from each ancestor, but let’s avoid opening that can of worms for now and assume the properties of your geneaological ancestors are the same as those of your genetic ancestors.

[3] In the older version of this post, I wrote that this was in fact impossible. I’m now convinced I was wrong about this, and that this is an extremely challenging problem rather than an impossible one.

[4] Note the different ancestry proportions reported to me by 23andMe, AncestryDNA, and Gencove. Most people think of these differences as different algorithmic solutions to the same question, but it’s entirely possible that the algorithms used by the different companies are answering slightly different questions! For example, it may be possible that the 23andMe algorithm is looking at slightly more recent ancestry on average than the AncestryDNA algorithm (I actually think this is indeed the case, for what it’s worth). On this general topic it’s worth reading this great post by Debbie Kennett comparing ancestry composition results across companies.

--

--