Simulate Admixed Populations with bnpsd

Alejandro Ochoa and John D. Storey

2019-05-15

\[ \newcommand{\Fst}{F_{\text{ST}}} \newcommand{\ft}[1][j]{f_{#1}^T} \]

Introduction

The bnpsd package simulates the genotypes of an admixed population. In the PSD model (Pritchard, Stephens, and Donnelly 2000), admixed individuals draw their alleles with individual-specific probabilities (admixture proportions) from \(K\) intermediate subpopulations. We impose the BN model (Balding and Nichols 1995) to the intermediate subpopulation allele frequency, which thus evolve independently with subpopulation-specific inbreeding coefficients (\(\Fst\) values) from a common ancestral population \(T\). The kinship coefficients and generalized \(\Fst\) of the admixed individuals were derived in recent work (Ochoa and Storey 2016a). A simulated admixed population generated using this package was used to benchmark kinship and \(\Fst\) estimators in the accompanying paper (Ochoa and Storey 2016b). Here we briefly summarize the notation and intuition behind the key parameters (see (Ochoa and Storey 2016a) for precise definitions).

There are conceptually two parts of the BN-PSD model, one that defines relatedness between individuals (under the admixture framework), while the other simulates allele frequencies and genotypes conditional on the specified parameters of relatedness/admixture.

The BN-PSD population structure

The population structure determines how individuals are related to each other. The key parameters are the coancestry coefficients of the intermediate subpopulations and the admixture proportions of each individual for each subpopulation (\(q_{ju}\)), which are treated as fixed variables.

The bnpsd functions that model population structure (not drawing allele frequencies or genotypes) admit arbitrary coancestry matrices for the intermediate subpopulations. However, the code that draws allele frequencies and genotypes requires these subpopulations to be independent, which means that the coancestry matrix must have zero values off-diagonal. In this setting—the only case we’ll considered in this vignette—the diagonal values of the coancestry matrix of the intermediate subpopulations are inbreeding coefficients (\(\ft[S_u]\) below), which correspond to subpopulation-specific \(\Fst\) values.

Each intermediate subpopulation \(S_u\) (\(u \in \{1, ..., K\}\)) evolved independently from a shared ancestral population \(T\) with an inbreeding coefficient denoted by \(\ft[S_u]\). Note that \(T\) is a superscript, and does not stand for exponentiation or matrix transposition. Each admixed individual \(j \in \{1, ..., n\}\) draws each allele from \(S_u\) with probability given by the admixture proportion \(q_{ju}\) (\(\sum_{u=1}^K q_{ju} = 1 \forall j\)). In this case the coancestry coefficients \(\theta^T_{jk}\) between individuals \(j,k\) (including \(j=k\) case) and the \(\Fst\) of the admixed individuals are given by: \[ \theta^T_{jk} = \sum_{u=1}^K q_{ju} q_{ku} \ft[S_u], \quad \quad \Fst = \sum_{j=1}^n \sum_{u=1}^K w_j q_{ju}^2 \ft[S_u], \] where \(0 < w_j < 1, \sum_{j=1}^n w_j = 1\) are user-defined weights for individuals (default \(w_j=\frac{1}{n} \forall j\)). Note \(\theta^T_{jk}\) equals the kinship coefficient for \(j \ne k\) and the inbreeding coefficient for \(j=k\).

The bias coefficient \(s\) is defined by \[ s = \frac{\bar{\theta}^T}{\Fst} \] where \(\bar{\theta}^T = \sum_{j=1}^n \sum_{k=1}^n w_j w_k \theta_{jk}^T\). This \(0 < s \le 1\) approximates the proportional bias of \(\Fst\) estimators that assume independent subpopulations, and one bnpsd function below fits its parameters to yield a desired \(s\).

Random allele frequencies and genotypes

This section details the distributions of the allele frequencies and genotypes of the various populations or individuals of the BN-PSD model.

Every biallelic locus \(i\) in the ancestral population \(T\) has an ancestral reference allele frequency denoted by \(p_i^T\). By default the bnpsd code draws \[ p_i^T \sim \text{Uniform}(a, b) \] with \(a=0.01, b=0.5\), but the code accepts \(p_i^T\) from arbitrary distributions (see below).

The distribution of the allele frequency at locus \(i\) in subpopulation \(S_u\), denoted by \(p_i^{S_u}\), is the BN distribution: \[ p_i^{S_u} | T \sim \text{Beta} \left( \nu_s p_i^T, \nu_s \left( 1-p_i^T \right) \right), \] where \(\nu_s = \frac{1}{\ft[S_u]}-1\). Allele frequencies for different loci and different subpopulations (\(S_u,S_v, u \ne v\)) are drawn independently.

Each admixed individual \(j\) at each locus \(i\) draws alleles from a mixture of Bernoulli distributions from each intermediate subpopulation, which is mathematically equivalent to assigning what we call individual-specific allele frequencies \(\pi_{ij}\) constructed as: \[ \pi_{ij} = \sum_{u=1}^K p_i^{S_u} q_{ju}. \] The unphased genotype \(x_{ij}\) (encoded to count the number of reference alleles) is drawn as: \[ x_{ij}|\pi_{ij} \sim \text{Binomial}(2, \pi_{ij}). \]

Simulation examples

Population structure: 1D geography

Let’s generate the same population structure used in the simulation of (Ochoa and Storey 2016b).

Draw random allele frequencies and genotypes

Now let’s draw all the random allele frequencies and genotypes from the population structure. The easiest way is to use draw_all_admix:

Lastly, let’s verify that the correlation structure of the genotypes matches the theoretical coancestry matrix we constructed earlier. For this we use the popkin function of the package with the same name.

Customizing the allele frequency and genotype pipeline

The random variables generated by draw_all_admix above can also be generated separately using the following functions (where \(p\) is the usual variable symbol for allele frequencies):

These functions are provided for greater flexibility (examples follow further below). However, the joint function draw_all_admix has the advantage of removing fixed loci (loci that were randomly set to 0 or 2 for all individuals, despite non-zero allele frequencies), which are re-drawn from scratch (starting from the ancestral allele frequencies).

Here is the step-by-step procedure for drawing AFs and genotypes in the default draw_all_admix (except for the re-drawing of fixed loci):

We provide functions for these separate steps to allow for more flexibility. For example, the ancestral allele frequencies could be drawn from a Symmetric Beta instead of using draw_p_anc:

You could also draw genotypes from the ancestral population or the intermediate populations:

Low-memory genotype simulation algorithm

If you desire to simulate a very large number of individuals (\(n\)) and loci (\(m\)), you might run out of memory while running draw_all_admix. Memory consumption is reduced by passing the low_mem = TRUE option to draw_all_admix, which draws the genotypes X directly from p_subpops and admix_proportions, without storing the entire intermediate matrix p_ind in memory at once. However, this algorithm can be much slower than the default one (low_mem = FALSE).

The same option exists for draw_genotypes_admix, which instead of accepting the p_ind matrix as input requires both p_subpops and admix_proportions as above:

Additional population structures

Here we show examples for functions that create admixture matrices for various simple population structures. The admixture scenarios implemented in bnpsd are generated by these functions:

The main example (above) was for admix_prop_1d_linear, the rest follow.

Circular 1D geography

This is a twist on the earlier 1D geography where subpopulations and individuals are placed on a circumference, so random walks wrap around and the appropriate density is the Von Misses distribution.

Let’s generate an analogous population structure to the original “linear” example.

Independent subpopulations

The independent subpopulations model, where individuals are actually unadmixed, is the most trivial form of the BN-PSD admixture model.

# define population structure
# we'll have k_subpops = 3, each subpopulation with these sizes:
n1 <- 100
n2 <- 50
n3 <- 20

# here's the labels (for simplicity, list all individuals of S1 first, then S2, then S3)
labs <- c(
    rep.int('S1', n1),
    rep.int('S2', n2),
    rep.int('S3', n3)
)
# data dimensions infered from labs:

# number of individuals "n_ind"
length(labs)
#> [1] 170

# number of subpopulations "k_subpops"
k_subpops <- length(unique(labs))
k_subpops
#> [1] 3

# desired admixture matrix
admix_proportions <- admix_prop_indep_subpops(labs)

# got a numeric matrix with a single 1 value per row
# (denoting the sole subpopulation from which each individual draws its ancestry)
head(admix_proportions, 2)
#>      S1 S2 S3
#> [1,]  1  0  0
#> [2,]  1  0  0

# construct the intermediate subpopulation FST vector
# the desired final FST
Fst <- 0.2
# subpopulation FST vector, unnormalized so far
inbr_subpops <- 1 : k_subpops
# normalized to have the desired FST
# NOTE fst is a function in the `popkin` package
inbr_subpops <- inbr_subpops / fst(inbr_subpops) * Fst
# verify FST for the intermediate subpopulations
fst(inbr_subpops)
#> [1] 0.2

# get coancestry of the admixed individuals
coancestry <- coanc_admix(admix_proportions, inbr_subpops)
# before getting FST for individuals, weigh then inversely proportional to subpop sizes
weights <- weights_subpops(labs) # function from `popkin` package
# verify FST for individuals (same as for intermediate subpops for this pop structure)
fst_admix(admix_proportions, inbr_subpops, weights)
#> [1] 0.2

References

Balding, D. J., and R. A. Nichols. 1995. “A Method for Quantifying Differentiation Between Populations at Multi-Allelic Loci and Its Implications for Investigating Identity and Paternity.” Genetica 96 (1-2): 3–12.

Ochoa, Alejandro, and John D. Storey. 2016a. “\(F_{\text{ST}}\) And Kinship for Arbitrary Population Structures I: Generalized Definitions.” bioRxiv doi:10.1101/083915. https://doi.org/10.1101/083915.

———. 2016b. “\(F_{\text{ST}}\) And Kinship for Arbitrary Population Structures II: Method of Moments Estimators.” bioRxiv doi:10.1101/083923. https://doi.org/10.1101/083923.

Pritchard, J. K., M. Stephens, and P. Donnelly. 2000. “Inference of Population Structure Using Multilocus Genotype Data.” Genetics 155 (2): 945–59.