Cophylogenetic simulation

Wade Dismukes

2020-09-30

This vignette summarizes the workflow for simulating a host and symbiont tree using the sim_cophylo_bdp function.

Introduction

Figuring out parameters and what they mean

This model has a number of parameters that are input into the sim_cophylo_bdp function. Let us assume that we just want to simulate to a time of 2.0 units. We can then use calculate_expected_leaves_sptree and calculate_expected_leaves_locustree to figure out the average number of extant tips that will be on our host and symbiont trees respectively.


library(treeducken)
#> Loading required package: ape
set.seed(42)

exp_tips_host <- calculate_expected_leaves_sptree(1.0, 0.5, 2.0)
# on average we will get about 5 tips
# maybe we want something more like 8, so let's decrease the extinction rate
exp_tips_host <- calculate_expected_leaves_sptree(1.0, 0.3, 2.0)
# that looks about right, let's assume there are no host speciations that are
# not cospeciations
h_lambda <- 0.0
h_mu <- 0.3
c_lambda <- 1.0
# now we need to worry about the symbiont tree since only cospeciations are
# occurring
# we can use the same math as for the locus tree simulator
exp_tips_symb <- calculate_expected_leaves_locustree(t = 2.0,
                                                     dup_rate = 0.2,
                                                     loss_rate = 0.1, 8)
# a little less than 10 symbiont tips on average
s_lambda <- 0.2
s_mu <- 0.1
# let's assume that when symbionts speciate that they always inherit their
# ancestor's host repertoire
s_her <- 0.0

Now we have all the necessary parameters set with some idea of what they mean.

Simulating a set of host and symbiont pairs

We will now use the main function, sim_cophylo_bdp with the parameters we set in the previous section. Let’s simulate ten sets and then we can print out a summary for each set.

# simulate 10 paired trees with our set parameters.
host_symb_sets <- sim_cophylo_bdp(hbr = h_lambda,
                hdr = h_mu,
                sbr = s_lambda,
                cosp_rate = c_lambda,
                sdr = s_mu,
                host_exp_rate = s_her,
                time_to_sim = 2.0,
                numbsim = 10)
print(host_symb_sets, details = TRUE)
#> 10 cophylogenetic sets.
#> Cophylogenetic set 1 :
#>  Symbiont tree has  20  tips.
#>  Host tree has  18  tips.
#> Cophylogenetic set 2 :
#>  Symbiont tree has  16  tips.
#>  Host tree has  14  tips.
#> Cophylogenetic set 3 :
#>  Symbiont tree has  3  tips.
#>  Host tree has  3  tips.
#> Cophylogenetic set 4 :
#>  Symbiont tree has  15  tips.
#>  Host tree has  12  tips.
#> Cophylogenetic set 5 :
#>  Symbiont tree has  8  tips.
#>  Host tree has  6  tips.
#> Cophylogenetic set 6 :
#>  Symbiont tree has  2  tips.
#>  Host tree has  2  tips.
#> Cophylogenetic set 7 :
#>  Symbiont tree has  4  tips.
#>  Host tree has  3  tips.
#> Cophylogenetic set 8 :
#>  Symbiont tree has  14  tips.
#>  Host tree has  11  tips.
#> Cophylogenetic set 9 :
#>  Symbiont tree has  5  tips.
#>  Host tree has  4  tips.
#> Cophylogenetic set 10 :
#>  Symbiont tree has  4  tips.
#>  Host tree has  4  tips.

Notice that some of these are not particularly interesting with only 2 host or symbiont tips. This is expected behavior, and occurs since there is a nonzero probability that no events occur in the time we simulated.

Examining the result & plotting some trees

Let’s check out one set.

# # plot one or two of them
tree_set_of_interest <- host_symb_sets[[1]]
print(tree_set_of_interest)
#> 
#>  Host Tree:
#> 
#> 
#> Phylogenetic tree with 18 tips and 17 internal nodes.
#> 
#> Tip labels:
#>   H1, H2, X3, H4, H5, X6, ...
#> 
#> Rooted; includes branch lengths.
#> 
#> 
#>  Symbiont Tree:
#> 
#> 
#> Phylogenetic tree with 20 tips and 19 internal nodes.
#> 
#> Tip labels:
#>   X1, X2, S3, X4, X5, S6, ...
#> 
#> Rooted; includes branch lengths.
#> 
#>  Association Matrix: 
#> 
#> 
#>  There are  15  rows (i.e. extant symbionts).
#>  There are  14  cols (i.e. extant hosts).
plot(tree_set_of_interest, col = "red", lty = "dotted")

Notice that the output of sim_cophylo_bdp is a list of cophy objects. Each cophy object is composed of 4 elements: host_tree, symb_tree, association_mat, an event_history. The host_tree an symb_tree are both of the phylo object (from the ape package). The association_mat which is a matrix with number of rows equal to the number of extant tips in the symbiont tree, and number of columns equal to the number of extant tips in the host tree. The event_history is a data frame which shows the full history of the simulation. It contains which events occur on which symbiont and/or host and at which time. The following event code is used: “C”- cospeciation, “HG” - host gain (a host speciation), “HL” - host less (host extinction), “SG” - symbiont gain (symbiont speciation), “SL” - symbiont loss (symbiont extinction), “AG” - association gain, and “AL” - association loss. This bit is a mess at present so please let me know if you think of ways I could trim this down.

We can also perform the parafit test using the results.

host_tree <- tree_set_of_interest$host_tree
symb_tree <- tree_set_of_interest$symb_tree
a <- tree_set_of_interest$association_mat
d <- parafit_stat(host_tr = host_tree, symb_tr = symb_tree, assoc_mat = a)
parafit_test(host_tr = host_tree,
             symb_tr = symb_tree,
             assoc_mat = a,
             D = d,
             reps = 99)
#> [1] 0.01

We can also do a full summary of the results with the following function. This tabulates the number of each event in the event_history data frame, and then performs the Parafit test of Legendre et al. 2004.

# of course that bit is maybe a little too verbose
# we may be interested in a lot of trees
# we can instead use cophylo_summary_stat
host_symb_summary_df <- cophy_summary_stat(host_symb_sets)
#> Calculating for replicate  1 
#> Calculating for replicate  2 
#> Calculating for replicate  3 
#> Calculating for replicate  4 
#> Calculating for replicate  5 
#> Calculating for replicate  6
#> Warning in treeducken::parafit_stat(cophy_obj[[cophy_obj_indx]]$host_tree, : 'host_tr' must be a tree with more than 2 extant tips to calculate the parafit stat returning NA.
#> Calculating for replicate  7
#> Warning in treeducken::parafit_stat(cophy_obj[[cophy_obj_indx]]$host_tree, : 'host_tr' must be a tree with more than 2 extant tips to calculate the parafit stat returning NA.
#> Calculating for replicate  8 
#> Calculating for replicate  9 
#> Calculating for replicate  10
host_symb_summary_df
#>    Cospeciations Host_Speciations Host_Extinctions Symbiont_Speciations
#> 1             17               NA                4                    2
#> 2             13               NA               NA                    2
#> 3              2               NA               NA                   NA
#> 4             11               NA                3                    3
#> 5              5               NA               NA                    2
#> 6              1               NA               NA                   NA
#> 7              2               NA                1                    1
#> 8             10               NA               NA                    3
#> 9              3               NA               NA                    1
#> 10             3               NA                1                   NA
#>    Symbiont_Extinctions Parafit_Stat Parafit_P-value
#> 1                     5 1467.6208709            0.01
#> 2                     1  844.7867082            0.01
#> 3                    NA    0.2159458            0.06
#> 4                     4  431.3475091            0.01
#> 5                    NA   78.3393946            0.01
#> 6                    NA           NA              NA
#> 7                     1           NA              NA
#> 8                     1  650.1323319            0.01
#> 9                     1  121.7864797            0.03
#> 10                    1    3.7077994            0.04

Simulating locus trees on host and symbiont trees

To model gene tree-species tree discordance on these trees we use the three-tree model (Rasmussen and Kellis 2012). This model has three levels: species trees, locus trees, and gene trees. The species tree models speciation and extinction, the locus tree models gene birth, death and transfers, and the gene tree models incomplete lineage sorting. Transfers can occur randomly to any recipient throughout a tree or to species that are more closely related to the transfer donor species.

We can use either symbiont or host trees to simulate locus trees. First we will use the host tree:

host_tree_locus_trees <- sim_locustree_bdp(host_tree,
                                           gbr = 0.4,
                                           gdr = 0.2,
                                           lgtr = 0.0,
                                           num_loci = 10)
host_tree_locus_trees
#> 10 phylogenetic trees
plot(host_tree_locus_trees)

Now we will use the symbiont tree:

symb_tree_locus_trees <- sim_locustree_bdp(symb_tree,
                                           gbr = 0.2,
                                           gdr = 0.1,
                                           lgtr = 0.1,
                                           num_loci = 10)
str(symb_tree_locus_trees[[1]])
#> List of 6
#>  $ edge       : num [1:38, 1:2] 21 21 23 23 25 25 27 27 26 26 ...
#>  $ edge.length: num [1:38] 0.391 0.5276 0.9907 0.1069 0.0804 ...
#>  $ Nnode      : int 19
#>  $ tip.label  : chr [1:20] "S20_1" "S19_1" "X4_1" "S14_1" ...
#>  $ root.edge  : num 0.0557
#>  $ node.label : chr [1:19] "" "" "" "" ...
#>  - attr(*, "class")= chr "phylo"

Using these locus trees we can simulate under the multi-locus coalescent process (see Rasmussen and Kellis 2012). We can specify an effective population size and a generation time measured in generations per unit time. For now we will assume that the time of our trees is in units of millions of years. The default for generation time is then 1 time per year (1e-6). And we will randomly draw an effective population size from a Lognormal with mean 14 and standard deviation 0.4. Note that the only reason these values chosen is to mirror the validation test of Mallo et al. (2015) for the SimPhy program.

host_locus_tree <- host_tree_locus_trees[[3]]
# randomly choose an effective popsize 
popsize <- 1e6
host_loci_gene_trees <- sim_multilocus_coal(host_locus_tree,
                                            effective_pop_size = popsize,
                                            num_reps = 100)

And we can calculate some tree summary statistics on all of genes that were simulated on one of the locus trees (in this case the first one).

Load in other datasets

You can also combine a host tree, symbiont tree, and association matrix into a cophy object that can be input into functions in treeducken. The example below reads in the classic gopher and lice dataset from Hafner et al. 1994. The more common association table format with two columns with hosts in the first and symbionts (or parasites) in the second column is converted into an association matrix. These are all then converted into treeducken’s cophy class. You can then print out a summary of these and then calculate summary statistics on these. Currently, this only calculates the parafit statistic and conducts the permutation test for that statistic if the event_history element of cophy is not set. If that is set it will count the numbers of each event.

gopher_lice_map <- read.table(system.file("extdata",
                                          "gopher_lice_mapping.txt",
                                          package = "treeducken"),
                              stringsAsFactors = FALSE, header = TRUE)

gopher_lice_assoc_matrix <- convert_assoc_table_to_matrix(gopher_lice_map)
gopher_tree <- ape::read.nexus(system.file("extdata",
                                           "gophers_bd.tre",
                                           package = "treeducken"))
lice_tree <- ape::read.nexus(system.file("extdata",
                                         "lice_bd.tre",
                                         package = "treeducken"))
gopher_lice_cophylo <- convert_to_cophy(hostTree = gopher_tree,
                                          symbTree = lice_tree,
                                          assocMat = gopher_lice_assoc_matrix)
print(gopher_lice_cophylo)
#> 
#>  Host Tree:
#> 
#> 
#> Phylogenetic tree with 15 tips and 14 internal nodes.
#> 
#> Tip labels:
#>   Cratogeomys_castanops_L32685, Cratogeomys_merriami_L32688, Geomys_breviceps_L32683, Geomys_bursarius_L32693, Geomys_bursarius_L32694, Geomys_personatus_L32689, ...
#> 
#> Rooted; includes branch lengths.
#> 
#> 
#>  Symbiont Tree:
#> 
#> 
#> Phylogenetic tree with 17 tips and 16 internal nodes.
#> 
#> Tip labels:
#>   Geomydoecus_actuosi_L32665, Geomydoecus_chapini_L32667, Geomydoecus_cherriei_L32668, Geomydoecus_costaricensis_L32669, Geomydoecus_ewingi_L32671, Geomydoecus_expansus_L32670, ...
#> 
#> Rooted; includes branch lengths.
#> 
#>  Association Matrix: 
#> 
#> 
#>  There are  17  rows (i.e. extant symbionts).
#>  There are  15  cols (i.e. extant hosts).
cophy_summary_stat(gopher_lice_cophylo)
#>   Cospeciations Host_Speciations Host_Extinctions Symbiont_Speciations
#> 1             0                0                0                    0
#>   Symbiont_Extinctions Parafit_Stat Parafit_P-value
#> 1                    0     2.720485             0.1

plot(gopher_lice_cophylo,
     fsize = 0.5,
     show_tip_label = FALSE,
     gap = 1,
     col = "purple",
     lty = "dashed")