IFAA is a novel approach to make inference on the association of covariates with the absolute abundance (AA) of microbiome in an ecosystem. It can be also directly applied to relative abundance (RA) data to make inference on AA because the ratio of two RA is equal ratio of their AA. This algorithm can estimate and test the associations of interest while adjusting for potential confounders. High-dimensional covariates are handled with regularization. The estimates of this method have easy interpretation like a typical regression analysis. This algorithm can find optimal reference taxa/OTU/ASV and control FDR by permutation.

To model the association, the following equation is used: \[ \log(\mathcal{Y}_i^k)|\mathcal{Y}_i^k>0=\beta^{0k}+X_i^T\beta^k+W_i^T\gamma^k+Z_i^Tb_i+\epsilon_i^k,\hspace{0.2cm}k=1,...,K+1, \] where

\(\mathcal{Y}_i^k\) is the AA of taxa \(k\) in subject \(i\) in the entire ecosystem.

\(X_i\) is the covariate matrix.

\(W_i\) is the confounder matrix.

\(Z_i\) is the design matrix for random effects.

\(\beta^k\) is the regression coefficients that will be estimated and tested with the

`IFAA()`

function.

The challenge in microbiome analysis is that we can not oberve \(\mathcal{Y}_i^k\). What is observed is its small proportion: \(Y_i^k=C_i\mathcal{Y}^k_i\) where \(C_i\) is an unknown number between 0 and 1 that denote the observed proportion. The IFAA method can handle this challenge by identifying and employing reference taxa.

To install, type the following command in R console:

The package could be also installed from GitHub using the following code:

The `IFAA()`

function is the main function. The User Inputs are:

`MicrobData`

: Microbiome data matrix containing microbiome abundance with each row per sample and each column per taxon/OTU/ASV. It should contain an`"id"`

variable to correspond to the`"id"`

variable in the covariates data:`CovData`

. This argument can also take file directory path. For example,`MicrobData="C://...//microbiomeData.tsv"`

.`CovData`

: Covariates data matrix containing covariates and confounders with each row per sample and each column per variable. It should also contain an`"id"`

variable to correspond to the`"id"`

variable in the microbiome data:`MicrobData`

. This argument can also take file directory path. For example,`CovData="C://...//covariatesData.tsv"`

.`linkIDname`

: Variable name of the`"id"`

variable in both`MicrobData`

and`CovData`

. The two data sets will be merged by this`"id"`

variable.`testCov`

: Covariates that are of primary interest for testing and estimating the associations. It corresponds to \(X_i\) in the equation. Default is`NULL`

which means all covariates are`testCov`

.`ctrlCov`

: Potential confounders that will be adjusted in the model. It corresponds to \(W_i\) in the equation. Default is`NULL`

which means all covariates except those in`testCov`

are adjusted as confounders.`testMany`

: This takes logical value`TRUE`

or`FALSE`

. If`TRUE`

, the`testCov`

will contain all the variables in`CovData`

provided`testCov`

is set to be`NULL`

. The default value is`TRUE`

which does not do anything if`testCov`

is not`NULL`

.`ctrlMany`

: This takes logical value`TRUE`

or`FALSE`

. If`TRUE`

, all variables except`testCov`

are considered as control covariates provided`ctrlCov`

is set to be`NULL`

. The default value is`FALSE`

.`nRef`

: The number of randomly picked reference taxa used in phase 1. Default number is`40`

.`refTaxa`

: A vector of taxa names. These are reference taxa specified by the user to be used in phase 1. If the number of reference taxa is less than ‘nRef’, the algorithm will randomly pick extra reference taxa to make up ‘nRef’. The default is`NULL`

since the algorithm will pick reference taxa randomly.`fdrRate`

: The false discovery rate for identifying taxa/OTU/ASV associated with`testCov`

in phase 1. Default is`0.25`

.`sequentialRun`

: This takes a logical value`TRUE`

or`FALSE`

. Default is`FALSE`

. This argument could be useful for debug.`paraJobs`

: If`sequentialRun`

is`FALSE`

, this specifies the number of parallel jobs that will be registered to run the algorithm. If specified as`NULL`

, it will automatically detect the cores to decide the number of parallel jobs. Default is`NULL`

. It is safe to have 4 gb memory per job. It may be needed to reduce the number of jobs if memory is limited.`standardize`

: This takes a logical value`TRUE`

or`FALSE`

. If`TRUE`

, all design matrix X in phase 1 and phase 2 will be standardized in the analyses. Default is`FALSE`

.`nRefMaxForEsti`

: The maximum number of reference taxa used in phase 2. The default is`1`

.`bootB`

: Number of bootstrap samples for obtaining confidence interval of estimates in phase 2. The default is`500`

.`refReadsThresh`

: The threshold of non-zero sequencing reads for choosing the reference taxon in phase 2. The default is`0.2`

which means at least 20% non-zero sequencing reads.`SDThresh`

: The threshold of standard deviations of sequencing reads for choosing the reference taxon in phase 2. The default is`0.5`

which means the standard deviation of sequencing reads should be at least`0.5`

.`balanceCut`

: The threshold of non-zero sequencing reads in each group of a binary variable for choosing the reference taxon in phase 2. The default number is`0.2`

which means at least 20% sequencing reads are non-zero in each group.`seed`

: Random seed for reproducibility. Default is`1`

.

The output of `IFAA()`

function is a list. The estimation results can extracted as the following:

`analysisResults$sig_results`

: A list containing estimating results for all the variables in`testCov`

.`analysisResults$full_results`

: A list containing estimating results for taxon. NA denotes unestimable.

The covariates data including `testCov`

and `ctrlCov`

can be extracted in the output:

`covariatesData`

: A dataset containing covariates and confounders used in the analyses

The example datasets `dataM`

and `dataC`

are included in the package. They could be accessed by:

```
library(IFAA)
data(dataM)
dim(dataM)
#> [1] 20 60
dataM[1:5, 1:8]
#> id rawCount1 rawCount2 rawCount3 rawCount4 rawCount5 rawCount6 rawCount7
#> 1 1 0 0 0 0 0 3 0
#> 2 2 0 0 0 0 0 0 0
#> 3 3 0 0 0 0 0 214 0
#> 4 4 0 0 0 0 0 2 0
#> 5 5 0 0 0 0 0 40 0
data(dataC)
dim(dataC)
#> [1] 20 6
dataC[1:5, ]
#> id v4 v1 v5 v2 v3
#> 1 1 1 1.653901 4 1 NA
#> 2 2 2 0.362706 5 2 2
#> 3 3 1 1.496269 NA 5 2
#> 4 4 1 1.755541 5 3 3
#> 5 5 1 1.035714 5 7 NA
```

Both the microbiome data `dataM`

and the covariates data `dataC`

contain 20 samples (i.e., 20 rows).

`dataM`

contains 60 taxa with absolute abundances and these are gut microbiome.`dataC`

contains 5 covariates.

Next we analyze the data to test the association between microbiome and the two variables `"v1"`

and `"v2"`

while adjusting for the variable `"v3"`

.

```
results <- IFAA(MicrobData = dataM,
CovData = dataC,
linkIDname = "id",
testCov = c("v1", "v2"),
ctrlCov = c("v3"),
nRef = 3,
paraJobs = 2,
fdrRate = 0.25)
#> There are 41 taxa without any sequencing reads before
#> data merging, and excluded from the analysis
#> Data dimensions (after removing missing data if any):
#> 13 samples
#> 18 taxa/OTU/ASV
#> 2 testCov variables in the analysis
#> These are the testCov variables:
#> v1, v2
#> 1 ctrlCov variables in the analysis
#> These are the ctrlCov variables:
#> v3
#> 0 binary covariates in the analysis
#> 54.27 percent of microbiome sequencing reads are zero
#> Start Phase 1 analysis
#> 2 parallel jobs are registered for analyzing 3 reference taxa in Phase 1
#> 33 percent of phase 1 analysis has been done
#> 2 parallel jobs are registered for analyzing 2 reference taxa in Phase 1
#> 67 percent of phase 1 analysis has been done
#> Phase 1 analysis used 0.14 minutes
#> Start Phase 2 parameter estimation
#> Final Reference Taxa are: rawCount32rawCount33
#> Start estimation for the 1th final reference taxon: rawCount32
#> Estimation done for the 1th final reference taxon: rawCount32 and it took 0.007 minutes
#> Start estimation for the 2th final reference taxon: rawCount33
#> Estimation done for the 2th final reference taxon: rawCount33 and it took 0.006 minutes
#> Phase 2 parameter estimation done and took 0.013 minutes.
#> The entire analysis took 0.15 minutes
```

In this example, we are only interested in testing the association with `"v1"`

and `"v2"`

which is why `testCov=c("v1,"v2")`

. The variable `"v3"`

is adjusted as a potential confounder in the analyses. For the sake of speed in this hypothetical example, we set small numbers for `nRef=4`

, `nPermu=4`

and `bootB=5`

. These are just for illustration purpose here and are too small for a formal analysis to generate valid results.

The final analysis results are stored in the list `analysisResults$sig_results`

:

The results found the two taxa `"rawCount23"`

and `"rawCount32"`

associated with `"v2"`

. The regression coefficients and their 95% confidence intervals are provided. These coefficients correspond to \(\beta^k\) in the model equation.

The interpretation is that

Every unit increase in

`"v2"`

is associated with approximately 7.9% increase in the log of absolute abundance of`"rawCount23"`

and approximately 8.1% increase in the log of absolute abundance of`"rawCount33"`

in the entire gut ecosystem.There were no taxa associated with

`"v1"`

in the analysis.

All the analyzed covariates including `testCov`

and `ctrlCov`

are stored in the object: `covariatesData`

:

```
results$covariatesData
#> id v1 v2 v3
#> 2 2 0.36270596 2 2
#> 3 3 1.49626921 5 2
#> 4 4 1.75554095 3 3
#> 6 6 1.64525227 4 4
#> 8 8 -1.57781131 24 22
#> 9 9 2.22581203 55 5
#> 10 10 0.71642615 98 67
#> 12 12 2.12230160 98 3
#> 14 14 1.99387922 93 4
#> 16 16 0.05417617 83 34
#> 18 18 -0.43426021 73 67
#> 19 19 1.46579846 68 566
#> 20 20 1.89625949 63 34
```

The IFAA package also implement the Multivariate Zero-Inflated Logistic Normal (MZILN) regression model for analyzing microbiome data. The regression model can be expressed as follows: \[ \log\bigg(\frac{\mathcal{Y}_i^k}{\mathcal{Y}_i^{K+1}}\bigg)|\mathcal{Y}_i^k>0,\mathcal{Y}_i^{K+1}>0=\alpha^{0k}+\mathcal{X}_i^T\alpha^k+\epsilon_i^k,\hspace{0.2cm}k=1,...,K, \] where

\(\mathcal{Y}_i^k\) is the AA of taxa \(k\) in subject \(i\) in the entire ecosystem.

\(\mathcal{Y}_i^{K+1}\) is the reference taxon (specified by user).

\(\mathcal{X}_i\) is the covariate matrix for all covariates including confounders.

\(\alpha^k\) is the regression coefficients that will be estimated and tested.

The `MZILN()`

function is to implement the Multivariate Zero-Inflated Logistic Normal model. It estimates and tests the associations given a user-specified reference taxon/OTU/ASV, whereas the ‘IFAA()’ does not require any user-specified reference taxa. If the user-specified taxon is independent of the covariates, ‘MZILN()’ should generate similar results as ‘IFAA()’. The User Inputs for ‘MZILN()’ are:

`MicrobData`

: Microbiome data matrix containing microbiome abundance with each row per sample and each column per taxon/OTU/ASV. It should contain an`"id"`

variable to correspond to the`"id"`

variable in the covariates data:`CovData`

. This argument can also take file directory path. For example,`MicrobData="C://...//microbiomeData.tsv"`

.`CovData`

: Covariates data matrix containing covariates and confounders with each row per sample and each column per variable. It should also contain an`"id"`

variable to correspond to the`"id"`

variable in the microbiome data:`MicrobData`

. This argument can also take file directory path. For example, CovData=“C://…//covariatesData.tsv”.`linkIDname`

Variable name of the`"id"`

variable in both`MicrobData`

and`CovData`

. The two data sets will be merged by this`"id"`

variable.`allCov`

All covariates of interest (including confounders) for estimating and testing their associations with microbiome. Default is ‘NULL’ meaning that all covariates in covData are of interest.`targetTaxa`

The taxa that should be used as numerator. Default is NULL.`refTaxa`

Reference taxa specified by the user and will be used as the reference taxa.`adjust_method`

The adjusting method used for p value adjustment. Same as p.adjust function in R.`fdrRate`

The false discovery rate for identifying taxa/OTU/ASV associated with`allCov`

. Default is`0.25`

.`sequentialRun`

This takes a logical value`TRUE`

or`FALSE`

. Default is`TRUE`

since there is only 1 reference taxon.`paraJobs`

If`sequentialRun`

is`FALSE`

, this specifies the number of parallel jobs that will be registered to run the algorithm. If specified as`NULL`

, it will automatically detect the cores to decide the number of parallel jobs. Default is`NULL`

. It is safe to have 4gb memory per job. It may be needed to reduce the number of jobs if memory is limited.`standardize`

This takes a logical value`TRUE`

or`FALSE`

. If`TRUE`

, all design matrix X in phase 1 and phase 2 will be standardized in the analyses. Default is`FALSE`

.`bootB`

Number of bootstrap samples for obtaining confidence interval of estimates in phase 2. The default is`500`

.`bootLassoAlpha`

The significance level in phase 2. Default is`0.05`

.`taxkeepThresh`

The threshold of number of non-zero sequencing reads for each taxon to be included into the analysis.`seed`

Random seed for reproducibility. Default is`1`

.

The output of `MZILN()`

function is a list. The estimation results can extracted as the following:

`analysisResults$full_results`

: A list containing estimating results for all significant taxa.`analysisResults$targettaxa_result_list`

: A list containing estimating results for targetTaxa. Only available when targetTaxa is non-empty.

All covariates data can be extracted:

`covariatesData`

: A dataset containing covariates and confounders used in the analyses

We use the same example data The example dataset as that for illustrating the IFAA function. `dataM`

and `dataC`

are included in the package. They could be accessed by:

```
data(dataM)
dim(dataM)
#> [1] 20 60
dataM[1:5, 1:8]
#> id rawCount1 rawCount2 rawCount3 rawCount4 rawCount5 rawCount6 rawCount7
#> 1 1 0 0 0 0 0 3 0
#> 2 2 0 0 0 0 0 0 0
#> 3 3 0 0 0 0 0 214 0
#> 4 4 0 0 0 0 0 2 0
#> 5 5 0 0 0 0 0 40 0
data(dataC)
dim(dataC)
#> [1] 20 6
dataC[1:5, ]
#> id v4 v1 v5 v2 v3
#> 1 1 1 1.653901 4 1 NA
#> 2 2 2 0.362706 5 2 2
#> 3 3 1 1.496269 NA 5 2
#> 4 4 1 1.755541 5 3 3
#> 5 5 1 1.035714 5 7 NA
```

Both the microbiome data `dataM`

and the covariates data `dataC`

contain 20 samples (i.e., 20 rows).

`dataM`

contains 60 taxa with absolute abundances and these are gut microbiome.`dataC`

contains 5 covariates.

Next we analyze the data to test the association between microbiome and all the three variables `"v1"`

, `"v2"`

and `"v3"`

.

```
results <- MZILN(MicrobData = dataM,
CovData = dataC,
linkIDname = "id",
allCov=c("v1","v2","v3"),
targetTaxa = "rawCount6",
refTaxa=c("rawCount11"),
paraJobs=2)
#> There are 41 taxa without any sequencing reads before
#> data merging, and excluded from the analysis
#> Data dimensions (after removing missing data if any):
#> 13 samples
#> 18 taxa/OTU/ASV
#> 3 testCov variables in the analysis
#> These are the testCov variables:
#> v1, v2, v3
#> 0 ctrlCov variables in the analysis
#> 0 binary covariates in the analysis
#> 54.27 percent of microbiome sequencing reads are zero
#> Estimation done for the 1th reference taxon: rawCount11 and it took 0.01 minutes
#> The entire analysis took 0.01 minutes
#> $rawCount11
#> $rawCount11$v1
#> estimate SE est CI low CI up adj p-value
#> rawCount6 0.21309 1.51055 -2.747587 3.173767 1
#>
#> $rawCount11$v2
#> estimate SE est CI low CI up adj p-value
#> rawCount6 0.02032646 0.02581197 -0.03026501 0.07091792 1
#>
#> $rawCount11$v3
#> estimate SE est CI low CI up adj p-value
#> rawCount6 -0.003316226 0.005345113 -0.01379265 0.007160195 1
```

In this example, we are only interested in testing the associations with `"v1"`

, `"v2"`

and `"v3"`

which is why `allCov=c("v1,"v2","v3")`

. The reference taxa is set to be `"rawCount11"`

, and the taxa we are interested in is `"rawCount6"`

, so that `refTaxa=c("rawCount11")`

and `targetTaxa = "rawCount6"`

.

The final analysis results are stored in the list `results$analysisResults$targettaxa_result_list`

:

```
results$analysisResults$targettaxa_result_list
#> $rawCount11
#> $rawCount11$v1
#> estimate SE est CI low CI up adj p-value
#> rawCount6 0.21309 1.51055 -2.747587 3.173767 1
#>
#> $rawCount11$v2
#> estimate SE est CI low CI up adj p-value
#> rawCount6 0.02032646 0.02581197 -0.03026501 0.07091792 1
#>
#> $rawCount11$v3
#> estimate SE est CI low CI up adj p-value
#> rawCount6 -0.003316226 0.005345113 -0.01379265 0.007160195 1
```

The regression coefficients and their 95% confidence intervals are provided. These coefficients correspond to \(\alpha^k\) in the model equation, and can be interpreted as the associations between the covariates and log-ratio of the significant taxa over the reference taxon.

The interpretation is that

- Every unit increase in
`"v1"`

is associated with approximately 21.3% increase in the abundance ratio of`"rawCount6"`

over`"rawCount11"`

;every unit increase in`"v2"`

is associated with approximately 2.0% increase in the abundance ratio of`"rawCount6"`

over`"rawCount11"`

;every unit increase in`"v3"`

is associated with approximately 0.3% decrease in the abundance ratio of`"rawCount6"`

over`"rawCount11"`

.

All the analyzed covariates are stored in the object: `covariatesData`

:

```
results$covariatesData
#> id v1 v2 v3
#> 2 2 0.36270596 2 2
#> 3 3 1.49626921 5 2
#> 4 4 1.75554095 3 3
#> 6 6 1.64525227 4 4
#> 8 8 -1.57781131 24 22
#> 9 9 2.22581203 55 5
#> 10 10 0.71642615 98 67
#> 12 12 2.12230160 98 3
#> 14 14 1.99387922 93 4
#> 16 16 0.05417617 83 34
#> 18 18 -0.43426021 73 67
#> 19 19 1.46579846 68 566
#> 20 20 1.89625949 63 34
```