ActivePathways

Jonathan Barenboim, Marta Paczkowska, Helen Zhu & Jüri Reimand

2019-05-03

Introduction

ActivePathways is a statistical method and software tool for pathway enrichment analysis across multiple omics datasets. Different omics techniques and analyses reveal different facets of underlying biology and such integrative analysis across multivariate data helps find their commonalities in the context of pathways, biological processes and functionally related gene sets. ActivePathways uses a data fusion technique to merge multiple gene lists with p-values into an integrated gene list and conducts enrichment analysis using the ranked hypergeometric statistic.

Pathway Enrichment and the Ordered Hypergeometric Algorithm

From a matrix of p-values representing genes and their significance in different omics datasets, ActivePathways creates an integrated gene list (i.e., a query) by merging the p-values to determine a single p-value for each gene. The integrated gene list is then filtered and sorted by merged p-values. ActivePathways tests if a pathway is ‘enriched’, or over-represented in our gene list, by performing a hypergeometric test on the gene list and the genes annotated to the pathway. However, rather than testing the entire gene list, ActivePathways performs incremental enrichment analysis with an increasingly larger subset of the gene list starting at the top of the list and identifies the subset of gene list that yields the smallest p-value (known as the ranked hypergeometric statistic). This approach is useful when the genes in the gene list are of decreasing biological importance. The ranked test is more sensitive than a single test of a regular “flat” gene list as it finds smaller and more specific pathways of genes ranked at the top of the list, as well as larger and more general pathways distributed across the entire input list.

Tutorial

The following example uses a dataset that contains a small subset of driver genes in a cohort of adenocarcinomas of multiple cancer types. The driver scores are p-values which represent the confidence that a given gene is a driver. Driver scores are provided separately for different genetic elements: untranslated regions (5’ and 3’ UTRs), coding regions, and promoters.

Providing a GMT File

The GMT file, representing a set of molecular pathways, are a small subset of the pathways curated by the Reactome database Gene Matrix Transposed file. Broadly, GMT files can be downloaded from the Bader Lab repository, the MSigDB database hosted at the Broad Institute or the g:ProfileR web server. Click on ‘Data sources’, select database preferences and download the name.gmt file. Note that the gene IDs or symbols in the GMT file need to match the IDs or symbols used in the gene matrix.

Downloading GMT Files from gProfileR

Downloading GMT Files from gProfileR

Providing a P-value Matrix

Let’s start by reading the input data from the file. ActivePathways expects a matrix, so the table has to be cast to the correct class.

scores <- read.table(system.file('extdata', 'Adenocarcinoma_scores_subset.tsv', package='ActivePathways'), header=TRUE, sep='\t', row.names='Gene')
scores <- as.matrix(scores)
scores
##                      X3UTR        X5UTR          CDS     promCore
## A2M                     NA 3.339676e-01 9.051708e-01 4.499201e-01
## AAAS                    NA 4.250601e-01 7.047723e-01 7.257641e-01
## ABAT          0.9664125635 4.202735e-02 7.600985e-01 1.903789e-01
## ABCC1         0.9383431571 4.599443e-01 2.599319e-01 2.980455e-01
## ABCC5         0.8021028187 4.478902e-01 4.788846e-01 8.447995e-01
## ABCC8                   NA 4.699356e-01 6.890250e-01 9.226617e-01
## ABHD6         0.4019418029 4.488881e-01 7.587176e-01 3.062514e-01
## ABI1          0.2894847305 1.977600e-01 4.815032e-01 3.486056e-01
##  [ reached getOption("max.print") -- omitted 2406 rows ]

Additionally, ActivePathways will not allow any missing values (NA) in the input. Rows containing missing values may be removed. Alternatively, as conducted in this example, missing values will be assigned a value of 1, indicating that the element is not a driver.

scores[is.na(scores)] <- 1

Basic Use

The ActivePathways function requires two inputs. The first is the data matrix of p-values. The second is the path to the GMT file.

library(ActivePathways)
gmt.file <- system.file('extdata', 'hsapiens_REAC_subset.gmt', package='ActivePathways')
ActivePathways(scores, gmt.file)
##          term.id                     term.name adjusted.p.val term.size
##  1: REAC:2424491               DAP12 signaling   4.491268e-05       358
##  2:  REAC:422475                 Axon guidance   2.028966e-02       555
##  3:  REAC:177929             Signaling by EGFR   6.245734e-04       366
##                                              overlap       evidence
##  1:              TP53,PIK3CA,KRAS,PTEN,BRAF,NRAS,...            CDS
##  2:          PIK3CA,KRAS,BRAF,NRAS,CALM2,RPS6KA3,... X3UTR,promCore
##  3:              TP53,PIK3CA,KRAS,PTEN,BRAF,NRAS,...            CDS
##                              Genes_X3UTR Genes_X5UTR
##  1:                                   NA          NA
##  2: CALM2,ARPC2,RHOA,NUMB,CALM1,ACTB,...          NA
##  3:                                   NA          NA
##                                      Genes_CDS
##  1:        TP53,PTEN,KRAS,PIK3CA,BRAF,NRAS,...
##  2:                                         NA
##  3:        TP53,PTEN,KRAS,PIK3CA,BRAF,NRAS,...
##                                  Genes_promCore
##  1:                                          NA
##  2: EFNA1,IQGAP1,COL4A1,SCN2B,RPS6KA3,CALM2,...
##  3:                                          NA
##  [ reached getOption("max.print") -- omitted 17 rows ]

Significance Threshold and Returning All Results

A pathway is considered to be significantly enriched if the associated p-value is less than the significance threshold (i.e. p.val <= significant). The significant parameter represents the maximum p-value for a result to be considered statistically significant after multiple testing correction. By default, only the significant pathways are returned. If all results are desired, the user can set the return.all parameter to TRUE. ActivePathways conducts multiple testing correction by default using the Holm method.

nrow(ActivePathways(scores, gmt.file, return.all=FALSE, significant=0.05))
## [1] 20
nrow(ActivePathways(scores, gmt.file, return.all=FALSE, significant=0.1))
## [1] 23
nrow(ActivePathways(scores, gmt.file, return.all=TRUE, significant=0.05))
## [1] 265

Note that while setting significant=1 gives the same result as return.all=TRUE, it is strongly recommended the latter is used because the results written to cytoscape.file.dir use the significant value to determine which pathways to include in the network. Thus ActivePathways(scores, gmt.file, return.all=TRUE, cytoscape.file.dir='/results/directory/') and ActivePathways(scores, gmt.file, significant=1, cytoscape.file.dir='/results/directory/) will return the same result table but will write different results files.

Examining GMT Objects

The GMT file contains the pathways to be analyzed for gene enrichment. Each line corresponds to a pathway and contains the pathway ID and name followed by all associated genes. While the location of the GMT file to ActivePathways can be directly specified, in some cases the user may desire to examine the GMT file interactively. The ActivePathways package includes an interface for working with GMT objects. The GMT file can be read into R as a GMT object with the read.GMT function. The GMT is structured as a list of terms, where each term is a list containing an id, a name, and a list of genes.

gmt <- read.GMT(gmt.file)
names(gmt[[1]])
## [1] "id"    "name"  "genes"
# Pretty-print the GMT
gmt[1:3]
## $`REAC:1630316`
## $`REAC:1630316`$id
## [1] "REAC:1630316"
## 
## $`REAC:1630316`$name
## [1] "Glycosaminoglycan metabolism"
## 
## $`REAC:1630316`$genes
##  [1] "ST3GAL1"    "CHP1"       "B4GALT5"    "ST3GAL4"    "SLC9A1"    
##  [6] "CHST11"     "B3GAT3"     "SLC26A1"    "ARSB"       "ABCC5"     
## [11] "LUM"        "PAPSS1"     "SDC4"       "NAGLU"      "AC022400.6"
## [16] "HYAL1"      "NDST3"      "SLC35B2"    "B3GAT1"     "CHST2"     
## [21] "DCN"        "CEMIP"      "IDS"        "IDUA"       "HS3ST3B1"  
## [26] "B3GNT7"     "CHST12"     "CHST14"     "BCAN"       "B4GALT3"   
## [31] "HS3ST1"     "B4GALT1"    "CSPG4"      "CHPF2"      "HEXB"      
##  [ reached getOption("max.print") -- omitted 92 entries ]
## 
## 
## $`REAC:5633007`
## $`REAC:5633007`$id
## [1] "REAC:5633007"
## 
## $`REAC:5633007`$name
## [1] "Regulation of TP53 Activity"
## 
## $`REAC:5633007`$genes
##  [1] "PPP1R13L" "RFC3"     "SGK1"     "HDAC1"    "SETD9"    "RPA3"    
##  [7] "CSNK2A2"  "EHMT1"    "PDPK1"    "CCNG1"    "TAF6"     "ING5"    
## [13] "TP53INP1" "BRCA1"    "TAF1"     "TAF9"     "EHMT2"    "RBBP8"   
## [19] "CHD4"     "USP2"     "BRPF1"    "BANP"     "PPP2R5C"  "SSRP1"   
## [25] "TAF4"     "TMEM55B"  "PPP1R13B" "TAF10"    "PRR5"     "ATM"     
## [31] "ZNF385A"  "MRE11"    "PIP4K2C"  "SMYD2"    "BARD1"   
##  [ reached getOption("max.print") -- omitted 125 entries ]
## 
## 
## $`REAC:5358346`
## $`REAC:5358346`$id
## [1] "REAC:5358346"
## 
## $`REAC:5358346`$name
## [1] "Hedgehog ligand biogenesis"
## 
## $`REAC:5358346`$genes
##  [1] "PSMB11"     "PSMD8"      "PSMB6"      "PSMB8"      "PSMD13"    
##  [6] "PSMA2"      "PSMB3"      "PSMB7"      "PSME2"      "PSMD1"     
## [11] "PSMA8"      "PSMA1"      "PSMD2"      "PSMD14"     "PSMC1"     
## [16] "IHH"        "SEL1L"      "SCUBE2"     "PSMB10"     "PSMD10"    
## [21] "PSMA5"      "PSMD3"      "PSMC2"      "PSMC6"      "PSME1"     
## [26] "PSME4"      "PSMC4"      "AC010132.3" "PSMD5"      "UBB"       
## [31] "GPC5"       "PSMA7"      "PSMB2"      "PSMD9"      "SEM1"      
##  [ reached getOption("max.print") -- omitted 31 entries ]
## 
## 
## attr(,"class")
## [1] "GMT"
# Look at the genes annotated to the first term
gmt[[1]]$genes
##  [1] "ST3GAL1"    "CHP1"       "B4GALT5"    "ST3GAL4"    "SLC9A1"    
##  [6] "CHST11"     "B3GAT3"     "SLC26A1"    "ARSB"       "ABCC5"     
## [11] "LUM"        "PAPSS1"     "SDC4"       "NAGLU"      "AC022400.6"
## [16] "HYAL1"      "NDST3"      "SLC35B2"    "B3GAT1"     "CHST2"     
## [21] "DCN"        "CEMIP"      "IDS"        "IDUA"       "HS3ST3B1"  
## [26] "B3GNT7"     "CHST12"     "CHST14"     "BCAN"       "B4GALT3"   
## [31] "HS3ST1"     "B4GALT1"    "CSPG4"      "CHPF2"      "HEXB"      
##  [ reached getOption("max.print") -- omitted 92 entries ]
# Get the full name of Reactome pathway 2424491
gmt$`REAC:2424491`$name
## [1] "DAP12 signaling"

GMT files can be filtered for size to remove large and small pathways which are overly generic or specific. Here, pathways that have less than 10 or more than 500 genes are removed. This step removes very small and specific gene sets that often represent a large fraction of all pathways, and large and unspecific gene sets that are subject to p-value inflation due to the properties of the hypergeometric distribution. Both very small and very large gene sets are often not helpful for data interpretation and can be safely discarded in most cases.

gmt <- Filter(function(term) length(term$genes) >= 10, gmt)
gmt <- Filter(function(term) length(term$genes) <= 500, gmt)

This GMT object can be written to a file.

write.GMT(gmt, 'hsapiens_REAC_subset_filtered.gmt', path = ".")

This filtering can also be done automatically using the geneset.filter parameter in the ActivePathways function by specifying a vector of length 2 containing the minimum and maximum pathway size desired by the user. For example, the above task can be completed by setting geneset.filter=c(10, 500). By default, ActivePathways removes any pathways with less than 5 or more than 1000 annotated genes.

The new GMT object can now be used for analysis with ActivePathways.

ActivePathways(scores, gmt)
##          term.id                     term.name adjusted.p.val term.size
##  1: REAC:2424491               DAP12 signaling   0.0002904356       358
##  2:  REAC:177929             Signaling by EGFR   0.0032110362       366
##  3: REAC:2559583           Cellular Senescence   0.0001568181       196
##                                              overlap evidence Genes_X3UTR
##  1:              TP53,PIK3CA,KRAS,PTEN,BRAF,NRAS,...      CDS          NA
##  2:              TP53,PIK3CA,KRAS,PTEN,BRAF,NRAS,...      CDS          NA
##  3:           TP53,RB1,ATM,MAP2K7,CDKN1A,RPS6KA3,...      CDS          NA
##     Genes_X5UTR                                  Genes_CDS Genes_promCore
##  1:          NA        TP53,PTEN,KRAS,PIK3CA,BRAF,NRAS,...             NA
##  2:          NA        TP53,PTEN,KRAS,PIK3CA,BRAF,NRAS,...             NA
##  3:          NA      TP53,RB1,ATM,MAP2K7,CDKN1A,CDKN1B,...             NA
##  [ reached getOption("max.print") -- omitted 14 rows ]

Statistical Background Gene Set

To perform pathway enrichment analysis, ActivePathways uses a set of genes as a statistical background. By default, the background is every gene found at least once in the GMT, but it can also be specified by the user by supplying a vector of genes to the background parameter in ActivePathways. For example, if the analysis is conducted on a set of genes captured by a sequencing panel, then the user may wish to limit the background to the genes whose characteristics can be measured. Genes that have no plausible signal in an experiment or analysis of interest should not be part of a statistical background as their inclusion in an analysis would lead to statistical inflation.

The list of genes in the GMT can be extracted with the makeBackground function (automatically performed by ActivePathways). In this example, let’s try excluding the gene TP53. This gene can be removed from the background and the new background can be used for the analysis.

background <- makeBackground(gmt)
background <- background[background != 'TP53']
ActivePathways(scores, gmt.file, background=background)
##          term.id                     term.name adjusted.p.val term.size
##  1: REAC:2424491               DAP12 signaling    0.002258987       358
##  2:  REAC:422475                 Axon guidance    0.009416431       555
##  3:  REAC:177929             Signaling by EGFR    0.021767252       366
##                                              overlap evidence
##  1:               PIK3CA,KRAS,PTEN,BRAF,NRAS,B2M,...      CDS
##  2:          PIK3CA,KRAS,BRAF,NRAS,CALM2,RPS6KA3,...    X3UTR
##  3:             PIK3CA,KRAS,PTEN,BRAF,NRAS,CALM2,...      CDS
##                              Genes_X3UTR Genes_X5UTR
##  1:                                   NA          NA
##  2: CALM2,ARPC2,RHOA,NUMB,CALM1,ACTB,...          NA
##  3:                                   NA          NA
##                                   Genes_CDS Genes_promCore
##  1:   PTEN,KRAS,PIK3CA,BRAF,NRAS,CDKN1A,...             NA
##  2:                                      NA             NA
##  3:   PTEN,KRAS,PIK3CA,BRAF,NRAS,CDKN1A,...             NA
##  [ reached getOption("max.print") -- omitted 13 rows ]

Note that only genes found in the background are used for testing enrichment. Any genes in the input data that are not in the background will be automatically removed by ActivePathways.

Merging P-values

ActivePathways statistically merges evidence from each source (e.e., omics dataset or analysis) by merging the p-values for each gene into a single combined score. The two main methods to merge p-values are Brown’s method (the default, which assumes a correlation between different sources of evidence) and Fisher’s method (which assumes independence of different sources of evidence), but other methods are also available.

scores
##                      X3UTR        X5UTR          CDS     promCore
## A2M           1.0000000000 3.339676e-01 9.051708e-01 4.499201e-01
## AAAS          1.0000000000 4.250601e-01 7.047723e-01 7.257641e-01
## ABAT          0.9664125635 4.202735e-02 7.600985e-01 1.903789e-01
## ABCC1         0.9383431571 4.599443e-01 2.599319e-01 2.980455e-01
## ABCC5         0.8021028187 4.478902e-01 4.788846e-01 8.447995e-01
## ABCC8         1.0000000000 4.699356e-01 6.890250e-01 9.226617e-01
## ABHD6         0.4019418029 4.488881e-01 7.587176e-01 3.062514e-01
## ABI1          0.2894847305 1.977600e-01 4.815032e-01 3.486056e-01
##  [ reached getOption("max.print") -- omitted 2406 rows ]
merge_p_values(scores, 'Fisher')
##          A2M         AAAS         ABAT        ABCC1        ABCC5 
## 8.580195e-01 9.310603e-01 2.463661e-01 5.587650e-01 8.697586e-01 
##        ABCC8        ABHD6         ABI1         ABL1   AC142381.1 
## 9.655215e-01 6.087795e-01 3.184240e-01 7.462578e-01 1.000000e+00 
##        ACAA1        ACADS        ACAT1          ACD        ACOT6 
## 5.552635e-01 4.613152e-01 4.281889e-01 9.692612e-01 7.933962e-02 
##        ACOT8        ACOX2        ACOX3        ACSM5         ACTB 
## 3.011535e-01 9.039274e-01 8.307694e-01 3.486612e-01 2.750060e-04 
##       ACTL6A        ACTN1        ACTN2       ACTR1B       ACVR2A 
## 3.361929e-01 3.449892e-01 8.649147e-01 5.511921e-01 9.876590e-12 
##       ACVR2B         ACY1         ACY3       ADAM10       ADAM17 
## 7.538053e-01 9.751877e-01 7.524688e-01 1.915695e-02 2.445514e-01 
##       ADAM22       ADARB1        ADAT2        ADAT3        ADCY1 
## 2.508503e-01 6.775969e-01 5.005089e-01 9.450412e-01 8.432588e-01 
##  [ reached getOption("max.print") -- omitted 2379 entries ]
merge_p_values(scores, 'Brown')
##          A2M         AAAS         ABAT        ABCC1        ABCC5 
## 7.784384e-01 8.627499e-01 2.672114e-01 5.148585e-01 7.908670e-01 
##        ABCC8        ABHD6         ABI1         ABL1   AC142381.1 
## 9.127001e-01 5.551149e-01 3.254985e-01 6.714814e-01 1.000000e+00 
##        ACAA1        ACADS        ACAT1          ACD        ACOT6 
## 5.120659e-01 4.379288e-01 4.119781e-01 9.189688e-01 1.165005e-01 
##        ACOT8        ACOX2        ACOX3        ACSM5         ACTB 
## 3.116936e-01 8.291729e-01 7.506855e-01 3.494921e-01 2.490608e-03 
##       ACTL6A        ACTN1        ACTN2       ACTR1B       ACVR2A 
## 3.396220e-01 3.465883e-01 7.856999e-01 5.088222e-01 3.755813e-08 
##       ACVR2B         ACY1         ACY3       ADAM10       ADAM17 
## 6.782366e-01 9.294653e-01 6.770367e-01 4.313449e-02 2.657156e-01 
##       ADAM22       ADARB1        ADAT2        ADAT3        ADCY1 
## 2.709006e-01 6.120292e-01 4.687120e-01 8.817002e-01 7.632312e-01 
##  [ reached getOption("max.print") -- omitted 2379 entries ]

The user can directly calculate merged p-values using the merge_p_values function. For example, the X3UTR, X5UTR, and promCore columns can be merged into a single non_coding column. This will consider the three non-coding regions as a single column, rather than giving them all equal weight as the CDS column.

scores2 <- cbind(scores[, 'CDS'], merge_p_values(scores[, c('X3UTR', 'X5UTR', 'promCore')], 'Brown'))
colnames(scores2) <- c('CDS', 'non_coding')

scores # with separate columns for 3' UTR, 5'UTR & Promoter
##                      X3UTR        X5UTR          CDS     promCore
## A2M           1.0000000000 3.339676e-01 9.051708e-01 4.499201e-01
## AAAS          1.0000000000 4.250601e-01 7.047723e-01 7.257641e-01
## ABAT          0.9664125635 4.202735e-02 7.600985e-01 1.903789e-01
## ABCC1         0.9383431571 4.599443e-01 2.599319e-01 2.980455e-01
## ABCC5         0.8021028187 4.478902e-01 4.788846e-01 8.447995e-01
## ABCC8         1.0000000000 4.699356e-01 6.890250e-01 9.226617e-01
## ABHD6         0.4019418029 4.488881e-01 7.587176e-01 3.062514e-01
## ABI1          0.2894847305 1.977600e-01 4.815032e-01 3.486056e-01
##  [ reached getOption("max.print") -- omitted 2406 rows ]
scores2 # with a single 'noncoding' column for the merged p-value of 3' UTR, 5'UTR & Promoter
##                        CDS   non_coding
## A2M           9.051708e-01 6.310758e-01
## AAAS          7.047723e-01 8.042034e-01
## ABAT          7.600985e-01 1.692698e-01
## ABCC1         2.599319e-01 5.952884e-01
## ABCC5         4.788846e-01 8.002823e-01
## ABCC8         6.890250e-01 8.829674e-01
## ABHD6         7.587176e-01 4.219903e-01
## ABI1          4.815032e-01 2.669029e-01
## ABL1          6.685222e-01 5.729940e-01
## AC142381.1    9.999658e-01 1.000000e+00
## ACAA1         4.139803e-01 4.917295e-01
## ACADS         6.352493e-01 3.370438e-01
## ACAT1         6.031193e-02 7.938979e-01
## ACD           9.487000e-01 8.196343e-01
## ACOT6         5.149489e-01 7.837917e-02
## ACOT8         5.532113e-01 2.374103e-01
## ACOX2         6.549615e-01 7.752572e-01
##  [ reached getOption("max.print") -- omitted 2397 rows ]
ActivePathways(scores, gmt.file) # results on the dataset with separate columns for 3' UTR, 5'UTR & Promoter
##          term.id                     term.name adjusted.p.val term.size
##  1: REAC:2424491               DAP12 signaling   4.491268e-05       358
##  2:  REAC:422475                 Axon guidance   2.028966e-02       555
##  3:  REAC:177929             Signaling by EGFR   6.245734e-04       366
##                                              overlap       evidence
##  1:              TP53,PIK3CA,KRAS,PTEN,BRAF,NRAS,...            CDS
##  2:          PIK3CA,KRAS,BRAF,NRAS,CALM2,RPS6KA3,... X3UTR,promCore
##  3:              TP53,PIK3CA,KRAS,PTEN,BRAF,NRAS,...            CDS
##                              Genes_X3UTR Genes_X5UTR
##  1:                                   NA          NA
##  2: CALM2,ARPC2,RHOA,NUMB,CALM1,ACTB,...          NA
##  3:                                   NA          NA
##                                      Genes_CDS
##  1:        TP53,PTEN,KRAS,PIK3CA,BRAF,NRAS,...
##  2:                                         NA
##  3:        TP53,PTEN,KRAS,PIK3CA,BRAF,NRAS,...
##                                  Genes_promCore
##  1:                                          NA
##  2: EFNA1,IQGAP1,COL4A1,SCN2B,RPS6KA3,CALM2,...
##  3:                                          NA
##  [ reached getOption("max.print") -- omitted 17 rows ]
ActivePathways(scores2, gmt.file) # results on the dataset with a single 'noncoding' column for the merged p-value of 3' UTR, 5'UTR & Promoter, reducing the impact of noncoding mutations
##          term.id                     term.name adjusted.p.val term.size
##  1: REAC:2424491               DAP12 signaling   0.0003694968       358
##  2:  REAC:422475                 Axon guidance   0.0051873493       555
##  3:  REAC:177929             Signaling by EGFR   0.0018584534       366
##  4: REAC:2559583           Cellular Senescence   0.0005822291       196
##                                     overlap evidence
##  1:     TP53,PIK3CA,PTEN,KRAS,BRAF,NRAS,...      CDS
##  2: PIK3CA,KRAS,BRAF,NRAS,RPS6KA3,CALM2,... combined
##  3:     TP53,PIK3CA,PTEN,KRAS,BRAF,NRAS,...      CDS
##  4:  TP53,RB1,ATM,MAP2K7,CDKN1A,RPS6KA3,...      CDS
##                                      Genes_CDS Genes_non_coding
##  1:        TP53,PTEN,KRAS,PIK3CA,BRAF,NRAS,...               NA
##  2:                                         NA               NA
##  3:        TP53,PTEN,KRAS,PIK3CA,BRAF,NRAS,...               NA
##  4:      TP53,RB1,ATM,MAP2K7,CDKN1A,CDKN1B,...               NA
##  [ reached getOption("max.print") -- omitted 26 rows ]

Gene Cutoff for Considering in Pathway Enrichment Analysis

In addition to removing any genes not found in the background, ActivePathways also removes any genes that have a p-value higher than the cutoff parameter after merging the scores. This threshold represents the maximum p-value for a gene to be considered of interest in our analysis. The cutoff is 0.1 by default.

P-value Adjustment for Pathway Enrichment Analysis

To avoid inflated p-values that arise from multiple testing, p-values are adjusted after running enrichment analysis on each pathway. ActivePathways uses the base R p.adjust function to adjust p-values. Thus, all statistical methods performed by p.adjust can be used in ActivePathways by altering the correction.method parameter. By default, the holm correction method is used. Setting correction.method='none' will instruct the software to avoid p-value adjustment.

The Results Table

Here are the results from a basic ActivePathways analysis.

res <- ActivePathways(scores, gmt.file)
res
##          term.id                     term.name adjusted.p.val term.size
##  1: REAC:2424491               DAP12 signaling   4.491268e-05       358
##  2:  REAC:422475                 Axon guidance   2.028966e-02       555
##  3:  REAC:177929             Signaling by EGFR   6.245734e-04       366
##                                              overlap       evidence
##  1:              TP53,PIK3CA,KRAS,PTEN,BRAF,NRAS,...            CDS
##  2:          PIK3CA,KRAS,BRAF,NRAS,CALM2,RPS6KA3,... X3UTR,promCore
##  3:              TP53,PIK3CA,KRAS,PTEN,BRAF,NRAS,...            CDS
##                              Genes_X3UTR Genes_X5UTR
##  1:                                   NA          NA
##  2: CALM2,ARPC2,RHOA,NUMB,CALM1,ACTB,...          NA
##  3:                                   NA          NA
##                                      Genes_CDS
##  1:        TP53,PTEN,KRAS,PIK3CA,BRAF,NRAS,...
##  2:                                         NA
##  3:        TP53,PTEN,KRAS,PIK3CA,BRAF,NRAS,...
##                                  Genes_promCore
##  1:                                          NA
##  2: EFNA1,IQGAP1,COL4A1,SCN2B,RPS6KA3,CALM2,...
##  3:                                          NA
##  [ reached getOption("max.print") -- omitted 17 rows ]

The term.id, term.name, and term.size columns give the identifier, name and number of genes in each pathway. The adjusted.p.val column gives the adjusted p-value for each pathway indicating the confidence that the pathway is enriched. If the return.all option is FALSE, all of these p-values will be smaller than the significance threshold, 0.05, and the table contains only the pathways that were found to be significantly enriched.

The overlap column gives the intersection between the genes annotated to the term and the subset of genes in the merged gene list that returned this p-value.

res$overlap[1:3]
## [[1]]
##  [1] "TP53"   "PIK3CA" "KRAS"   "PTEN"   "BRAF"   "NRAS"   "B2M"   
##  [8] "CALM2"  "CDKN1A" "CDKN1B"
## 
## [[2]]
##  [1] "PIK3CA"  "KRAS"    "BRAF"    "NRAS"    "CALM2"   "RPS6KA3" "ACTB"   
##  [8] "EFNA1"   "SCN2B"   "EPHA2"   "GAP43"   "COL4A1"  "RASAL2"  "CLTC"   
## [15] "IQGAP1"  "NF1"     "FGF9"    "ADAM10"  "PTPRC"   "ITGA10"  "PDGFB"  
## [22] "COL4A2"  "RGMB"    "RASA1"   "FGF6"    "CALM1"   "PSMB7"   "PPP2R5D"
## [29] "PPP2R1A"
## 
## [[3]]
## [1] "TP53"   "PIK3CA" "KRAS"   "PTEN"   "BRAF"   "NRAS"   "CALM2"  "CDKN1A"
## [9] "CDKN1B"

The overlaps are in the same order as the merged gene list, meaning that more significant genes from scores are ranked higher, allowing the user to return to the gene space and identify significant genes in each pathway.

The evidence column gives the names of columns (sources of evidence such as omics datasets) in scores in which a given pathway is found to be enriched. For example, the majority of the pathways in this example have ‘CDS’ as the only evidence, meaning that the pathway is enriched in this column (and in the combined data). Pathway REAC:422475 has as evidence list('X3UTR', 'promCore'), which means that the pathway is found to be enriched when considering the X3UTR column, the promCore column, and the combined data. If a pathway is found to be enriched only in the combined data and not in any individual column, ‘combined’ will be listed as the evidence. If a pathway is not found to be enriched in any individual column or in the combined data (only possible if return.all==TRUE), then the evidence will be ‘none’.

The rest of the columns are named as Genes_{column}, and represent the gene overlaps of each individual column if a pathway is found to be enriched using the p-values from that column. Otherwise, the column is NA.

Writing Results to File

The results are returned as a data.table, but since the overlap column is a list, it can be a challenge to write the table to a file. The usual write.table and write.csv functions will struggle with this column unless the user manually transforms the list into a string. Fortunately, data.table includes a function fwrite to handle such complex structures.

result.file <- paste(system.file('extdata', package='ActivePathways'), 'example_write.csv', sep='/')
data.table::fwrite(res, result.file)
Write results as csv file using data.table::fwrite - first 4 columns shown here
term.id term.name adjusted.p.val term.size
REAC:2424491 DAP12 signaling 0.0000449 358
REAC:422475 Axon guidance 0.0202897 555
REAC:177929 Signaling by EGFR 0.0006246 366

Visualizing the Results in Cytoscape

The Cytoscape program, combined with the EnrichmentMap app, provides a powerful tool to visualize pathway enrichment analysis as a network. ActivePathways can help speed up this process by automatically writing the files that can be uploaded to Cytoscape to create a network. To create these files, a file directory must be supplied to ActivePathways using the parameter cytoscape.file.dir. If the user plans on reanalyzing the same data with changes to the parameters, the parameter reanalyze can be set to TRUE, and subdirectories of the form Version1A, Version1B, etc. will be created in the cytoscape.file.dir to ensure that each version of the analysis is stored.

res <- ActivePathways(scores, gmt.file, cytoscape.file.dir = system.file('extdata', package='ActivePathways'))

Four files are written:

files <- paste(system.file('extdata', package='ActivePathways'), 
               c('pathways.txt', 
                 'subgroups.txt',
                 'pathways.gmt', 
                 'legend.pdf'), 
               sep='/')

cat(paste(readLines(files[1])[1:5], collapse='\n'))
## term.id  term.name   adjusted.p.val
## REAC:2424491 DAP12 signaling 4.49126833230489e-05
## REAC:422475  Axon guidance   0.0202896586319552
## REAC:177929  Signaling by EGFR   0.000624573372270557
## REAC:2559583 Cellular Senescence 6.59544666946083e-05
cat(paste(readLines(files[2])[1:5], collapse='\n'))
## term.id  CDS X3UTR   promCore    combined    instruct
## REAC:2424491 1   0   0   0   piechart: attributelist="CDS,X3UTR,promCore,combined" colorlist="#FF0000FF,#80FF00FF,#00FFFFFF,#8000FFFF" showlabels=FALSE
## REAC:422475  0   1   1   0   piechart: attributelist="CDS,X3UTR,promCore,combined" colorlist="#FF0000FF,#80FF00FF,#00FFFFFF,#8000FFFF" showlabels=FALSE
## REAC:177929  1   0   0   0   piechart: attributelist="CDS,X3UTR,promCore,combined" colorlist="#FF0000FF,#80FF00FF,#00FFFFFF,#8000FFFF" showlabels=FALSE
## REAC:2559583 1   0   0   0   piechart: attributelist="CDS,X3UTR,promCore,combined" colorlist="#FF0000FF,#80FF00FF,#00FFFFFF,#8000FFFF" showlabels=FALSE
cat(paste(readLines(files[3])[1], collapse='\n'))
## REAC:2424491 DAP12 signaling IL17RD  PSMC1   PDGFRB  PSMD14  TNRC6C  CD80    DUSP10  SPTBN4  PIP5K1B NRG1    TNRC6A  FGF22   ADCY5   CHUK    PSME2   CUL3    NRG2    PSMB8   PSMA2   PSMB3   PSMD13  AC010132.3  PIK3CB  FGF18   PRKCE   FN1 SHC2    RBX1    PSMD5   FGF19   NF1 ARRB1   FGFR1   PRKACG  FGB PSME1   VAV1    PSMD3   ITGA2B  HBEGF   ERBB3   FGF3    KSR2    PPP2R5C VAV3    PPP2CB  IL5RA   PSMB9   CDKN1B  RASA2   PPP2R5E PIK3AP1 PSMF1   FOXO4   PSMC3   PSMA6   AL358075.4  FGF2    ICOS    IL2RB   BTC ADCY2   RAPGEF2 CDKN1A  DUSP6   ERBB4   PTK2    MET FGA TSC2    NRAS    SPTB    SPRED3  TNRC6B  FOXO3   GSK3A   FGF1    IL3RA   PIP5K1A IL2RA   GFRA4   PSMD1   TEK RICTOR  PSMD10  CASP9   CAMK2G  RASAL3  PPP2R1A PSMB6   FGF6    IL3 MAPKAP1 SYK CNKSR1  RASGRF2 GRIN2C  PSMB7   CD86    ADCY3   TLN1    PSMC4   HRAS    MLST8   VAV2    LAMTOR2 THEM4   DUSP7   RASGRF1 CALM3   RAP1B   PSMC2   KSR1    RAC1    ADCY1   JAK1    SPRED2  PDE1A   FGF10   PSME4   EGF PDE1B   GRIN2A  IRS2    VWF PIK3CD  FGFR4   PHB AKT1    IQGAP1  PTPRA   PSMB1   PRKAR1B KL  PRKCD   PSMD9   PSMB2   EGFR    MAP2K2  PRKAR2B KRAS    CAMK2D  SRC PIK3CA  NRTN    IL2 CAMK2B  KIT CSF2RA  CSK UBC SPTBN1  RASA4   CD19    DUSP2   LAMTOR3 ADCY4   PHLPP2  ARAF    PSMD2   PDGFB   PSMA8   FGFR3   NEFL    TRAT1   MIR26A2 FGF23   AGO3    PTPN11  PSMB10  GAB1    MAPK1   TREM2   PRKAR2A LCK ADCY6   SPRED1  MIR26A1 SPTAN1  PSMB11  PDPK1   ITPR3   KBTBD7  RET PSMD8   FGG GFRA1   AHCYL1  FOXO1   PIK3R1  RASGEF1A    JAK2    B2M FGF20   PIK3R2  FGF4    RASAL1  FGF7    PIP4K2C PDGFA   PRR5    TYROBP  EREG    PPP2R5B GRAP2   SOS1    DUSP8   PRKACA  RASA1   PSMA5   DUSP4   PRKCA   ADCY7   CSF2    DUSP16  PHLPP1  CAMK4   KLB GDNF    AKT2    CREB1   PPP2CA  FGF8    FGF9    ITPR2   BAD PAQR3   SYNGAP1 APBB1IP SEM1    RPS6KB2 AKT1S1  PPP5C   PLCG1   PSMA7   SHC1    ARTN    PRKCG   PSMA3   KITLG   GRK2    AKAP9   ANGPT1  FGF17   MTOR    PDE1C   GRIN2B  NR4A1   ITGB3   PSMC5   AGO1    KLRD1   BRAF    AKT3    PSMD4   PIP4K2A TRIB3   RASA3   MDM2    PSMA4   SPTBN2  DAB2IP  DUSP5   PSMB4   PSMA1   MAPK3   KLRK1   GSK3B   PDGFRA  CAMK2A  VCL GFRA2   ITPR1   KLRC2   LCP2    ERBB2   GRIN1   GRB2    LAT RAF1    SPTA1   RANBP9  PIP4K2B FRS3    INS-IGF2    RASGRP3 SHC3    IER3    INSR    BRAP    PEBP1   CD28    GRIN2D  FYN YWHAB   IL5 AGO4    UBB FGF16   IL2RG   FGF5    PSMC6   IRS1    PPP2R1B ARRB2   MARK3   BTK PLCG2   RAP1A   PEA15   PSMD6   ADCY8   JAK3    SPTBN5  ACTN2   PPP2R5A MAP3K11 DUSP1   DUSP9   RPS27A  CSF2RB  HLA-E   AL672043.1  RASAL2  CALM1   PIK3R3  FRS2    PRKAR1A ADCY9   FGFR2   DLG4    PPP2R5D CNKSR2  NCAM1   RASGRP4 PSMD12  PSMB5   TP53    WDR83   NRG4    PRKACB  MAP2K1  MOV10   PIP5K1C NRG3    CALM2   PTEN    UBA52   PSMD7   RNASE1  PSPN    GFRA3   PSMD11  INS RASGRP1 AGO2    HGF PSME3   

To learn how to use these files to create an enrichment map, take a look at the enrichmentMap-vignette included in this package.