# pathfindR - An R Package for Pathway Enrichment Analysis Utilizing Active Subnetworks

#### 2018-11-20

pathfindR is an R package for pathway enrichment analysis of gene-level omics data utilizing active subnetworks. The package also enables hierarchical clustering of the enriched pathways. The method is described in detail in Ulgen E, Ozisik O, Sezerman OU. 2018. pathfindR: An R Package for Pathway Enrichment Analysis Utilizing Active Subnetworks. bioRxiv. https://doi.org/10.1101/272450

Our motivation to develop this package was that direct pathway enrichment analysis of differential RNA/protein expression or DNA methylation results may not provide the researcher with the full picture. That is to say; pathway enrichment of only the list of significant genes may not be informative enough to explain the underlying disease mechanisms.

An active subnetwork is defined as a group of interconnected genes in a protein-protein interaction network (PIN) that contains most of the significant genes. Therefore, these active subnetworks define distinct disease-associated sets of genes, whether discovered through differential expression analysis or discovered because of being in interaction with a significant gene.

Therefore, we propose to leverage information from a PIN to identify distinct active subnetworks and then perform pathway enrichment analyses on these subnetworks. Briefly, this workflow first maps the significant genes onto a PIN and finds active subnetworks. Next, pathway enrichment analyses are performed using each gene set of the identified active subnetworks. Finally, these enrichment results are summarized and returned as a data frame. This workflow is implemented as the function run_pathfindR() and further described in the “Enrichment Workflow” section of this vignette.

This process usually yields a great number of enriched pathways with related biological functions. We therefore implemented a pairwise distance metric based on kappa statistics (as proposed by Huang et al. 1) between terms and based on this distance metric, also implemented hierarchical clustering and fuzzy clustering 2 of the pathways. Details of clustering and partitioning of pathways are presented in the “Pathway Clustering” section of this vignette.

# Enrichment Workflow

The overview of the enrichment workflow is presented in the figure below:

For this workflow, the wrapper function run_pathfindR() can be used. This function takes in a data frame consisting of Gene Symbol, log-fold-change (optional) and adjusted-p values. The first 6 rows of an example input dataset (of rheumatoid arthritis differential-expression) can be found below:

suppressPackageStartupMessages(library(pathfindR))
data("RA_input")
knitr::kable(head(RA_input))
ILMN_1755092 FAM110A -0.6939359 0.0000034
ILMN_1730628 RNASE2 1.3535040 0.0000101
ILMN_1729801 S100A8 1.5448338 0.0000347
ILMN_1714991 S100A9 1.0280904 0.0002263
ILMN_1762037 TEX261 -0.3235994 0.0002263
ILMN_1718610 ARHGAP17 -0.6919330 0.0002708

Executing the workflow is straightforward (but takes several minutes):

RA_output <- run_pathfindR(RA_input)

The user may want to change certain arguments of the function:

# to change the output directory
RA_output <- run_pathfindR(RA_input, output = "new_directory")

# to change the PIN (default = Biogrid)
RA_output <- run_pathfindR(RA_input, pin_name = "IntAct")
# to use an external PIN of user's choice
RA_output <- run_pathfindR(RA_input, pin_name = "/path/to/myPIN.sif")

# available gene sets are KEGG, Reactome, BioCarta, GO-BP, GO-CC and GO-MF
# default is KEGG
# to change the gene sets used for enrichment analysis
RA_output <- run_pathfindR(RA_input, gene_sets = "BioCarta")

# to change the active subnetwork search algorithm (default = "GR", i.e. greedy algorithm)
# for simulated annealing:
RA_output <- run_pathfindR(RA_input, search_method = "SA")

# to change the number of iterations (default = 10)
RA_output <- run_pathfindR(RA_input, iterations = 5)

# to manually specify the number processes used during parallel loop by foreach
# defaults to the number of detected cores
RA_output <- run_pathfindR(RA_input, n_processes = 2)

# to report the non-DEG active subnetwork genes
RA_output <- run_pathfindR(RA_input, list_active_snw_genes = TRUE)

For a full list of arguments, see ?run_pathfindR.

The workflow consists of the following steps :

After input testing, the program attempts to convert any gene symbol that is not in the PIN to an alias symbol that is in the PIN. Next, active subnetwork search is performed via the selected algorithm. The available algorithms for active subnetwork search are:

• Greedy Algorithm (based on Ideker et al. 3),
• Simulated Annealing Algorithm (based on Ideker et al. 4) and
• Genetic Algorithm (based on Ozisik et al. 5).

Next, pathway enrichment analyses are performed using the genes in each of the active subnetworks. For this, up-to-date information on human gene sets from KEGG, Reactome, BioCarta and Gene Ontology were retrieved and is available for use within the package. The user may specify custom gene sets, including gene sets for non-human organisms, as described in the section “pathfindR Analysis with Custom Gene Sets”.

During enrichment analyses, pathways with adjusted-p values larger than the enrichment_threshold (an argument of run_pathfindR(), defaults to 0.05) are discarded. The results of enrichment analyses over all active subnetworks are combined by keeping only the lowest adjusted-p value for each pathway.

This process of active subnetwork search and enrichment analyses is repeated for a selected number of iterations (indicated by the iterations argument of run_pathfindR()), which is performed in parallel via the R package foreach.

The wrapper function returns a data frame that contains the lowest and the highest adjusted-p values for each enriched pathway, as well as the numbers of times each pathway is encountered over all iterations. The first two rows of the example output of the pathfindR-enrichment workflow (performed on the rheumatoid arthritis data RA_output) is shown below:

data("RA_output")
knitr::kable(head(RA_output, 2))
ID Pathway Fold_Enrichment occurrence lowest_p highest_p Up_regulated Down_regulated
hsa00190 Oxidative phosphorylation 71.86252 10 3e-07 3e-07 NDUFB3, NDUFA1, COX7C, COX7A2, UQCRQ, COX6A1, ATP6V0E1, ATP6V1D ATP6V0E2
hsa05012 Parkinson’s disease 63.72714 10 4e-07 4e-07 NDUFA1, NDUFB3, UQCRQ, COX6A1, COX7A2, COX7C SLC25A5, VDAC1, UBE2G1

The function also creates an HTML report results.html that is saved in a directory, by default named pathfindr_Results but can be changed by changing the argument output_dir, under the current working directory. This report contains links to two other HTML files:

## 1. all_pathways.html

This document contains a table of the active subnetwork-oriented pathway enrichment results. Each enriched pathway name is linked to the visualization of that pathway, with the gene nodes colored according to their log-fold-change values. This table contains the same information as the returned data frame. Columns are:

• ID: KEGG ID of enriched pathway
• Pathway: Description the pathway
• Fold_Enrichment: Fold enrichment value for the pathway.
• occurrence: The number of times the pathway was found to be enriched over all iterations
• lowest_p: the lowest adjusted-p value of the pathway over all iterations
• highest_p: the highest adjusted-p value of the pathway over all iterations
• Up_regulated: the up-regulated genes involved in the pathway
• Down_regulated: the down-regulated genes involved in the pathway

## 2. genes_table.html

This document contains a table of converted gene symbols. Columns are:

• Old Symbol: the original gene symbol
• Converted Symbol: the alias symbol that was found in the PIN
• Change: the provided change value
• p-value: the provided adjusted p value

The document contains a second table of genes for which no interactions were identified (after checking for alias symbols).

# Pathway Clustering

For this workflow, the wrapper function cluster_pathways() is used. This function first calculates the pairwise kappa statistics between the terms in the input data frame. By default, the function performs hierarchical clustering of the terms using this kappa matrix, automatically determines the optimal number of clusters by maximizing the average silhouette width and returns a data frame with cluster assignments:

data("RA_output")
RA_clustered <- cluster_pathways(RA_output)
#> The maximum average silhouette width was 20.35 for k = 8

## First 2 rows of clustered terms data frame
knitr::kable(head(RA_clustered, 2))
ID Pathway Fold_Enrichment occurrence lowest_p highest_p Up_regulated Down_regulated Cluster Status
hsa00190 Oxidative phosphorylation 71.86252 10 3e-07 3e-07 NDUFB3, NDUFA1, COX7C, COX7A2, UQCRQ, COX6A1, ATP6V0E1, ATP6V1D ATP6V0E2 1 Representative
hsa05012 Parkinson’s disease 63.72714 10 4e-07 4e-07 NDUFA1, NDUFB3, UQCRQ, COX6A1, COX7A2, COX7C SLC25A5, VDAC1, UBE2G1 1 Member
## The 8 representative terms
knitr::kable(RA_clustered[RA_clustered$Status == "Representative", ]) ID Pathway Fold_Enrichment occurrence lowest_p highest_p Up_regulated Down_regulated Cluster Status 1 hsa00190 Oxidative phosphorylation 71.86252 10 0.0000003 0.0000003 NDUFB3, NDUFA1, COX7C, COX7A2, UQCRQ, COX6A1, ATP6V0E1, ATP6V1D ATP6V0E2 1 Representative 3 hsa03040 Spliceosome 49.39033 10 0.0000005 0.0000005 SF3B6, LSM3, BUD31 SNRPB, SF3B2, U2AF2, PUF60, HNRNPA1, PCBP1, SRSF5, SRSF8, SNU13, DDX23, EIF4A3 2 Representative 5 hsa03010 Ribosome 39.02933 10 0.0000011 0.0000063 RPS24, RPL26, RPL39, RPL31, MRPL33, MRPS18C RPLP2 3 Representative 9 hsa04064 NF-kappa B signaling pathway 67.75926 10 0.0000025 0.0000025 LY96 IKBKB, PRKCQ, CARD11, TICAM1, CSNK2A2, PARP1, UBE2I 4 Representative 10 hsa04714 Thermogenesis 39.27370 10 0.0000069 0.0000069 COX6A1, COX7A2, COX7C, NDUFA1, NDUFB3, UQCRQ CREB1, ADCY7, ACTB, ACTG1, SMARCA4, ARID1A, KDM1A, MTOR 5 Representative 15 hsa04130 SNARE interactions in vesicular transport 79.54348 10 0.0000664 0.0000664 STX10, STX6 BET1L, SNAP23, STX2 6 Representative 38 hsa03050 Proteasome 85.09302 10 0.0018696 0.0018696 PSMD7, PSMB10 7 Representative 48 hsa03420 Nucleotide excision repair 59.65761 10 0.0053843 0.0053843 GTF2H5, POLE4 POLD2, RPA1, XPC 8 Representative  # to display the heatmap of kappa statistics RA_clustered <- cluster_pathways(RA_output, plot_hmap = TRUE, plot_clusters_graph = FALSE) #> The maximum average silhouette width was 20.35 for k = 8  # to display the dendrogram and optimal clusters RA_clustered <- cluster_pathways(RA_output, plot_dend = TRUE, plot_clusters_graph = FALSE) #> The maximum average silhouette width was 20.35 for k = 8  # to change agglomeration method (default = "average") RA_clustered <- cluster_pathways(RA_output, hclu_method = "centroid") #> The maximum average silhouette width was 6.9 for k = 7 Alternatively, the fuzzy clustering method (as described by Huang et al.6) can be used: RA_clustered <- cluster_pathways(RA_output, method = "fuzzy") # Pathway Scores per Sample The function calculate_pw_scores can be used to calculate the pathway scores per sample. This allows the user to individually examine the scores and infer whether a pathway is activated or repressed in a given sample. For a set of pathways $$P = \{P_1, P_2, ... , P_k\}$$, where each $$P_i$$ contains a set of genes, i.e. $$P_i = \{g_1, g_2, ...\}$$, the pathway score matrix $$PS$$ is defined as: $$PS_{p,s} = \frac{1}{k} \sum_{g \in P_p} GS_{g,s}$$ for each pathway $$p$$ and for each sample $$s$$. $$GS$$ is the gene score per sample matrix and is defined as: $$GS_{g,s} = (EM_{g,s} - \bar{x}_g) / sd_g$$ where $$EM$$ is the expression matrix (columns are samples, rows are genes), $$\bar{x}_g$$ is the mean expression value of the gene and $$sd_g$$ is the standard deviation of the expression values for the gene. An example application is provided below: ## Pathway data frame pws_table <- pathfindR::RA_clustered # selecting "Representative" pathways for clear visualization pws_table <- pws_table[pws_table$Status == "Representative", ]

## Expression matrix
exp_mat <- pathfindR::RA_exp_mat

## Vector of "Case" IDs
cases <- c("GSM389703", "GSM389704", "GSM389706", "GSM389708",
"GSM389711", "GSM389714", "GSM389716", "GSM389717",
"GSM389719", "GSM389721", "GSM389722", "GSM389724",
"GSM389726", "GSM389727", "GSM389730", "GSM389731",
"GSM389733", "GSM389735")

## Calculate pathway scores and plot heatmap
score_matrix <- calculate_pw_scores(pws_table, exp_mat, cases)