RCPA: an open-source R package for data processing, differential analysis, consensus pathway analysis, and visualization

Phi Hung Bya, Zeynab Maghsoudi, Ha Nguyen, Bang Tran, Sorin Draghici, and Tin Nguyen

2024-07-03

Introduction

RCPA is a software package designed for for Consensus Pathway Analysis (RCPA) that im- plements a complete analysis pipeline including: i) download and process data from NCBI Gene Expression Omnibus, ii) perform differential analysis using techniques developed for both microarray and sequencing data, iii) perform pathway analysis using different methods for enrichment analysis and topology-based (TB) analysis, iv) perform meta-analysis and combine methods and datasets to find consensus results, and v) visualize analysis results and ex- plore significantly impacted pathways across multiple analyses. The package supports the analysis of more than 1,000 species, three differential analysis techniques, eight pathway methods, and two pathway databases (GO and KEGG).

This vignette serves as an introductory guide to RCPA’s essential usage and provides users with a solid understanding of its capabilities. Through practical case studies utilizing gene expression data sourced from NCBI GEO, we demonstrate the practical application of RCPA. By the end, users will have a clearer picture of the available functions, their relevance, and a starting point for seeking assistance as needed.

After installation, users can load the package as follows:

unloadNamespace("RCPA")
library(RCPA)
library(SummarizedExperiment)
library(ggplot2)
library(gridExtra)

Module 1: Data processing

Processing Affymetrix and Agilent microarray data

RCPA provides the downloadGEO function for downloading, processAffymetrix and processAgilent for processing high-throughput gene expression data from NCBI GEO repository supporting both Affymetrix and Agilent technologies. Users then can save the processed data in a SummarizedExperiment object. The SummarizedExperiment object is structured in a way that captures the various dimensions and metadata associated with the data, making it an ideal choice for storing and manipulating complex datasets. The assays attribute of the resulted object contains gene expression values, and colData attribute stores sample-level metadata, such as clinical information, and experimental conditions.

Download and process GSE5281 with affymetrix protocol:

In this example, we demonstrate the use of function downloadGEO for downloading an Affymetrix dataset from GEO database (accession ID: GSE5181). Users can run the following code to get the data:

# User-defined directory to save the downloaded data
downloadPath <- file.path(getwd(), "GSE5281")
# Create the directory if it does not exist
if(!dir.exists(downloadPath)) dir.create(downloadPath)
# download the data
downloadedFiles <- RCPA::downloadGEO(GEOID = "GSE5281", platform = "GPL570", protocol = "affymetrix", destDir = downloadPath)

After executing the code, by running head(downloadedFiles) command to see the output. The function downloadGEO returns a list of downloaded files, which is assigned to the downloadedFiles variable. The first element of the list is the metadata file, and the remaining elements are the CEL files.

[1] "metadata.csv"     "GSM119615.CEL.gz" "GSM119616.CEL.gz" "GSM119617.CEL.gz"
[5] "GSM119618.CEL.gz" "GSM119619.CEL.gz"

We can then process the downloaded files as follow:

# read the metadata file
affySampleInfo <- read.csv(file.path(downloadPath, "metadata.csv"))

# read the CEL files
affyExprs <- RCPA::processAffymetrix(dir = downloadPath, samples = affySampleInfo$geo_accession)

# create the SummarizedExperiment object
affyDataset <- SummarizedExperiment::SummarizedExperiment(assays = affyExprs, colData = affySampleInfo)

By running head(affyDataset) command, users can expect to get the below output:

class: SummarizedExperiment 
dim: 6 161 
metadata(0):
assays(1): ''
rownames(6): 1007_s_at 1053_at ... 1255_g_at 1294_at
rowData names(0):
colnames(161): GSM119615 GSM119616 ... GSM238955 GSM238963
colData names(70): title geo_accession ... sex.ch1 Sex.ch1

We can access to the assay data and the sample data stored in the SummarizedExperiment object using the functions from SummarizedExperiment namely assay and colData:

# Access to assay data
affyExprs <- SummarizedExperiment::assay(affyDataset)
# Access to sample information
affySampleInfo <- SummarizedExperiment::colData(affyDataset)

After executing the above code, by running head(affyExprs, c(5,6)) and head(affySampleInfo, c(3, 10)) commands, users can expect to get the below outputs:

> head(affyExprs, c(5,6))
          GSM119615 GSM119616 GSM119617 GSM119618 GSM119619 GSM119620
1007_s_at  3.043234  3.055157  3.144277  3.150378  3.084336  2.989966
1053_at    1.698974  1.645050  1.618537  1.589216  1.676278  1.581733
117_at     1.795751  1.770719  1.805597  1.995794  1.688068  1.961556
121_at     2.553174  2.668456  2.801450  2.784216  2.588638  2.692849
1255_g_at  1.626691  1.940055  1.663162  1.483875  2.354096  1.671216

> head(affySampleInfo, c(3, 10))
DataFrame with 3 rows and 10 columns
                 title geo_accession                status submission_date last_update_date
           <character>   <character>           <character>     <character>      <character>
GSM119615 EC control 1     GSM119615 Public on Jul 10 2006     Jul 10 2006      Jun 26 2019
GSM119616 EC control 2     GSM119616 Public on Jul 10 2006     Jul 10 2006      Jun 26 2019
GSM119617 EC control 3     GSM119617 Public on Jul 10 2006     Jul 10 2006      Jun 26 2019
                 type channel_count        source_name_ch1 organism_ch1  characteristics_ch1
          <character>   <character>            <character>  <character>          <character>
GSM119615         RNA             1 brain, Entorhinal Co.. Homo sapiens Sample Amount: 10 ug
GSM119616         RNA             1 brain, Entorhinal Co.. Homo sapiens Sample Amount: 10 ug
GSM119617         RNA             1 brain, Entorhinal Co.. Homo sapiens Sample Amount: 10 ug

Download and process GSE61196 with agilent technology:

In this example, we use the function downloadGEO to download an Agilent dataset from GEO database (accession ID: GSE61196):

# User-defined directory to save the downloaded data
downloadPath <- file.path(getwd(), "GSE61196")
# Create the directory if it does not exist
if(!dir.exists(downloadPath)) dir.create(downloadPath)
# download the data
downloadedFiles <- RCPA::downloadGEO(GEOID = "GSE61196", platform = "GPL4133", protocol = "agilent", destDir = downloadPath)

In the above code, the function downloadGEO requires users to specify which color channel should be used when downloading an Agilent dataset. When greenOnly = FALSE, it tells the function that both red and green channels required. Similarly, we run head(downloadedFiles) command, and get the below output:

> head(downloadedFiles)
[1] "metadata.csv"      "GSM1499379.TXT.gz" "GSM1499380.TXT.gz"
[4] "GSM1499381.TXT.gz" "GSM1499382.TXT.gz" "GSM1499383.TXT.gz"

Next, we process the downloaded files as follows:

# read the metadata file
agilSampleInfo <- read.csv(file.path(downloadPath, "metadata.csv"))

# read the TXT files
agilExprs <- processAgilent(dir = downloadPath, samples = agilSampleInfo$geo_accession, greenOnly = FALSE)

# create the SummarizedExperiment object
agilDataset <- SummarizedExperiment::SummarizedExperiment(assays = agilExprs, colData = agilSampleInfo)

We can then run the print(agilDataset) to display the SummarizedExperiment object:

class: SummarizedExperiment
dim: 6 21
metadata(0):
assays(1): ''
rownames(6): GE_BrightCorner DarkCorner ... DarkCorner DarkCorner
rowData names(0):
colnames(21): GSM1499379 GSM1499380 ... GSM1499398 GSM1499399
colData names(48): title geo_accession ... tissue.ch1 tissue.ch2

We can access to the assay data and the sample data stored in agilDataset variable as follows:

# Access to assay data
agilExprs <- SummarizedExperiment::assay(agilDataset)
# Access to sample information
agilSampleInfo <- SummarizedExperiment::colData(agilDataset)

Next, we run print(agilExprs[9:14, 1:6]) and head(agilSampleInfo, c(3,10)) commands, and get the below outputs:

> print(agilExprs[9:14, 1:6])
             GSM1499379 GSM1499380 GSM1499381 GSM1499382 GSM1499383 GSM1499384
DarkCorner     4.663415   5.054929   5.113313   5.295257   5.424192   4.885411
DarkCorner     4.964227   5.219474   4.962025   5.199360   5.141188   5.020854
DarkCorner     4.898062   5.091960   4.909536   4.909536   5.424192   5.000929
A_24_P66027    8.090737   8.141476   8.163278   8.089728   8.128815   8.096007
A_32_P77178    5.329005   5.219474   5.551633   5.803849   5.725941   5.369755
A_23_P212522  11.178282  11.469014  11.354861  11.288909  11.435052  11.290439

> head(agilSampleInfo, c(3, 10))
DataFrame with 3 rows and 10 columns
                            title geo_accession                status submission_date
                      <character>   <character>           <character>     <character>
GSM1499379 human CPE Braak-0 do..    GSM1499379 Public on Dec 31 2014     Sep 08 2014
GSM1499380 human CPE Braak-0 do..    GSM1499380 Public on Dec 31 2014     Sep 08 2014
GSM1499381 human CPE Braak-0 do..    GSM1499381 Public on Dec 31 2014     Sep 08 2014
           last_update_date        type channel_count        source_name_ch1 organism_ch1
                <character> <character>   <character>            <character>  <character>
GSM1499379      Dec 31 2014         RNA             2 healthy human choroi.. Homo sapiens
GSM1499380      Dec 31 2014         RNA             2 healthy human choroi.. Homo sapiens
GSM1499381      Dec 31 2014         RNA             2 healthy human choroi.. Homo sapiens
              characteristics_ch1
                      <character>
GSM1499379 tissue: choroid plex..
GSM1499380 tissue: choroid plex..
GSM1499381 tissue: choroid plex..

Download and process RNA-Seq read counts

One can also download read counts data from GEO and create the SummarizedExperiment object to be used in later stages. Here, we download read counts data for GSE153873 from GEO:

# Specify the GEO accession ID
GEOID <- "GSE153873"
# Create a download path
downloadPath <- getwd()
if(!dir.exists(downloadPath)) dir.create(downloadPath)
# Download the data
GEOquery::getGEOSuppFiles(GEOID, fetch_files = TRUE, baseDir = downloadPath)

Users can check the files in the download path using list.files(downloadPath) command and get the following result:

> list.files(downloadPath)
[1] "GSE153873" 

As we can see, when running the GEOquery::getGEOSuppFiles() function, it will automatically create a folder named GSE153873 in the user-defined download folder and save all the downloaded data in that folder. We can see the files name in this newly-created folder by running list.files(file.path(downloadPath, GEOID)) and get the following output:

> list.files(file.path(downloadPath, GEOID))
[1] "GSE153873_AD.vs.Old_diff.genes.xlsx" "GSE153873_summary_count.ercc.txt.gz"
[3] "GSE153873_summary_count.star.txt.gz"

Accordingly, the read count is saved under the name ``GSE153873_summary_count.star.txt.gz”. We can extract this file and obtain the assay data by running the following code snippet:

# Specify the path to the downloaded read counts file
countsFile <- file.path(downloadPath, GEOID, "GSE153873_summary_count.star.txt.gz")
# Read the downloaded file
countsData <- read.table(countsFile, header = TRUE, sep = "\t", fill = 0, row.names = 1, check.names = FALSE)

By running head(countsData, c(5,6)), we can see some pieces of the data obtained, which is a dataframe in which rows are genes and columns are samples as follow:

> head(countsData, c(5,6))
       20-1T-AD 13-11T-Old 15-13T-Old 16-14T-Old 3-17T-Young 5-18T-Young
SGIP1      1405       1405       1169       2408         859        1164
NECAP2      295        460        334        347         617         585
AZIN2       356        306        385        507         787         751
AGBL4       191        200        173        323          36          89
CLIC4       876       1443        639        792        4806        5968

In addition to assay data, we also need to download the metadata for the samples. The function to download is getGEO from the GEOquery package. We have to specify the ID of the dataset that we want.

# Download the GEO object to get metadata
GEOObject <- GEOquery::getGEO(GEOID, GSEMatrix = T, getGPL = T, destdir = downloadPath)
# Check the length of GEOObject
print(length(GEOObject))
# Extract the dataset from the GEOObject
samplesData <- GEOObject[[1]]
# Export sample data
metadata <- Biobase::pData(samplesData)

Some datasets on GEO may be derived from different microarray platforms. Therefore, the object GEOObject can be a list of different datasets. We can find out how many were used by checking the length of the GEOObject object. Usually there will only be one platform and the dataset we want to analyse will be the first object in the list (GEOObject[[1]]). Moreover, data submitted to GEO contain sample labels assigned by the experimenters, and some information about the processing protocol. All these data can be extracted by the Biobase::pData() function. By running head(metadata, c(5,6)), users can expect to see the following output:

> head(metadata, c(5,6))
                 title geo_accession                status submission_date last_update_date
GSM4656348  13-11T-Old    GSM4656348 Public on Jul 07 2020     Jul 06 2020      Oct 13 2020
GSM4656349  15-13T-Old    GSM4656349 Public on Jul 07 2020     Jul 06 2020      Oct 13 2020
GSM4656350    20-1T-AD    GSM4656350 Public on Jul 07 2020     Jul 06 2020      Oct 13 2020
GSM4656351  16-14T-Old    GSM4656351 Public on Jul 07 2020     Jul 06 2020      Oct 13 2020
GSM4656352 3-17T-Young    GSM4656352 Public on Jul 07 2020     Jul 06 2020      Oct 13 2020
           type
GSM4656348  SRA
GSM4656349  SRA
GSM4656350  SRA
GSM4656351  SRA
GSM4656352  SRA

As we can see, title column of metadata contains the value of the columns labels of countsData. To be consistent with other examples in which the sample IDs are in the standard GEO accession ID form (i.e., GSMxxxx), we will update the columns label of count data as follows:

# Get the sample IDs in the GEO accession ID form
sampleIDs <- rownames(metadata)
# Get the sample titles
sampleTitles <- metadata[, "title"]
# Reorder the columns in the assay data
countsData <- countsData[,sampleTitles]
# Assign sample IDs to columns' labels
colnames(countsData) <- sampleIDs

By running head(countsData, c(5,6)), the new countsData is as follows:

> head(countsData, c(5,6))
       GSM4656348 GSM4656349 GSM4656350 GSM4656351 GSM4656352 GSM4656353
SGIP1        1405       1169       1405       2408        859       1164
NECAP2        460        334        295        347        617        585
AZIN2         306        385        356        507        787        751
AGBL4         200        173        191        323         36         89
CLIC4        1443        639        876        792       4806       5968

Finally, we can create the SummarizedExperiment to stored the assay and sample data as follows:

# Create the SummarizedExperiment object
RNASeqDataset <- SummarizedExperiment::SummarizedExperiment(
  assays = as.matrix(countsData),
  colData = metadata
)

The summary of created RNASeq object after running head(RNASeqDataset) command can be explored, as:

class: SummarizedExperiment
dim: 6 30
metadata(0):
assays(1): ''
rownames(6): SGIP1 NECAP2 ... CLIC4 SLC45A1
rowData names(0):
colnames(30): GSM4656348 GSM4656349 ... GSM4656376 GSM4656377
colData names(38): title geo_accession ... disease state:ch1 tissue:ch1

Additionally, users can create their own SummarizedExperiment object from their customized data with the same procedure.

Module 2 : Differential analysis

To begin with differential analysis (DE analysis), we can call runDEAnalysis function. This function utilizes one of the limma, DESeq2, and edgeR methods to perform the analysis. It is generally recommended to use limma for differential expression analysis of microarray data, or log-scaled RNA-seq data with TPM, FPKM, or RPKM normalization values. For RNA-seq data with raw counts, DESeq2 or edgeR are recommended.

There are two important arguments in the function: design and contrast. The design matrix defines the statistical model to be fitted in the analysis. The contrast matrix defines the comparisons to be performed in the analysis.

DE analysis using an Affymetrix dataset

To start, first, we add additional sample information to each of the SummarizedExperiment objects obtained from Stage 1. This completely depends on the experimental design and needs to be specified based on research question. This information is used to define the design matrix for differential expression analysis. For GSE5281 dataset, the samples have two conditions: normal and Alzheimer’s, which are defined in the characteristics_ch1.8 column. Each sample is also associated with a brain region, which is defined in the characteristics_ch1.4 column.

# GSE5281
# Add a column specifying the condition of the sample, which can be either normal or alzheimer
affySampleInfo$condition <- ifelse(grepl("normal", affySampleInfo$characteristics_ch1.8), "normal", "alzheimer")
# Factorize the newly added column
affySampleInfo$condition <- factor(affySampleInfo$condition)
# Add the new column specify the region of the sample tissue
# and use make.names to remove special characters
affySampleInfo$region <- make.names(affySampleInfo$characteristics_ch1.4)
# Factorize the newly added column
affySampleInfo$region <- factor(affySampleInfo$region)
# Update the affyDataset object
SummarizedExperiment::colData(affyDataset) <- affySampleInfo

We can check whether the two new columns are added and saved in the affyDataset object by running the head(affyDataset) command line again:

class: SummarizedExperiment 
dim: 6 161 
metadata(0):
assays(1): ''
rownames(6): 1007_s_at 1053_at ... 1255_g_at 1294_at
rowData names(0):
colnames(161): GSM119615 GSM119616 ... GSM238955 GSM238963
colData names(72): title geo_accession ... condition region

As we can see, these two columns condition and region are added. Next, we will create the design and contrast matrices for DE analysis.

# GSE5281
# Create a design matrix
affyDesign <- model.matrix(~0 + condition + region + condition:region, data = SummarizedExperiment::colData(affyDataset))
# Avoid special characters in column names
colnames(affyDesign) <- make.names(colnames(affyDesign))
# Create a constrast matrix
affyContrast <- limma::makeContrasts(conditionalzheimer-conditionnormal, levels=affyDesign)

Here, the formula ~0 + condition + region + condition:region defines the design matrix to include the main effects of condition and region, as well as the interaction between condition and region. The term ~0 indicates that the intercept should not be included in the design matrix since the intercept is not of interest in the analysis. For more information on how to define the design matrix for different scenarios and interests, please refer to the limma user guide. Next, we make use of the limma::makeContrasts function to define the contrast matrix. The formula conditionalzheimer-conditionnormal defines the contrast matrix to compare the alzheimer condition versus the normal condition.

An example of design matrix and contrast matrix is as follow:

> print(affyDesign[1:5, 1:3])
          conditionalzheimer conditionnormal regionorgan.region..hippocampus.
GSM119615                  0               1                                0
GSM119616                  0               1                                0
GSM119617                  0               1                                0
GSM119618                  0               1                                0
GSM119619                  0               1                                0
                                            Contrasts
Levels                                       conditionalzheimer - conditionnormal
  conditionalzheimer                                                            1
  conditionnormal                                                              -1
  regionorgan.region..hippocampus.                                              0
  regionorgan.region..medial.temporal.gyrus.                                    0
  regionorgan.region..posterior.cingulate.                                      0
  regionorgan.region..posterior.singulate.                                      0

Given the design matrix and contrast matrix, we can perform the differential analysis using limma for Affymetrix dataset as follow:

# Run differential expression analysis
affyDEExperiment <- RCPA::runDEAnalysis(affyDataset, method = "limma", design = affyDesign, contrast = affyContrast, annotation = "GPL570")

We can check this variable by running affyDEExperiment in the console:

> affyDEExperiment
class: SummarizedExperiment 
dim: 21367 161 
metadata(4): DEAnalysis.method DEAnalysis.design DEAnalysis.contrast DEAnalysis.mapping
assays(1): counts
rownames(21367): 55101 92840 ... 9695 83887
rowData names(9): PROBEID ID ... sampleSize pFDR
colnames(161): GSM119615 GSM119616 ... GSM238955 GSM238963
colData names(72): title geo_accession ... condition region

The function runDEAnalysis() outputs a SummarizedExperiment object, and this new object extends the input SummarizedExperiment object to include the results of the differential analysis. In essence, the resulting object not only retains the expression data matrix and sample information but also incorporates the differential expression analysis results, which are stored under rowData attribute. As we can see in the R console, the new attribute rowData is added into the SummarizedExperiment object. Similar to the data stored under other attributes, this data can also be accessed using the function rowData() from the SummarizedExperiment package, as follows:

# Extract the differential analysis result
affyDEResults <- SummarizedExperiment::rowData(affyDEExperiment)

Users can display the result in R console by running head(affyDEResults, c(3,5)):

> head(affyDEResults, c(3,5))
DataFrame with 3 rows and 5 columns
           PROBEID          ID     p.value statistic     logFC
       <character> <character>   <numeric> <numeric> <numeric>
55101     45828_at       55101 1.53869e-23  -11.8894 -0.465108
92840    226597_at       92840 3.56762e-23  -11.7546 -0.661995
727957   227778_at      727957 3.55473e-22  -11.3858 -0.525213

DE analysis using an Agilent dataset

For GSE61196 dataset, the samples have two conditions: normal and Alzheimer’s, which are defined in the source_name_ch1 column.

# GSE61196
colData(agilDataset)$condition <- ifelse(grepl("healthy", colData(agilDataset)$source_name_ch1), "normal", "alzheimer")
colData(agilDataset)$condition <- factor(colData(agilDataset)$condition)

Next, we define the design matrix and contrast matrix for differential expression analysis. For this dataset, we define the design matrix to include the condition, and similarly, the contrast matrix to compare the alzheimer condition versus the normal condition.

# GSE61196
agilDesign <- model.matrix(~0 + condition, data = colData(agilDataset))
agilContrast <- limma::makeContrasts("conditionalzheimer-conditionnormal", levels=agilDesign)

As an example of expected output, here we show a part of constructed design and contrast matrices for GSE61196:

head(agilDesign)
           conditionalzheimer conditionnormal
GSM1499379                  0               1
GSM1499380                  0               1
GSM1499381                  0               1
GSM1499382                  0               1
GSM1499383                  0               1
GSM1499384                  0               1
head(agilContrast)
                    Contrasts
Levels               conditionalzheimer-conditionnormal
  conditionalzheimer                                  1
  conditionnormal                                    -1

Next, we define the mapping between the ID of the assay in the SummarizedExperiment object and Entrez ID. The annotation argument can accept a supported GEO platform ID or a data frame with two columns: FROM and TO. If a GEO platform ID is provided, the function will automatically download the available annotation from Bioconductor. Users can use the function RCPA::getSupportedPlatforms() to get the list of supported GEO platform IDs. For GSE61196, since the platform GPL4133 is not supported by Bioconductor, we need to manually download the annotation from GEO as follows.

# GSE61196
GPL4133Anno <- GEOquery::dataTable(GEOquery::getGEO("GPL4133"))@table
GPL4133GeneMapping <- data.frame(FROM = GPL4133Anno$SPOT_ID, TO = as.character(GPL4133Anno$GENE), stringsAsFactors = F)
GPL4133GeneMapping <- GPL4133GeneMapping[!is.na(GPL4133GeneMapping$TO), ]

The mapping output is a dataframe as:

head(GPL4133GeneMapping)

        FROM        TO
 A_24_P66027      9582
A_23_P212522     23200
A_24_P934473 100132006
  A_24_P9671      3301
A_24_P801451     10919
 A_32_P30710      9349

Finally, we perform differential expression analysis using the runDEAnalysis function. We still use limma for differential expression analysis since the data are from microarray experiments.

# GSE61196
agilDEExperiment <- RCPA::runDEAnalysis(agilDataset, method = "limma", design = agilDesign, contrast = agilContrast, annotation = GPL4133GeneMapping)

DE analysis using an RNA-Seq dataset

For GSE153873, the samples have two conditions: normal and Alzheimer’s, which are defined in the characteristics_ch1.1 column.

# GSE153873
colData(RNASeqDataset)$condition <- ifelse(grepl("Alzheimer", colData(RNASeqDataset)$characteristics_ch1.1), "alzheimer", "normal")

Next, we define the design matrix and contrast matrix for differential expression analysis. For GSE153873 dataset, we define the design matrix to include the condition, and similarly, the contrast matrix to compare the alzheimer condition versus the normal condition.

# GSE61196
agilDesign <- model.matrix(~0 + condition, data = colData(agilDataset))
agilContrast <- limma::makeContrasts("conditionalzheimer-conditionnormal", levels=agilDesign)

# GSE153873
RNASeqDesign <- model.matrix(~0 + condition, data = colData(RNASeqDataset))
RNASeqContrast <- limma::makeContrasts("conditionalzheimer-conditionnormal", levels=RNASeqDesign)

For GSE153873, the IDs are Gene SYMBOL, we can use the org.Hs.eg.db package to map Human Gene SYMBOL to Entrez ID as follows:

# GSE153873
if (!require("org.Hs.eg.db", quietly = TRUE)) {
    BiocManager::install("org.Hs.eg.db")
}

library(org.Hs.eg.db)
ENSEMBLMapping <- AnnotationDbi::select(org.Hs.eg.db, keys = rownames(RNASeqDataset), columns = c("SYMBOL", "ENTREZID"), keytype = "SYMBOL")
colnames(ENSEMBLMapping) <- c("FROM", "TO")

The first rows of generated mapping dataframe are as:

head(ENSEMBLMapping)

   FROM     TO
  SGIP1  84251
 NECAP2  55707
  AZIN2 113451
  AGBL4  84871
  CLIC4  25932
SLC45A1  50651

Finally, we perform differential expression analysis using the runDEAnalysis function. For GSE153873 dataset, we use DESeq2 for differential expression analysis since the data are counts data from RNA-Seq experiments.

# GSE153873
RNASeqDEExperiment <- RCPA::runDEAnalysis(RNASeqDataset, method = "DESeq2", design = RNASeqDesign, contrast = RNASeqContrast, annotation = ENSEMBLMapping)

User can also use edgeR for GSE153873 dataset by setting the method argument to edgeR:

# GSE153873
RNASeqDEExperiment <- RCPA::runDEAnalysis(RNASeqDataset, method = "edgeR", design = RNASeqDesign, contrast = RNASeqContrast, annotation = ENSEMBLMapping)

Quality Control of DE Analysis Results

It is important to examine the differential analysis results before further analysis. This will help to identify potential issues in the analysis as well as to determine the suitable downstream analysis methods and parameters. Two common approaches are MA plot and volcano plot.

We have implemented three functions, plotMA, plotVolcano, and plotVennDE. Each of the mentioned functions returns a ggplot object that can be further customized using the ggplot2 package.

The code below shows how to generate the MA plots for three datasets:

# MA plots
p1 <- RCPA::plotMA(rowData(affyDEExperiment), logFCThreshold = 0.5) + ggplot2::ggtitle("Affymetrix - GSE5281")
p2 <- RCPA::plotMA(rowData(agilDEExperiment), logFCThreshold = 0.5) + ggplot2::ggtitle("Agilent - GSE61196")
p3 <- RCPA::plotMA(rowData(RNASeqDEExperiment), logFCThreshold = 0.5) + ggplot2::ggtitle("RNASeq - GSE153873")

gridExtra::grid.arrange(p1, p2, p3, ncol = 3)

A normal MA plot will have most of the points centered around the zero log fold change line, If all points are shifted up or down, it indicates a systematic bias in the data that needs to be corrected. The obtained plots are:

Similarly, we can have volcano plot for the three datasets, as well:

# Volcano plots
p1 <- RCPA::plotVolcanoDE(rowData(affyDEExperiment), logFCThreshold = 0.5) + ggplot2::ggtitle("Affymetrix - GSE5281")
p2 <- RCPA::plotVolcanoDE(rowData(agilDEExperiment), logFCThreshold = 0.5) + ggplot2::ggtitle("Agilent - GSE61196")
p3 <- RCPA::plotVolcanoDE(rowData(RNASeqDEExperiment), logFCThreshold = 0.5) + ggplot2::ggtitle("RNASeq - GSE153873")

gridExtra::grid.arrange(p1, p2, p3, ncol = 3)

A normal volcano plot will have points that are relatively symmetrical around the y-axis of zero log fold change. The generated plots are:

It is also important to check the intersection of DE genes between different datasets. This will help to identify potential issues in the analysis design. To this aim, we can use the function plotVennDE to generate the Venn diagram of DE genes among datasets as follows:

# All DE genes
DEResults <- list(
"Affymetrix - GSE5281" = rowData(affyDEExperiment),
"Agilent - GSE61196" = rowData(agilDEExperiment),
"RNASeq - GSE153873" = rowData(RNASeqDEExperiment)
)
# Up-regulated genes
DEResultUps <- lapply(DEResults, function(df) df[!is.na(df$logFC) & df$logFC > 0,])
# Down-regulated genes
DEResultDowns <- lapply(DEResults, function(df) df[!is.na(df$logFC) & df$logFC < 0,])

p1 <- RCPA::plotVennDE(DEResults) + ggplot2::ggtitle("All DE Genes")
p2 <- RCPA::plotVennDE(DEResultUps) + ggplot2::ggtitle("Up-regulated DE Genes")
p3 <- RCPA::plotVennDE(DEResultDowns) + ggplot2::ggtitle("Down-regulated DE Genes")

gridExtra::grid.arrange(p1, p2, p3, ncol = 3)

and the generated Venn diagrams are:

Module 3 : System-level Analysis

In this pipeline, we have implemented eight widely used and reliable approaches for pathway analysis, which assist users in identifying significantly enriched functional categories. We categorize these approaches into two subgroups: non-topology-based and topology-based pathway analysis tools.

Non-Topology-Based Pathway Analysis

The included geneset enrichment analysis methods in RCPA package are ORA, FGSEA, GSA, KS (Kolmogorov–Smirnov), and Wilcox tests, which do not account for gene interactions within a pathway.

To start with this analysis, beside the SummarizedExperiment object generated by differential analysis, users need to prepare genesets information beforehand. In this protocol, we have implemented getGeneSets function to retrieve gene sets from two databases, including KEGG and GO, for a given organism. The below code retrieves genesets definitions from KEGG for human organism:

genesets <- RCPA::getGeneSets(database = "KEGG", org = "hsa")

The returned object is a named list of length three, including, genesets definitions, names of genesets, and the size of genesets.

If users wish to use their own genesets in the analysis, they can prepare a geneset object with the same format as the returned object from getGeneSets function. We recommend using Entreze gene IDs as the gene identifiers in the genesets object, as we provided helper functions to retrieve the gene symbols and descriptions from the NCBI database, which will be used in the downstream analysis and visualization.

After preparing the genesets object, users can proceed to the enrichment analysis. The analysis can be initiated by calling the runGeneSetAnalysis function, which will execute the gene set analysis procedure based on the selected method.

To run the function with fgsea as method, for three datasets, the code is:

#The list of additional arguments passed to fgsea function
fgseaArgsList <- list(minSize = 10, maxSize = Inf)

#Geneset enrichment analysis
affyFgseaResult <- runGeneSetAnalysis(affyDEExperiment, genesets,
                                                method = "fgsea", FgseaArgs = fgseaArgsList)
agilFgseaResult <- runGeneSetAnalysis(agilDEExperiment, genesets,
                                                method = "fgsea", FgseaArgs = fgseaArgsList)
RNASeqFgseaResult <- runGeneSetAnalysis(RNASeqDEExperiment, genesets,
                                                  method = "fgsea", FgseaArgs = fgseaArgsList)

For ORA and affyDEExperiment object, we can use:

affyORAResult <- runGeneSetAnalysis(affyDEExperiment, genesets,
                                              method = "ora", ORAArgs = list(pThreshold = 0.05))

For GSA and affyDEExperiment object, we can use:

affyGSAResult <- runGeneSetAnalysis(affyDEExperiment, genesets, method = "gsa")

For KS test and affyDEExperiment object, we can use:

affyKSResult <- runGeneSetAnalysis(affyDEExperiment, genesets, method = "ks")

For Wilcox test and affyDEExperiment object, we can use:

affyKSResult <- runGeneSetAnalysis(affyDEExperiment, genesets, method = "wilcox")

The output of the runGeneSetAnalysis function is a dataframe containing the results of the enrichment analysis. We can see a part of the output after running the function using fgsea method for affyDEExperiment:

> head(RNASeqFgseaResult)
             ID  p.value  score normalizedScore sampleSize                                              name
1 path:hsa05014 1.69e-35 -0.584           -2.78         30                     Amyotrophic lateral sclerosis
2 path:hsa05016 9.76e-33 -0.597           -2.78         30                                Huntington disease
3 path:hsa05012 2.84e-29 -0.607           -2.78         30                                 Parkinson disease
4 path:hsa05020 9.29e-29 -0.598           -2.74         30                                     Prion disease
5 path:hsa00190 8.11e-28 -0.743           -3.07         30                         Oxidative phosphorylation
6 path:hsa05022 3.86e-25 -0.487           -2.37         30 Pathways of neurodegeneration - multiple diseases
      pFDR pathwaySize
1 5.73e-33         364
2 1.65e-30         306
3 3.21e-27         266
4 7.87e-27         272
5 5.50e-26         134
6 2.18e-23         476

Topology-Based Pathway Analysis

The methods for pathway enrichment analysis in RCPA are SPIA, CePa ORA, and CePa GSA, which consider the topology and interconnections between genes within a pathway. We have implemented the runPathwayAnalysis function to perform pathway enrichment analysis using one of the mentioned methods.

For each of these methods, we need to pass a network object which contains the pathways network information from the KEGG database. To elaborate, the getSPIAKEGGNetwork function generates the necessary network for SPIA, while the getCePaPathwayCatalogue function produces the pathway catalog object required by both CePa ORA and CePa GSA.

We start with the process of running SPIA. First, we call the getSPIAKEGGNetwork to get network for human organism:

spiaNetwork <- RCPA::getSPIAKEGGNetwork(org = "hsa", updateCache = FALSE)

The generated network object is a list of length three which are network definitions of pathways, names of patwhays, and the size of pathways.

Now, we can call runPathwayAnalysis function:

affySpiaResult <- runPathwayAnalysis(affyDEExperiment, spiaNetwork, method = "spia")
agilSpiaResult <- runPathwayAnalysis(agilDEExperiment, spiaNetwork, method = "spia")
RNASeqSpiaResult <- runPathwayAnalysis(RNASeqDEExperiment, spiaNetwork, method = "spia")

To run any of the CePa ORA, or CePa GSA, we need first get the pathway network using getCePaPathwayCatalogue function, as:

cepaNetwork <- RCPA::getCePaPathwayCatalogue(org = "hsa", updateCache = FALSE)

Similar to getSPIAKEGGNetwork, the generated network catalogue is a list of length three with the same structure.

Then, we can utilize the generated network to call runPathwayAnalysis function. To call the function using affyDEExperiment object and with CePa ORA as method:

affyCepaORAResult <- runPathwayAnalysis(affyDEExperiment, cepaNetwork, method = "cepaORA")

and similarly, to call function using affyDEExperiment object and with CePa GSA as method:

affyCepaGSAResult <- runPathwayAnalysis(affyDEExperiment, cepaNetwork, method = "cepaGSA")

The runPathwayAnalysis function returns a dataframe the same as the output of the runGeneSetAnalysis function. Below, the top rows of obtained results using SPIA and affyDEExperiment object is shown:

> head(RNASeqSpiaResult)
             ID      p.value     score normalizedScore sampleSize
1 path:hsa05016 1.836417e-28  33.48486      -1.3614702         30
2 path:hsa05014 8.015966e-27  68.44187      -1.0164137         30
3 path:hsa05020 1.771357e-24  43.61561      -1.0511215         30
4 path:hsa00190 2.024686e-24  46.89983       4.7167369         30
5 path:hsa05022 3.590430e-23 213.46961       0.7301322         30
6 path:hsa05010 5.217307e-23 130.11774      -0.5420581         30
                                               name         pFDR pathwaySize
1                                Huntington disease 5.215424e-26         306
2                     Amyotrophic lateral sclerosis 1.138267e-24         364
3                                     Prion disease 1.437527e-22         272
4                         Oxidative phosphorylation 1.437527e-22         134
5 Pathways of neurodegeneration - multiple diseases 2.039364e-21         476
6                                 Alzheimer disease 2.469525e-21         384

Module 4 : Integrative Analysis and Visualization

Gene-Level Meta Analysis

For gene-level meta-analysis, RCPA provides the function named runDEMetaAnalysis to perform meta-analysis using DE analysis results. Users can pick one of the fisher, stouffer, addCLT, geoMean, and minP methods to perform meta-analysis. Besides the p-values, the fold-changes also are combined using SMD method. In the following code, we call this function utilizing three obtained DE results from stage 2:

#Preparing the input list of DE results
DEResults <- list(
	"Affymetrix - GSE5281" = rowData(affyDEExperiment),
	"Agilent - GSE61196" = rowData(agilDEExperiment),
	"RNASeq - GSE153873" = rowData(RNASeqDEExperiment)
	)

#Calling the runDEMetaAnalysis with 'stouffer' as the selected method to combine PValues
metaDEResult <- RCPA::runDEMetaAnalysis(DEResults, method = "stouffer")

The output is a dataframe containing combined p-value, pFDR, log fold change, and log fold change standard error. In the below table, we show six top rows of the generated dataframe:

> head(metaDEResult)
     ID      p.value         pFDR      logFC    logFCSE
1 84964 1.541734e-19 2.585333e-15 -0.4693914 0.04721164
2 10106 6.840827e-16 4.219658e-12  0.3273123 0.03915734
3  7108 8.988938e-16 4.219658e-12 -0.3681265 0.04798479
4  4713 1.006538e-15 4.219658e-12 -0.5306827 0.06044290
5 10382 1.766204e-15 5.923497e-12 -0.6648832 0.07334485
6   396 2.202702e-15 6.156186e-12 -0.3427080 0.03925764

Pathway-Level Meta Analysis

Users can perform meta-analysis of pathway analysis results, using the runPathwayMetaAnalysis function. sers can pick one of the fisher, stouffer, addCLT, geoMean, and minP methods to perform meta-analysis. Besides the p-values, the scores also are combined using SMD method. In the following code, we call this function utilizing three obtained enrichment analysis results of the FGSEA method in stage 3 to conduct meta-analysis based on Stouffer’s method:

	#Preparing the input list of enrichment results
	PAResults <- list(
	"Affymetrix - GSE5281" = affyFgseaResult,
	"Agilent - GSE61196" = agilFgseaResult,
	"RNASeq - GSE153873" = RNASeqFgseaResult
	)

	#Calling the runPathwayMetaAnalysis with 'stouffer' as the selected method to combine PValues
	metaPAResult <- RCPA::runPathwayMetaAnalysis(PAResults, method = "stouffer")

The output of meta-analysis is a dataframe containing combined p-value, score, and normalized score. In the below table, we show six top rows of the obtained results after performing meta-analysis:

             ID                                              name      p.value         pFDR     score
1 path:hsa05012                                 Parkinson disease 7.123092e-46 2.393359e-43 -2.467830
2 path:hsa05016                                Huntington disease 3.763897e-44 6.323347e-42 -2.430244
3 path:hsa05014                     Amyotrophic lateral sclerosis 9.552510e-44 1.069881e-41 -2.393301
4 path:hsa00190                         Oxidative phosphorylation 6.473335e-40 5.437601e-38 -2.691299
5 path:hsa05020                                     Prion disease 1.291475e-37 8.678712e-36 -2.402497
6 path:hsa05022 Pathways of neurodegeneration - multiple diseases 9.036272e-34 5.060312e-32 -2.119561
  normalizedScore pathwaySize
1       -2.467830         266
2       -2.430244         306
3       -2.393301         364
4       -2.691299         134
5       -2.402497         272
6       -2.119561         476

Pathway-Level Consensus Analysis

To start with consensus analysis, users can call the runConsensusAnalysis function with either weightedZMean or RRA to conduct consensus analysis. In the following code, we call runConsensusAnalysis to perform consensus analysis based on weightedZMean utilizing two obtained enrichment analysis results for Affymetrix-GSE5281 dataset by FGSEA, and SPIA methods:

	PAResults <- list(
	"fgsea" = affyFgseaResult,
	"spia" = affySpiaResult
	)

	consensusPAResult <- RCPA::runConsensusAnalysis(PAResults, method = "weightedZMean")

The output of runConsensusAnalysis function, is a dataframe containing combined p-value, and pFDR. In the below table, we show six top rows of the generated dataframe:

> head(consensusPAResult)
             ID      p.value                                              name pathwaySize         pFDR
1 path:hsa05022 8.213205e-09 Pathways of neurodegeneration - multiple diseases         476 8.213205e-09
2 path:hsa05014 7.783738e-07                     Amyotrophic lateral sclerosis         364 7.783738e-07
3 path:hsa05016 1.933887e-06                                Huntington disease         306 1.933887e-06
4 path:hsa05020 3.727841e-06                                     Prion disease         272 3.727841e-06
5 path:hsa04723 7.422680e-06              Retrograde endocannabinoid signaling         148 7.422680e-06
6 path:hsa00190 7.736429e-06                         Oxidative phosphorylation         134 7.736429e-06

Stage 5 : Visualization

In this section, we demonstrate how to visualize the results of the pathway analysis and integrative analysis. RCPA includes several state-of-the-art plots, which can be used to visualize and interpret results more efficiently. It is notable that since consensus analysis results do not report any score, the results can be visualized using plotting functions in RCPA that employ the p-value or pFDR.

Venn Diagram

We can plot the Venn diagram for enrichment analysis results using plotVennPathway function. The below code generates three Venn diagrams of all significant pathways, significant up-regulated pathways, and significant down-regulated pathways utilizing results obtained by fgsea method. The definition of up/down regulation depends on the method which we are using to analyze data.

PAResults <- list(
  	"Affymetrix - GSE5281" = affyFgseaResult,
  	"Agilent - GSE61196" = agilFgseaResult,
  	"RNASeq - GSE153873" = RNASeqFgseaResult,
  	"Meta-analysis" = metaPAResult
	)

	PAREsultUps <- lapply(PAResults, function(df) df[df$normalizedScore > 0,])
	PAREsultDowns <- lapply(PAResults, function(df) df[df$normalizedScore < 0,])

	p1 <- RCPA::plotVennPathway(PAResults, pThreshold = 0.05) + ggplot2::ggtitle("All Significant Pathways")
	p2 <- RCPA::plotVennPathway(PAREsultUps, pThreshold = 0.05) + ggplot2::ggtitle("Significantly Up-regulated Pathways")
	p3 <- RCPA::plotVennPathway(PAREsultDowns, pThreshold = 0.05) + ggplot2::ggtitle("Significantly Down-regulated Pathways")

	gridExtra::grid.arrange(p1, p2, p3, ncol = 3)

The generated Venn diagram is as: