Single-sample pathway enrichment analysis is an effectively approach for identifying cancer subtypes and pathway biomarkers, which promotes the development of precision medicine. However, the existing approaches focused on gene expression data but neglected gene mutation information. Here, we proposed a novel single-sample mutation-based pathway analysis approach (ssMutPA) to infer individualized pathway activities by integrating somatic mutation data and the protein-protein interaction (PPI) network.For each sample, ssMutPA first uses local and global weighted strategies to evaluate the effects of network genes from mutations according to the network topology, and then calculates mutation-based pathway enrichment score (ssMutPES) to reflect the accumulated effect of mutations of each pathway.
MAF files contain many fields ranging from chromosome names to cosmic annotations. However, most of the analysis in ssMutPA uses the following fields.
Complete specification of MAF files can be found on NCI GDC documentation page.
The ssMutPA package is a Bioinformatics tool to
perform single-sample mutation-based pathway analysis. And
ssMutPA functions can be categorized into mainly
Analysis and Visualization modules. Each of these functions and a short
description is summarized as shown below:
1.Obtain the gene mutation status matrix.
2.Calculate single-sample mutation-based pathway enrichment score.
3.Identification of cancer subtypes based on unsupervised spectral
clustering algorithm.
4.Visualization of results:
4.1 Plotting a heatmap with subtype labels.
4.2 Plotting the density ridges plot.
4.3 Plotting the Dot plot.
4.4 Plotting a waterfall plot of a
particular pathway.
We downloaded patients’ somatic mutation data from the The Cancer
Genome Atlas Program(TCGA) or other open-source databases in MAF format
and converted MAF format data into a gene mutation status matrix, where
each element indicates whether a gene is mutated or not in a sample. In
this study, we only extract the non-silent somatic mutations
(Missense_Mutation,Frame_Shift_Del,Frame_Shift_Ins,In_Frame_Del,Nonsense_Mutation,In_Frame_Ins,Splice_Site,Nonstop_Mutation,Translation_Start_Site)
in protein-coding regions. The function get_mut_status
in the ssMutPA package can implement the above process.
Simulation data as an example, the command lines are as follows:
#load the mutation annotation file
mut_path <- system.file("extdata","mutation_data.Rdata",package = "ssMutPA")
load(mut_path)
#perform the function 'get_mut_status'
mut_status<-get_mut_status(mutation_data,nonsynonymous=TRUE,TCGA=TRUE,mut_rate=0)
#view the first five lines of mut_status matrix
mut_status[1:5,1:5]
#> TCGA-32-1979-01A TCGA-4W-AA9S-01A TCGA-06-5858-01A TCGA-19-4068-01A
#> GRHL3 1 0 0 0
#> BRDT 1 0 0 0
#> APH1A 1 0 0 0
#> OR6N2 1 0 0 0
#> APOB 1 0 0 0
#> TCGA-06-0644-01A
#> GRHL3 0
#> BRDT 0
#> APH1A 0
#> OR6N2 0
#> APOB 0
Firstly, we performed local weighted on the seed nodes. For a given
seed node (Gi), there are
two variables to characterize if any Gi may be
reinforced by the mutations of its neighbors: the number of mutation
genes in the neighbors of Gi, designated
as Xi, and
the number of neighbors of Gi, designated
as Mi. For
a seed Gi
in the sample, if it play more important function in the disease
progression, the number of mutation genes in its neighbors will be
significantly larger than the expectation E(Xi),
which can be calculated as:
E(Xi) = MiK/N
Thus, a rescaled form Xi − E(Xi)
is proposed to quantify the important strength of Gi in a
sample,which was defined as local weight Wi :
wi = Xi − E(Xi)
Wi = logα(wiI(wi) + α) where I is an indicator function,if wi greater than 0,its value is equal to 1,otherwise equal to 0;and α is a scalar base to guarantee the weight has a minimum value of 1, here we set α as 2.
Subsequently, we normalized the initial weights of all seeds in a
sample to ensure that they sum to 1 and used the global propagation
algorithm(Random Walk with Restart(RWR)) to predict probable influence
of nodes in the network by seed nodes(mutation genes).The RWR formula is
as follows:
###Random walk with restart pt + 1 = (1 − c)Apt + cp0
wherec is a certain
probability of continuing the random walk or restarting from the restart
set; A is the
column-normalized adjacency matrix of the PPI network; pt is a vector
containing visiting probabilities of all nodes in the network at time
point t; p0 is the
initial probability vector.
Finally, we constructed a gene list by ranking the genes according to
the normalized global weight and calculated the degree of enrichment of
genes contained in each pathway on the gene list based on the
Kolmogorov-Smirnov statistic. The function get_RWR_ES
in the ssMutPA package can accomplish all the processes
mentioned above.
#Method of obtaining data
data(mut_status)
net_path <- system.file("extdata","ppi_network.Rdata",package = "ssMutPA")
load(net_path)
pathway_path<-system.file("extdata","kegg_323_gmt.Rdata",package = "ssMutPA")
load(pathway_path)
samp_name<-c("TCGA-32-1979-01A","TCGA-32-2494-01A")
examp_data<-mut_status[,samp_name]
#perform the function 'get_RWR_ES'
Path_ES<-get_RWR_ES(examp_data,net_data=ppi_network,pathway_data=kegg_323_gmt,BC_Num=12436)
#> Warning in asMethod(object): sparse->dense coercion: allocating vector of size
#> 1.2 GiB
#view the first six lines of pathway enrichment profile
head(Path_ES)
#> TCGA-32-1979-01A
#> ABC transporters 0.00000
#> AGE-RAGE signaling pathway in diabetic complications 0.65015
#> AMPK signaling pathway 0.56116
#> Acute myeloid leukemia 0.69988
#> Adherens junction 0.71432
#> Adipocytokine signaling pathway 0.69112
#> TCGA-32-2494-01A
#> ABC transporters 0.00000
#> AGE-RAGE signaling pathway in diabetic complications 0.67965
#> AMPK signaling pathway 0.59142
#> Acute myeloid leukemia 0.72841
#> Adherens junction 0.66251
#> Adipocytokine signaling pathway 0.63442
The function get_samp_class can perform spectral clustering based on the ssMutPESs of characteristic pathways.
#Load sample mutation data
surv_path <- system.file("extdata","sur.Rdata",package = "ssMutPA")
load(surv_path)
data(Path_ES)
#Perform the function `get_samp_class`
res<-get_samp_class(Path_ES,sur,seed_num=5,cox_pval=0.05,min.nc = 2,max.nc =5)
#view the label of samples
res$sample_class[1:10]
#> TCGA-32-1979-01A TCGA-4W-AA9S-01A TCGA-06-5858-01A TCGA-19-4068-01A
#> 1 1 2 1
#> TCGA-06-0644-01A TCGA-12-3649-01A TCGA-06-0875-01A TCGA-12-0691-01A
#> 1 1 1 1
#> TCGA-19-5955-01A TCGA-12-5295-01A
#> 2 1
#Load the data
data(Path_ES,sample_class,Path_Name)
#perform the function `get_heatmap`.
get_heatmap(Path_ES,Path_name=Path_Name,samp_class=sample_class)
#Get the data of ROC curve
data(Path_ES,sample_class)
#perform the function `mountain_plot`
mountain_plot(data=Path_ES,sample_class=sample_class,Path_name=rownames(Path_ES)[c(12,20,74,103,113,123,138,151,188)])
#> Picking joint bandwidth of 0.0119
#load the data
mut_path <- system.file("extdata","maffile.txt",package = "ssMutPA")
maf<-maftools::read.maf(mut_path ,isTCGA = FALSE)
pathway_path <- system.file("extdata","kegg_323_gmt.Rdata",package = "ssMutPA")
load(pathway_path)
data(samp_class_onco,mut_onco,sur_onco)
#draw a waterfall plot
# \donttest{Oncoplot(maf,samp_class_onco,sur_onco,mut_onco,kegg_323_gmt,"IL-17 signaling pathway")}
#> -Reading
#> -Validating
#> --Removed 1121 duplicated variants
#> -Silent variants: 607
#> -Summarizing
#> --Possible FLAGS among top ten genes:
#> MUC16
#> TTN
#> -Processing clinical data
#> --Missing clinical data
#> -Finished in 0.717s elapsed (0.821s cpu)