Easy forest plots
This tutorial introduces one of the ways GAMBLR.viz facilitates exploration of mutation incidence differences between groups of samples. Forest plots are a popular approach to summarize the result of a set univariate tests or a multivariate model. prettyForestPlot
is a convenient approach to generate our take on forest plots and is part of a group of GAMBLR.viz functions. If you want to compare the frequency of mutations in a set of genes (or across all genes), all you need is a MAF and some basic sample metadata. The metadata need only contain a row for each sample identifiers in the MAF (Tumor_Sample_Barcode
) and another column containing the labels for how you want the samples stratified. This tutorial will demonstate the example of the inputs and showcase the main features of this function.
Prepare setup
We will first import the necessary packages:
Next, we will get some data to display. The metadata is expected to be a data frame with one required column: sample_id
and another column that will contain sample annotations according to the comparison group. In this example, we will use as example the data set and variant calls from the study that identified genetic subgroup of Burkitt lymphoma (BL).
metadata <- get_gambl_metadata() %>%
filter(study == "BL_Thomas")
Next, with the help of GAMBLR.open and GAMBLR.data, we will obtain the coding mutations that will be used in the plotting. Many of the get_
family of functions return a data frame with mutations annotated in the familiar MAF format.
maf <- get_coding_ssm(
these_samples_metadata = metadata,
tool_name = "publication",
projection = "hg38",
include_silent = FALSE
)
# What does it look like?
dim(maf)
[1] 34453 49
genomic_data Object
Genome Build: hg38
Showing first 10 rows:
Tumor_Sample_Barcode Hugo_Symbol Variant_Classification
1 Akata CPTP Missense_Mutation
2 Akata FNDC10 Missense_Mutation
3 Akata MORN1 Missense_Mutation
4 Akata MEGF6 Missense_Mutation
5 Akata GPR157 Missense_Mutation
6 Akata H6PD Missense_Mutation
For the purpose of this tutorial, we will focus on a small subset of genes known to be significantly mutated in BL.
#genes <- lymphoma_genes_bl_v_latest$Gene
genes <- GAMBLR.data::lymphoma_genes %>%
filter(BL_Tier == 1) %>%
pull(Gene)
head(genes)
[1] "ARID1A" "BACH2" "BCL6" "BCL7A" "BMP7" "CCND3"
Now we have our metadata and mutations we want to explore, so we are ready to start visualizing the data.
The default forest plot
The forest plot is ready to be called with the default parameters after just providing the metadata and data frame with mutations in standard maf format. Here is an example of the output with all default parameters:
comparison_column <- "EBV_status_inf" # character of column name comparison
comparison_vals <- c("EBV-positive","EBV-negative")
fp <- prettyForestPlot(
metadata = metadata,
maf = maf,
genes = genes,
comparison_column = comparison_column,
comparison_values = comparison_vals
)
The output of the function is a list containing the following objects: - fisher
: a data frame with detailed statistics of the Fisher’s test for each gene - mutmat
: a binary matrix used for the Fisher’s test - forest
: a ggplot2 object with the forest plot of the ORs from the Fisher’s test for each gene - bar
: a ggplot2 object wiht mutation frequencies for each Gene - arranged
: a display item where both the forest and bar plots are nicely arranged side-by-side
names(fp)
[1] "fisher" "forest" "bar" "arranged" "mutmat"
Report only significant differences
By default, all of the genes of interest are reported in the output. After the Fisher’s test is performed, the prettyForestPlot
also calculates FDR and we can use it to only report significant differences by providing a significance cutoff with the parameter max_q
:
max_q <- 0.1 # only those qith Q value <= 0.1 will be reported
fp <- prettyForestPlot(
metadata = metadata,
maf = maf,
genes = genes,
comparison_column = comparison_column,
comparison_values = comparison_vals,
max_q = max_q,
base_size = 9
)
We now can take a look at what genes are passing the significance cutoff:
fp$arranged
If we don’t provide a set of candidate genes, the Fisher’s exact test is applied to every gene mutated in at least one sample, which is incredibly ambitious. Because prettyForestPlot
automatically corrests for multiple testing, this can limit how many significant associations you end up with. Here, we raised the minimum to 10 mutations and excluded the candidate gene restriction. As you can see, this can lead to additional differences being detected.
fp <- prettyForestPlot(
metadata = metadata,
maf = maf,
comparison_column = comparison_column,
comparison_values = comparison_vals,
max_q = max_q,
min_mutations = nmut,
base_size = 9
)
fp$arranged
num_genes = nrow(fp$fisher)
A total of 18 remained significant after correcting for multiple tests.
Comparing categories with more than two groups
As the prettyForestPlot
construcst the 2x2 contingency tables to run Fisher’s test to find significant differences, it can only operate on comparing 2 groups between themselves - but what if you have more than that and want to see the difference between some of them? To handle this scenario, we can take advantage of the comparison_values
parameter, which will be used to subset the metadata to only requested groups and only perform testing and plotting on this subset. Let’s see it in action:
comparison_column <- "genetic_subgroup" # change the comparison column
comparison_values <- c("IC-BL", "Q53-BL")
fp <- prettyForestPlot(
metadata = metadata,
maf = maf,
genes = genes,
comparison_column = comparison_column,
comparison_values = comparison_values,
max_q = max_q,
base_size = 9
)
fp$arranged
This plot is exactly reproducing the Supplemmental Figure 12D from the Thomas et al study!
Separating genes with hotspots
We can additionally separate hotspots from the other mutations and compare those separately. First, we need to annotate the maf data, for which we will use the annotate_hotspots
from GAMBLR family. This function will add a new column to the maf named hot_spot
indicating whether or not the specific mutation is in the hotspot region.
# Annotate hotspots
maf <- annotate_hotspots(maf)
# What are the hotspots?
maf %>%
filter(hot_spot) %>%
select(Hugo_Symbol, hot_spot) %>%
table()
< table of extent 0 x 0 >
Note
The GAMBLR.data version of the
annotate_hotspots
only handles very specific genes and does not have functionality to annotate all hotspots.
Oh no! Looks like there is no hotspots in this maf data. This does not make sense, so what happened? Aha, the hotspot annotation in GAMBLR.data works only on the data in grch37 projection. But our maf is in hg38, so what should we do? One way is to lift the maf data to another projection using the UCSC’s liftOver, and GAMBLR family has exactly the function that serves this purpose:
Can we annotate the hotspots now?
maf_grch37 <- annotate_hotspots(maf_grch37)
# What are the hotspots?
maf_grch37 %>%
filter(hot_spot) %>%
select(Hugo_Symbol, hot_spot) %>%
table()
hot_spot
Hugo_Symbol TRUE
CREBBP 1
EZH2 1
FOXO1 58
MYD88 2
STAT6 4
Indeed, the hotspots are properly annotated once we have maf in correct projection. Now, we can simply toggle the separate_hotspots
parameter to perform separate comparisons within hotspots:
comparison_column <- "EBV_status_inf"
fp <- prettyForestPlot(
metadata = metadata,
maf = maf_grch37,
genes = genes,
comparison_column = comparison_column,
comparison_values = comparison_vals,
max_q = max_q,
separate_hotspots = TRUE
)
fp$arranged
Using binary matrix as input
Sometimes it might be useful to have different input format instead of maf - for example, you already have a one-hot encoded matrix of feature status (i.e. a binary matrix). We can run prettyForestPlot
using this as input in lieu of the MAF. This also means we can apply it to more than just simple somatic mutation profiles.
First, let’s construct the binary matrix. We will supplement our maf with the non-coding mutations to look at the aSHM regions in addition to coding mutations, and this will already give us the data in correct projection:
maf <- get_ssm_by_samples(
these_samples_metadata = metadata
)
maf$Variant_Classification %>% table
.
3'Flank 3'UTR 5'Flank
1457 513 2957
5'UTR Frame_Shift_Del Frame_Shift_Ins
1102 124 97
IGR In_Frame_Del In_Frame_Ins
457 40 14
Intron Missense_Mutation Nonsense_Mutation
44397 1859 286
Nonstop_Mutation RNA Silent
6 74 481
Splice_Region Splice_Site Translation_Start_Site
148 114 22
Now we convert this maf into binary matrix:
# Generate binary matrix
coding_matrix <- get_coding_ssm_status(
these_samples_metadata = metadata,
maf_data = maf,
gene_symbols = genes,
include_hotspots = TRUE,
review_hotspots = TRUE
)
Next, supplement this with the matrix of non-coding mutation across aSHM regions
# We'll use the aSHM regions defined in GAMBLR.data
# Let our helper function massage it into a consistent format
regions_bed = create_bed_data(somatic_hypermutation_locations_GRCh37_v0.2,
fix_names="concat",
concat_cols=c("gene","region"),
sep="-")
# Generate matrix of mutations per each site
ashm_matrix <- get_ashm_count_matrix(
regions_bed = regions_bed,
these_samples_metadata = metadata
)
# Binarize matrix using arbitrary 3 muts/region cutoff
ashm_matrix[ashm_matrix <= 3] = 0
ashm_matrix[ashm_matrix > 3] = 1
ashm_matrix <- ashm_matrix %>%
rownames_to_column("sample_id")
We can now combine both coding and non-coding features into single matrix:
Now we can provide the binary matrix to the prettyForestPlot
and regenerate the Supplemmental Figure 12C from the Thomas et al study!
comparison_column <- "genetic_subgroup"
comparison_values <- c("DGG-BL", "Q53-BL")
fp <- prettyForestPlot(
metadata = metadata,
mutmat = feature_matrix,
genes = genes,
comparison_column = comparison_column,
comparison_values = comparison_values,
max_q = max_q
)
fp$arranged