Genetic classification of tumours
In another tutorial, we saw at how GAMBLR.open can be used to generate binary matrices representing the presence/absence of mutations and other genetic features (e.g. common CNVs, SVs). In this tutorial, we will review how the binary matrix generation can be utilized to classify samples according to different classification systems and tumour labels can be inferred to assign tumours into genetic subgroups.
Tip
We would not need to assemble the matrix separately - the
classify_
family of GAMBLR functions will directly support the maf/seg/bedpe data and handle matrix construction for you.
The GAMBLR contains a collection of trained models and functions to pre-format inputs for these models. It contains classifiers of Burkitt and Follicular lymphomas originally published, as well as reproduced of DLBCL classification by the groupings of Chapuy et al, Lacy et al, and Runge et al.
Load packages and data
First, we will load required packages. Only GAMBLR.open
and tidyverse
are needed - they both will load for you all required functionality.
Next, we will obtain the data. We would need the metadata, SSM in maf format, CNV in seg format, and SV in bedped format. For the demonstration purposes, we will use the data bundled with GAMBLR to show the classification functionality. Here, we will subset data to a small number of samples and will illustrate the required formatting and minimal required information.
Metadata
The metadata is a data frame which contains the column sample_id
listing sample ids for tumours to be classified, and column pathology
which will dictate sliding threshold for aSHM site annotation when necessary. The column seq_type
is not required for classification purposes, but will be kept in this tutorial for the purpose of subsequent retreival of mutation data.
# Load metadata
metadata <- get_gambl_metadata() %>%
filter(seq_type == "genome") %>%
filter(pathology %in% c("FL", "DLBCL", "BL")) %>%
group_by(sample_id) %>%
slice_head() %>%
ungroup %>%
filter(!study %in% c("DLBCL_Arthur", "DLBCL_Thomas"))
metadata %>%
count(pathology)
pathology | n |
---|---|
BL | 234 |
DLBCL | 334 |
FL | 219 |
# Demonstrate the required columns in metadata
metadata <- metadata %>%
select(sample_id, pathology, seq_type)
head(metadata)
sample_id | pathology | seq_type |
---|---|---|
00-14595_tumorC | DLBCL | genome |
00-14595_tumorD | FL | genome |
00-15201_tumorA | DLBCL | genome |
00-15201_tumorB | DLBCL | genome |
00-26427_tumorA | DLBCL | genome |
00-26427_tumorC | DLBCL | genome |
SSM (maf format)
The key data to be provided for the genetic subgroup classification is SSM in maf format. Only a small subset of columns is required to be used as input information, and here we would demonstrate the expected format and minimal required information for SSM.
maf <- get_ssm_by_samples(
these_samples_metadata = metadata
) %>% as.data.frame
# Only these columns are required
maf_columns <- c(
"Hugo_Symbol", "NCBI_Build",
"Chromosome", "Start_Position", "End_Position",
"Variant_Classification", "HGVSp_Short",
"Tumor_Sample_Barcode"
)
maf <- maf %>%
select(all_of(maf_columns))
head(maf)
Hugo_Symbol | NCBI_Build | Chromosome | Start_Position | End_Position | Variant_Classification | HGVSp_Short | Tumor_Sample_Barcode |
---|---|---|---|---|---|---|---|
TNFRSF14 | GRCh37 | 1 | 2488139 | 2488139 | Nonsense_Mutation | p.W12* | 01-20260T |
SPEN | GRCh37 | 1 | 16245921 | 16245921 | Missense_Mutation | p.P515R | 01-20260T |
BRINP3 | GRCh37 | 1 | 190187637 | 190187637 | Intron | 01-20260T | |
BRINP3 | GRCh37 | 1 | 190189661 | 190189661 | Intron | 01-20260T | |
BRINP3 | GRCh37 | 1 | 190264257 | 190264257 | Intron | 01-20260T | |
BTG2 | GRCh37 | 1 | 203274977 | 203274977 | Intron | 01-20260T |
CNV (seg format)
Several classifiers (Chapuy et al, Lacy et al, and Runge et al) require a CNV information to be provided, and here we would demonstrate the expected format and minimal required information for CNV in standard seg format.
seg <- get_cn_segments(
these_samples_metadata = metadata
) %>% as.data.frame
# Only these columns are required
seg_columns <- c(
"ID", "chrom", "start", "end", "LOH_flag", "log.ratio"
)
seg <- seg %>%
select(all_of(seg_columns))
head(seg)
ID | chrom | start | end | LOH_flag | log.ratio |
---|---|---|---|---|---|
02-13135T | 1 | 10001 | 762600 | 0 | 0.0000000 |
02-13135T | 1 | 762601 | 121500000 | 0 | 0.0000000 |
02-13135T | 1 | 142600000 | 161506889 | 0 | 0.0000000 |
02-13135T | 1 | 161506890 | 161652716 | 0 | 0.0000000 |
02-13135T | 1 | 161652717 | 162110568 | 0 | 0.7279926 |
02-13135T | 1 | 162110569 | 162111399 | 0 | 0.0000000 |
SV (bedpe format)
Several classifiers require SV data to be provided, and here we would demonstrate the expected format and minimal required information for SV in standard bed format.
bedpe <- get_manta_sv(
these_samples_metadata = metadata
) %>% as.data.frame
# Only these columns are required
bedpe_columns <- colnames(bedpe)[1:11]
bedpe <- bedpe %>%
select(all_of(bedpe_columns))
head(bedpe)
CHROM_A | START_A | END_A | CHROM_B | START_B | END_B | manta_name | SCORE | STRAND_A | STRAND_B | tumour_sample_id |
---|---|---|---|---|---|---|---|---|---|---|
1 | 161658631 | 161658631 | 3 | 16509907 | 16509907 | MantaBND:21171:0:1:0:0:0 | 133 | + | + | FL2002T1 |
1 | 161663959 | 161663959 | 9 | 37363320 | 37363320 | MantaBND:206628:0:1:0:0:0 | 122 | + | + | 09-15842_tumorA |
1 | 161663959 | 161663959 | 9 | 37363320 | 37363320 | MantaBND:195941:0:1:0:0:0 | 151 | + | + | 09-15842_tumorB |
11 | 65267283 | 65267283 | 14 | 106110907 | 106110907 | MantaBND:152220:0:1:0:0:0:0 | 88 | + | - | 15-38154T |
11 | 65267422 | 65267422 | 14 | 106110905 | 106110905 | MantaBND:152220:0:1:0:0:0:0 | 135 | - | + | 15-38154T |
13 | 91976545 | 91976545 | 14 | 106211857 | 106211857 | MantaBND:18:59794:59817:0:1:0 | 90 | - | + | 15-31924T |
Note
The column
manta_name
can be any other column from the tool of choice that generated bedpe output and reports the unique id of the SV event, and is expected to be unique for the SV event.
Classify DLBCL
Here, we will explore the reproduced DLBCL classification by the groupings of Chapuy et al, Lacy et al, and Runge et al. The classification algorithm can be easily controlled by switching the argument method
of the classify_dlbcl()
function.
Important
The DLBCL classifiers were not released with the publications, and the models provided here are the best attempt on reproducing the original models. While the model described by Chapuy et al is reproduced with > 92% accuracy, the Lacy and HMRN classifier results need to be taken with caution as the accuracy of the bundled model is only 80%.
Chapuy et al. (C0-C5 clusters)
The paper published originally in 2018 by Chapuy et al was the first attempt to use systematic approach and classify DLBCL patients into genetic subgroupings using genomic information.
Note
This model classifies tumours according to the original 2018 publication, and is not directly related to the 2024 DLBCLass model.
Here is how these subgroupings can be recapitulated with GAMBLR functionality:
The classify_
family of functions by default return both the assembled matrix and predictions. The model reports the confidence of each tumour’s subgoup vote, as well as final label.
# assembled matrix
predictions$matrix[1:5,1:5]
ACTB | B2M | BCL10 | BCL11A | BCL2 | |
---|---|---|---|---|---|
00-14595_tumorC | 0 | 0 | 0 | 0 | 2 |
00-15201_tumorA | 0 | 0 | 0 | 0 | 0 |
00-15201_tumorB | 0 | 0 | 0 | 0 | 0 |
00-26427_tumorA | 0 | 0 | 0 | 2 | 0 |
00-26427_tumorC | 0 | 0 | 0 | 0 | 0 |
# subgroup assignment
predictions$predictions %>%
slice_head(n=10)
sample_id | C0 | C1 | C2 | C3 | C4 | C5 | Chapuy_cluster |
---|---|---|---|---|---|---|---|
00-14595_tumorC | 0 | 0.1060348 | 0.0000000 | 0.5110284 | 0.2784030 | 0.1045338 | C3 |
00-15201_tumorA | 0 | 0.0931391 | 0.0386996 | 0.1889158 | 0.3859610 | 0.2932845 | C4 |
00-15201_tumorB | 0 | 0.2842088 | 0.0914615 | 0.1265154 | 0.0730396 | 0.4247747 | C5 |
00-26427_tumorA | 0 | 0.0797927 | 0.0670896 | 0.0457110 | 0.1034915 | 0.7039151 | C5 |
00-26427_tumorC | 0 | 0.1152963 | 0.0000000 | 0.0000000 | 0.3449964 | 0.5397073 | C5 |
01-14774_tumorA | 0 | 0.4723915 | 0.0275195 | 0.1548457 | 0.2401450 | 0.1050982 | C1 |
01-14774_tumorB | 0 | 0.5192447 | 0.0050716 | 0.0872477 | 0.3039748 | 0.0844612 | C1 |
01-16433_tumorA | 0 | 0.0000000 | 0.0000000 | 0.8286825 | 0.1713175 | 0.0000000 | C3 |
01-16433_tumorB | 0 | 0.2873270 | 0.2551277 | 0.2676392 | 0.1094943 | 0.0804118 | C1 |
01-23117_tumorA | 0 | 0.2100450 | 0.0000000 | 0.0463066 | 0.2689441 | 0.4747043 | C5 |
count(predictions$predictions, Chapuy_cluster)
Chapuy_cluster | n |
---|---|
C0 | 2 |
C1 | 38 |
C2 | 60 |
C3 | 124 |
C4 | 36 |
C5 | 74 |
Lacy et al.
The paper published originally in 2020 by Lacy et al did not release the classification algorithm, but we recapitulated it by training random forest model with 80% accuaracy. Here is how these subgroupings can be recapitulated with GAMBLR functionality:
Similar to the Chapuy et al predictions, we can see both the constructed matrix and the confidence of each tumour’s subgoup vote, as well as final label.
# assembled matrix
predictions$matrix[1:5,1:5]
BCL11A_amp | MALT1_amp | REL_amp | XPO1_amp | CD58_OR_del | |
---|---|---|---|---|---|
00-14595_tumorC | 0 | 0 | 0 | 0 | 0 |
00-15201_tumorA | 0 | 0 | 0 | 0 | 0 |
00-15201_tumorB | 0 | 0 | 0 | 0 | 0 |
00-26427_tumorA | 0 | 0 | 0 | 0 | 0 |
00-26427_tumorC | 0 | 0 | 0 | 0 | 1 |
# subgroup assignment
head(predictions$predictions)
sample_id | BCL2 | MYD88 | NOTCH2 | Other | SOCS1/SGK1 | TET2/SGK1 | Lacy_cluster |
---|---|---|---|---|---|---|---|
00-14595_tumorC | 0.278 | 0.070 | 0.058 | 0.010 | 0.500 | 0.084 | SOCS1/SGK1 |
00-15201_tumorA | 0.220 | 0.102 | 0.148 | 0.070 | 0.440 | 0.020 | SOCS1/SGK1 |
00-15201_tumorB | 0.052 | 0.186 | 0.500 | 0.210 | 0.026 | 0.026 | NOTCH2 |
00-26427_tumorA | 0.020 | 0.310 | 0.194 | 0.470 | 0.000 | 0.006 | Other |
00-26427_tumorC | 0.020 | 0.754 | 0.032 | 0.162 | 0.016 | 0.016 | MYD88 |
01-14774_tumorA | 0.262 | 0.064 | 0.196 | 0.016 | 0.398 | 0.064 | SOCS1/SGK1 |
count(predictions$predictions, Lacy_cluster)
Lacy_cluster | n |
---|---|
BCL2 | 110 |
MYD88 | 57 |
NOTCH2 | 34 |
Other | 42 |
SOCS1/SGK1 | 81 |
TET2/SGK1 | 10 |
HMRN
The paper published by Runge et al modified the original Lacy et al. subgoupings to more closely adhere the genetic subgroups to the LymphGen classification system. The difference between the Runge and Lacy methods is that tumours with truncating NOTCH1 mutation will be assigned to a separate subgrouping category regardless of other genetic alterations present in the same tumour sample. Here is how these subgroupings can be recapitulated with GAMBLR functionality:
Take a look at the resulting output
# assembled matrix
predictions$matrix[1:5,1:5]
BCL11A_amp | MALT1_amp | REL_amp | XPO1_amp | CD58_OR_del | |
---|---|---|---|---|---|
00-14595_tumorC | 0 | 0 | 0 | 0 | 0 |
00-15201_tumorA | 0 | 0 | 0 | 0 | 0 |
00-15201_tumorB | 0 | 0 | 0 | 0 | 0 |
00-26427_tumorA | 0 | 0 | 0 | 0 | 0 |
00-26427_tumorC | 0 | 0 | 0 | 0 | 1 |
predictions$matrix[1:5,1:5]
BCL11A_amp | MALT1_amp | REL_amp | XPO1_amp | CD58_OR_del | |
---|---|---|---|---|---|
00-14595_tumorC | 0 | 0 | 0 | 0 | 0 |
00-15201_tumorA | 0 | 0 | 0 | 0 | 0 |
00-15201_tumorB | 0 | 0 | 0 | 0 | 0 |
00-26427_tumorA | 0 | 0 | 0 | 0 | 0 |
00-26427_tumorC | 0 | 0 | 0 | 0 | 1 |
# subgroup assignment
head(predictions$predictions)
sample_id | BCL2 | MYD88 | NOTCH2 | Other | SOCS1/SGK1 | TET2/SGK1 | NOTCH1 | hmrn_cluster |
---|---|---|---|---|---|---|---|---|
00-14595_tumorC | 0.278 | 0.070 | 0.058 | 0.010 | 0.500 | 0.084 | 0 | SOCS1/SGK1 |
00-15201_tumorA | 0.220 | 0.102 | 0.148 | 0.070 | 0.440 | 0.020 | 0 | SOCS1/SGK1 |
00-15201_tumorB | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1 | NOTCH1 |
00-26427_tumorA | 0.020 | 0.310 | 0.194 | 0.470 | 0.000 | 0.006 | 0 | Other |
00-26427_tumorC | 0.020 | 0.754 | 0.032 | 0.162 | 0.016 | 0.016 | 0 | MYD88 |
01-14774_tumorA | 0.262 | 0.064 | 0.196 | 0.016 | 0.398 | 0.064 | 0 | SOCS1/SGK1 |
count(predictions$predictions, hmrn_cluster)
hmrn_cluster | n |
---|---|
BCL2 | 110 |
MYD88 | 56 |
NOTCH1 | 3 |
NOTCH2 | 32 |
Other | 41 |
SOCS1/SGK1 | 81 |
TET2/SGK1 | 11 |
Classify FL
Here, we will explore the classification of FL tumours into genetic subgroups with differential propensity for transformation as originally described here. The developed model can be easily accessed with classify_fl()
function.
Similar to DLBCL classifier, we can take a look at the assembled matrix, as well as at the predictions and the confidence of each tumour’s vote:
# assembled matrix
predictions$matrix[1:10, 1:5]
ACTB | ARID1A | ATP6AP1 | ATP6V1B2 | B2M | |
---|---|---|---|---|---|
00-14595_tumorC | 0 | 1 | 0 | 0 | 0 |
00-14595_tumorD | 0 | 0 | 0 | 0 | 0 |
00-15201_tumorA | 0 | 0 | 0 | 0 | 0 |
00-15201_tumorB | 0 | 1 | 0 | 0 | 0 |
00-26427_tumorA | 0 | 0 | 0 | 0 | 0 |
00-26427_tumorC | 0 | 0 | 0 | 0 | 0 |
01-14774_tumorA | 0 | 0 | 0 | 0 | 1 |
01-14774_tumorB | 0 | 0 | 0 | 0 | 0 |
01-16433_tumorA | 0 | 0 | 0 | 0 | 0 |
01-16433_tumorB | 0 | 1 | 0 | 0 | 1 |
# subgroup assignment
head(predictions$predictions)
sample_id | cFL | dFL | is_cFL |
---|---|---|---|
00-14595_tumorC | 0.052 | 0.948 | dFL |
00-14595_tumorD | 0.038 | 0.962 | dFL |
00-15201_tumorA | 0.000 | 1.000 | dFL |
00-15201_tumorB | 0.078 | 0.922 | dFL |
00-26427_tumorA | 0.018 | 0.982 | dFL |
00-26427_tumorC | 0.014 | 0.986 | dFL |
count(predictions$predictions, is_cFL)
is_cFL | n |
---|---|
cFL | 41 |
dFL | 510 |
Classify BL
Here, we will explore the classification of BL tumours into genetic subgroups as originally described in Thomas et al. The developed model reported in this study can be easily accessed with classify_bl()
function.
Similar to other classifiers, we can take a look at the assembled matrix, as well as at the predictions and the confidence of each tumour’s vote:
# assembled matrix
predictions$matrix[1:10, 1:5]
ARID1A | B2M | BCL10 | BCL11A | BCL2 | |
---|---|---|---|---|---|
00-14595_tumorC | 1 | 0 | 0 | 0 | 1 |
00-15201_tumorA | 0 | 0 | 0 | 0 | 0 |
00-15201_tumorB | 1 | 0 | 0 | 0 | 0 |
00-26427_tumorA | 0 | 0 | 0 | 1 | 0 |
00-26427_tumorC | 0 | 0 | 0 | 0 | 0 |
01-14774_tumorA | 0 | 1 | 1 | 0 | 0 |
01-14774_tumorB | 0 | 0 | 0 | 0 | 0 |
01-16433_tumorA | 0 | 0 | 0 | 0 | 1 |
01-16433_tumorB | 1 | 1 | 0 | 0 | 1 |
01-23117_tumorA | 0 | 1 | 0 | 0 | 0 |
# subgroup assignment
head(predictions$predictions)
sample_id | DGG-BL | DLBCL | IC-BL | Q53-BL | BL_subgroup |
---|---|---|---|---|---|
00-14595_tumorC | 0.116 | 0.816 | 0.068 | 0.000 | DLBCL |
00-15201_tumorA | 0.052 | 0.930 | 0.002 | 0.016 | DLBCL |
00-15201_tumorB | 0.296 | 0.648 | 0.050 | 0.006 | DLBCL |
00-26427_tumorA | 0.036 | 0.964 | 0.000 | 0.000 | DLBCL |
00-26427_tumorC | 0.000 | 1.000 | 0.000 | 0.000 | DLBCL |
01-14774_tumorA | 0.072 | 0.894 | 0.018 | 0.016 | DLBCL |
count(predictions$predictions, BL_subgroup)
BL_subgroup | n |
---|---|
DGG-BL | 106 |
DLBCL | 345 |
IC-BL | 108 |
Q53-BL | 9 |
That’s it!
Happy GAMBLing!
/$$$$$$ /$$$$$$ /$$ /$$ /$$$$$$$ /$$ .:::::::
/$$__ $$ /$$__ $$ | $$$ /$$$ | $$__ $$ | $$ .:: .::
| $$ \__/ | $$ \ $$ | $$$$ /$$$$ | $$ \ $$ | $$ .:: .::
| $$ /$$$$ | $$$$$$$$ | $$ $$/$$ $$ | $$$$$$$ | $$ <- .: .::
| $$|_ $$ | $$__ $$ | $$ $$$| $$ | $$__ $$ | $$ .:: .::
| $$ \ $$ | $$ | $$ | $$\ $ | $$ | $$ \ $$ | $$ .:: .::
| $$$$$$/ | $$ | $$ | $$ \/ | $$ | $$$$$$$/ | $$$$$$$$ .:: .::
\______/ |__/ |__/ |__/ |__/ |_______/ |________/
~GENOMIC~~~~~~~~~~~~~OF~~~~~~~~~~~~~~~~~B-CELL~~~~~~~~~~~~~~~~~~IN~~~~~~
~~~~~~~~~~~~ANALYSIS~~~~~~MATURE~~~~~~~~~~~~~~~~~~~LYMPHOMAS~~~~~~~~~~R~