Genetic classification of tumours • GAMBLR.open

In another tutorial, we saw at how GAMBLR.open can be used to generate binary matrices representing the presence/absence of mutations and other genetic features (e.g. common CNVs, SVs). In this tutorial, we will review how the binary matrix generation can be utilized to classify samples according to different classification systems and tumour labels can be inferred to assign tumours into genetic subgroups.

Tip

We would not need to assemble the matrix separately - the classify_ family of GAMBLR functions will directly support the maf/seg/bedpe data and handle matrix construction for you.

The GAMBLR contains a collection of trained models and functions to pre-format inputs for these models. It contains classifiers of Burkitt and Follicular lymphomas originally published, as well as reproduced of DLBCL classification by the groupings of Chapuy et al, Lacy et al, and Runge et al.

Load packages and data

First, we will load required packages. Only GAMBLR.open and tidyverse are needed - they both will load for you all required functionality.

# Load packages
library(GAMBLR.open)
library(tidyverse)

Next, we will obtain the data. We would need the metadata, SSM in maf format, CNV in seg format, and SV in bedped format. For the demonstration purposes, we will use the data bundled with GAMBLR to show the classification functionality. Here, we will subset data to a small number of samples and will illustrate the required formatting and minimal required information.

Metadata

The metadata is a data frame which contains the column sample_id listing sample ids for tumours to be classified, and column pathology which will dictate sliding threshold for aSHM site annotation when necessary. The column seq_type is not required for classification purposes, but will be kept in this tutorial for the purpose of subsequent retreival of mutation data.

# Load metadata
metadata <- get_gambl_metadata() %>%
    filter(seq_type == "genome") %>%
    filter(pathology %in% c("FL", "DLBCL", "BL")) %>%
    group_by(sample_id) %>%
    slice_head() %>%
    ungroup %>%
    filter(!study %in% c("DLBCL_Arthur", "DLBCL_Thomas"))

metadata %>%
    count(pathology)

pathology	n
BL	234
DLBCL	334
FL	219

# Demonstrate the required columns in metadata
metadata <- metadata %>%
    select(sample_id, pathology, seq_type) 

head(metadata)

sample_id	pathology	seq_type
00-14595_tumorC	DLBCL	genome
00-14595_tumorD	FL	genome
00-15201_tumorA	DLBCL	genome
00-15201_tumorB	DLBCL	genome
00-26427_tumorA	DLBCL	genome
00-26427_tumorC	DLBCL	genome

SSM (maf format)

The key data to be provided for the genetic subgroup classification is SSM in maf format. Only a small subset of columns is required to be used as input information, and here we would demonstrate the expected format and minimal required information for SSM.

maf <- get_ssm_by_samples(
    these_samples_metadata = metadata
) %>% as.data.frame

# Only these columns are required
maf_columns <- c(
    "Hugo_Symbol", "NCBI_Build",
    "Chromosome", "Start_Position", "End_Position",
    "Variant_Classification", "HGVSp_Short",
    "Tumor_Sample_Barcode"
)

maf <- maf %>%
    select(all_of(maf_columns))

head(maf)

Hugo_Symbol	NCBI_Build	Chromosome	Start_Position	End_Position	Variant_Classification	HGVSp_Short	Tumor_Sample_Barcode
TNFRSF14	GRCh37	1	2488139	2488139	Nonsense_Mutation	p.W12*	01-20260T
SPEN	GRCh37	1	16245921	16245921	Missense_Mutation	p.P515R	01-20260T
BRINP3	GRCh37	1	190187637	190187637	Intron		01-20260T
BRINP3	GRCh37	1	190189661	190189661	Intron		01-20260T
BRINP3	GRCh37	1	190264257	190264257	Intron		01-20260T
BTG2	GRCh37	1	203274977	203274977	Intron		01-20260T

CNV (seg format)

Several classifiers (Chapuy et al, Lacy et al, and Runge et al) require a CNV information to be provided, and here we would demonstrate the expected format and minimal required information for CNV in standard seg format.

seg <- get_cn_segments(
    these_samples_metadata = metadata
) %>% as.data.frame

# Only these columns are required
seg_columns <- c(
    "ID", "chrom", "start", "end", "LOH_flag", "log.ratio"
)

seg <- seg %>%
    select(all_of(seg_columns))

head(seg)

ID	chrom	start	end	log.ratio
02-13135T	1	10001	762600	0.0000000
02-13135T	1	762601	121500000	0.0000000
02-13135T	1	142600000	161506889	0.0000000
02-13135T	1	161506890	161652716	0.0000000
02-13135T	1	161652717	162110568	0.7279926
02-13135T	1	162110569	162111399	0.0000000

SV (bedpe format)

Several classifiers require SV data to be provided, and here we would demonstrate the expected format and minimal required information for SV in standard bed format.

bedpe <- get_manta_sv(
    these_samples_metadata = metadata
) %>% as.data.frame

# Only these columns are required
bedpe_columns <- colnames(bedpe)[1:11]

bedpe <- bedpe %>%
    select(all_of(bedpe_columns))

head(bedpe)

CHROM_A	START_A	END_A	CHROM_B	START_B	END_B	manta_name	SCORE	STRAND_A	STRAND_B	tumour_sample_id
1	161658631	161658631	3	16509907	16509907	MantaBND:21171:0:1:0:0:0	133	+	+	FL2002T1
1	161663959	161663959	9	37363320	37363320	MantaBND:206628:0:1:0:0:0	122	+	+	09-15842_tumorA
1	161663959	161663959	9	37363320	37363320	MantaBND:195941:0:1:0:0:0	151	+	+	09-15842_tumorB
11	65267283	65267283	14	106110907	106110907	MantaBND:152220:0:1:0:0:0:0	88	+	-	15-38154T
11	65267422	65267422	14	106110905	106110905	MantaBND:152220:0:1:0:0:0:0	135	-	+	15-38154T
13	91976545	91976545	14	106211857	106211857	MantaBND:18:59794:59817:0:1:0	90	-	+	15-31924T

Note

The column manta_name can be any other column from the tool of choice that generated bedpe output and reports the unique id of the SV event, and is expected to be unique for the SV event.

Classify DLBCL

Here, we will explore the reproduced DLBCL classification by the groupings of Chapuy et al, Lacy et al, and Runge et al. The classification algorithm can be easily controlled by switching the argument method of the classify_dlbcl() function.

Important

The DLBCL classifiers were not released with the publications, and the models provided here are the best attempt on reproducing the original models. While the model described by Chapuy et al is reproduced with > 92% accuracy, the Lacy and HMRN classifier results need to be taken with caution as the accuracy of the bundled model is only 80%.

Chapuy et al. (C0-C5 clusters)

The paper published originally in 2018 by Chapuy et al was the first attempt to use systematic approach and classify DLBCL patients into genetic subgroupings using genomic information.

Note

This model classifies tumours according to the original 2018 publication, and is not directly related to the 2024 DLBCLass model.

Here is how these subgroupings can be recapitulated with GAMBLR functionality:

predictions <- classify_dlbcl(
    these_samples_metadata = metadata %>%
        filter(pathology == "DLBCL"),
    maf_data = maf,
    seg_data = seg,
    sv_data = bedpe
)

The classify_ family of functions by default return both the assembled matrix and predictions. The model reports the confidence of each tumour’s subgoup vote, as well as final label.

# assembled matrix
predictions$matrix[1:5,1:5]

	BCL11A	BCL2
00-14595_tumorC	0	2
00-15201_tumorA	0	0
00-15201_tumorB	0	0
00-26427_tumorA	2	0
00-26427_tumorC	0	0

# subgroup assignment
predictions$predictions %>%
 slice_head(n=10)

sample_id	C1	C2	C3	C4	C5	Chapuy_cluster
00-14595_tumorC	0.1060348	0.0000000	0.5110284	0.2784030	0.1045338	C3
00-15201_tumorA	0.0931391	0.0386996	0.1889158	0.3859610	0.2932845	C4
00-15201_tumorB	0.2842088	0.0914615	0.1265154	0.0730396	0.4247747	C5
00-26427_tumorA	0.0797927	0.0670896	0.0457110	0.1034915	0.7039151	C5
00-26427_tumorC	0.1152963	0.0000000	0.0000000	0.3449964	0.5397073	C5
01-14774_tumorA	0.4723915	0.0275195	0.1548457	0.2401450	0.1050982	C1
01-14774_tumorB	0.5192447	0.0050716	0.0872477	0.3039748	0.0844612	C1
01-16433_tumorA	0.0000000	0.0000000	0.8286825	0.1713175	0.0000000	C3
01-16433_tumorB	0.2873270	0.2551277	0.2676392	0.1094943	0.0804118	C1
01-23117_tumorA	0.2100450	0.0000000	0.0463066	0.2689441	0.4747043	C5

count(predictions$predictions, Chapuy_cluster)

Chapuy_cluster	n
C0	2
C1	38
C2	60
C3	124
C4	36
C5	74

Lacy et al.

The paper published originally in 2020 by Lacy et al did not release the classification algorithm, but we recapitulated it by training random forest model with 80% accuaracy. Here is how these subgroupings can be recapitulated with GAMBLR functionality:

predictions <- classify_dlbcl(
    these_samples_metadata = metadata %>%
        filter(pathology == "DLBCL"),
    maf_data = maf,
    seg_data = seg,
    sv_data = bedpe,
    method = "lacy"
)

Similar to the Chapuy et al predictions, we can see both the constructed matrix and the confidence of each tumour’s subgoup vote, as well as final label.

# assembled matrix
predictions$matrix[1:5,1:5]

	BCL11A_amp	MALT1_amp	REL_amp	XPO1_amp	CD58_OR_del
00-14595_tumorC	0	0	0	0	0
00-15201_tumorA	0	0	0	0	0
00-15201_tumorB	0	0	0	0	0
00-26427_tumorA	0	0	0	0	0
00-26427_tumorC	0	0	0	0	1

# subgroup assignment
head(predictions$predictions)

sample_id	BCL2	MYD88	NOTCH2	Other	SOCS1/SGK1	TET2/SGK1	Lacy_cluster
00-14595_tumorC	0.278	0.070	0.058	0.010	0.500	0.084	SOCS1/SGK1
00-15201_tumorA	0.220	0.102	0.148	0.070	0.440	0.020	SOCS1/SGK1
00-15201_tumorB	0.052	0.186	0.500	0.210	0.026	0.026	NOTCH2
00-26427_tumorA	0.020	0.310	0.194	0.470	0.000	0.006	Other
00-26427_tumorC	0.020	0.754	0.032	0.162	0.016	0.016	MYD88
01-14774_tumorA	0.262	0.064	0.196	0.016	0.398	0.064	SOCS1/SGK1

count(predictions$predictions, Lacy_cluster)

Lacy_cluster	n
BCL2	110
MYD88	57
NOTCH2	34
Other	42
SOCS1/SGK1	81
TET2/SGK1	10

HMRN

The paper published by Runge et al modified the original Lacy et al. subgoupings to more closely adhere the genetic subgroups to the LymphGen classification system. The difference between the Runge and Lacy methods is that tumours with truncating NOTCH1 mutation will be assigned to a separate subgrouping category regardless of other genetic alterations present in the same tumour sample. Here is how these subgroupings can be recapitulated with GAMBLR functionality:

predictions <- classify_dlbcl(
    these_samples_metadata = metadata %>%
        filter(pathology == "DLBCL"),
    maf_data = maf,
    seg_data = seg,
    sv_data = bedpe,
    method = "hmrn"
)

Take a look at the resulting output

# assembled matrix
predictions$matrix[1:5,1:5]

	BCL11A_amp	MALT1_amp	REL_amp	XPO1_amp	CD58_OR_del
00-14595_tumorC	0	0	0	0	0
00-15201_tumorA	0	0	0	0	0
00-15201_tumorB	0	0	0	0	0
00-26427_tumorA	0	0	0	0	0
00-26427_tumorC	0	0	0	0	1

predictions$matrix[1:5,1:5]

	BCL11A_amp	MALT1_amp	REL_amp	XPO1_amp	CD58_OR_del
00-14595_tumorC	0	0	0	0	0
00-15201_tumorA	0	0	0	0	0
00-15201_tumorB	0	0	0	0	0
00-26427_tumorA	0	0	0	0	0
00-26427_tumorC	0	0	0	0	1

# subgroup assignment
head(predictions$predictions)

sample_id	BCL2	MYD88	NOTCH2	Other	SOCS1/SGK1	TET2/SGK1	NOTCH1	hmrn_cluster
00-14595_tumorC	0.278	0.070	0.058	0.010	0.500	0.084	0	SOCS1/SGK1
00-15201_tumorA	0.220	0.102	0.148	0.070	0.440	0.020	0	SOCS1/SGK1
00-15201_tumorB	0.000	0.000	0.000	0.000	0.000	0.000	1	NOTCH1
00-26427_tumorA	0.020	0.310	0.194	0.470	0.000	0.006	0	Other
00-26427_tumorC	0.020	0.754	0.032	0.162	0.016	0.016	0	MYD88
01-14774_tumorA	0.262	0.064	0.196	0.016	0.398	0.064	0	SOCS1/SGK1

count(predictions$predictions, hmrn_cluster)

hmrn_cluster	n
BCL2	110
MYD88	56
NOTCH1	3
NOTCH2	32
Other	41
SOCS1/SGK1	81
TET2/SGK1	11

Classify FL

Here, we will explore the classification of FL tumours into genetic subgroups with differential propensity for transformation as originally described here. The developed model can be easily accessed with classify_fl() function.

predictions <- classify_fl(
    these_samples_metadata = metadata %>%
        filter(pathology  %in% c("FL", "DLBCL")),
    maf_data = maf,
    output = "both"
)

Similar to DLBCL classifier, we can take a look at the assembled matrix, as well as at the predictions and the confidence of each tumour’s vote:

# assembled matrix
predictions$matrix[1:10, 1:5]

	ARID1A	B2M
00-14595_tumorC	1	0
00-14595_tumorD	0	0
00-15201_tumorA	0	0
00-15201_tumorB	1	0
00-26427_tumorA	0	0
00-26427_tumorC	0	0
01-14774_tumorA	0	1
01-14774_tumorB	0	0
01-16433_tumorA	0	0
01-16433_tumorB	1	1

# subgroup assignment
head(predictions$predictions)

sample_id	cFL	dFL	is_cFL
00-14595_tumorC	0.052	0.948	dFL
00-14595_tumorD	0.038	0.962	dFL
00-15201_tumorA	0.000	1.000	dFL
00-15201_tumorB	0.078	0.922	dFL
00-26427_tumorA	0.018	0.982	dFL
00-26427_tumorC	0.014	0.986	dFL

count(predictions$predictions, is_cFL)

is_cFL	n
cFL	41
dFL	510

Classify BL

Here, we will explore the classification of BL tumours into genetic subgroups as originally described in Thomas et al. The developed model reported in this study can be easily accessed with classify_bl() function.

predictions <- classify_bl(
    these_samples_metadata = metadata %>%
        filter(pathology  %in% c("BL", "DLBCL")),
    maf_data = maf
)

Similar to other classifiers, we can take a look at the assembled matrix, as well as at the predictions and the confidence of each tumour’s vote:

# assembled matrix
predictions$matrix[1:10, 1:5]

	ARID1A	B2M	BCL10	BCL11A	BCL2
00-14595_tumorC	1	0	0	0	1
00-15201_tumorA	0	0	0	0	0
00-15201_tumorB	1	0	0	0	0
00-26427_tumorA	0	0	0	1	0
00-26427_tumorC	0	0	0	0	0
01-14774_tumorA	0	1	1	0	0
01-14774_tumorB	0	0	0	0	0
01-16433_tumorA	0	0	0	0	1
01-16433_tumorB	1	1	0	0	1
01-23117_tumorA	0	1	0	0	0

# subgroup assignment
head(predictions$predictions)

sample_id	DGG-BL	DLBCL	IC-BL	Q53-BL	BL_subgroup
00-14595_tumorC	0.116	0.816	0.068	0.000	DLBCL
00-15201_tumorA	0.052	0.930	0.002	0.016	DLBCL
00-15201_tumorB	0.296	0.648	0.050	0.006	DLBCL
00-26427_tumorA	0.036	0.964	0.000	0.000	DLBCL
00-26427_tumorC	0.000	1.000	0.000	0.000	DLBCL
01-14774_tumorA	0.072	0.894	0.018	0.016	DLBCL

count(predictions$predictions, BL_subgroup)

BL_subgroup	n
DGG-BL	106
DLBCL	345
IC-BL	108
Q53-BL	9

That’s it!

Happy GAMBLing!

  /$$$$$$     /$$$$$$    /$$      /$$   /$$$$$$$    /$$        .:::::::
 /$$__  $$   /$$__  $$  | $$$    /$$$  | $$__  $$  | $$        .::    .::
| $$  \__/  | $$  \ $$  | $$$$  /$$$$  | $$  \ $$  | $$        .::    .::
| $$ /$$$$  | $$$$$$$$  | $$ $$/$$ $$  | $$$$$$$   | $$   <-   .: .::
| $$|_  $$  | $$__  $$  | $$  $$$| $$  | $$__  $$  | $$        .::  .::
| $$  \ $$  | $$  | $$  | $$\  $ | $$  | $$  \ $$  | $$        .::    .::
|  $$$$$$/  | $$  | $$  | $$ \/  | $$  | $$$$$$$/  | $$$$$$$$  .::      .::
 \______/   |__/  |__/  |__/     |__/  |_______/   |________/
 ~GENOMIC~~~~~~~~~~~~~OF~~~~~~~~~~~~~~~~~B-CELL~~~~~~~~~~~~~~~~~~IN~~~~~~
 ~~~~~~~~~~~~ANALYSIS~~~~~~MATURE~~~~~~~~~~~~~~~~~~~LYMPHOMAS~~~~~~~~~~R~