GAMBLR.data • GAMBLR.open

The GAMBLR.data package, which is installed and loaded as part of GAMBLR.open, comes with many different bundled objects with various uses. These can be divided in the following categories:

Somatic variants

sample_data A list of data frames containing the metadata, simple somatic, copy number, and structural variants collected together from the supplemental tables of large sequencing studies of B-cell lymphomas.

Important

The data stored in sample_data is not meant to be accessed directly. You should use one of the various get_ functions in GAMBLR.open to retrieve the data for your use case.

Curated gene lists

gene_blacklist A tibble with gene symbols (Hugo) that fall within blacklisted regions of the genome. The genes in this data object represent common sequencing artifacts and are discarded during the data analysis.

# Look at 15 randomly sampled genes in the blacklist
slice_sample(gene_blacklist,n=15)

Gene
KRTAP20-2
PABPC1
GYPB
MUC22
MUC17
IGKV1D-8
RPL10
MUC6
KRTAP10-8
KRTAP1-5
NFATC4
KRTAP4-5
OR2L9P
PCDHB13
HSP90AB1

lymphoma_genes A data frame with a manually curated set of genes commonly mutated in lymphomas. For each pathology, there is a column that indicates whether (TRUE) or not (FALSE) each gene is in either the Tier 1 or Tier 2 list for that pathology. For more granularity, you can also use the corresponding PATHOLOGY_Tier column e.g. DLBCL_Tier.

# Look at the first 15 DLBCL genes in Tier 1
dplyr::filter(lymphoma_genes,DLBCL_Tier == 1) %>% 
    slice_head(n=15)

Gene	MCL_Tier	MCL	FL_Tier	FL	DLBCL_Tier	DLBCL	BL_Tier	BL	ensembl_gene_id	hgnc_symbol	LymphGen	Reddy	Chapuy
ACTB	NA	FALSE	1	TRUE	1	TRUE	NA	FALSE	ENSG00000075624	ACTB	TRUE	TRUE	TRUE
ACTG1	NA	FALSE	2	TRUE	1	TRUE	NA	FALSE	ENSG00000267807	ACTG1	TRUE	FALSE	FALSE
ARID1A	NA	FALSE	1	TRUE	1	TRUE	1	TRUE	ENSG00000117713	ARID1A	TRUE	TRUE	FALSE
ATM	1	TRUE	NA	FALSE	1	TRUE	NA	FALSE	ENSG00000149311	ATM	FALSE	TRUE	FALSE
B2M	2	TRUE	1	TRUE	1	TRUE	2	TRUE	ENSG00000166710	B2M	FALSE	TRUE	TRUE
BCL10	NA	FALSE	1	TRUE	1	TRUE	NA	FALSE	ENSG00000142867	BCL10	TRUE	TRUE	TRUE
BCL2	NA	FALSE	1	TRUE	1	TRUE	2	TRUE	ENSG00000171791	BCL2	TRUE	TRUE	TRUE
BCL6	NA	FALSE	1	TRUE	1	TRUE	1	TRUE	ENSG00000113916	BCL6	TRUE	TRUE	TRUE
BCL7A	NA	FALSE	1	TRUE	1	TRUE	1	TRUE	ENSG00000110987	BCL7A	FALSE	TRUE	FALSE
BIRC6	NA	FALSE	NA	FALSE	1	TRUE	NA	FALSE	ENSG00000115760	BIRC6	FALSE	TRUE	FALSE
BRAF	NA	FALSE	NA	FALSE	1	TRUE	NA	FALSE	ENSG00000157764	BRAF	FALSE	TRUE	TRUE
BTG1	NA	FALSE	2	TRUE	1	TRUE	2	TRUE	ENSG00000133639	BTG1	TRUE	TRUE	TRUE
BTG2	NA	FALSE	2	TRUE	1	TRUE	NA	FALSE	ENSG00000159388	BTG2	TRUE	TRUE	TRUE
BTK	NA	FALSE	1	TRUE	1	TRUE	NA	FALSE	ENSG00000010671	BTK	FALSE	TRUE	FALSE
CARD11	1	TRUE	1	TRUE	1	TRUE	2	TRUE	ENSG00000198286	CARD11	FALSE	TRUE	TRUE

lymphoma_genes_comprehensive An older (and outdated) curated list of genes reported as significantly mutated in the large lymphoma studies. Both Ensembl ID and Hugo Symbol are available as gene identifiers. This data contains annotations for the studies by Chapuy, Reddy, Wright (LymphGen), Lacy, as well as annotations for whether the gene is curated, reported as SMG in other_studies, or a target of aSHM. This is mostly included for backwards compatability and you should use lymphoma_genes instead.

Coordinate-based resources

chromosome_arms_grch37: A data frame with the chromosome arm coordinates with respect to the grch37 projection.
chromosome_arms_hg38 A data frame with the chromosome arm coordinates with respect to the hg38 projection.
grch37_gene_coordinates A data frame of all gene coordinates with respect to grch37. Contains both Ensembl ID and Hugo Symbol as identifiers.
grch37_lymphoma_genes_bed A data frame in the bed format for genes commonly associated with B-cell lymphomas. Coordinates are with respect to grch37.
grch37_oncogene A data frame with the coordinates of lymphoma oncogenes relative to grch37. Used in mapping of the breakpoint coordinates.
grch37_partners A data frame of translocation partners for oncogenes with coordinates relative to grch37.
hg38_gene_coordinates A data frame of all gene coordinates with respect to hg38. Contains both Ensembl ID and Hugo Symbol as identifiers.
hg38_lymphoma_genes_bed A data frame in the bed format for genes commonly associated with B-cell lymphomas. Coordinates are with respect to hg38.
hg38_oncogene A data frame with the coordinates of lymphoma oncogenes relative to the hg38. Used in mapping of the breakpoint coordinates.
hg38_partners A data frame of translocation partners for oncogenes with relative coordinates to hg38.
grch37_all_gene_coordinates A data frame of protein-coding gene coordinates relative to grch37. Contains both Ensembl ID and Hugo Symbol as identifiers. Mainly here for backwards compatibility with earlier GAMBLR versions.
hotspot_regions_grch37 A data frame of mutation hotspot regions relative to grch37.
hotspot_regions_hg38 A data frame of mutation hotspot regions relative to hg38.
target_regions_grch37 A data frame with coordinates of the regions of the genome targeted by the whole exome sequencing panel Agilent V5 (no UTR) relative to grch37.
target_regions_hg38 A data frame with coordinates of the regions of the genome targeted by the whole exome sequencing panel Agilent V5 (no UTR) relative to hg38.

aSHM regions

grch37_ashm_regions Aberrant somatic hypermutation (aSHM) regions relative to grch37. This object always by default refers to the most recent version of the aSHM regions.

# Peek at the first 15 rows
grch37_ashm_regions %>%
    slice_head(n=15)

chr_name	hg19_start	hg19_end	gene	region	regulatory_comment
chr1	6661482	6662702	KLHL21	TSS	NA
chr1	23885584	23885835	ID3	TSS	NA
chr1	28832551	28836339	RCC1	TSS	NA
chr1	31229012	31232011	LAPTM5	TSS	NA
chr1	150550814	150552135	MCL1	intron	NA
chr2	88904839	88909096	EIF2Ak3	intron	NA
chr2	88925456	88927581	EIF2Ak3	TSS	NA
chr2	232572640	232574297	PTMA	TSS	NA
chr2	157669490	157671299	FCRL3	TSS	NA
chr1	203274698	203275778	BTG2	intron	active_promoter
chr1	206285239	206288105	RHEX	TSS	NA
chr1	226864857	226873452	ITPKB	intron	weak_enhancer
chr1	226920563	226927885	ITPKB	TSS	active_promoter
chr1	226921088	226927982	ITPKB	intron-1	enhancer
chr1	245023502	245029083	HNRNPU	TSS	NA

hg38_ashm_regions Aberrant somatic hypermutation (aSHM) regions relative to hg38. This object always by default refers to the most recent version of the aSHM regions.

# Peek at the first 15 rows
hg38_ashm_regions %>%
    slice_head(n=15)

chr_name	hg38_start	hg38_end	gene	region	regulatory_comment
chr1	6601422	6602642	KLHL21	TSS	NA
chr1	23559093	23559344	ID3	TSS	NA
chr1	28506039	28509827	RCC1	TSS	NA
chr1	30756165	30759164	LAPTM5	TSS	NA
chr1	150578338	150579659	MCL1	intron	NA
chr2	88605321	88609578	EIF2Ak3	intron	NA
chr2	88625938	88628063	EIF2Ak3	TSS	NA
chr2	231707930	231709587	PTMA	TSS	NA
chr2	156812978	156814787	FCRL3	TSS	NA
chr1	203305570	203306650	BTG2	intron	active_promoter
chr1	206053265	206054139	RHEX	TSS	NA
chr1	226677156	226685751	ITPKB	intron	weak_enhancer
chr1	226732862	226740184	ITPKB	TSS	active_promoter
chr1	226733387	226740281	ITPKB	intron-1	enhancer
chr1	244860200	244865781	HNRNPU	TSS	NA

Other resources

colour_codes A data frame with colour codes (HEX) arranged into different categories, groups.
dhitsig_genes_with_weights A data frame with double hit signature genes (both as ensembl IDs and Hugo symbols) and importance scores.
gambl_metadata A data frame with metadata for a collection of GAMBL samples. This represents a collection of whole genome, exome, targeted, RNA, and PrometION sequencing samples available as a data set known as GAMBL. This object rather serves an FYI purpose as not all samples listed here are published and bundled with GAMBLR.data.
hgnc2pfam.df A dataset containing the mapping table between Hugo symbol, UniProt ID, and Pfam ACC. This dataset comes from the g3viz package and was obtained via this URL: https://github.com/morinlab/g3viz/tree/master/data
hotspots_annotations Hotspot coordinates used in the feature annotation during matrix assembly of data for cFL classifier.
mirage_metrics A data frame providing the data reported in the Supplemental Table of the MIRAGE study by Dreval et al, 2022
mutation.table.df A data frame providing the linkage between Variant Classification, Mutation_Class, and Short_Name for the simple somatic mutations.
reddy_genes A data frame of the genes reported as significantly mutated by the study of Reddy et al, 2017
wright_genes_with_weights Wright genes with weight values from the study by Scott et al, 2014.