GAMBLR terms and concepts • GAMBLR.open

There are several concepts that often come up when using any of the GAMBLR packages. In particular, there are a few arguments that are required by most functions and that have a very specific meaning in the context of GAMBLR. This guide will help you understand the importance of these and how they are used across GAMBLR.

metadata

In it’s minimal form, this is a data frame with a set of 7 required columns: patient_id, Tumor_Sample_Barcode, sample_id, seq_type, sex, cohort, and pathology. In reality many of these variables such as patient details (e.g. sex, pathology), convenience variables such as cohort or study can contain NA values but those columns must be present in the metadata. The main purpose of this data frame is to provide a structure for the metadata that is used by numerous GAMBLR functions. In general, it provides linkage between unique sample identifiers and basic metadata fields that are needed by various functions. Notably, the columns Tumor_Sample_Barcode and sample_id are expected to be identical. In most cases, sample_id is used but the alias Tumor_Sample_Barcode is required to be present mostly for ease of linking MAF-type data with the metadata.

Here is an example from GAMBLR.data:

sample_id	Tumor_Sample_Barcode	patient_id	seq_type	pathology	cohort	sex
SP193005	SP193005	DO228305	genome	DLBCL	DLBCL_ICGC	M
Reddy_2144T	Reddy_2144T	Reddy_2144	capture	DLBCL	dlbcl_reddy	F
DLBCL11000T	DLBCL11000T	DLBCL11000	capture	DLBCL	dlbcl_schmitz	NA
Reddy_3520T	Reddy_3520T	Reddy_3520	capture	DLBCL	dlbcl_reddy	NA
DLBCL10780T	DLBCL10780T	DLBCL10780	capture	DLBCL	dlbcl_schmitz	NA
c_M_1616a	c_M_1616a	c_M_1616a	capture	PMBCL	NCI_DLBCL_Golub	NA
Reddy_3990T	Reddy_3990T	Reddy_3990	capture	DLBCL	dlbcl_reddy	M
Reddy_2464T	Reddy_2464T	Reddy_2464	capture	DLBCL	dlbcl_reddy	M
SP193925	SP193925	DO228303	genome	FL	FL_ICGC	F
99-27137T	99-27137T	99-27137	genome	DLBCL	DLBCL_Marra	F
Reddy_3948T	Reddy_3948T	Reddy_3948	capture	DLBCL	dlbcl_reddy	F
Reddy_2805T	Reddy_2805T	Reddy_2805	capture	DLBCL	dlbcl_reddy	M
DLBCL-RICOVER_102-Tumor	DLBCL-RICOVER_102-Tumor	DLBCL-RICOVER_102	capture	DLBCL	dlbcl_chapuy	NA
Reddy_2588T	Reddy_2588T	Reddy_2588	capture	DLBCL	dlbcl_reddy	F
16-17861T	16-17861T	16-17861	genome	DLBCL	DLBCL_GenomeCanada	F

If you are analyzing your own data in GAMBLR and don’t have detailed metadata, the bare minimum you could get away with is a data frame with dummy values for every required column. Here’s an example that assumes you are working entirely with exome (capture) data from a collection of Burkitt lymphoma (BL) patients.

actual_ids = c("my_sample1","my_sample2","my_sample3","my_sample_1000")

metadata = data.frame(sample_id=actual_ids,
                      Tumor_Sample_Barcode=actual_ids,
                      patient_id = actual_ids,
                      seq_type = "capture",
                      pathology = "BL",
                      cohort = NA,
                      sex=NA)
metadata %>% kableExtra::kable()

sample_id	Tumor_Sample_Barcode	patient_id	seq_type	pathology	cohort	sex
my_sample1	my_sample1	my_sample1	capture	BL	NA	NA
my_sample2	my_sample2	my_sample2	capture	BL	NA	NA
my_sample3	my_sample3	my_sample3	capture	BL	NA	NA
my_sample_1000	my_sample_1000	my_sample_1000	capture	BL	NA	NA

genome build

GAMBLR functions and the data in GAMBLR.data all support two main genomic coordinate systems. As GAMBL itself and the data bundled and available through this package represents a large collection of samples sequenced both locally and externally, we opted to support only two genome build flavours. For somewhat nuanced reasons, coordinates using grch37 or hg38 are supported. grch37 uses the same numeric coordinate system a NCBI hg19, but for the former, chromosome names lack the “chr” prefix. In contrast, the hg38 genome build is always chr-prefixed and is in the same coordinate system as the GRCh38 genome build. If you are analyzing your own data in GAMBLR, obviously you will have to know which coordinate system and genome build is appropriate. Generally, you can make any results that correspond to an unsupported genome build (e.g. hg19 or GRCh38) by removing or adding the “chr” prefix to the chromosome name.

# example of grch37 coordinates
head(GAMBLR.data::chromosome_arms_grch37)

  chromosome     start       end arm
1          1     10000 121500000   p
2          1 142600000 249250621   q
3          2     10000  90500000   p
4          2  96800000 243199373   q
5          3     10000  87900000   p
6          3  98300000 198022430   q

# example of hg38 coordinates
head(GAMBLR.data::chromosome_arms_hg38)

  chromosome     start       end arm
1       chr1     10000 121700000   p
2       chr1 143200000 248956422   q
3       chr2         0  91800000   p
4       chr2  96000000 242193529   q
5       chr3         0  87800000   p
6       chr3  98600000 198295559   q

projection

You can think of projection as an alias for genome build that is used in a very specific (and common) context within GAMBLR packages. When you use a function that retrieves some sort of data on a coordinate system, you must tell that function which coordinate system you want. Within GAMBLR.data, there is data available relative to grch37 or hg38. Behind the scenes, results have all been lifted over or projected to both builds using tools such as liftOver or crossMap. Hence, we refer to the results relative to any given genome build as a projection.