sample_id | Tumor_Sample_Barcode | patient_id | seq_type | pathology | cohort | sex |
---|---|---|---|---|---|---|
SP193005 | SP193005 | DO228305 | genome | DLBCL | DLBCL_ICGC | M |
Reddy_2144T | Reddy_2144T | Reddy_2144 | capture | DLBCL | dlbcl_reddy | F |
DLBCL11000T | DLBCL11000T | DLBCL11000 | capture | DLBCL | dlbcl_schmitz | NA |
Reddy_3520T | Reddy_3520T | Reddy_3520 | capture | DLBCL | dlbcl_reddy | NA |
DLBCL10780T | DLBCL10780T | DLBCL10780 | capture | DLBCL | dlbcl_schmitz | NA |
c_M_1616a | c_M_1616a | c_M_1616a | capture | PMBCL | NCI_DLBCL_Golub | NA |
Reddy_3990T | Reddy_3990T | Reddy_3990 | capture | DLBCL | dlbcl_reddy | M |
Reddy_2464T | Reddy_2464T | Reddy_2464 | capture | DLBCL | dlbcl_reddy | M |
SP193925 | SP193925 | DO228303 | genome | FL | FL_ICGC | F |
99-27137T | 99-27137T | 99-27137 | genome | DLBCL | DLBCL_Marra | F |
Reddy_3948T | Reddy_3948T | Reddy_3948 | capture | DLBCL | dlbcl_reddy | F |
Reddy_2805T | Reddy_2805T | Reddy_2805 | capture | DLBCL | dlbcl_reddy | M |
DLBCL-RICOVER_102-Tumor | DLBCL-RICOVER_102-Tumor | DLBCL-RICOVER_102 | capture | DLBCL | dlbcl_chapuy | NA |
Reddy_2588T | Reddy_2588T | Reddy_2588 | capture | DLBCL | dlbcl_reddy | F |
16-17861T | 16-17861T | 16-17861 | genome | DLBCL | DLBCL_GenomeCanada | F |
GAMBLR terms and concepts
There are several concepts that often come up when using any of the GAMBLR packages. In particular, there are a few arguments that are required by most functions and that have a very specific meaning in the context of GAMBLR. This guide will help you understand the importance of these and how they are used across GAMBLR.
metadata
In it’s minimal form, this is a data frame with a set of 7 required columns: patient_id
, Tumor_Sample_Barcode
, sample_id
, seq_type
, sex
, cohort
, and pathology
. In reality many of these variables such as patient details (e.g. sex
, pathology
), convenience variables such as cohort
or study
can contain NA values but those columns must be present in the metadata. The main purpose of this data frame is to provide a structure for the metadata that is used by numerous GAMBLR functions. In general, it provides linkage between unique sample identifiers and basic metadata fields that are needed by various functions. Notably, the columns Tumor_Sample_Barcode
and sample_id
are expected to be identical. In most cases, sample_id
is used but the alias Tumor_Sample_Barcode
is required to be present mostly for ease of linking MAF-type data with the metadata.
Here is an example from GAMBLR.data:
If you are analyzing your own data in GAMBLR and don’t have detailed metadata, the bare minimum you could get away with is a data frame with dummy values for every required column. Here’s an example that assumes you are working entirely with exome (capture
) data from a collection of Burkitt lymphoma (BL) patients.
actual_ids = c("my_sample1","my_sample2","my_sample3","my_sample_1000")
metadata = data.frame(sample_id=actual_ids,
Tumor_Sample_Barcode=actual_ids,
patient_id = actual_ids,
seq_type = "capture",
pathology = "BL",
cohort = NA,
sex=NA)
metadata %>% kableExtra::kable()
sample_id | Tumor_Sample_Barcode | patient_id | seq_type | pathology | cohort | sex |
---|---|---|---|---|---|---|
my_sample1 | my_sample1 | my_sample1 | capture | BL | NA | NA |
my_sample2 | my_sample2 | my_sample2 | capture | BL | NA | NA |
my_sample3 | my_sample3 | my_sample3 | capture | BL | NA | NA |
my_sample_1000 | my_sample_1000 | my_sample_1000 | capture | BL | NA | NA |
genome build
GAMBLR functions and the data in GAMBLR.data all support two main genomic coordinate systems. As GAMBL itself and the data bundled and available through this package represents a large collection of samples sequenced both locally and externally, we opted to support only two genome build flavours. For somewhat nuanced reasons, coordinates using grch37
or hg38
are supported. grch37
uses the same numeric coordinate system a NCBI hg19
, but for the former, chromosome names lack the “chr” prefix. In contrast, the hg38
genome build is always chr-prefixed and is in the same coordinate system as the GRCh38
genome build. If you are analyzing your own data in GAMBLR, obviously you will have to know which coordinate system and genome build is appropriate. Generally, you can make any results that correspond to an unsupported genome build (e.g. hg19
or GRCh38
) by removing or adding the “chr” prefix to the chromosome name.
# example of grch37 coordinates
head(GAMBLR.data::chromosome_arms_grch37)
chromosome start end arm
1 1 10000 121500000 p
2 1 142600000 249250621 q
3 2 10000 90500000 p
4 2 96800000 243199373 q
5 3 10000 87900000 p
6 3 98300000 198022430 q
# example of hg38 coordinates
head(GAMBLR.data::chromosome_arms_hg38)
chromosome start end arm
1 chr1 10000 121700000 p
2 chr1 143200000 248956422 q
3 chr2 0 91800000 p
4 chr2 96000000 242193529 q
5 chr3 0 87800000 p
6 chr3 98600000 198295559 q
projection
You can think of projection
as an alias for genome build that is used in a very specific (and common) context within GAMBLR packages. When you use a function that retrieves some sort of data on a coordinate system, you must tell that function which coordinate system you want. Within GAMBLR.data, there is data available relative to grch37
or hg38
. Behind the scenes, results have all been lifted over or projected to both builds using tools such as liftOver
or crossMap
. Hence, we refer to the results relative to any given genome build as a projection
.