GAMBLR.data glossary

There are several key concepts underlying the logic behind the GAMBLR.data package. The main terms are:

projection: This is a coordinate system defining the relationship with genome build and chromosome prefixing. The main projections supported throughout GAMBLR.data are grch37 and hg38. The grch37 projection contains the same coordinate system as genome build hg19, but never has the “chr” prefix on chromosome names. In contrast, the hg38 projection is always chr-prefixed and is in the same coordinate system as the hg38 genome build. As GAMBL itself and the data bundled and available through this package represents a large collection of samples sequenced both locally and externally, there is always a difficulty associated with proper handling of prefixes, different custom contigs and their lengths in the fasta reference, coordinates, and other distinctions complicating the direct comparison between data and comprehensive data analysis. These difficulties are handled internallyby making data always available in both projections, regardless of initial genome build to which the sample was aligned.
metadata: This is a data frame with a set of minimal required columns: patient_id, Tumor_Sample_Barcode, sample_id, seq_type, sex, cohort, and pathology. The columns like sex and cohort can contain NA values but must be present in the metadata. The main purpose of this data frame is to provide a structure for the metadata that is always expected to be available and provides linkage between unique sample identifiers and associated basic metadata values. The columns Tumor_Sample_Barcode and sample_id are expected to share the same values, but are required to be present for direct operation on the outputs of different upstream tools.