Get GAMBL metadata.
get_gambl_metadata.Rd
Return metadata for a selection of samples.
Usage
get_gambl_metadata(
dna_seq_type_priority = "genome",
capture_protocol_priority = "Exome",
dna_preservation_priority = "frozen",
exome_capture_space_priority = c("agilent-sureselect-human-all-exon-v7",
"agilent-sureselect-V5-plus-utr", "idt-xgen-v2-grch37", "exome-utr-grch37",
"exome-utr-grch38", "none"),
mrna_collapse_redundancy = TRUE,
also_normals = FALSE,
everything = FALSE,
verbose = FALSE,
invert = FALSE,
exclude = "promethION",
...
)
Arguments
- dna_seq_type_priority
The default is "genome" and the only other option is "capture". For duplicate biopsy_id/patient combinations with different seq_type available, prioritize this seq_type and drop the others.
- capture_protocol_priority
For duplicate biopsy_id/patient combinations with different seq_type available, prioritize this seq_type and drop the others.
- dna_preservation_priority
Which to prioritize between FFPE and frozen samples from the same biopsy (default: "frozen")
- exome_capture_space_priority
A vector specifying how to prioritize exome capture space #TODO: implement and test once examples are available
- mrna_collapse_redundancy
Default: TRUE. Set to FALSE to obtain all rows for the mrna seq_type including those that would otherwise be collapsed.
- also_normals
Set to TRUE to force the return of rows where tissue_status is normal (default is to restrict to tumour)
- everything
Set to TRUE to include samples with
bam_available == FALSE
. Default: FALSE - only samples withbam_available = TRUE
are retained.- verbose
Set to TRUE for a chatty output (mostly for debugging)
- invert
Set to TRUE to force the function to return only the rows that are lost in all the prioritization steps (mostly for debugging)
- exclude
Specify one or more seq_type to drop from the output. This prevents metadata from containing anythong other than the three standard seq_type (genome, capture, mrna). Default setting will exclude "promethION".
- ...
Additional arguments
Value
A data frame with metadata for each biopsy in GAMBL
- compression
Format of the original data used as input for our analysis pipelines (cram, bam or fastq)
- bam_available
Whether or not this file was available when last checked.
- patient_id
The anonymized unique identifier for this patient. For BC samples, this will be Res ID.
- sample_id
A unique identifier for the sample analyzed.
- seq_type
The assay type used to produce this data (one of "genome","capture, "mrna", "promethION")
- capture_space
Unique ID for the capture space, where applicable
- genome_build
The name of the genome reference the data were aligned to.
- tissue_status
Whether the sample was atumour or normal.
- cohort
Name for a group of samples that were added together (usually from a single study), often in the format pathology_cohort_descriptor.
- library_id
The unique identifier for the sequencing library.
- pathology
The diagnosis or pathology for the sample
- time_point
Timing of biopsy in increasing alphabetical order (A = diagnosis, B = first relapse etc)
- protocol
General protocol for library construction. e.g. "Ribodepletion", "PolyA", or "Genome"
- ffpe_or_frozen
Whether the nucleic acids were extracted from a frozen or FFPE sample
- read_length
The length of reads (required for RNA-seq libraries)
- strandedness
Whether the RNA-seq librayr construction was strand-specific and, if so, which strand. Required for RNAseq; "positive", "negative", or "unstranded")
- seq_source_type
Required for RNAseq. Usually the same value as ffpe_or_frozen but sometimes immunotube or sorted cells
- data_path
Symbolic link to the bam or cram file (not usually relevant for GAMBLR)
- link_name
Standardized naming for symbolic link (not usually relevant for GAMBLR)
- fastq_data_path
Symbolic link to the fastq file (not usually relevant for GAMBLR)
- fastq_link_name
Standardized naming for symbolic link for FASTQ file, if used (not usually relevant for GAMBLR)
- unix_group
Whether a source is external and restricted by data access agreements (icgc_dart) or internal (gambl)
- COO_consensus
TODO
- DHITsig_consensus
TODO
- COO_PRPS_class
TODO
- DHITsig_PRPS_class
TODO
- DLBCL90_dlbcl_call
TODO
- DLBCL90_dhitsig_call
TODO
- res_id
duplicate of sample_id for local samples and NA otherwise
- DLBCL90_code_set
Code set used for DLBCL90 call. One of DLBCL90 DLBCL90v2 DLBCL90v3
- DLBCL90_dlbcl_score
TODO
- DLBCL90_pmbl_score
TODO
- DLBCL90_pmbl_call
TODO
- DLBCL90_dhitsig_score
TODO
- myc_ba
Result from breakapart FISH for MYC locus
- myc_cn
Result from copy number FISH for MYC locus
- bcl2_ba
Result from breakapart FISH for BCL2 locus
- bcl2_cn
Result from copy number FISH for BCL2 locus
- bcl6_ba
Result from breakapart FISH for BCL6 locus
- bcl6_cn
Result from copy number FISH for BCL6 locus
- time_since_diagnosis_years
TODO
- relapse_timing
TODO
- dtbx
TODO. OR REMOVE?
- dtdx
TODO. OR REMOVE?
- lymphgen_no_cnv
TODO
- lymphgen_with_cnv
TODO
- lymphgen_cnv_noA53
TODO
- lymphgen_wright
The LymphGen call for this sample from Wright et all (if applicable)
- fl_grade
TODO
- transformation
TODO
- relapse
TODO
- ighv_mutation_original
TODO
Details
This function returns metadata for GAMBL samples. This replaces the functionality of the original version, which is still available under the new name og_get_gambl_metadata. The purpose of this function is to provide the metadata for the non-redundant set of samples from GAMBL, dealing with all types of redundancy caused by samples or biopsies that have data from >1 seq_type (genome or capture), different capture protocols (exome or targeted capture) etc.