Skip to contents

Return metadata for a selection of samples.

Usage

get_gambl_metadata(
  dna_seq_type_priority = "genome",
  capture_protocol_priority = "Exome",
  dna_preservation_priority = "frozen",
  exome_capture_space_priority = c("agilent-sureselect-human-all-exon-v7",
    "agilent-sureselect-V5-plus-utr", "idt-xgen-v2-grch37", "exome-utr-grch37",
    "exome-utr-grch38", "none"),
  mrna_collapse_redundancy = TRUE,
  also_normals = FALSE,
  everything = FALSE,
  verbose = FALSE,
  invert = FALSE,
  exclude = "promethION",
  ...
)

Arguments

dna_seq_type_priority

The default is "genome" and the only other option is "capture". For duplicate biopsy_id/patient combinations with different seq_type available, prioritize this seq_type and drop the others.

capture_protocol_priority

For duplicate biopsy_id/patient combinations with different seq_type available, prioritize this seq_type and drop the others.

dna_preservation_priority

Which to prioritize between FFPE and frozen samples from the same biopsy (default: "frozen")

exome_capture_space_priority

A vector specifying how to prioritize exome capture space #TODO: implement and test once examples are available

mrna_collapse_redundancy

Default: TRUE. Set to FALSE to obtain all rows for the mrna seq_type including those that would otherwise be collapsed.

also_normals

Set to TRUE to force the return of rows where tissue_status is normal (default is to restrict to tumour)

everything

Set to TRUE to include samples with bam_available == FALSE. Default: FALSE - only samples with bam_available = TRUE are retained.

verbose

Set to TRUE for a chatty output (mostly for debugging)

invert

Set to TRUE to force the function to return only the rows that are lost in all the prioritization steps (mostly for debugging)

exclude

Specify one or more seq_type to drop from the output. This prevents metadata from containing anythong other than the three standard seq_type (genome, capture, mrna). Default setting will exclude "promethION".

...

Additional arguments

Value

A data frame with metadata for each biopsy in GAMBL

compression

Format of the original data used as input for our analysis pipelines (cram, bam or fastq)

bam_available

Whether or not this file was available when last checked.

patient_id

The anonymized unique identifier for this patient. For BC samples, this will be Res ID.

sample_id

A unique identifier for the sample analyzed.

seq_type

The assay type used to produce this data (one of "genome","capture, "mrna", "promethION")

capture_space

Unique ID for the capture space, where applicable

genome_build

The name of the genome reference the data were aligned to.

tissue_status

Whether the sample was atumour or normal.

cohort

Name for a group of samples that were added together (usually from a single study), often in the format pathology_cohort_descriptor.

library_id

The unique identifier for the sequencing library.

pathology

The diagnosis or pathology for the sample

time_point

Timing of biopsy in increasing alphabetical order (A = diagnosis, B = first relapse etc)

protocol

General protocol for library construction. e.g. "Ribodepletion", "PolyA", or "Genome"

ffpe_or_frozen

Whether the nucleic acids were extracted from a frozen or FFPE sample

read_length

The length of reads (required for RNA-seq libraries)

strandedness

Whether the RNA-seq librayr construction was strand-specific and, if so, which strand. Required for RNAseq; "positive", "negative", or "unstranded")

seq_source_type

Required for RNAseq. Usually the same value as ffpe_or_frozen but sometimes immunotube or sorted cells

data_path

Symbolic link to the bam or cram file (not usually relevant for GAMBLR)

link_name

Standardized naming for symbolic link (not usually relevant for GAMBLR)

fastq_data_path

Symbolic link to the fastq file (not usually relevant for GAMBLR)

fastq_link_name

Standardized naming for symbolic link for FASTQ file, if used (not usually relevant for GAMBLR)

unix_group

Whether a source is external and restricted by data access agreements (icgc_dart) or internal (gambl)

COO_consensus

TODO

DHITsig_consensus

TODO

COO_PRPS_class

TODO

DHITsig_PRPS_class

TODO

DLBCL90_dlbcl_call

TODO

DLBCL90_dhitsig_call

TODO

res_id

duplicate of sample_id for local samples and NA otherwise

DLBCL90_code_set

Code set used for DLBCL90 call. One of DLBCL90 DLBCL90v2 DLBCL90v3

DLBCL90_dlbcl_score

TODO

DLBCL90_pmbl_score

TODO

DLBCL90_pmbl_call

TODO

DLBCL90_dhitsig_score

TODO

myc_ba

Result from breakapart FISH for MYC locus

myc_cn

Result from copy number FISH for MYC locus

bcl2_ba

Result from breakapart FISH for BCL2 locus

bcl2_cn

Result from copy number FISH for BCL2 locus

bcl6_ba

Result from breakapart FISH for BCL6 locus

bcl6_cn

Result from copy number FISH for BCL6 locus

time_since_diagnosis_years

TODO

relapse_timing

TODO

dtbx

TODO. OR REMOVE?

dtdx

TODO. OR REMOVE?

lymphgen_no_cnv

TODO

lymphgen_with_cnv

TODO

lymphgen_cnv_noA53

TODO

lymphgen_wright

The LymphGen call for this sample from Wright et all (if applicable)

fl_grade

TODO

transformation

TODO

relapse

TODO

ighv_mutation_original

TODO

Details

This function returns metadata for GAMBL samples. This replaces the functionality of the original version, which is still available under the new name og_get_gambl_metadata. The purpose of this function is to provide the metadata for the non-redundant set of samples from GAMBL, dealing with all types of redundancy caused by samples or biopsies that have data from >1 seq_type (genome or capture), different capture protocols (exome or targeted capture) etc.

Examples