Retrieve all coding SSMs from the GAMBL database in MAF-like format.

get_coding_ssm(
  limit_cohort,
  exclude_cohort,
  limit_pathology,
  limit_samples,
  these_samples_metadata,
  force_unmatched_samples,
  projection = "grch37",
  seq_type,
  basic_columns = TRUE,
  maf_cols = NULL,
  from_flatfile = TRUE,
  augmented = TRUE,
  min_read_support = 3,
  groups = c("gambl", "icgc_dart"),
  include_silent = TRUE,
  engine = "fread_maf"
)

Arguments

limit_cohort

Supply this to restrict mutations to one or more cohorts in a vector.

exclude_cohort

Supply this to exclude mutations from one or more cohorts in a vector.

limit_pathology

Supply this to restrict mutations to one pathology.

limit_samples

Supply this to restrict mutations to a vector of sample_id (instead of subsetting using the provided metadata)

these_samples_metadata

Supply a metadata table to auto-subset the data to samples in that table before returning

force_unmatched_samples

Optional argument for forcing unmatched samples, using get_ssm_by_samples.

projection

Reference genome build for the coordinates in the MAF file. The default is hg19 genome build.

seq_type

The seq_type you want back, default is genome.

basic_columns

Set to FALSE to override the default behavior of returning only the first 45 columns of MAF data.

maf_cols

if basic_columns is set to FALSE, the user can specify what columns to be returned within the MAF. This parameter can either be a vector of indexes (integer) or a vector of characters (matching columns in MAF).

from_flatfile

Set to TRUE to obtain mutations from a local flatfile instead of the database. This can be more efficient and is currently the only option for users who do not have ICGC data access.

augmented

default: TRUE. Set to FALSE if you instead want the original MAF from each sample for multi-sample patients instead of the augmented MAF

min_read_support

Only returns variants with at least this many reads in t_alt_count (for cleaning up augmented MAFs)

groups

Unix groups for the samples to be included. Default is both gambl and icgc_dart samples.

include_silent

Logical parameter indicating whether to include silent mutations into coding mutations. Default is TRUE.

engine

Specify one of readr or fread_maf (default) to change how the large files are loaded prior to subsetting. You may have better performance with one or the other but for me fread_maf is faster and uses a lot less RAM.

Value

A data frame containing all the MAF data columns (one row per mutation).

Details

Effectively retrieve coding SSM calls. Multiple filtering parameters are available for this function. For more information on how to implement the filtering parameters, refer to the parameter descriptions as well as examples in the vignettes. Is this function not what you are looking for? Try one of the following, similar, functions; get_coding_ssm_status, get_ssm_by_patients, get_ssm_by_sample, get_ssm_by_samples, get_ssm_by_region, get_ssm_by_regions

Examples

#basic usage
maf_data = get_coding_ssm(seq_type = "genome", limit_cohort = c("BL_ICGC"))
#> reading from: /projects/nhl_meta_analysis_scratch/gambl/results_local/all_the_things/slms_3-1.0_vcf2maf-1.3/genome--projection/deblacklisted/augmented_maf/all_slms-3--grch37.CDS.maf
#> mutations from 1652 samples
#> after linking with metadata, we have mutations from 17 samples

maf_data = get_coding_ssm(seq_type = "genome", limit_samples = "HTMCP-01-06-00485-01A-01D")
#> reading from: /projects/nhl_meta_analysis_scratch/gambl/results_local/all_the_things/slms_3-1.0_vcf2maf-1.3/genome--projection/deblacklisted/augmented_maf/all_slms-3--grch37.CDS.maf
#> mutations from 1652 samples
#> after linking with metadata, we have mutations from 1 samples