get_coding_ssm.Rd
Retrieve all coding SSMs from the GAMBL database in MAF-like format.
get_coding_ssm(
limit_cohort,
exclude_cohort,
limit_pathology,
limit_samples,
these_samples_metadata,
force_unmatched_samples,
projection = "grch37",
seq_type,
basic_columns = TRUE,
maf_cols = NULL,
from_flatfile = TRUE,
augmented = TRUE,
min_read_support = 3,
groups = c("gambl", "icgc_dart"),
include_silent = TRUE,
engine = "fread_maf"
)
Supply this to restrict mutations to one or more cohorts in a vector.
Supply this to exclude mutations from one or more cohorts in a vector.
Supply this to restrict mutations to one pathology.
Supply this to restrict mutations to a vector of sample_id (instead of subsetting using the provided metadata)
Supply a metadata table to auto-subset the data to samples in that table before returning
Optional argument for forcing unmatched samples, using get_ssm_by_samples.
Reference genome build for the coordinates in the MAF file. The default is hg19 genome build.
The seq_type you want back, default is genome.
Set to FALSE to override the default behavior of returning only the first 45 columns of MAF data.
if basic_columns is set to FALSE, the user can specify what columns to be returned within the MAF. This parameter can either be a vector of indexes (integer) or a vector of characters (matching columns in MAF).
Set to TRUE to obtain mutations from a local flatfile instead of the database. This can be more efficient and is currently the only option for users who do not have ICGC data access.
default: TRUE. Set to FALSE if you instead want the original MAF from each sample for multi-sample patients instead of the augmented MAF
Only returns variants with at least this many reads in t_alt_count (for cleaning up augmented MAFs)
Unix groups for the samples to be included. Default is both gambl and icgc_dart samples.
Logical parameter indicating whether to include silent mutations into coding mutations. Default is TRUE.
Specify one of readr or fread_maf (default) to change how the large files are loaded prior to subsetting. You may have better performance with one or the other but for me fread_maf is faster and uses a lot less RAM.
A data frame containing all the MAF data columns (one row per mutation).
Effectively retrieve coding SSM calls. Multiple filtering parameters are available for this function. For more information on how to implement the filtering parameters, refer to the parameter descriptions as well as examples in the vignettes. Is this function not what you are looking for? Try one of the following, similar, functions; get_coding_ssm_status, get_ssm_by_patients, get_ssm_by_sample, get_ssm_by_samples, get_ssm_by_region, get_ssm_by_regions
#basic usage
maf_data = get_coding_ssm(seq_type = "genome", limit_cohort = c("BL_ICGC"))
#> reading from: /projects/nhl_meta_analysis_scratch/gambl/results_local/all_the_things/slms_3-1.0_vcf2maf-1.3/genome--projection/deblacklisted/augmented_maf/all_slms-3--grch37.CDS.maf
#> mutations from 1652 samples
#> after linking with metadata, we have mutations from 17 samples
maf_data = get_coding_ssm(seq_type = "genome", limit_samples = "HTMCP-01-06-00485-01A-01D")
#> reading from: /projects/nhl_meta_analysis_scratch/gambl/results_local/all_the_things/slms_3-1.0_vcf2maf-1.3/genome--projection/deblacklisted/augmented_maf/all_slms-3--grch37.CDS.maf
#> mutations from 1652 samples
#> after linking with metadata, we have mutations from 1 samples