Get Coding SSM Status. — get_coding_ssm

Tabulate mutation status (SSM) for a set of genes.

Usage

get_coding_ssm_status(
  gene_symbols,
  these_samples_metadata,
  augmented = TRUE,
  min_read_support = 3,
  maf_path = NULL,
  maf_data,
  include_hotspots = TRUE,
  keep_multihit_hotspot = FALSE,
  recurrence_min = 5,
  review_hotspots = TRUE,
  genes_of_interest = c("FOXO1", "MYD88", "CREBBP"),
  genome_build,
  include_silent = FALSE,
  include_silent_genes,
  suffix,
  this_seq_type
)

Arguments

gene_symbols: A vector of gene symbols for which the mutation status will be tabulated. If not provided, lymphoma genes will be returned by default.
these_samples_metadata: The metadata for samples of interest to be included in the returned matrix. Only the column "sample_id" is required. If not provided, the matrix is tabulated for all available samples as default.
augmented: default: TRUE. Set to FALSE if you instead want the original MAF from each sample for multi-sample patients instead of the augmented MAF.
min_read_support: Only returns variants with at least this many reads in t_alt_count (for cleaning up augmented MAFs).
maf_path: If the status of coding SSM should be tabulated from a custom maf file, provide path to the maf in this argument. The default is set to NULL.
maf_data: Either a maf loaded from disk or from the database using a get_ssm function.
include_hotspots: Logical parameter indicating whether hotspots object should also be tabulated. Default is TRUE.
keep_multihit_hotspot: Logical parameter indicating whether to keep the gene annotation as mutated when the gene has both hot spot and non-hotspot mutation. Default is FALSE. If set to TRUE, will report the number of non-hotspot mutations instead of tabulating for just mutation presence.
recurrence_min: Integer value indicating minimal recurrence level.
review_hotspots: Logical parameter indicating whether hotspots object should be reviewed to include functionally relevant mutations or rare lymphoma-related genes. Default is TRUE.
genes_of_interest: A vector of genes for hotspot review. Currently only FOXO1, MYD88, and CREBBP are supported.
genome_build: Reference genome build for the coordinates in the MAF file. The default is hg19 genome build.
include_silent: Logical parameter indicating whether to include silent mutations into coding mutations. Default is FALSE.
include_silent_genes: Optionally, provide a list of genes for which the Silent variants to be considered. If provided, the Silent variants for these genes will be included regardless of the include_silent argument.
suffix: Optionally provide a character that will be appended to the end of each name
this_seq_type: Deprecated. This is now determined from the metadata provided.
projection: Specify projection (grch37 or hg38) of mutations. Default is grch37.

Value

A data frame with tabulated mutation status.

Details

This function takes a data frame (in MAF-like format) and converts it to a binary one-hot encoded matrix of mutation status for either a set of user-specified genes (via gene_symbols) or, if no genes are provided, default to all lymphoma genes. The default behaviour is to assign each gene/sample_id combination as mutated only if there is a protein coding mutation for that sample in the MAF but this can be configured to use synonymous variants in some (via include_silent_genes) or all (via include_silent) genes. This function also has other filtering and convenience parameters giving the user full control of the return. For more information, refer to the parameter descriptions and examples. Is this function not what you are looking for? Try one of the following, similar, functions; get_coding_ssm, get_ssm_by_patients, get_ssm_by_sample, get_ssm_by_samples, get_ssm_by_region, get_ssm_by_regions.

Examples

# FL Tier 1 genes
genes = dplyr::filter(GAMBLR.data::lymphoma_genes,
             FL_Tier==1) %>% 
             dplyr::pull(Gene)

# Metadata for FL genomes and exomes
fl_meta = suppressMessages(get_gambl_metadata()) %>% 
    dplyr::filter(pathology=="FL",
                  cohort != "FL_Crouch",
                  seq_type != "mrna")

table(fl_meta$seq_type)
#> 
#> capture  genome 
#>     333     463 
# Here, we let the function load the data for us
 coding_tabulated_df = get_coding_ssm_status(
  gene_symbols=genes,
  include_hotspots=FALSE,
  genome_build = "hg38",
  these_samples_metadata = fl_meta
 )
#> Joining with `by = join_by(sample_id)`
 length(genes)
#> [1] 54
 dim(coding_tabulated_df)
#> [1] 796  55
 head(colnames(coding_tabulated_df))
#> [1] "sample_id" "TNFRSF14"  "ARID1A"    "RRAGC"     "BCL10"     "CTSS"     

# Alternatively, we can provide the MAF data directly

# Load the MAF data (let's use the other genome build this time)

maf_data = get_all_coding_ssm(these_samples_metadata = fl_meta,
                              projection = "grch37")

coding_tabulated2 = get_coding_ssm_status(gene_symbols=genes,
                                          these_samples_metadata = fl_meta,
                                          maf_data = maf_data,
                                          include_hotspots=FALSE)
#> Joining with `by = join_by(sample_id)`
 dim(coding_tabulated2)
#> [1] 796  55
 head(colnames(coding_tabulated2))
#> [1] "sample_id" "TNFRSF14"  "ARID1A"    "RRAGC"     "BCL10"     "CTSS"