Get Coding SSM Status.
get_coding_ssm_status.Rd
Tabulate mutation status (SSM) for a set of genes.
Usage
get_coding_ssm_status(
gene_symbols,
these_samples_metadata,
augmented = TRUE,
min_read_support = 3,
maf_path = NULL,
maf_data,
include_hotspots = TRUE,
keep_multihit_hotspot = FALSE,
recurrence_min = 5,
review_hotspots = TRUE,
genes_of_interest = c("FOXO1", "MYD88", "CREBBP"),
genome_build,
include_silent = FALSE,
include_silent_genes,
suffix,
this_seq_type
)
Arguments
- gene_symbols
A vector of gene symbols for which the mutation status will be tabulated. If not provided, lymphoma genes will be returned by default.
- these_samples_metadata
The metadata for samples of interest to be included in the returned matrix. Only the column "sample_id" is required. If not provided, the matrix is tabulated for all available samples as default.
- augmented
default: TRUE. Set to FALSE if you instead want the original MAF from each sample for multi-sample patients instead of the augmented MAF.
- min_read_support
Only returns variants with at least this many reads in t_alt_count (for cleaning up augmented MAFs).
- maf_path
If the status of coding SSM should be tabulated from a custom maf file, provide path to the maf in this argument. The default is set to NULL.
- maf_data
Either a maf loaded from disk or from the database using a get_ssm function.
- include_hotspots
Logical parameter indicating whether hotspots object should also be tabulated. Default is TRUE.
- keep_multihit_hotspot
Logical parameter indicating whether to keep the gene annotation as mutated when the gene has both hot spot and non-hotspot mutation. Default is FALSE. If set to TRUE, will report the number of non-hotspot mutations instead of tabulating for just mutation presence.
- recurrence_min
Integer value indicating minimal recurrence level.
- review_hotspots
Logical parameter indicating whether hotspots object should be reviewed to include functionally relevant mutations or rare lymphoma-related genes. Default is TRUE.
- genes_of_interest
A vector of genes for hotspot review. Currently only FOXO1, MYD88, and CREBBP are supported.
- genome_build
Reference genome build for the coordinates in the MAF file. The default is hg19 genome build.
- include_silent
Logical parameter indicating whether to include silent mutations into coding mutations. Default is FALSE.
- include_silent_genes
Optionally, provide a list of genes for which the Silent variants to be considered. If provided, the Silent variants for these genes will be included regardless of the include_silent argument.
- suffix
Optionally provide a character that will be appended to the end of each name
- this_seq_type
Deprecated. This is now determined from the metadata provided.
- projection
Specify projection (grch37 or hg38) of mutations. Default is grch37.
Details
This function takes a data frame (in MAF-like format) and converts it to a binary one-hot encoded matrix of mutation status for either a set of user-specified genes (via gene_symbols) or, if no genes are provided, default to all lymphoma genes. The default behaviour is to assign each gene/sample_id combination as mutated only if there is a protein coding mutation for that sample in the MAF but this can be configured to use synonymous variants in some (via include_silent_genes) or all (via include_silent) genes. This function also has other filtering and convenience parameters giving the user full control of the return. For more information, refer to the parameter descriptions and examples. Is this function not what you are looking for? Try one of the following, similar, functions; get_coding_ssm, get_ssm_by_patients, get_ssm_by_sample, get_ssm_by_samples, get_ssm_by_region, get_ssm_by_regions.
Examples
# FL Tier 1 genes
genes = dplyr::filter(GAMBLR.data::lymphoma_genes,
FL_Tier==1) %>%
dplyr::pull(Gene)
# Metadata for FL genomes and exomes
fl_meta = suppressMessages(get_gambl_metadata()) %>%
dplyr::filter(pathology=="FL",
cohort != "FL_Crouch",
seq_type != "mrna")
table(fl_meta$seq_type)
#>
#> capture genome
#> 333 463
# Here, we let the function load the data for us
coding_tabulated_df = get_coding_ssm_status(
gene_symbols=genes,
include_hotspots=FALSE,
genome_build = "hg38",
these_samples_metadata = fl_meta
)
#> Joining with `by = join_by(sample_id)`
length(genes)
#> [1] 54
dim(coding_tabulated_df)
#> [1] 796 55
head(colnames(coding_tabulated_df))
#> [1] "sample_id" "TNFRSF14" "ARID1A" "RRAGC" "BCL10" "CTSS"
# Alternatively, we can provide the MAF data directly
# Load the MAF data (let's use the other genome build this time)
maf_data = get_all_coding_ssm(these_samples_metadata = fl_meta,
projection = "grch37")
coding_tabulated2 = get_coding_ssm_status(gene_symbols=genes,
these_samples_metadata = fl_meta,
maf_data = maf_data,
include_hotspots=FALSE)
#> Joining with `by = join_by(sample_id)`
dim(coding_tabulated2)
#> [1] 796 55
head(colnames(coding_tabulated2))
#> [1] "sample_id" "TNFRSF14" "ARID1A" "RRAGC" "BCL10" "CTSS"