Annotate MAF with triplet context — annotate_maf

Give triple sequence of mutated base with its adjacent bases (-1 and +1)

Usage

annotate_maf_triplet(
  maf,
  all_SNVs = TRUE,
  ref,
  alt,
  genome_build,
  fastaPath,
  pyrimidine_collapse = FALSE
)

Arguments

maf: MAF file (required columns: Reference_Allele, Tumor_Seq_Allele2)
all_SNVs: To give us all the triplet sequences of SNVs and not specifying any specific ref and alt alleles (default is TRUE)
ref: Reference allele
alt: Alternative allele
genome_build: The genome build for the variants you are working with (default is to infer it from the MAF)
fastaPath: Can be a path to a FASTA file on a disk. When on GSC, this is first attempted to be inferred from the gambl reference through path specified in config. Local files are also accepted as value here.
pyrimidine_collapse: Estimate mutation_strand and

Value

A data frame with two to three extra columns, in case pyrimidine_collapse = FALSE, it will add triple sequence (seq) and the strand (mutation_strand). When pyrimidine_collapse = T, another column also will be added to the those extra columns which shows the mutation (mutation)

Details

It gives the reference and alternative alleles and looks for the rows of the data frame based on these values for + strand genes and their complement alleles rows for - strand genes, then it can look for the adjacent bases in that mutation position. Also, it can look for all the SNVs in the MAF data frame and provide triple sequences for them (reverse complement sequence for the - strand).

Examples

maf <- get_coding_ssm(projection = "grch37") %>% head(n = 500)
#> Warning: One or more parsing issues, call `problems()` on your data frame for details,
#> e.g.:
#>   dat <- vroom(...)
#>   problems(dat)
# peek at the data
dplyr::select(maf, 1:12) %>% head()
#> genomic_data Object
#> Genome Build: grch37 
#> Showing first 10 rows:
#>     Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome Start_Position
#> 1    AL627309.1              0      .     GRCh37          1         138626
#> 2    AL627309.1              0      .     GRCh37          1         138972
#> 3 RP11-206L10.9              0      .     GRCh37          1         730845
#> 4        FAM87B              0      .     GRCh37          1         753589
#> 5        SAMD11              0      .     GRCh37          1         871158
#> 6        SAMD11              0      .     GRCh37          1         871192
#>   End_Position Strand Variant_Classification Variant_Type Reference_Allele
#> 1       138626      +                 Silent          SNP                T
#> 2       138973      +        Frame_Shift_Ins          INS                -
#> 3       730845      +          Splice_Region          SNP                G
#> 4       753589      +          Splice_Region          SNP                A
#> 5       871158      +                 Silent          SNP                C
#> 6       871192      +      Missense_Mutation          SNP                C
#>   Tumor_Seq_Allele1
#> 1                 T
#> 2                 -
#> 3                 G
#> 4                 A
#> 5                 C
#> 6                 C

maf_anno <- annotate_maf_triplet(maf)
dplyr::select(maf_anno, 1:12, seq) %>% head()
#> genomic_data Object
#> Genome Build: grch37 
#> Showing first 10 rows:
#>     Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome Start_Position
#> 1    AL627309.1              0      .     GRCh37          1         138626
#> 2    AL627309.1              0      .     GRCh37          1         138972
#> 3 RP11-206L10.9              0      .     GRCh37          1         730845
#> 4        FAM87B              0      .     GRCh37          1         753589
#> 5        SAMD11              0      .     GRCh37          1         871158
#> 6        SAMD11              0      .     GRCh37          1         871192
#>   End_Position Strand Variant_Classification Variant_Type Reference_Allele
#> 1       138626      +                 Silent          SNP                T
#> 2       138973      +        Frame_Shift_Ins          INS                -
#> 3       730845      +          Splice_Region          SNP                G
#> 4       753589      +          Splice_Region          SNP                A
#> 5       871158      +                 Silent          SNP                C
#> 6       871192      +      Missense_Mutation          SNP                C
#>   Tumor_Seq_Allele1 seq
#> 1                 T ATG
#> 2                 -  NA
#> 3                 G CGT
#> 4                 A AAA
#> 5                 C ACG
#> 6                 C CCG
# Each mutation is now associated with it's sequence context in the
# reference genome in a column named seq
if (FALSE) { # \dontrun{
annotate_maf_triplet(maf, all_SNVs = FALSE, "C", "T")
annotate_maf_triplet(maf, ref = "C", alt = "T", pyrimidine_collapse = TRUE)
} # }