Annotate MAF with triplet context
annotate_maf_triplet.Rd
Give triple sequence of mutated base with its adjacent bases (-1 and +1)
Usage
annotate_maf_triplet(
maf,
all_SNVs = TRUE,
ref,
alt,
genome_build,
fastaPath,
pyrimidine_collapse = FALSE
)
Arguments
- maf
MAF file (required columns: Reference_Allele, Tumor_Seq_Allele2)
- all_SNVs
To give us all the triplet sequences of SNVs and not specifying any specific ref and alt alleles (default is TRUE)
- ref
Reference allele
- alt
Alternative allele
- genome_build
The genome build for the variants you are working with (default is to infer it from the MAF)
- fastaPath
Can be a path to a FASTA file on a disk. When on GSC, this is first attempted to be inferred from the gambl reference through path specified in config. Local files are also accepted as value here.
- pyrimidine_collapse
Estimate mutation_strand and
Value
A data frame with two to three extra columns, in case pyrimidine_collapse = FALSE, it will add triple sequence (seq) and the strand (mutation_strand). When pyrimidine_collapse = T, another column also will be added to the those extra columns which shows the mutation (mutation)
Details
It gives the reference and alternative alleles and looks for the rows of the data frame based on these values for + strand genes and their complement alleles rows for - strand genes, then it can look for the adjacent bases in that mutation position. Also, it can look for all the SNVs in the MAF data frame and provide triple sequences for them (reverse complement sequence for the - strand).
Examples
maf <- get_coding_ssm(projection = "grch37") %>% head(n = 500)
#> Warning: One or more parsing issues, call `problems()` on your data frame for details,
#> e.g.:
#> dat <- vroom(...)
#> problems(dat)
# peek at the data
dplyr::select(maf, 1:12) %>% head()
#> genomic_data Object
#> Genome Build: grch37
#> Showing first 10 rows:
#> Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome Start_Position
#> 1 AL627309.1 0 . GRCh37 1 138626
#> 2 AL627309.1 0 . GRCh37 1 138972
#> 3 RP11-206L10.9 0 . GRCh37 1 730845
#> 4 FAM87B 0 . GRCh37 1 753589
#> 5 SAMD11 0 . GRCh37 1 871158
#> 6 SAMD11 0 . GRCh37 1 871192
#> End_Position Strand Variant_Classification Variant_Type Reference_Allele
#> 1 138626 + Silent SNP T
#> 2 138973 + Frame_Shift_Ins INS -
#> 3 730845 + Splice_Region SNP G
#> 4 753589 + Splice_Region SNP A
#> 5 871158 + Silent SNP C
#> 6 871192 + Missense_Mutation SNP C
#> Tumor_Seq_Allele1
#> 1 T
#> 2 -
#> 3 G
#> 4 A
#> 5 C
#> 6 C
maf_anno <- annotate_maf_triplet(maf)
dplyr::select(maf_anno, 1:12, seq) %>% head()
#> genomic_data Object
#> Genome Build: grch37
#> Showing first 10 rows:
#> Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome Start_Position
#> 1 AL627309.1 0 . GRCh37 1 138626
#> 2 AL627309.1 0 . GRCh37 1 138972
#> 3 RP11-206L10.9 0 . GRCh37 1 730845
#> 4 FAM87B 0 . GRCh37 1 753589
#> 5 SAMD11 0 . GRCh37 1 871158
#> 6 SAMD11 0 . GRCh37 1 871192
#> End_Position Strand Variant_Classification Variant_Type Reference_Allele
#> 1 138626 + Silent SNP T
#> 2 138973 + Frame_Shift_Ins INS -
#> 3 730845 + Splice_Region SNP G
#> 4 753589 + Splice_Region SNP A
#> 5 871158 + Silent SNP C
#> 6 871192 + Missense_Mutation SNP C
#> Tumor_Seq_Allele1 seq
#> 1 T ATG
#> 2 - NA
#> 3 G CGT
#> 4 A AAA
#> 5 C ACG
#> 6 C CCG
# Each mutation is now associated with it's sequence context in the
# reference genome in a column named seq
if (FALSE) { # \dontrun{
annotate_maf_triplet(maf, all_SNVs = FALSE, "C", "T")
annotate_maf_triplet(maf, ref = "C", alt = "T", pyrimidine_collapse = TRUE)
} # }