Mutation counts across sliding windows for multiple regions.
calc_mutation_frequency_bin_regions.Rd
Obtain a long tidy or wide matrix of mutation counts across sliding windows for multiple regions.
Usage
calc_mutation_frequency_bin_regions(
regions_list = NULL,
regions_bed = NULL,
these_samples_metadata = NULL,
these_sample_ids = NULL,
this_seq_type = "genome",
maf_data = NULL,
projection = "grch37",
region_padding = 1000,
drop_unmutated = FALSE,
skip_regions = NULL,
only_regions = NULL,
slide_by = 100,
window_size = 500,
return_format = "wide",
...
)
Arguments
- regions_list
Named vector of regions in the format c(name1 = "chr:start-end", name2 = "chr:start-end"). If neither
regions
norregions_bed
is specified, the function will use GAMBLR aSHM region information.- regions_bed
Data frame of regions with four columns (chrom, start, end, name).
- these_samples_metadata
Metadata with at least sample_id column. If not providing a maf data frame, seq_type is also required.
- these_sample_ids
Vector of sample IDs. Metadata will be subset to sample IDs present in this vector.
- this_seq_type
Optional vector of seq_types to include in heatmap. Default "genome". Uses default seq_type priority for samples with >1 seq_type.
- maf_data
Optional maf data frame. Will be subset to rows where Tumor_Sample_Barcode matches provided sample IDs or metadata table. If not provided, maf data will be obtained with get_ssm_by_regions().
- projection
Genome build the function will operate in. Ensure this matches your provided regions and maf data for correct chr prefix handling. Default "grch37".
- region_padding
Amount to pad the start and end coordinates by. Default 1000.
- drop_unmutated
Whether to drop bins with 0 mutations. If returning a matrix format, this will only drop bins with no mutations in any samples.
- skip_regions
Optional character vector of genes to exclude from the default aSHM regions.
- only_regions
Optional character vector of genes to include from the default aSHM regions.
- slide_by
Slide size for sliding window. Default 100.
- window_size
Size of sliding window. Default 500.
- return_format
Return format of mutations. Accepted inputs are "long" and "wide". Long returns a data frame of one sample ID/window per row. Wide returns a matrix with one sample ID per row and one window per column. Using the "wide" format will retain all samples and windows regardless of the drop_unmutated or min_count_per_bin parameters. Default wide.
- ...
Any additional parameters.
Value
A table of mutation counts for sliding windows across one or more regions. May be long or wide.
Details
This function takes a metadata table with these_samples_metadata
parameter and internally calls calc_mutation_frequency_bin_region
(that internally calls get_ssm_by_regions
).
to retrieve mutation counts for sliding windows across one or more regions.
May optionally provide any combination of a maf data frame, existing metadata,
or a regions data frame or named vector.
Examples
#load metadata.
my_meta = get_gambl_metadata()
#> Using the bundled metadata in GAMBLR.data...
dlbcl_bl_meta = dplyr::filter(my_meta, pathology %in% c("DLBCL", "BL"))
#get ashm regions
some_regions = create_bed_data(grch37_ashm_regions,
fix_names = "concat",
concat_cols = c("gene","region"),
sep="-")
print(some_regions)
#> BED Data Object
#> Genome Build: grch37
#> Showing first 10 rows:
#> chrom start end name region regulatory_comment
#> 1 1 6661482 6662702 KLHL21-TSS TSS <NA>
#> 2 1 23885584 23885835 ID3-TSS TSS <NA>
#> 3 1 28832551 28836339 RCC1-TSS TSS <NA>
#> 4 1 31229012 31232011 LAPTM5-TSS TSS <NA>
#> 5 1 150550814 150552135 MCL1-intron intron <NA>
#> 6 2 88904839 88909096 EIF2Ak3-intron intron <NA>
#> 7 2 88925456 88927581 EIF2Ak3-TSS TSS <NA>
#> 8 2 232572640 232574297 PTMA-TSS TSS <NA>
#> 9 2 157669490 157671299 FCRL3-TSS TSS <NA>
#> 10 1 203274698 203275778 BTG2-intron intron active_promoter
mut_count_matrix <- calc_mutation_frequency_bin_regions(
these_samples_metadata = dlbcl_bl_meta,
regions_bed = some_regions
)
#> id_ease: WARNING! 1783 samples in the provided metadata were removed because their seq types are not the same as in the `set_type` argument. Use `verbose = TRUE` to see their IDs.
dim(mut_count_matrix)
#> [1] 763 14697
tail(mut_count_matrix[,c(1:10)])
#> # A tibble: 6 × 10
#> sample_id `1_6661382` `1_6661482` `1_6661582` `1_6661682` `1_6661782`
#> <chr> <int> <int> <int> <int> <int>
#> 1 SP59456 0 0 0 0 0
#> 2 SP59460 0 0 0 0 0
#> 3 SU-DHL-10 0 0 0 0 0
#> 4 SU-DHL-4 0 0 0 0 0
#> 5 Seraphina 0 0 0 0 0
#> 6 Thomas 0 0 0 0 0
#> # ℹ 4 more variables: `1_6661882` <int>, `1_6661982` <int>, `1_6662082` <int>,
#> # `1_6662182` <int>