Get Manta SVs — get_manta_sv • GAMBLR.results

Retrieve Manta SVs for one or many samples

Usage

get_manta_sv(
  these_samples_metadata = NULL,
  projection = "grch37",
  region,
  min_vaf = 0.1,
  min_score = 40,
  pass_filters = TRUE,
  verbose = TRUE,
  from_cache = TRUE,
  write_to_file = FALSE,
  chromosome,
  qstart,
  qend,
  these_sample_ids = NULL,
  pairing_status
)

Arguments

these_samples_metadata: A metadata data frame to limit the result to sample_ids within it
projection: The projection genome build. Default is grch37.
region: Specify a single region to fetch SVs anchored within using the format "chrom:start-end"
min_vaf: The minimum tumour VAF for a SV to be returned. Default is 0.1.
min_score: The lowest Manta somatic score for a SV to be returned. Default is 40.
pass_filters: If TRUE (default) only return SVs that are annotated with PASS in the FILTER column. Set to FALSE to keep all variants, regardless if they PASS the filters.
verbose: Set to FALSE to minimize the output to console. Default is TRUE. This parameter also dictates the verbose-ness of any helper function internally called inside the main function.
from_cache: Boolean variable for using cached results, default is TRUE. If write_to_file = TRUE, this parameter auto-defaults to FALSE.
write_to_file: Boolean statement that outputs bedpe file if TRUE, default is FALSE. Setting this to TRUE forces from_cache = FALSE.
chromosome: DEPRECATED. Use region instead.
qstart: DEPRECATED. Use region instead.
qend: DEPRECATED. Use region instead.
these_sample_ids: DEPRECATED. Use these_samples_metadata instead.
pairing_status: DEPRECATED. Subset your metadata and supply these_samples_metadata instead.

Value

A data frame in a bedpe-like format with additional columns that allow filtering of high-confidence SVs.

Details

Retrieve Manta SVs with additional VCF information to allow for filtering of high-confidence variants. To get SV calls for multiple samples, supply a metadata table via these_samples_metadata that has been subset to only those samples. The results will be restricted to the sample_ids within that data frame. This function relies on a set of specific internal functions get_manta_sv_by_samples (if from_cache = FALSE). This function can also restrict the returned breakpoints within a genomic region specified via region (in chr:start-end format). Useful filtering parameters are also available, use min_vaf to set the minimum tumour VAF for a SV to be returned and min_score to set the lowest Manta somatic score for a SV to be returned. In addition, the user can chose to return all variants, even the ones not passing the filter criteria. To do so, set pass_filters = FALSE (defaults to TRUE).

Advanced settings (probably not for you)

Is it advised to leave the default from_cache setting to TRUE. To ensure manta results arre pulled from a pre-generated merge (i.e. the cached result). If set to FALSE in combination with write_to_file = TRUE, the function will (re)generate new merged manta calls, if the user has the required file permissions. Note, that if write_to_file is set to TRUE, the function defaults from_cache = FALSE to avoid nonsense parameter combinations. Is this function not what you are looking for? You may want: get_combined_sv After running this or get_combined_sv, you most likely want to annotate the result using GAMBLR.utils::annotate_sv

Examples

# lazily get every SV in the table with default quality filters
all_sv <- get_manta_sv()
#> [1] "no metadata provided, fetching all samples..."
#> [1] "dropping capture samples because manta results\n      are only available for genome seq_type"
#> 
#> The cached results were last updated: 2025-02-24 16:06:43.114603
#> 
#> Reading cached results...
#> [1] "No Manta SVs found for 327 samples and 13 cohorts"
#>  [1] "DLBCL_LSARP_Trios"   "tFL_LSARP_Trios"     "pFL_LSARP_Trios"    
#>  [4] "FL_FOLL_BR"          "DLBCL_TFRI_DarkZone" "DLBCL_Pasqualucci"  
#>  [7] "DLBCL_montreal"      "DLBCL_Jain"          "DLBCL_cell_lines"   
#> [10] "MCL_CellLines"       "cHL_Maura"           "MM_mmsanger"        
#> [13] "SMZL_Strefford"     
#> 
#> The following VCF filters are applied;
#>   Minimum VAF: 0.1
#>   Minimum Score: 40
#>   Only keep variants passing the quality filter: TRUE
#> 
#> Returning 789098 variants from 1664 sample(s)
#> 
#> Done!
dplyr::select(all_sv,1:14) %>% head()
#> genomic_data Object
#> Genome Build: grch37 
#> Showing first 10 rows:
#>   CHROM_A START_A END_A CHROM_B   START_B     END_B
#> 1       1   10286 10286       8 146301391 146301391
#> 2       1   10309 10837      12     95038     95505
#> 3       1   10347 10630      15 102520227 102520676
#> 4       1   10438 10438       8 146301391 146301391
#> 5       1   10438 10438       8 146301391 146301391
#> 6       1   10457 10839      12     94873     95291
#>                     manta_name SCORE STRAND_A STRAND_B tumour_sample_id
#> 1   MantaBND:5:1923:1927:0:0:0    46        +        +        09-41114T
#> 2   MantaBND:1:6049:6050:1:0:0    52        +        +     4687-03-01BD
#> 3  MantaBND:11:3940:4135:0:0:0    58        -        -        12-34927T
#> 4   MantaBND:2:7221:7224:0:1:0    84        +        +      102-01-01TD
#> 5   MantaBND:2:1723:1728:0:0:0    81        +        +    102-0202-1DVT
#> 6 MantaBND:3:26317:26320:0:0:0    56        +        +     4690-03-01BD
#>   normal_sample_id VAF_tumour  DP
#> 1        14-11247N      0.118 110
#> 2   14-11247Normal      0.250  52
#> 3        14-11247N      0.135 104
#> 4   14-11247Normal      0.520  25
#> 5   14-11247Normal      0.630  27
#> 6   14-11247Normal      0.333  18

# get all SVs for just one cohort
cohort_meta = suppressMessages(get_gambl_metadata()) %>% 
              dplyr::filter(cohort == "DLBCL_cell_lines")

some_sv <- get_manta_sv(these_samples_metadata = cohort_meta, verbose=FALSE)
dplyr::select(some_sv,1:14) %>% head()
#> genomic_data Object
#> Genome Build: grch37 
#> Showing first 10 rows:
#>   CHROM_A START_A   END_A CHROM_B START_B   END_B               manta_name
#> 1       1  963851  963870       1  964461  964461 MantaDEL:14848:0:0:0:0:0
#> 2       1 1142719 1142719       1 1143140 1143140 MantaDEL:14306:0:0:0:0:0
#> 3       1 1142719 1142719       1 1143140 1143140 MantaDEL:14173:0:0:0:0:0
#> 4       1 1142719 1142719       1 1143140 1143140 MantaDEL:11910:0:0:0:0:0
#> 5       1 1161716 1161716       1 1161780 1161780 MantaDEL:15361:0:0:0:0:0
#> 6       1 1161716 1161716       1 1161780 1161780 MantaDEL:11880:0:0:0:0:0
#>   SCORE STRAND_A STRAND_B tumour_sample_id normal_sample_id VAF_tumour  DP
#> 1   144        +        -           Toledo        14-11247N      0.923  26
#> 2    94        +        -            HBL-1        14-11247N      0.300 100
#> 3    81        +        -         SU-DHL-4        14-11247N      0.256  78
#> 4    55        +        -         SU-DHL-9        14-11247N      0.183  60
#> 5    48        +        -               HT        14-11247N      0.273  44
#> 6    58        +        -            MD903        14-11247N      0.471  34
nrow(some_sv)
#> [1] 21216

# get the SVs in a region around MYC
# WARNING: This is not the best way to find MYC SVs.
# Use annotate_sv on the full SV set instead.
myc_region_hg38 = "chr8:127710883-127761821"
myc_region_grch37 = "8:128723128-128774067"

hg38_myc_locus_sv <- get_manta_sv(region = myc_region_hg38,
                                projection = "hg38",
                                verbose = FALSE)
dplyr::select(hg38_myc_locus_sv,1:14) %>% head()
#> genomic_data Object
#> Genome Build: hg38 
#> Showing first 10 rows:
#>   CHROM_A  START_A    END_A CHROM_B   START_B     END_B
#> 1    chr2  9700440  9700440    chr8 127726024 127726024
#> 2    chr2 28983233 28983240    chr8 127711264 127711271
#> 3    chr2 88858802 88858802    chr8 127744262 127744262
#> 4    chr2 88860304 88860306    chr8 127751936 127751938
#> 5    chr2 88860417 88860417    chr8 127751955 127751955
#> 6    chr2 88861500 88861500    chr8 127748752 127748752
#>                     manta_name SCORE STRAND_A STRAND_B
#> 1     MantaBND:80035:1:8:0:0:0   103        +        +
#> 2 MantaBND:3:52907:52908:0:3:0    43        -        -
#> 3    MantaBND:279432:0:1:0:0:0   148        +        +
#> 4  MantaBND:194837:0:1:0:0:0:0   102        +        +
#> 5  MantaBND:194837:0:1:0:0:0:0    73        -        -
#> 6   MantaBND:1102030:0:1:0:0:0    89        +        +
#>            tumour_sample_id          normal_sample_id VAF_tumour  DP
#> 1 BLGSP-71-06-00252-01A-01D BLGSP-71-06-00252-10A-01D      0.194 252
#> 2           02-14764_tumorB           02-14764_normal      0.109  55
#> 3                   SP59344                   SP59342      0.386  88
#> 4 BLGSP-71-27-00414-01A-01E BLGSP-71-27-00414-10A-01D      0.171 280
#> 5 BLGSP-71-27-00414-01A-01E BLGSP-71-27-00414-10A-01D      0.117 230
#> 6 BLGSP-71-30-00647-01A-01E BLGSP-71-06-00286-99A-01D      0.283  46
nrow(hg38_myc_locus_sv)
#> [1] 458

incorrect_myc_locus_sv <- get_manta_sv(region = myc_region_grch37,
                                projection = "hg38",
                                verbose = FALSE)
dplyr::select(incorrect_myc_locus_sv,1:14) %>% head()
#> genomic_data Object
#> Genome Build: hg38 
#> Showing first 10 rows:
#>   CHROM_A   START_A     END_A CHROM_B   START_B     END_B
#> 1    chr4  77227094  77227100    chr8 128767241 128767247
#> 2    chr8   1287381   1287381    chr8   1287384   1287384
#> 3    chr8 128726344 128727379   chr11  93629113  93629647
#> 4    chr8 128726820 128726820    chr8 128726825 128726825
#> 5    chr8 128726820 128726820    chr8 128726825 128726825
#> 6    chr8 128738979 128738983    chr8 128752584 128752588
#>                   manta_name SCORE STRAND_A STRAND_B tumour_sample_id
#> 1  MantaBND:658884:1:2:0:0:0    42        -        -  14-33798_tumorB
#> 2 MantaINS:1063533:0:0:0:4:0    51        +        -  97-28459_tumorB
#> 3   MantaBND:28037:1:9:0:0:0    66        -        +        01-20774T
#> 4  MantaINS:242009:0:0:0:3:0    76        +        -         PD26403a
#> 5  MantaINS:226876:7:7:1:3:0    84        +        -         PD26403c
#> 6 MantaDEL:1407936:0:1:0:0:0   118        +        -  04-14093_tumorA
#>   normal_sample_id VAF_tumour  DP
#> 1  14-33798_normal      0.136  44
#> 2          FL3006N      0.308  26
#> 3        14-11247N      0.280  25
#> 4         PD26403b      0.400 105
#> 5         PD26403b      0.407 113
#> 6  04-14093_normal      0.442  43
nrow(incorrect_myc_locus_sv)
#> [1] 28

# Despite potentially being incomplete, we can nonetheless
# annotate these directly for more details
annotated_myc_hg38 = suppressMessages(
         GAMBLR.utils::annotate_sv(hg38_myc_locus_sv, genome_build = "hg38")
)
head(annotated_myc_hg38)
#>   chrom1    start1      end1 chrom2    start2      end2 name score strand1
#> 1      2  28983233  28983240      8 127711264 127711271    .    43       -
#> 2      4   1746419   1746421      8 127723483 127723485    .    77       -
#> 3      8 127226860 127226862      8 127759782 127759784    .    56       +
#> 4      8 127226860 127226860      8 127759821 127759821    .    51       -
#> 5      8 127301019 127301020      8 127742838 127742839    .    71       -
#> 6      8 127301020 127301022      8 127742838 127742840    .    65       +
#>   strand2 tumour_sample_id  gene partner   fusion
#> 1       -  02-14764_tumorB   ALK    <NA>   NA-ALK
#> 2       -        09-41114T WHSC1    <NA> NA-WHSC1
#> 3       +          SP13307   MYC    <NA>   NA-MYC
#> 4       -          SP13307   MYC    <NA>   NA-MYC
#> 5       -      365-16-01TD   MYC    <NA>   NA-MYC
#> 6       +      365-16-01TD   MYC    <NA>   NA-MYC
table(annotated_myc_hg38$partner)
#> 
#>  BCL6 CCNL1   DMD   IGH   IGK   IGL  LRMP  PAX5 RFTN1 
#>     3     1     2   293     5     6     1     5     1 
# The usual MYC partners are seen here

annotated_myc_incorrect = suppressMessages(
         GAMBLR.utils::annotate_sv(incorrect_myc_locus_sv, genome_build = "hg38")
)
head(annotated_myc_incorrect)
#>   chrom1    start1      end1 chrom2    start2      end2 name score strand1
#> 1      8 128726344 128727379     11  93629113  93629647    .    66       -
#> 2      8 128726820 128726820      8 128726825 128726825    .    76       +
#> 3      8 128726820 128726820      8 128726825 128726825    .    84       +
#> 4      8 128738979 128738983      8 128752584 128752588    .   118       +
#> 5      8 128738979 128738983      8 128752584 128752588    .   127       +
#> 6      8 128738981 128738981      8 128752584 128752584    .   126       +
#>   strand2 tumour_sample_id gene partner fusion
#> 1       +        01-20774T  MYC    <NA> NA-MYC
#> 2       -         PD26403a  MYC    <NA> NA-MYC
#> 3       -         PD26403c  MYC    <NA> NA-MYC
#> 4       -  04-14093_tumorA  MYC    <NA> NA-MYC
#> 5       -  04-14093_tumorB  MYC    <NA> NA-MYC
#> 6       -        05-24065T  MYC    <NA> NA-MYC
table(annotated_myc_incorrect$partner)
#> < table of extent 0 >
# The effect of specifying the wrong coordinate is evident