Skip to contents

Retrieve Manta SVs for one or many samples

Usage

get_manta_sv(
  these_samples_metadata = NULL,
  projection = "grch37",
  region,
  min_vaf = 0.1,
  min_score = 40,
  pass_filters = TRUE,
  verbose = TRUE,
  from_cache = TRUE,
  write_to_file = FALSE,
  chromosome,
  qstart,
  qend,
  these_sample_ids = NULL,
  pairing_status
)

Arguments

these_samples_metadata

A metadata data frame to limit the result to sample_ids within it

projection

The projection genome build. Default is grch37.

region

Specify a single region to fetch SVs anchored within using the format "chrom:start-end"

min_vaf

The minimum tumour VAF for a SV to be returned. Default is 0.1.

min_score

The lowest Manta somatic score for a SV to be returned. Default is 40.

pass_filters

If TRUE (default) only return SVs that are annotated with PASS in the FILTER column. Set to FALSE to keep all variants, regardless if they PASS the filters.

verbose

Set to FALSE to minimize the output to console. Default is TRUE. This parameter also dictates the verbose-ness of any helper function internally called inside the main function.

from_cache

Boolean variable for using cached results, default is TRUE. If write_to_file = TRUE, this parameter auto-defaults to FALSE.

write_to_file

Boolean statement that outputs bedpe file if TRUE, default is FALSE. Setting this to TRUE forces from_cache = FALSE.

chromosome

DEPRECATED. Use region instead.

qstart

DEPRECATED. Use region instead.

qend

DEPRECATED. Use region instead.

these_sample_ids

DEPRECATED. Use these_samples_metadata instead.

pairing_status

DEPRECATED. Subset your metadata and supply these_samples_metadata instead.

Value

A data frame in a bedpe-like format with additional columns that allow filtering of high-confidence SVs.

Details

Retrieve Manta SVs with additional VCF information to allow for filtering of high-confidence variants. To get SV calls for multiple samples, supply a metadata table via these_samples_metadata that has been subset to only those samples. The results will be restricted to the sample_ids within that data frame. This function relies on a set of specific internal functions get_manta_sv_by_samples (if from_cache = FALSE). This function can also restrict the returned breakpoints within a genomic region specified via region (in chr:start-end format). Useful filtering parameters are also available, use min_vaf to set the minimum tumour VAF for a SV to be returned and min_score to set the lowest Manta somatic score for a SV to be returned. In addition, the user can chose to return all variants, even the ones not passing the filter criteria. To do so, set pass_filters = FALSE (defaults to TRUE).

Advanced settings (probably not for you)

Is it advised to leave the default from_cache setting to TRUE. To ensure manta results arre pulled from a pre-generated merge (i.e. the cached result). If set to FALSE in combination with write_to_file = TRUE, the function will (re)generate new merged manta calls, if the user has the required file permissions. Note, that if write_to_file is set to TRUE, the function defaults from_cache = FALSE to avoid nonsense parameter combinations. Is this function not what you are looking for? You may want: get_combined_sv After running this or get_combined_sv, you most likely want to annotate the result using GAMBLR.utils::annotate_sv

Examples

# lazily get every SV in the table with default quality filters
all_sv <- get_manta_sv()
#> [1] "no metadata provided, fetching all samples..."
#> [1] "dropping capture samples because manta results\n      are only available for genome seq_type"
#> 
#> The cached results were last updated: 2025-02-24 16:06:43.114603
#> 
#> Reading cached results...
#> [1] "No Manta SVs found for 327 samples and 13 cohorts"
#>  [1] "DLBCL_LSARP_Trios"   "tFL_LSARP_Trios"     "pFL_LSARP_Trios"    
#>  [4] "FL_FOLL_BR"          "DLBCL_TFRI_DarkZone" "DLBCL_Pasqualucci"  
#>  [7] "DLBCL_montreal"      "DLBCL_Jain"          "DLBCL_cell_lines"   
#> [10] "MCL_CellLines"       "cHL_Maura"           "MM_mmsanger"        
#> [13] "SMZL_Strefford"     
#> 
#> The following VCF filters are applied;
#>   Minimum VAF: 0.1
#>   Minimum Score: 40
#>   Only keep variants passing the quality filter: TRUE
#> 
#> Returning 789098 variants from 1664 sample(s)
#> 
#> Done!
dplyr::select(all_sv,1:14) %>% head()
#> genomic_data Object
#> Genome Build: grch37 
#> Showing first 10 rows:
#>   CHROM_A START_A END_A CHROM_B   START_B     END_B
#> 1       1   10286 10286       8 146301391 146301391
#> 2       1   10309 10837      12     95038     95505
#> 3       1   10347 10630      15 102520227 102520676
#> 4       1   10438 10438       8 146301391 146301391
#> 5       1   10438 10438       8 146301391 146301391
#> 6       1   10457 10839      12     94873     95291
#>                     manta_name SCORE STRAND_A STRAND_B tumour_sample_id
#> 1   MantaBND:5:1923:1927:0:0:0    46        +        +        09-41114T
#> 2   MantaBND:1:6049:6050:1:0:0    52        +        +     4687-03-01BD
#> 3  MantaBND:11:3940:4135:0:0:0    58        -        -        12-34927T
#> 4   MantaBND:2:7221:7224:0:1:0    84        +        +      102-01-01TD
#> 5   MantaBND:2:1723:1728:0:0:0    81        +        +    102-0202-1DVT
#> 6 MantaBND:3:26317:26320:0:0:0    56        +        +     4690-03-01BD
#>   normal_sample_id VAF_tumour  DP
#> 1        14-11247N      0.118 110
#> 2   14-11247Normal      0.250  52
#> 3        14-11247N      0.135 104
#> 4   14-11247Normal      0.520  25
#> 5   14-11247Normal      0.630  27
#> 6   14-11247Normal      0.333  18

# get all SVs for just one cohort
cohort_meta = suppressMessages(get_gambl_metadata()) %>% 
              dplyr::filter(cohort == "DLBCL_cell_lines")

some_sv <- get_manta_sv(these_samples_metadata = cohort_meta, verbose=FALSE)
dplyr::select(some_sv,1:14) %>% head()
#> genomic_data Object
#> Genome Build: grch37 
#> Showing first 10 rows:
#>   CHROM_A START_A   END_A CHROM_B START_B   END_B               manta_name
#> 1       1  963851  963870       1  964461  964461 MantaDEL:14848:0:0:0:0:0
#> 2       1 1142719 1142719       1 1143140 1143140 MantaDEL:14306:0:0:0:0:0
#> 3       1 1142719 1142719       1 1143140 1143140 MantaDEL:14173:0:0:0:0:0
#> 4       1 1142719 1142719       1 1143140 1143140 MantaDEL:11910:0:0:0:0:0
#> 5       1 1161716 1161716       1 1161780 1161780 MantaDEL:15361:0:0:0:0:0
#> 6       1 1161716 1161716       1 1161780 1161780 MantaDEL:11880:0:0:0:0:0
#>   SCORE STRAND_A STRAND_B tumour_sample_id normal_sample_id VAF_tumour  DP
#> 1   144        +        -           Toledo        14-11247N      0.923  26
#> 2    94        +        -            HBL-1        14-11247N      0.300 100
#> 3    81        +        -         SU-DHL-4        14-11247N      0.256  78
#> 4    55        +        -         SU-DHL-9        14-11247N      0.183  60
#> 5    48        +        -               HT        14-11247N      0.273  44
#> 6    58        +        -            MD903        14-11247N      0.471  34
nrow(some_sv)
#> [1] 21216

# get the SVs in a region around MYC
# WARNING: This is not the best way to find MYC SVs.
# Use annotate_sv on the full SV set instead.
myc_region_hg38 = "chr8:127710883-127761821"
myc_region_grch37 = "8:128723128-128774067"

hg38_myc_locus_sv <- get_manta_sv(region = myc_region_hg38,
                                projection = "hg38",
                                verbose = FALSE)
dplyr::select(hg38_myc_locus_sv,1:14) %>% head()
#> genomic_data Object
#> Genome Build: hg38 
#> Showing first 10 rows:
#>   CHROM_A  START_A    END_A CHROM_B   START_B     END_B
#> 1    chr2  9700440  9700440    chr8 127726024 127726024
#> 2    chr2 28983233 28983240    chr8 127711264 127711271
#> 3    chr2 88858802 88858802    chr8 127744262 127744262
#> 4    chr2 88860304 88860306    chr8 127751936 127751938
#> 5    chr2 88860417 88860417    chr8 127751955 127751955
#> 6    chr2 88861500 88861500    chr8 127748752 127748752
#>                     manta_name SCORE STRAND_A STRAND_B
#> 1     MantaBND:80035:1:8:0:0:0   103        +        +
#> 2 MantaBND:3:52907:52908:0:3:0    43        -        -
#> 3    MantaBND:279432:0:1:0:0:0   148        +        +
#> 4  MantaBND:194837:0:1:0:0:0:0   102        +        +
#> 5  MantaBND:194837:0:1:0:0:0:0    73        -        -
#> 6   MantaBND:1102030:0:1:0:0:0    89        +        +
#>            tumour_sample_id          normal_sample_id VAF_tumour  DP
#> 1 BLGSP-71-06-00252-01A-01D BLGSP-71-06-00252-10A-01D      0.194 252
#> 2           02-14764_tumorB           02-14764_normal      0.109  55
#> 3                   SP59344                   SP59342      0.386  88
#> 4 BLGSP-71-27-00414-01A-01E BLGSP-71-27-00414-10A-01D      0.171 280
#> 5 BLGSP-71-27-00414-01A-01E BLGSP-71-27-00414-10A-01D      0.117 230
#> 6 BLGSP-71-30-00647-01A-01E BLGSP-71-06-00286-99A-01D      0.283  46
nrow(hg38_myc_locus_sv)
#> [1] 458

incorrect_myc_locus_sv <- get_manta_sv(region = myc_region_grch37,
                                projection = "hg38",
                                verbose = FALSE)
dplyr::select(incorrect_myc_locus_sv,1:14) %>% head()
#> genomic_data Object
#> Genome Build: hg38 
#> Showing first 10 rows:
#>   CHROM_A   START_A     END_A CHROM_B   START_B     END_B
#> 1    chr4  77227094  77227100    chr8 128767241 128767247
#> 2    chr8   1287381   1287381    chr8   1287384   1287384
#> 3    chr8 128726344 128727379   chr11  93629113  93629647
#> 4    chr8 128726820 128726820    chr8 128726825 128726825
#> 5    chr8 128726820 128726820    chr8 128726825 128726825
#> 6    chr8 128738979 128738983    chr8 128752584 128752588
#>                   manta_name SCORE STRAND_A STRAND_B tumour_sample_id
#> 1  MantaBND:658884:1:2:0:0:0    42        -        -  14-33798_tumorB
#> 2 MantaINS:1063533:0:0:0:4:0    51        +        -  97-28459_tumorB
#> 3   MantaBND:28037:1:9:0:0:0    66        -        +        01-20774T
#> 4  MantaINS:242009:0:0:0:3:0    76        +        -         PD26403a
#> 5  MantaINS:226876:7:7:1:3:0    84        +        -         PD26403c
#> 6 MantaDEL:1407936:0:1:0:0:0   118        +        -  04-14093_tumorA
#>   normal_sample_id VAF_tumour  DP
#> 1  14-33798_normal      0.136  44
#> 2          FL3006N      0.308  26
#> 3        14-11247N      0.280  25
#> 4         PD26403b      0.400 105
#> 5         PD26403b      0.407 113
#> 6  04-14093_normal      0.442  43
nrow(incorrect_myc_locus_sv)
#> [1] 28

# Despite potentially being incomplete, we can nonetheless
# annotate these directly for more details
annotated_myc_hg38 = suppressMessages(
         GAMBLR.utils::annotate_sv(hg38_myc_locus_sv, genome_build = "hg38")
)
head(annotated_myc_hg38)
#>   chrom1    start1      end1 chrom2    start2      end2 name score strand1
#> 1      2  28983233  28983240      8 127711264 127711271    .    43       -
#> 2      4   1746419   1746421      8 127723483 127723485    .    77       -
#> 3      8 127226860 127226862      8 127759782 127759784    .    56       +
#> 4      8 127226860 127226860      8 127759821 127759821    .    51       -
#> 5      8 127301019 127301020      8 127742838 127742839    .    71       -
#> 6      8 127301020 127301022      8 127742838 127742840    .    65       +
#>   strand2 tumour_sample_id  gene partner   fusion
#> 1       -  02-14764_tumorB   ALK    <NA>   NA-ALK
#> 2       -        09-41114T WHSC1    <NA> NA-WHSC1
#> 3       +          SP13307   MYC    <NA>   NA-MYC
#> 4       -          SP13307   MYC    <NA>   NA-MYC
#> 5       -      365-16-01TD   MYC    <NA>   NA-MYC
#> 6       +      365-16-01TD   MYC    <NA>   NA-MYC
table(annotated_myc_hg38$partner)
#> 
#>  BCL6 CCNL1   DMD   IGH   IGK   IGL  LRMP  PAX5 RFTN1 
#>     3     1     2   293     5     6     1     5     1 
# The usual MYC partners are seen here

annotated_myc_incorrect = suppressMessages(
         GAMBLR.utils::annotate_sv(incorrect_myc_locus_sv, genome_build = "hg38")
)
head(annotated_myc_incorrect)
#>   chrom1    start1      end1 chrom2    start2      end2 name score strand1
#> 1      8 128726344 128727379     11  93629113  93629647    .    66       -
#> 2      8 128726820 128726820      8 128726825 128726825    .    76       +
#> 3      8 128726820 128726820      8 128726825 128726825    .    84       +
#> 4      8 128738979 128738983      8 128752584 128752588    .   118       +
#> 5      8 128738979 128738983      8 128752584 128752588    .   127       +
#> 6      8 128738981 128738981      8 128752584 128752584    .   126       +
#>   strand2 tumour_sample_id gene partner fusion
#> 1       +        01-20774T  MYC    <NA> NA-MYC
#> 2       -         PD26403a  MYC    <NA> NA-MYC
#> 3       -         PD26403c  MYC    <NA> NA-MYC
#> 4       -  04-14093_tumorA  MYC    <NA> NA-MYC
#> 5       -  04-14093_tumorB  MYC    <NA> NA-MYC
#> 6       -        05-24065T  MYC    <NA> NA-MYC
table(annotated_myc_incorrect$partner)
#> < table of extent 0 >
# The effect of specifying the wrong coordinate is evident