Frequently Asked Qestions

This section will cover most of the questions you may have about GAMBLR.data. If there is something that is not covered, please feel free to reach out to us via GitHub by reporting an issue and we will be happy to add it to this page.

What exactly is meant by “lymphoma genes”?

The term lymphoma genes refers to the curated list of genes significantly mutated in mature B-cell lymphomas. This list is available within this package by simply calling lymphoma_genes. Several modifications of the curated list are available, including the lists split by pathology, and a more comprehensive list lymphoma_genes_comprehensive that provides more information such as which studies reported it as significantly mutated, whether it is target of aSHM etc.

Someone said “aSHM regions”. What is this?

The GAMBLR.data comes with the list of aSHM regions. They are determined by variety of approaches but mostly by manually reviewing patters of thousands of variants across multiple lymphomas. They can be obtained by simply calling {projection}_ashm_regions. The aSHM regions are version-controlled to track the previous iterations of the list, but the {projection}_ashm_regions always refers to the latest and most complete version. If you want to access the previous iterations, they can be referred to as somatic_hypermutation_locations_GRCh3{projection_version}_v{version}. For more details please refer to the bundled data resources.

What kind of somatic variants are bundled?

The GAMBLR.data provides simple somatic, copy number, and structural variants. For data that comes directly from the Supplemental Tables of supported studies, it is provided in the projection as reported in the original publications. For the rest of the samples it is available with respect to both grch37 and hg38.

What are the studies bundled with the package?

The data includes variants from the following studies:

                  
                    BL BLL COMFL DLBCL  FL HGBL MZL NS1
  BL_Thomas        234   0     0     0   0    0   0   0
  DLBCL_Arthur       0   0     1   152   0    0   0   0
  DLBCL_cell_lines   0   0     0     5   0    0   0   0
  dlbcl_chapuy       0   0     0   233   0    0   0   0
  DLBCL_Hilton       0   1     9   137   6    5   1   1
  dlbcl_reddy        0   0     0   999   0    0   0   0
  dlbcl_schmitz      0   0     0   470   0    0   0   0
  DLBCL_Thomas       0   0     0    43   0    0   0   0
  FL_Dreval          0   0    21   209 213    0   0   0

Please see the use cases for the more information on each of these studies.

Which variant caller is used to call the variants?

The simple somatic mutations are called using the SLMS-3 variant caller. Briefly, this is a voting approach of strelka2, mutect2, LoFreq, and SAGE. For more details about SLMS-3, please refer to the source code and manuscript by Thomas et al.. For copy number variants, the calls for WGS samples are obtained using battenberg or sequenza when the samples are matched, and contolfreec when the samples are unmatched. The samples subjected to whole exome sequencing, the copy number variants are called with battenberg for matched samples, and combination of cnvkit/pureCN for samples without matched normal sample. The structural variants are called using manta.

Are the variant calls reported for throughout the whole genome space?

The data for BL samples from the manuscript by Thomas et al. and FL/DLBCL samples from the manuscript by Dreval et al. are directly pulled from the Supplemental Tables. They represent only coding variants. The rest of the simple somatic variants are restricted to only coding variants in lymphoma genes.

Back to top