DLBCLone Models • GAMBLR.predict

Preamble

For future use and to facilitate reproducibility, you can save the outputs of DLBCLone_optimize_params (i.e. a DLBCLone model) and restore them in a subsequent session.

DLBCLone_save_optimized

To store your model, you simply specify the directory and a name prefix that will be incorporated into the file names. A saved model has two separate compoents, a .rds file and a .uwot file.

DLBCLone_save_optimized( # saving DLBCLone_optimize_params
  DLBCLone_model = best_opt_model,
  base_path="../data/saved_tutorial_model",
  name_prefix="best_opt_model"
)

DLBCLone_load_optimized

Allows you to load in your stored model.

loaded_model <- DLBCLone_load_optimized( # loading DLBCLone_optimize_params
   path="../data/saved_tutorial_model",
   name_prefix="best_opt_model"
)

default_model <- "best_opt_w"
# Load your stored optimized model from DLBCLone_save_optimized()
all_features_optimized <- DLBCLone_load_optimized( 
        dirname(
            system.file(
                "extdata/models",
                "best_opt_model.rds",
                package = "GAMBLR.predict"
            )
        ),
        default_model
    )

Key components of a DLBCLone model

DLBCLone models are simply named lists that keep track of all the necessary objects required for downstream functions such as DLBCLone_predict. Key components you should be aware of:

features feature matrix for the training data and what is used to initially create the UMAP model and determine the coordinates of training samples in UMAP space. It contains all the original mutation features plus (where applicable) any meta-features. The latter are recognizable because they’re named differently (all end in _feats).
model the actual UMAP model, generated by the uwot package
predictions these are the original predicted labels for every training sample as determined during DLBCLone_optimize_params.

Let’s take a peek at what’s in the features for this model. As you will see, this model used 5 meta-features, one for each of ST2, N1, EZB, MCD and BN2.

colnames(all_features_optimized$features)

 [1] "ACTB"         "ACTG1"        "BCL10"        "BCL2"         "BCL2L1"
 [6] "BCL2_SV"      "BCL6"         "BCL6_SV"      "BIRC3"        "BRAF"
[11] "BTG1"         "BTG2"         "BTK"          "CD19"         "CD70"
[16] "CD79B"        "CD83"         "CDKN2A"       "CREBBP"       "DDX3X"
[21] "DTX1"         "DUSP2"        "EDRF1"        "EIF4A2"       "EP300"
[26] "ETS1"         "ETV6"         "EZH2"         "FAS"          "FCGR2B"
[31] "FOXC1"        "FOXO1"        "GNA13"        "GRHPR"        "HLA-A"
[36] "HLA-B"        "HNRNPD"       "IRF4"         "IRF8"         "ITPKB"
[41] "JUNB"         "KLF2"         "KLHL14"       "KLHL6"        "KMT2D"
[46] "MEF2B"        "MPEG1"        "MYD88"        "MYD88HOTSPOT" "NFKBIA"
[51] "NFKBIE"       "NFKBIZ"       "NOL9"         "NOTCH1"       "NOTCH2"
[56] "OSBPL10"      "PIM1"         "PIM2"         "PRDM1"        "PRDM15"
[61] "PRKDC"        "PRRC2C"       "PTPN1"        "RFTN1"        "S1PR2"
[66] "SETD1B"       "SGK1"         "SOCS1"        "SPEN"         "STAT3"
[71] "STAT6"        "TBCC"         "TBL1XR1"      "TET2"         "TMEM30A"
[76] "TMSB4X"       "TNFAIP3"      "TNFRSF14"     "TOX"          "TP53"
[81] "TP73"         "UBE2A"        "WEE1"         "XBP1"         "ZFP36L1"
[86] "st2_feats"    "n1_feats"     "ezb_feats"    "mcd_feats"    "bn2_feats"

head(rownames(all_features_optimized$features))

[1] "00-14595_tumorC"   "00-15201_tumorA"   "00-15201_tumorB"
[4] "00-17960_CLC01670" "FL1015T2"          "00-22011_tumorB"

all_features_optimized$features[c(1:10),c(1:10)]

                  ACTB ACTG1 BCL10 BCL2 BCL2L1 BCL2_SV BCL6 BCL6_SV BIRC3 BRAF
00-14595_tumorC      0     0     0    2      0       2    2       2     0    0
00-15201_tumorA      0     0     0    0      0       0    2       0     0    0
00-15201_tumorB      0     0     0    0      0       0    0       0     0    0
00-17960_CLC01670    2     0     0    1      0       2    0       2     0    0
FL1015T2             0     0     0    0      0       2    0       2     0    0
00-22011_tumorB      0     0     0    0      0       0    0       0     0    0
00-23442_tumorB      0     0     0    0      0       2    0       0     0    0
00-26427_tumorA      0     0     0    0      0       2    0       0     0    0
00-26427_tumorC      0     0     0    0      0       2    0       0     0    0
FL1002T2             0     0     0    2      0       2    0       0     0    0

DLBCLone_predict is a versatile function, you can use it to cross reference your training data, predict a test sample one at a time, or all of your test samples all at once! Here we will just be re-analyzing training samples. There are only two required arguments:

mutation_status A data frame with sample_id as rownames and feature name as columns. This is the feature matrix you want to predict classes for. You can subset it to contain only rows for the samples you want analyzed but this isn’t necessary.
optimized_model The actual DLBCLone model you want to use. For example, the output of DLBCLone_optimize_params or a model you’ve re-loaded from disk via DLBCLone_load_optimized

Activating a model

If you plan on classifying more than one sample in a session, an activated model will perform better. This simply pre-computes the coordinates of all training samples using the model and stores these for re-use.

# Reactivates stored model setting the UMAP projection

all_features_optimized <- DLBCLone_activate( 
  all_features_optimized, 
  force = TRUE
)