Predict DLBCL genetic subgroup for one or more samples using a pre-trained DLBCLone model
DLBCLone_predict.RdProjects new samples into a frozen UMAP space from a pre-trained
DLBCLone model and assigns subtype labels with a weighted k-NN
classifier. The function aligns features between the input matrix and
the training model, optionally constructs/weights meta-features
(*_feats) based on the model's core_features, and prevents
label leakage by excluding the test sample from its own neighbor set.
Usage
DLBCLone_predict(
mutation_status,
optimized_model,
fill_missing = FALSE,
drop_extra = FALSE,
check_frequencies = FALSE,
dry_run = FALSE,
seed = 12345,
verbose = FALSE
)Arguments
- mutation_status
Data frame or matrix with one row per sample and one column per feature (typically binary or count mutation indicators). Row names must be unique sample IDs. Columns should correspond to the features the model was trained on. Extra/missing features are handled by
drop_extraandfill_missing.- optimized_model
A pre-trained DLBCLone model object as returned by
DLBCLone_load_optimized()(or saved viaDLBCLone_save_optimized()). Must contain fields such as$model(uwot UMAP),$features(training feature matrix),$df(training metadata incl.sample_id,lymphgen),$best_params(e.g.,na_option,threshold,use_weights), and$k_DLBCLone_w. Optional fields include$core_features,$core_feature_multiplier,$purity_DLBCLone_w,$score_thresh_DLBCLone_w, and$truth_classes.- fill_missing
Logical. If
TRUE, any features that exist in the model but are absent inmutation_statusare added and set to zero for all samples (with a warning). IfFALSE, missing model features cause an error. Defaults toFALSE.- drop_extra
Logical. If
TRUE, any features present inmutation_statusbut not seen during model training are dropped (with a message). IfFALSE, extra features cause an error. Defaults toFALSE.- check_frequencies
Logical. If
TRUE, the function does not run prediction. Instead, it computes and returns a per-feature frequency comparison between test and training cohorts (binary presence/absence), along with a ggplot highlighting large deviations. Defaults toFALSE.- dry_run
Logical. If
TRUE, return the preprocessed/weighted test feature matrix (after column alignment and any core-feature weighting/meta-feature construction) without running projection or classification. Defaults toFALSE.- seed
Integer seed passed to embedding/projection steps for reproducibility. Defaults to
12345.- annotate_accuracy
Logical. Currently unused (reserved for future output annotations). Defaults to
FALSE.
Value
A list with elements (fields may vary depending on options/model):
prediction: data frame of per-sample predictions joined to test UMAP coordinates (V1,V2) and any vote metrics.projection: data frame of projected test coordinates.umap_input: the model's training feature matrix (optimized_model$features).model: the frozen UMAP model used for projection (optimized_model$model).features_df: the processed test feature matrix used for projection/classification (rows = samples).df: alias ofprediction(kept for compatibility).metadata: the model's training metadata (optimized_model$df).type: string identifier"DLBCLone_predict".unprocessed_votes: predictions before post-processing.(optional)
core_features,core_feature_multiplier,mutation_status: included when core-feature weighting was applied.
Details
Feature alignment. Columns in mutation_status must match the
model's training features. Set drop_extra=TRUE to drop unexpected
columns, and fill_missing=TRUE to add missing model columns (filled
with zero). If, after adjustment, columns do not exactly match (including
order), the function stops with an error.
Core features & meta-features. If optimized_model$core_features
is:
a list of feature groups, the function creates meta-features (columns suffixed with
"_feats") as the (weighted) mean of each group's members. Whencore_feature_multiplieris a list, group-specific numeric weights are applied.a character vector, core features may be multiplicatively up-weighted by
core_feature_multiplierto match training scale if they do not already appear weighted in the input.
Projection & classification. Both training and test samples are
projected into the same frozen UMAP space (no retraining) via
make_and_annotate_umap(..., umap_out=optimized_model, ret_model=FALSE).
For each test sample, neighbors are drawn from the projected training set
with that test sample removed to prevent self-label leakage. Labels are
assigned using weighted_knn_predict_with_conf() with
k=optimized_model$k_DLBCLone_w, confidence threshold
best_params$threshold, optional distance weighting
best_params$use_weights, and max_neighbors. The
other_class is currently fixed as "Other".
Post-processing (optional). If the model contains
$purity_DLBCLone_w and/or $score_thresh_DLBCLone_w, the raw
vote outputs are passed to process_votes() to derive
DLBCLone_w and DLBCLone_wo decisions using score and ratio
criteria. When available, $truth_classes are used to order/interpret
group labels.
Early-return modes.
check_frequencies=TRUE: returns a list with a ggplot object and a frequency deviation data frame; no projection/prediction.dry_run=TRUE: returns the processed test feature matrix (after alignment/weighting); no projection/prediction.
Errors and messages
The function stops if feature columns cannot be reconciled (see
drop_extra, fill_missing), or if, after alignment, the column
order/identity does not exactly match the training features. Messages are
emitted when dropping/adding features or when core-feature weighting is
inferred/applied or skipped.
Examples
if (FALSE) { # \dontrun{
# Typical usage
model <- DLBCLone_load_optimized(path = "models", name_prefix = "DLBCLone_LySeqST")
# Inspect distribution shifts without running classification
freq_chk <- DLBCLone_predict(
mutation_status = my_panelX_matrix,
optimized_model = model,
check_frequencies = TRUE
)
print(freq_chk$plot); head(freq_chk$data)
preds <- DLBCLone_predict(
mutation_status = feat_status_LySeqST, # rows = samples, cols = features
optimized_model = model,
drop_extra = TRUE, # drop unexpected columns
fill_missing = TRUE # add any missing model columns as zeros
)
head(preds$prediction)
# Dry run to see the meta-features after alignment/weighting
processed <- DLBCLone_predict(
mutation_status = my_panelX_matrix,
optimized_model = model,
dry_run = TRUE
)
colnames(processed)
} # }