Predict DLBCL genetic subgroup for one or more samples using a pre-trained DLBCLone model

Projects new samples into a frozen UMAP space from a pre-trained DLBCLone model and assigns subtype labels with a weighted k-NN classifier. The function aligns features between the input matrix and the training model, optionally constructs/weights meta-features (*_feats) based on the model's core_features, and prevents label leakage by excluding the test sample from its own neighbor set.

Usage

DLBCLone_predict(
  mutation_status,
  optimized_model,
  fill_missing = FALSE,
  drop_extra = FALSE,
  check_frequencies = FALSE,
  dry_run = FALSE,
  seed = 12345,
  verbose = FALSE
)

Arguments

mutation_status: Data frame or matrix with one row per sample and one column per feature (typically binary or count mutation indicators). Row names must be unique sample IDs. Columns should correspond to the features the model was trained on. Extra/missing features are handled by drop_extra and fill_missing.
optimized_model: A pre-trained DLBCLone model object as returned by DLBCLone_load_optimized() (or saved via DLBCLone_save_optimized()). Must contain fields such as $model (uwot UMAP), $features (training feature matrix), $df (training metadata incl. sample_id, lymphgen), $best_params (e.g., na_option, threshold, use_weights), and $k_DLBCLone_w. Optional fields include $core_features, $core_feature_multiplier, $purity_DLBCLone_w, $score_thresh_DLBCLone_w, and $truth_classes.
fill_missing: Logical. If TRUE, any features that exist in the model but are absent in mutation_status are added and set to zero for all samples (with a warning). If FALSE, missing model features cause an error. Defaults to FALSE.
drop_extra: Logical. If TRUE, any features present in mutation_status but not seen during model training are dropped (with a message). If FALSE, extra features cause an error. Defaults to FALSE.
check_frequencies: Logical. If TRUE, the function does not run prediction. Instead, it computes and returns a per-feature frequency comparison between test and training cohorts (binary presence/absence), along with a ggplot highlighting large deviations. Defaults to FALSE.
dry_run: Logical. If TRUE, return the preprocessed/weighted test feature matrix (after column alignment and any core-feature weighting/meta-feature construction) without running projection or classification. Defaults to FALSE.
seed: Integer seed passed to embedding/projection steps for reproducibility. Defaults to 12345.
annotate_accuracy: Logical. Currently unused (reserved for future output annotations). Defaults to FALSE.

Value

A list with elements (fields may vary depending on options/model):

prediction: data frame of per-sample predictions joined to test UMAP coordinates (V1, V2) and any vote metrics.
projection: data frame of projected test coordinates.
umap_input: the model's training feature matrix (optimized_model$features).
model: the frozen UMAP model used for projection (optimized_model$model).
features_df: the processed test feature matrix used for projection/classification (rows = samples).
df: alias of prediction (kept for compatibility).
metadata: the model's training metadata (optimized_model$df).
type: string identifier "DLBCLone_predict".
unprocessed_votes: predictions before post-processing.
(optional) core_features, core_feature_multiplier, mutation_status: included when core-feature weighting was applied.

Details

Feature alignment. Columns in mutation_status must match the model's training features. Set drop_extra=TRUE to drop unexpected columns, and fill_missing=TRUE to add missing model columns (filled with zero). If, after adjustment, columns do not exactly match (including order), the function stops with an error.

Core features & meta-features. If optimized_model$core_features is:

a list of feature groups, the function creates meta-features (columns suffixed with "_feats") as the (weighted) mean of each group's members. When core_feature_multiplier is a list, group-specific numeric weights are applied.
a character vector, core features may be multiplicatively up-weighted by core_feature_multiplier to match training scale if they do not already appear weighted in the input.

Projection & classification. Both training and test samples are projected into the same frozen UMAP space (no retraining) via make_and_annotate_umap(..., umap_out=optimized_model, ret_model=FALSE). For each test sample, neighbors are drawn from the projected training set with that test sample removed to prevent self-label leakage. Labels are assigned using weighted_knn_predict_with_conf() with k=optimized_model$k_DLBCLone_w, confidence threshold best_params$threshold, optional distance weighting best_params$use_weights, and max_neighbors. The other_class is currently fixed as "Other".

Post-processing (optional). If the model contains $purity_DLBCLone_w and/or $score_thresh_DLBCLone_w, the raw vote outputs are passed to process_votes() to derive DLBCLone_w and DLBCLone_wo decisions using score and ratio criteria. When available, $truth_classes are used to order/interpret group labels.

Early-return modes.

check_frequencies=TRUE: returns a list with a ggplot object and a frequency deviation data frame; no projection/prediction.
dry_run=TRUE: returns the processed test feature matrix (after alignment/weighting); no projection/prediction.

Errors and messages

The function stops if feature columns cannot be reconciled (see drop_extra, fill_missing), or if, after alignment, the column order/identity does not exactly match the training features. Messages are emitted when dropping/adding features or when core-feature weighting is inferred/applied or skipped.

Examples

if (FALSE) { # \dontrun{
# Typical usage
model <- DLBCLone_load_optimized(path = "models", name_prefix = "DLBCLone_LySeqST")
# Inspect distribution shifts without running classification
freq_chk <- DLBCLone_predict(
  mutation_status = my_panelX_matrix,
  optimized_model = model,
  check_frequencies = TRUE
)
print(freq_chk$plot); head(freq_chk$data)

preds <- DLBCLone_predict(
  mutation_status = feat_status_LySeqST,  # rows = samples, cols = features
  optimized_model = model,
  drop_extra = TRUE,    # drop unexpected columns
  fill_missing = TRUE   # add any missing model columns as zeros
)
head(preds$prediction)


# Dry run to see the meta-features after alignment/weighting
processed <- DLBCLone_predict(
  mutation_status = my_panelX_matrix,
  optimized_model = model,
  dry_run = TRUE
)
colnames(processed)


} # }