Run DLBCLone KNN Classification — DLBCLone

Weighted KNN on a feature (mutation) matrix with optional upweighting of user-specified "core" features, optional exclusion of "hidden" features, and optional optimization of an explicit outgroup (e.g. "Other"). WARNING: This function is not one of the core DLBCLone functions. You should probably be using DLBCLone_predict instead!

Usage

DLBCLone_KNN(
  features_df,
  metadata,
  core_features = NULL,
  core_feature_multiplier = 1.5,
  hidden_features = NULL,
  min_k = 5,
  max_k = 60,
  truth_column = "lymphgen",
  truth_classes = c("EZB", "BN2", "ST2", "MCD", "N1", "Other"),
  other_class = "Other",
  optimize_for_other = TRUE,
  predict_unlabeled = FALSE,
  plot_samples = NULL,
  DLBCLone_KNN_out = NULL,
  seed = 12345,
  epsilon = 0.001,
  weighted_votes = TRUE,
  skip_umap = FALSE
)

Arguments

features_df: Numeric matrix/data.frame (rows = samples, cols = features). Row names must be sample IDs.
metadata: Data frame with at least sample_id and the ground-truth label column given in truth_column.
core_features: Character vector of feature names to upweight (optional).
core_feature_multiplier: Numeric multiplier for core_features.
hidden_features: Character vector of feature names to drop (optional).
min_k, max_k: Integer K range to explore when optimizing.
truth_column: Name of metadata column with ground-truth class labels.
truth_classes: Character vector of all classes to consider (including other_class if you intend to optimize for it).
other_class: Name of the explicit outgroup class (default: "Other").
optimize_for_other: Logical; if TRUE, computes a separate "other" score (ratio) and searches a purity threshold; if FALSE, treats all classes symmetrically.
predict_unlabeled: If TRUE, re-runs KNN to classify samples that were present in features_df but not in metadata.
plot_samples: Optional vector of sample_ids to keep in example plots.
DLBCLone_KNN_out: Optional prior result; if supplied, its learned parameters are reused (skip optimization).
seed: Random seed.
epsilon: Small value added to distances before weighting.
weighted_votes: If FALSE, neighbors are unweighted (equal votes).
skip_umap: If TRUE, skip layout optimization plots at the end.

Value

A list with fields including:

predictions: Per-sample vote/score summary and predicted labels
DLBCLone_k_best_k: Best K found
DLBCLone_k_purity_threshold: Best purity threshold (if applicable)
DLBCLone_k_accuracy: Best accuracy metric achieved
truth_classes, truth_column: Echoed arguments
unlabeled_predictions: Predictions for unlabeled samples (if requested)
df: Annotated layout for plotting (when built in this run)
plot_truth, plot_predicted: ggplots when built in this run

Details

This version removes hard-coded LymphGen class names and instead derives the in-group classes and the outgroup column name from the arguments truth_classes and other_class. It keeps backward compatibility for the default LymphGen-like usage.