Skip to contents

Weighted KNN on a feature (mutation) matrix with optional upweighting of user-specified "core" features, optional exclusion of "hidden" features, and optional optimization of an explicit outgroup (e.g. "Other"). WARNING: This function is not one of the core DLBCLone functions. You should probably be using DLBCLone_predict instead!

Usage

DLBCLone_KNN(
  features_df,
  metadata,
  core_features = NULL,
  core_feature_multiplier = 1.5,
  hidden_features = NULL,
  min_k = 5,
  max_k = 60,
  truth_column = "lymphgen",
  truth_classes = c("EZB", "BN2", "ST2", "MCD", "N1", "Other"),
  other_class = "Other",
  optimize_for_other = TRUE,
  predict_unlabeled = FALSE,
  plot_samples = NULL,
  DLBCLone_KNN_out = NULL,
  seed = 12345,
  epsilon = 0.001,
  weighted_votes = TRUE,
  skip_umap = FALSE
)

Arguments

features_df

Numeric matrix/data.frame (rows = samples, cols = features). Row names must be sample IDs.

metadata

Data frame with at least sample_id and the ground-truth label column given in truth_column.

core_features

Character vector of feature names to upweight (optional).

core_feature_multiplier

Numeric multiplier for core_features.

hidden_features

Character vector of feature names to drop (optional).

min_k, max_k

Integer K range to explore when optimizing.

truth_column

Name of metadata column with ground-truth class labels.

truth_classes

Character vector of all classes to consider (including other_class if you intend to optimize for it).

other_class

Name of the explicit outgroup class (default: "Other").

optimize_for_other

Logical; if TRUE, computes a separate "other" score (ratio) and searches a purity threshold; if FALSE, treats all classes symmetrically.

predict_unlabeled

If TRUE, re-runs KNN to classify samples that were present in features_df but not in metadata.

plot_samples

Optional vector of sample_ids to keep in example plots.

DLBCLone_KNN_out

Optional prior result; if supplied, its learned parameters are reused (skip optimization).

seed

Random seed.

epsilon

Small value added to distances before weighting.

weighted_votes

If FALSE, neighbors are unweighted (equal votes).

skip_umap

If TRUE, skip layout optimization plots at the end.

Value

A list with fields including:

predictions

Per-sample vote/score summary and predicted labels

DLBCLone_k_best_k

Best K found

DLBCLone_k_purity_threshold

Best purity threshold (if applicable)

DLBCLone_k_accuracy

Best accuracy metric achieved

truth_classes, truth_column

Echoed arguments

unlabeled_predictions

Predictions for unlabeled samples (if requested)

df

Annotated layout for plotting (when built in this run)

plot_truth, plot_predicted

ggplots when built in this run

Details

This version removes hard-coded LymphGen class names and instead derives the in-group classes and the outgroup column name from the arguments truth_classes and other_class. It keeps backward compatibility for the default LymphGen-like usage.