Skip to contents

To demonstrate that our approach also works in the original high-dimensional feature space as well as UMAP space, we provide two additional functions: DLBCLone_KNN and DLBCLone_KNN_predict. Unlike the earlier functions, these do not rely on UMAP embeddings. Instead, they call UMAP only to compute pairwise distances in the high-dimensional space, a preliminary step in the UMAP algorithm before applying KNN directly. It’s important not to confuse these with the previously introduced DLBCLone functions, which embed the data in a lower-dimensional space and run KNN on those embeddings. With the earlier functions, users never need to handle the KNN step themselves, since it is built into the workflow. These are not meant for production usage!

DLBCLone_KNN

Weighted KNN on a feature (mutation) matrix with optional upweighting of user-specified “core” features, optional exclusion of “hidden” features, and optional optimization of an explicit outgroup (e.g. “Other”).

features_df matrix (rows = samples, cols = features) rownames must be sample IDs

metadata data frame with at least sample_id and the ground-truth label column given in truth_column

core_features character vector of feature names to upweight (optional)

core_feature_multiplier numeric multiplier for core_features (default: 1.5)

hidden_features vector of feature names to drop

min_k integer K range to explore when optimizing (default: 5)

max_k integer K range to explore when optimizing (default: 60)

truth_column name of metadata column with ground-truth class labels

truth_classes vector of all classes to consider (including other_class if you intend to optimize for it)

other_class name of the explicit outgroup class (default: “Other”)

optimize_for_other if TRUE: computes a separate “other” score (ratio) and searches a purity threshold; if FALSE, treats all classes symmetrically

predict_unlabeled if TRUE: re-runs KNN to classify samples that were present in features_df but not in metadata

plot_samples optional vector of sample_ids to keep in example plots

DLBCLone_KNN_out optional prior result; if supplied, its learned parameters are reused (skip optimization)

epsilon small value added to distances before weighting

weighted_votes If FALSE: neighbors are unweighted (equal votes)

skip_umap If TRUE: skip layout optimization plots at the end

dlbcl_knn <- DLBCLone_KNN(
  features_df = best_opt_model_matrix,
  metadata = dlbcl_meta_clean,
  min_k = 5,
  max_k = 21,
  optimize_for_other = TRUE
)

knitr::kable(head(dlbcl_knn$predictions))
sample_id EZB BN2 ST2 MCD N1 Other_score top_class top_class_count by_score top_group_score score_ratio lymphgen min_score DLBCLone_k DLBCLone_ko valid_classes
00-14595_tumorC 25.7028 0 0 0 0 0 EZB 25.7028 EZB 25.7028 Inf EZB 16.5 EZB EZB EZB
00-15201_tumorA 2.3604 0 2.0548 4.2871 0 14.8873 MCD 4.2871 MCD 4.2871 0.288 Other 16.5 MCD MCD Other
00-15201_tumorB 0 6.0352 0 3.5092 0 13.1948 BN2 6.0352 BN2 6.0352 0.4574 N1 16.5 BN2 BN2 Other
00-17960_CLC01670 29.3236 3.1066 0 0 0 0 EZB 29.3236 EZB 29.3236 Inf EZB 16.5 EZB EZB EZB
00-22011_tumorB 4.3219 0 0 4.1948 0 16.0209 EZB 4.3219 EZB 4.3219 0.2698 Other 16.5 EZB EZB Other
00-23442_tumorA 27.5503 0 0 0 0 8.411 EZB 27.5503 EZB 27.5503 3.2755 EZB 16.5 EZB EZB EZB
dlbcl_knn$plot_truth  

dlbcl_knn$plot_predicted

DLBCLone_KNN_predict

Applies a previously optimized DLBCLone_KNN model to predict class labels for new (test) samples. This function combines the training and test feature matrices, ensures feature compatibility, and uses the parameters from a DLBCLone KNN optimization run to classify the test samples. Optionally, runs in iterative mode for more stable results when predicting multiple samples. DLBCLone_KNN_predict is only reproducible when you run a sample through individually because the other samples affect the nearest-neighbor distances. It is OK for when users want to “try out” a model but it doesn’t scale well, predict_single_sample_DLBCLone is the way to go.

train_df matrix of features for training samples (rows = samples, columns = features)

test_df matrix of features for test samples to be classified

metadata data frame with metadata for all samples, including at least a sample_id column

core_features optional character vector of feature names to upweight in the KNN calculation

core_feature_multiplier multiplier to apply to core features (default: 1.5)

hidden_features optional character vector of feature names to exclude from the analysis

DLBCLone_KNN_out output from a previous call to DLBCLone_KNN containing optimized parameters

mode if “iterative”, runs KNN prediction for each test sample individually (recommended for stability)

Batch Mode

predictions_b <- DLBCLone_KNN_predict(
  train_df = best_opt_model_matrix,
  test_df = valid_df,
  metadata = dlbcl_meta_clean,
  DLBCLone_KNN_out = dlbcl_knn,
  mode = "batch"
)