Skip to contents

Optimize parameters for classifying samples using UMAP and k-nearest neighbor

Usage

DLBCLone_optimize_params(
  combined_mutation_status_df,
  metadata_df,
  umap_out,
  truth_classes = c("EZB", "MCD", "ST2", "N1", "BN2", "Other"),
  other_class = "Other",
  truth_column = "lymphgen",
  optimize_for_other = FALSE,
  eval_group = NULL,
  min_k = 3,
  max_k = 23,
  verbose = FALSE,
  seed = 12345,
  maximize = "balanced_accuracy",
  cap_classification_rate = 1,
  exclude_other_for_accuracy = FALSE,
  weights_opt = c(TRUE)
)

Arguments

combined_mutation_status_df

Data frame with one row per sample and one column per mutation

metadata_df

Data frame of metadata with one row per sample and three required columns: sample_id, dataset and lymphgen

umap_out

The output of a previous run of make_and_annotate_umap. If provided, the function will use this model to project the data instead of re-running UMAP.

truth_classes

Vector of classes to use for training and testing. Default: c("EZB","MCD","ST2","N1","BN2","Other")

optimize_for_other

Set to TRUE to optimize the threshold for classifying samples as "Other" based on the relative proportion of samples near the sample in UMAP space with the "Other" label. Rather than treating Other as just another class, this will optimize the threshold for a separate score that considers how many Other and non-Other samples are in the neighborhood of the sample in question. This parameter will NOT change the value in predicted_label. Instead, the predicted_label_optimized column will contain the optimized label. Default: FALSE

eval_group

If desired, use this to specify which rows will be evaluated and held out from training rather than using all samples. NOTE: this parameter will probably become deprecated!

min_k

Starting k for knn (Default: 3)

max_k

Ending k for knn (Default: 33)

verbose

Whether to print verbose outputs to console

seed

Random seed to use for reproducibility (default: 12345)

maximize

Metric to use for optimization. Either "sensitivity" (average sensitivity across all classes), "accuracy" (actual accuracy across all samples) or "balanced_accuracy" (the mean of the balanced accuracy values across all classes). Default: "balanced_accuracy"

Value

List of data frames with the results of the parameter optimization including the best model, the associated knn parameters and the annotated UMAP output as a data frame. The list also includes the predictions for the "Other" class if it was included in the training and testing.

Examples


if (FALSE) { # \dontrun{

lymphgen_lyseq =
GAMBLR.predict::DLBCLone_optimize_params(
 dlbcl_status_combined_lyseq,
 dlbcl_meta_lyseq_train,
 min_k = 5,
 max_k=23,
 truth_classes = c("MCD","EZB","ST2","BN2","Other"))
} # }