Optimize parameters for classifying samples using UMAP and k-nearest neighbor

Usage

DLBCLone_optimize_params(
  combined_mutation_status_df,
  metadata_df,
  umap_out,
  truth_classes = c("EZB", "MCD", "ST2", "N1", "BN2", "Other"),
  other_class = "Other",
  truth_column = "lymphgen",
  optimize_for_other = FALSE,
  eval_group = NULL,
  min_k = 3,
  max_k = 23,
  verbose = FALSE,
  seed = 12345,
  maximize = "balanced_accuracy",
  cap_classification_rate = 1,
  exclude_other_for_accuracy = FALSE,
  weights_opt = c(TRUE)
)

Arguments

combined_mutation_status_df: Data frame with one row per sample and one column per mutation
metadata_df: Data frame of metadata with one row per sample and three required columns: sample_id, dataset and lymphgen
umap_out: The output of a previous run of make_and_annotate_umap. If provided, the function will use this model to project the data instead of re-running UMAP.
truth_classes: Vector of classes to use for training and testing. Default: c("EZB","MCD","ST2","N1","BN2","Other")
optimize_for_other: Set to TRUE to optimize the threshold for classifying samples as "Other" based on the relative proportion of samples near the sample in UMAP space with the "Other" label. Rather than treating Other as just another class, this will optimize the threshold for a separate score that considers how many Other and non-Other samples are in the neighborhood of the sample in question. This parameter will NOT change the value in predicted_label. Instead, the predicted_label_optimized column will contain the optimized label. Default: FALSE
eval_group: If desired, use this to specify which rows will be evaluated and held out from training rather than using all samples. NOTE: this parameter will probably become deprecated!
min_k: Starting k for knn (Default: 3)
max_k: Ending k for knn (Default: 33)
verbose: Whether to print verbose outputs to console
seed: Random seed to use for reproducibility (default: 12345)
maximize: Metric to use for optimization. Either "sensitivity" (average sensitivity across all classes), "accuracy" (actual accuracy across all samples) or "balanced_accuracy" (the mean of the balanced accuracy values across all classes). Default: "balanced_accuracy"

Value

List of data frames with the results of the parameter optimization including the best model, the associated knn parameters and the annotated UMAP output as a data frame. The list also includes the predictions for the "Other" class if it was included in the training and testing.

Examples


if (FALSE) { # \dontrun{

lymphgen_lyseq =
GAMBLR.predict::DLBCLone_optimize_params(
 dlbcl_status_combined_lyseq,
 dlbcl_meta_lyseq_train,
 min_k = 5,
 max_k=23,
 truth_classes = c("MCD","EZB","ST2","BN2","Other"))
} # }