Optimize parameters for classifying samples using UMAP and k-nearest neighbor
DLBCLone_optimize_params.RdOptimize parameters for classifying samples using UMAP and k-nearest neighbor
Usage
DLBCLone_optimize_params(
combined_mutation_status_df,
metadata_df,
umap_out,
truth_classes = c("EZB", "MCD", "ST2", "N1", "BN2", "Other"),
other_class = "Other",
truth_column = "lymphgen",
optimize_for_other = FALSE,
eval_group = NULL,
min_k = 3,
max_k = 23,
verbose = FALSE,
seed = 12345,
maximize = "balanced_accuracy",
cap_classification_rate = 1,
exclude_other_for_accuracy = FALSE,
weights_opt = c(TRUE)
)Arguments
- combined_mutation_status_df
Data frame with one row per sample and one column per mutation
- metadata_df
Data frame of metadata with one row per sample and three required columns: sample_id, dataset and lymphgen
- umap_out
The output of a previous run of make_and_annotate_umap. If provided, the function will use this model to project the data instead of re-running UMAP.
- truth_classes
Vector of classes to use for training and testing. Default: c("EZB","MCD","ST2","N1","BN2","Other")
- optimize_for_other
Set to TRUE to optimize the threshold for classifying samples as "Other" based on the relative proportion of samples near the sample in UMAP space with the "Other" label. Rather than treating Other as just another class, this will optimize the threshold for a separate score that considers how many Other and non-Other samples are in the neighborhood of the sample in question. This parameter will NOT change the value in predicted_label. Instead, the predicted_label_optimized column will contain the optimized label. Default: FALSE
- eval_group
If desired, use this to specify which rows will be evaluated and held out from training rather than using all samples. NOTE: this parameter will probably become deprecated!
- min_k
Starting k for knn (Default: 3)
- max_k
Ending k for knn (Default: 33)
- verbose
Whether to print verbose outputs to console
- seed
Random seed to use for reproducibility (default: 12345)
- maximize
Metric to use for optimization. Either "sensitivity" (average sensitivity across all classes), "accuracy" (actual accuracy across all samples) or "balanced_accuracy" (the mean of the balanced accuracy values across all classes). Default: "balanced_accuracy"
Value
List of data frames with the results of the parameter optimization including the best model, the associated knn parameters and the annotated UMAP output as a data frame. The list also includes the predictions for the "Other" class if it was included in the training and testing.
Examples
if (FALSE) { # \dontrun{
lymphgen_lyseq =
GAMBLR.predict::DLBCLone_optimize_params(
dlbcl_status_combined_lyseq,
dlbcl_meta_lyseq_train,
min_k = 5,
max_k=23,
truth_classes = c("MCD","EZB","ST2","BN2","Other"))
} # }