GAMBLR developer guidelines

To ensure consistency throughout all of the packages in the GAMBLR-verse, please follow these developer guidelines. If you have any questions, please feel free to contact us or submit an issue in the corresponding repository.

Installation for developers

The easiest way to obtain and contribute to GAMBLR is to do this via cloning the repository. Here is an example using GAMBLR repo, but any other repo should be cloned in the same way (just substitute the GAMBLR with GAMBLR.{package}, where {package} is one of data, helpers, utils, viz, or results).

cd
git clone git@github.com:morinlab/GAMBLR.git

In your R editor of choice (which is hopefully VS Code), set your working directory to the directory you just cloned the repo into.

How do I setup VS Code as R editor?

Please refer to the tutorial.

setwd("~/GAMBLR")

Install the package in R by running the following command (requires the devtools package):

devtools::install()

After applying your modifications to the code, use the following command to quickly test your changes without directly installing the package (requires the devtools dependency):

devtools::load_all()

The Master branch of all GAMBLR-verse repos is protected. We welcome contributions (pull request, bug report, feature request, PR review) from all levels of users. All commits must be submitted via pull request on a branch. Please refer to the GitHub documentation for details on how to do pull request, or ask for help in the #git-help Slack channel.

Creating or updating functions

Please always ensure that the new function goes into the corresponding child package according to it’s intended use. If you are not sure to which package the new function belongs to, please check the Frequently Asked Question.

When designing new functions or updating the existing ones, please refer to guidelines and best practices of R package development detailed here. Ensure to always provide the required documentation for any new functions. See this section for more details on best practices for documenting R functions. Unsure what information goes where in a function documentation? For more information, see this.

Name

The new functions should follow the GAMBLR-verse convention implemented and expected throughout the packaged to ensure consistency and user expectations. The maing naming conventions are:

Use get_ when the function is supposed to retreive some sort of data
Use pretty when the function is supposed to generate a plot or a multi-panel figure
Use collate_ when the function generates sample-level summary values from another result, which will be added as one or more new column to existing metadata
Use annotate_ when the function is supposed to annotate the data input with some additional information

Title

The title displayed in the documentation is taken from the first sentence. It should be written in sentence case, not end in a full stop, and be followed by a blank line. For example:

This is a good title
This Is Not A Proper Title.

The title is shown in various function indexes (e.g. help(package = “some_package”)) and is what the user will usually see when browsing multiple functions.

Description

The description displayed in the documentation is taken from the next paragraph. It’s shown at the top of documentation and should briefly describe the most important features of the function.

Details

Additional details are anything after the description. Details are optional, but can be any length so are useful if you want to dig deep into some important aspect of the function. Note that, even though the details come right after the description in the introduction, they appear much later in rendered documentation. If you want to add code to the details, this is also the sections to do so. For example, the new function relies on some bash code in order to utilize the GAMBLR code. You can detail such code here by simply adding a code block as you would in a regular markdown file.

Arguments

Detailed argument descriptions should be included for all functions. Remember to state the required data types, default values, if the argument is required or optional, etc.

There are several key concepts underlying the logic behind the naming of the arguments in the GAMBLR-verse. The main terms are:

projection: This is a coordinate system defining the relationship with genome build and chromosome prefixing. The main projections that are expected to be supported throughout GAMBLR-verse are grch37 and hg38. The grch37 projection contains the same coordinate system as genome build hg19, but never has the “chr” prefix on chromosome names. In contrast, the hg38 projection is always chr-prefixed and is in the same coordinate system as the hg38 genome build. The projection and genome_build are essentially synonymous in their meaning but projection is meant to reflect the intent of the user wishing to retrieve data relative to a specific genome build (for example, with one of get_ functions) whereas genome_build is used to tell other functions what projection they should be anticipating for their input (for example, annotate_ functions). Any function that accepts bed_data, seg_data or maf_data should theoretically be able to infer genome_build directly using get_genome_build and developers should strive to use this feature to avoid potential user error.
metadata: This is a data frame with a set of minimal columns defining the biological or clinical characteristics of a sample or a cohort. The argument that defines the metadata should always be called these_samples_metadata. Typically the output of the get_gambl_metadata() is provided to this argument, but you can expect that the following required columns will be present: patient_id, Tumor_Sample_Barcode, sample_id, seq_type, sex, cohort, and pathology. The main purpose of this data frame is to provide a structure for the metadata that is always expected to be available and provides linkage between unique sample identifiers and associated basic metadata values. The columns Tumor_Sample_Barcode and sample_id are expected to share the same values, but are required to be present for direct operation on the outputs of different upstream tools. When handling the metadata, you should always refer to the column sample_id for the unique sample identifiers. When implementing a function that utilizes the metadata, you must consider the scenario that the same sample_id can exist in more than one row but there should never be rows that share the same combination of sample_id and seq_type. Thus, any internal data operation that extends across more than one seq_type must maintain the linkage between the data and its original seq_type. In some cases, this is accomplished simply by adding a column to the output with a self-explanatory name. For example, the maf_data returned by get_all_coding_ssm has a column maf_seq_type.
maf: This is a data frame in a standard maf format defining the simple somatic mutations. The argument that defines the maf should always be called maf_data. Typically the output of the get_ssm* is provided to this argument, but you can expect that the standard maf columns will be present, including Tumor_Sample_Barcode, Chromosome, Start_Position, End_Position, Variant_Classification, Hugo_Symbol, etc. When handling the maf, you should always refer to the column Tumor_Sample_Barcode for the unique sample identifiers.

Return

Specify the returned object, is it a data frame, a list, a vector or characters, etc.

Import (specifying dependencies)

Always import all the packages from which you are calling any functions outside of base R and R packages that must be loaded in order for your function to execute.

Note

Remember to not import tidyverse - rather, import the individual packages from tidyverse that the function is depending on (dplyr, readr, ggplot2 etc).

Important

Overall, the GAMBLR already depends on many different packages that has been already added as dependencies. This results in very long installation time, frequent errors in installation and version conflicts between sub-dependencies, function name clashes, and other issues. For these and other additional reasons, addition on new dependencies is higly discouraged and is strongly adviced against! DO NOT add any new dependency that is not already a dependency of one of GAMBLR packages.

Export

Should this function be exported to NAMESPACE (i.e make it directly accessible for anyone who loads GAMBLR), or is the function considered to be an internal/helper function? In order to have the function populate NAMESPACE, the developer has to run devtools::document() while in the working directory of the package. All functions that have the @export line in its documentation will be documented and added to NAMESPACE.

Examples

Please provide fully reproducible examples for the function. Ideally, the example should demonstrate basic usage, as well as more advanced usage with different parameter combinations. Note that examples can not extend over 100 characters per line, since this will cause the lines to be truncated in the rendered PDF manual. It is advised to write your example in such a way that loading external packages are avoided as much as possible. Instead, prioritize base R as much as possible. In some cases, it is undesirable to have a function run its examples. This applies to functions that are writing files and helper functions, or functions that rely on GSC server access. To avoid any such examples to run, simply wrap the example in:

\dontrun{
do_not_run = some_function()
}

Testing New Functions

So you have added a new function (carefully following the steps in the previous section!) and you are obviously extremely proud and eager to test it out (and let others test it). There are two different approaches to do so.

Option 1

Your first option, and likely the preferred route to take, is to make sure that the working directory in R studio is set to the GAMBLR folder with your updated code and then run devtools::load_all() to load all the functions available in the R/ folder of thee same repo. This should make all such functions available to call.

Option 2

As an alternative, you can also run devtools::install() from the updated GAMBLR directory. As the name implies, this will install the complete package complete with dependencies, remotes, etc. Note, if you run with the second option, make sure to restart your R session with .rs.restartR() after installing the package and then load GAMBLR with library(GAMBLR). Now you have installed the updated branch of GAMBLR and are free to call any functions available in the R/

Note

The behaviour of the load_all() and install() is different - make sure you understand what you are doing or please ask for help.

Function Documentation Template

For your convenience, here is a documentation template for GAMBLR functions.

#' @title
#'
#' @description
#'
#' @details
#'
#' @param a_parameter
#' @param another_parameter
#'
#' @return
#'
#' @import
#' @export
#'
#' @examples
#' #this is an example
#' ###For your reference, this line is exactly 100 characters. Do not exceed 100 characters per line
#'
function_name = function(a_parameter,
                         another_parameter){
                         }

Updating GAMBLR website

Most of the GAMBLR-verse repos have their own, dedicated website with the documentation, tutorials, and other information. Regardless of the repo, they all have consistent and straight-forward approach to update the website. Please follow these steps:

Checkout the master branch of the corresponding repo

git checkout master

Pull the most recent version to your local version of the repo

git pull --ff-only

Each repo has a dedicated website branch that contains all of the required files to run the website. Checkout this branch and merge the master version of the repo:

git checkout website
git merge master

All of the necessary files related to website are collected in the folder docs. After updating the branch to the most recent master, we will change now our working directory:

cd docs

In the docs folder, there is index.qmd file that renders the landing page for each website. There are also other folders, like resources and tutorials, containing the pages from drop-down menus of the website.
- If you want to edit an existing page, you can edit any existing .qmd file.
- If you want to add a new page, you can use any of the esixting .qmd files as template for your new page. The pages in the subfolders (concepts, resources, tutorials etc) are rendered by default and you don’t need to do anything in order for them to show up on the website. If you are creating new page at the top level (in the docs folder), you would also need to update the _quarto.yml file in order for your new page to show up. Please add your page to the sidebar key under contents.
After applying your modification, the website can be rebuild by rendering all project files at once.

Note

Since we did not change our working directory, you should still be in the website’s top-level docs. If you swicthed somewhere else, please make sure you are still on the website branch and docs is your working directory.

quarto render

This will render the updated version of the website and we can now add-commit-push our changes as we normally do:

git add --all
git commit -m "website update" # or add your other informative message
git push

The push to the website branch will automatically trigger GitHub actions, which normally completes in about 60-90 seconds. Once actions are completed, you can refresh the website and should be able to see your changes.

Note

For multiple reasons, we do not merge the changes for the website to master branch. Once your changes are pushed, there is no need to create a PR.