GAMBLR developer guidelines
To ensure consistency throughout all of the packages in the GAMBLR-verse, please follow these developer guidelines. If you have any questions, please feel free to contact us or submit an issue in the corresponding repository.
Installation for developers
The easiest way to obtain and contribute to GAMBLR is to do this via cloning the repository. Here is an example using GAMBLR repo, but any other repo should be cloned in the same way (just substitute the GAMBLR
with GAMBLR.{package}
, where {package}
is one of data
, helpers
, utils
, viz
, or results
).
cd
git clone git@github.com:morinlab/GAMBLR.git
In your R editor of choice (which is hopefully VS Code), set your working directory to the directory you just cloned the repo into.
Please refer to the tutorial.
setwd("~/GAMBLR")
Install the package in R by running the following command (requires the devtools
package):
::install() devtools
After applying your modifications to the code, use the following command to quickly test your changes without directly installing the package (requires the devtools
dependency):
::load_all() devtools
The Master branch of all GAMBLR-verse repos is protected. We welcome contributions (pull request, bug report, feature request, PR review) from all levels of users. All commits must be submitted via pull request on a branch. Please refer to the GitHub documentation for details on how to do pull request, or ask for help in the #git-help
Slack channel.
Creating or updating functions
Please always ensure that the new function goes into the corresponding child package according to it’s intended use. If you are not sure to which package the new function belongs to, please check the Frequently Asked Question.
When designing new functions or updating the existing ones, please refer to guidelines and best practices of R package development detailed here. Ensure to always provide the required documentation for any new functions. See this section for more details on best practices for documenting R functions. Unsure what information goes where in a function documentation? For more information, see this.
Name
The new functions should follow the GAMBLR-verse convention implemented and expected throughout the packaged to ensure consistency and user expectations. The maing naming conventions are:
- Use
get_
when the function is supposed to retreive some sort of data - Use
pretty
when the function is supposed to generate a plot or a multi-panel figure - Use collate_ when the function generates sample-level summary values from another result, which will be added as one or more new column to existing metadata
- Use
annotate_
when the function is supposed to annotate the data input with some additional information
Title
The title displayed in the documentation is taken from the first sentence. It should be written in sentence case, not end in a full stop, and be followed by a blank line. For example:
- This is a good title
- This Is Not A Proper Title.
The title is shown in various function indexes (e.g. help(package = “some_package”)) and is what the user will usually see when browsing multiple functions.
Description
The description displayed in the documentation is taken from the next paragraph. It’s shown at the top of documentation and should briefly describe the most important features of the function.
Details
Additional details are anything after the description. Details are optional, but can be any length so are useful if you want to dig deep into some important aspect of the function. Note that, even though the details come right after the description in the introduction, they appear much later in rendered documentation. If you want to add code to the details, this is also the sections to do so. For example, the new function relies on some bash code in order to utilize the GAMBLR code. You can detail such code here by simply adding a code block as you would in a regular markdown file.
Arguments
Detailed argument descriptions should be included for all functions. Remember to state the required data types, default values, if the argument is required or optional, etc.
There are several key concepts underlying the logic behind the naming of the arguments in the GAMBLR-verse. The main terms are:
projection: This is a coordinate system defining the relationship with genome build and chromosome prefixing. The main projections that are expected to be supported throughout GAMBLR-verse are
grch37
andhg38
. Thegrch37
projection contains the same coordinate system as genome build hg19, but never has the “chr” prefix on chromosome names. In contrast, thehg38
projection is always chr-prefixed and is in the same coordinate system as the hg38 genome build. The projection and genome_build are essentially synonymous in their meaning but projection is meant to reflect the intent of the user wishing to retrieve data relative to a specific genome build (for example, with one ofget_
functions) whereas genome_build is used to tell other functions what projection they should be anticipating for their input (for example,annotate_
functions). Any function that accepts bed_data, seg_data or maf_data should theoretically be able to infer genome_build directly usingget_genome_build
and developers should strive to use this feature to avoid potential user error.metadata: This is a data frame with a set of minimal columns defining the biological or clinical characteristics of a sample or a cohort. The argument that defines the metadata should always be called
these_samples_metadata
. Typically the output of theget_gambl_metadata()
is provided to this argument, but you can expect that the following required columns will be present:patient_id
,Tumor_Sample_Barcode
,sample_id
,seq_type
,sex
,cohort
, andpathology
. The main purpose of this data frame is to provide a structure for the metadata that is always expected to be available and provides linkage between unique sample identifiers and associated basic metadata values. The columnsTumor_Sample_Barcode
andsample_id
are expected to share the same values, but are required to be present for direct operation on the outputs of different upstream tools. When handling the metadata, you should always refer to the columnsample_id
for the unique sample identifiers. When implementing a function that utilizes the metadata, you must consider the scenario that the samesample_id
can exist in more than one row but there should never be rows that share the same combination ofsample_id
andseq_type
. Thus, any internal data operation that extends across more than one seq_type must maintain the linkage between the data and its originalseq_type
. In some cases, this is accomplished simply by adding a column to the output with a self-explanatory name. For example, the maf_data returned byget_all_coding_ssm
has a columnmaf_seq_type
.maf: This is a data frame in a standard maf format defining the simple somatic mutations. The argument that defines the maf should always be called
maf_data
. Typically the output of theget_ssm*
is provided to this argument, but you can expect that the standard maf columns will be present, includingTumor_Sample_Barcode
,Chromosome
,Start_Position
,End_Position
,Variant_Classification
,Hugo_Symbol
, etc. When handling the maf, you should always refer to the columnTumor_Sample_Barcode
for the unique sample identifiers.
Return
Specify the returned object, is it a data frame, a list, a vector or characters, etc.
Import (specifying dependencies)
Always import all the packages from which you are calling any functions outside of base R and R packages that must be loaded in order for your function to execute.
Remember to not import tidyverse
- rather, import the individual packages from tidyverse
that the function is depending on (dplyr
, readr
, ggplot2
etc).
Overall, the GAMBLR already depends on many different packages that has been already added as dependencies. This results in very long installation time, frequent errors in installation and version conflicts between sub-dependencies, function name clashes, and other issues. For these and other additional reasons, addition on new dependencies is higly discouraged and is strongly adviced against! DO NOT add any new dependency that is not already a dependency of one of GAMBLR packages.
Export
Should this function be exported to NAMESPACE (i.e make it directly accessible for anyone who loads GAMBLR), or is the function considered to be an internal/helper function? In order to have the function populate NAMESPACE, the developer has to run devtools::document()
while in the working directory of the package. All functions that have the @export
line in its documentation will be documented and added to NAMESPACE.
Examples
Please provide fully reproducible examples for the function. Ideally, the example should demonstrate basic usage, as well as more advanced usage with different parameter combinations. Note that examples can not extend over 100 characters per line, since this will cause the lines to be truncated in the rendered PDF manual. It is advised to write your example in such a way that loading external packages are avoided as much as possible. Instead, prioritize base R as much as possible. In some cases, it is undesirable to have a function run its examples. This applies to functions that are writing files and helper functions, or functions that rely on GSC server access. To avoid any such examples to run, simply wrap the example in:
\dontrun{
do_not_run = some_function()
}
Testing New Functions
So you have added a new function (carefully following the steps in the previous section!) and you are obviously extremely proud and eager to test it out (and let others test it). There are two different approaches to do so.
Option 1
Your first option, and likely the preferred route to take, is to make sure that the working directory in R studio is set to the GAMBLR folder with your updated code and then run devtools::load_all()
to load all the functions available in the R/
folder of thee same repo. This should make all such functions available to call.
Option 2
As an alternative, you can also run devtools::install()
from the updated GAMBLR directory. As the name implies, this will install the complete package complete with dependencies, remotes, etc. Note, if you run with the second option, make sure to restart your R session with .rs.restartR()
after installing the package and then load GAMBLR with library(GAMBLR)
. Now you have installed the updated branch of GAMBLR and are free to call any functions available in the R/
The behaviour of the load_all()
and install()
is different - make sure you understand what you are doing or please ask for help.
Function Documentation Template
For your convenience, here is a documentation template for GAMBLR functions.
#' @title
#'
#' @description
#'
#' @details
#'
#' @param a_parameter
#' @param another_parameter
#'
#' @return
#'
#' @import
#' @export
#'
#' @examples
#' #this is an example
#' ###For your reference, this line is exactly 100 characters. Do not exceed 100 characters per line
#'
= function(a_parameter,
function_name
another_parameter){ }
Updating GAMBLR website
Most of the GAMBLR-verse repos have their own, dedicated website with the documentation, tutorials, and other information. Regardless of the repo, they all have consistent and straight-forward approach to update the website. Please follow these steps:
- Checkout the
master
branch of the corresponding repo
git checkout master
- Pull the most recent version to your local version of the repo
git pull --ff-only
- Each repo has a dedicated
website
branch that contains all of the required files to run the website. Checkout this branch and merge the master version of the repo:
git checkout website
git merge master
- All of the necessary files related to website are collected in the folder
docs
. After updating the branch to the most recent master, we will change now our working directory:
cd docs
In the
docs
folder, there isindex.qmd
file that renders the landing page for each website. There are also other folders, likeresources
andtutorials
, containing the pages from drop-down menus of the website.- If you want to edit an existing page, you can edit any existing
.qmd
file. - If you want to add a new page, you can use any of the esixting
.qmd
files as template for your new page. The pages in the subfolders (concepts
,resources
,tutorials
etc) are rendered by default and you don’t need to do anything in order for them to show up on the website. If you are creating new page at the top level (in thedocs
folder), you would also need to update the_quarto.yml
file in order for your new page to show up. Please add your page to thesidebar
key undercontents
.
- If you want to edit an existing page, you can edit any existing
After applying your modification, the website can be rebuild by rendering all project files at once.
Since we did not change our working directory, you should still be in the website’s top-level docs
. If you swicthed somewhere else, please make sure you are still on the website
branch and docs
is your working directory.
quarto render
- This will render the updated version of the website and we can now add-commit-push our changes as we normally do:
git add --all
git commit -m "website update" # or add your other informative message
git push
- The push to the
website
branch will automatically trigger GitHub actions, which normally completes in about 60-90 seconds. Once actions are completed, you can refresh the website and should be able to see your changes.
For multiple reasons, we do not merge the changes for the website to master branch. Once your changes are pushed, there is no need to create a PR.