# Running the Pipeline

# Overview

Note

Tempo does not support running samples from mixed sequencing platforms together. By default, the pipeline assumes the inputs are from exome sequencing.

This page provides instructions on how to run the pipeline through the pipeline.nf script. The basic command below shows how to run Tempo, with an explanation of flags and input arguments and files. Below is also described how to best run the pipeline on Juno as well as on AWS.

nextflow run dsl2.nf \
    --mapping/--bamMapping <input mapping tsv file> \ 
    --pairing <input mapping tsv file, can be optional> \
    -profile juno \
    --workflows="SNV,qc" \
    --aggregate <true, false, or [a tsv file]>

Note: The number of dashes matters.

Required arguments:

  • --mapping/--bamMapping <tsv> is required except running in --aggregate [tsv] mode. When --mapping [tsv] is provided, FASTQ file paths are expected in the TSV file, and the pipeline will start from FASTQ files and go through all steps to generate BAM files. When --bamMapping [tsv] if provided, BAM file paths are expected in the TSV file. See The Mapping File and Execution Mode for details.
  • --pairing <tsv> is required when --somatic and/or --germline are enabled. --pairing <tsv> is not needed when you are running BAM generation part only, even if you are doign it with --QC( or --QC and --aggregate) enabled. See The Mapping File and Execution Mode for details.
  • -profile loads the preset configuration required to run the pipeline in the supported environment. Accepted values are juno and awsbatch for execution on the Juno cluster or on AWS Batch, respectively. -profile test_singularity is for testing on juno.
  • --assayType ensures appropriate resources are allocated for indicated assay type. Only exome or genome is supported. Default is exome. Note: Please also make sure this value matches the TARGET field you put in the mapping.tsv file. Available TARGET field value for exome are idt or agilenti (can be mixed), for genome is wgs.
  • --workflows inidicates which sub-workflows should be executed for this run. Possible options are snv, sv, mutsig, germSNV, germSV, lohhla, facets,qc, and msisensor. Multiple arguments can be provided in quotation marks (i.e. --workflows="snv,qc").

Section arguments:

  • --workflows can be run independently as needed, however, when the output of a sub-workflow is required to as a dependency for an indicated workflow provided via the --workflows argument, the necessary dependent workflows will be automatically enabled. Note that while sub-workflows can be run independently, specific processes must be run as part of as sub-workflow. See the sub-workflows section for more details.
  • --aggregate <true, false, or [a tsv file]> can be boolean or be given a path to a tsv file. Default value is false. A cohort_level/[cohort]/ directory generated under --outDir [path]. When boolean value true is given (equal to only give --aggregate), TEMPO will aggregate all the samples in the mapping and pairing file as one cohort named "default cohort". When --aggregate <tsv> file is given, the pipeline will aggregate samples and tumor/normal pairs based on the value is given in COHORT columns. Each sample and tumor/normal pairs can be assigned to different cohorts in different rows.

Optional arguments:

  • --outDir is the directory where the output will end up. This directory does not need to exist. If not set, by default it will be set to run directory (i.e. the directory from which the command nextflow run is executed.)
  • -work-dir/-w is the directory where the temporary output will be cached. By default, this is set to the run directory. Please see NXF_WORK in Nextflow environment variables (opens new window).
  • -publishAll is a boolean, resulting in retention of intermediate output files ((default: true).
  • --splitLanes indicates that the provided FASTQ files will be scanned for all unique sequencing lanes and demultiplexed accordingly. This is recommended for some steps of the alignment pipeline. See more under The Mapping File (default: true).
  • -with-timeline and -with-report are enabled by default and results in the generation of a timeline and resource usage report for the pipeline run. These are boolean but can also be fed output names for the respective file.
  • --genome is the version of reference files for your analysis. Currently only GRCh37 is supported. We will add support for GRCh38 later. (default: GRCh37)
  • --cosmic is the version of reference for mutational signature analysis. Two options are v2 (COSMIC v2 30 signatures (opens new window)), which is the default, and v3 (COSMIC v3 60 signatures (opens new window)).

Using test inputs provided in the GitHub repository, here is a concrete example:

nextflow run dsl2.nf \
    -profile juno \
    --mapping test_inputs/local/full_test_mapping.tsv \
    --pairing test_inputs/local/full_test_pairing.tsv 
    --workflows="SNV,qc,lohhla" \
    --aggregate true

# Input Files

Note

The header lines are mandatory in the following files, but not the order of their columns. The nesessary header fields must be included, and additional columns are allowed.

Be aware

Tempo checks for the following aspects:

  1. Duplicated rows
  2. Valid headers
  3. Concordance of given --assayType and TARGET field in mapping file. (for exome are idt or agilent, for genome is wgs)
  4. Concordance of TARGET field in mapping file for Tumor/Normal pairs that is defined in pairing file.
  5. File paths in the mapping file are all valid
  6. File extentions in the mapping file
  7. BAI files exisits in the same directory for the BAM files in the mapping file. (Basename can be either *.bai or *.bam.bai)
  8. Required Arguments are provided and valid
  9. Reference files all exists
  10. Samples are present both in mapping and pairing files at the same time
  11. Duplicated files are in the mapping file after alignment is done.So we suggest you do your own validation of your input for this aspect to ensure smooth execution.

# The Mapping Files

# FASTQ Mapping File (--mapping <tsv>)

For processing paired-end FASTQ inputs, users must provide both a mapping file using --mapping <tsv>, as described below.

You do not need --pairing <tsv> when you are running BAM generation alone.

You must to give --pariring <tsv> when running any other sub-workflow.

Be aware

Tempo can deal with any number of sequencing lanes per sample, in any combination of lanes split or combined across multiple FASTQ pairs. Different FASTQ pairs for the same sample can be provided as different lines in the mapping file and give the same SAMPLE ID in the SAMPLE field, and repeating the TARGET field. By default, Tempo will look for all distinct sequencing lanes in provided FASTQ files by scanning each FASTQ read name. The pipeline uses this and the instrument, run, and flowcell IDs from the sequence identifiers in the input FASTQs to generate all different read group IDs for each sample. This information is used by the base quality score recalibration steps of the GATK suite of tools. If FASTQ files name explicitly specified the lane name in the format of _L(\d){3}_ ("_L" + "3 integer" + "_"), the pipeline will assume this FASTQ files contain only one lane, and it will skip scanning and splitting the FASTQ files, and give one read group ID all the reads in the FASTQ files based on the name of the first read in the FASTQ file. Please refer to this GATK Forum Page (opens new window)for more details.

This file is necessary to map the input FASTQ pairs from one or more FASTQ pairs to SAMPLE IDs. Additionally, this file tells the pipeline what bait set it used (wgs in the case of whole genome sequencing).

Example:

SAMPLE TARGET FASTQ_PE1 FASTQ_PE2
normal_sample_1 agilent normal1_L001_R01.fastq.gz normal1_L001_R02.fastq.gz
normal_sample_1 agilent normal1_L002_R01.fastq.gz normal1_L002_R02.fastq.gz
tumor_sample_1 agilent tumor1_L001_R01.fastq.gz tumor1_L001_R02.fastq.gz
tumor_sample_1 agilent ... ...
tumor_sample_1 agilent tumor1_L00N_R01.fastq.gz tumor1_L00N_R02.fastq.gz

Accepted values for the TARGET column are agilent, idt or wgs. Please note idt and agilent can be mixed and are valid when --assayType exome. But wgs can not be mixed with any other value, and is only valid when --assayType genome\

Read further details on these parameters here.

# BAM Mapping File (--bamMapping <tsv>)

If the user is using pre-processed BAMs, the input TSV file is a similar format as FASTQ mapping TSV file, with slight difference showing below.

You must to give --pariring <tsv> and specify at least one sub-workflow when beginning with BAM mapping files.

Example:

SAMPLE TARGET BAM BAI
normal_sample_1 agilent normal1.bam normal1.bai
normal_sample_2 agilent normal2.bam normal2.bai
tumor_sample_1 agilent ... ...
tumor_sample_2 agilent tumor2.bam tumor2.bai

The --pairing <tsv> file will be exactly the same as using FASTQ mapping TSV file, describing below.

Note

The pipeline expects BAM file indices in the same subdirectories as TUMOR_BAM and NORMAL_BAM. If the index files *.bai or *.bam.bai do not exist, dsl2.nf will throw an error. The BAI column in the BAM Mapping TSV file is not actually used.

Different from FASTQ mapping tsv, in this TSV file each SAMPLE id can only appear once, meaning the pipeline will not combine different BAMs for you for the same sample.

# The Pairing File (--pairing <tsv>)

The pipeline needs to know which tumor and normal samples are to be analyzed as matched pairs. This file provides that pairing by referring to the sample names as provided in the SAMPLE column in the mapping file.

You do not need --pairing <tsv> when you are running only the alignment sub-workflow.

Example:

NORMAL_ID TUMOR_ID
normal_sample_1 tumor_sample_1
normal_sample_2 tumor_sample_2
... ...
normal_sample_n tumor_sample_n

# Aggregate File (--aggregate true/false/<tsv>)

  • When boolean value true is given (equal to only give --aggregate), TEMPO will aggregate all the samples in the mapping and pairing file as one cohort named "default cohort".
  • When --aggregate <tsv> file is given, the pipeline will aggregate samples and tumor/normal pairs based on the value is given in COHORT column. Each sample and tumor/normal pairs can be assigned to different cohorts in different rows.
  • When running aggregation only mode, the PATH column needs to be provided to introduce the TEMPO result directories for each sample and tumor/normal pairs (only up to a TEMPO produced output folder).

Example:

NORMAL_ID TUMOR_ID COHORT PATH(only applicable for runing aggregate only mode)
normal_sample_1 normal_sample_1 cohort1 /home/tempo/v1/result
normal_sample_2 tumor_sample_2 cohort1 /home/tempo/v1/result
normal_sample_3 tumor_sample_3 ... ...
normal_sample_n tumor_sample_n cohort2 /home/tempo/v2/result

The --pairing <tsv> file will be exactly the same as using FASTQ mapping TSV file, describing below.

Note

The pipeline expects BAM file indices in the same subdirectories as TUMOR_BAM and NORMAL_BAM. If the index files *.bai or *.bam.bai do not exist, dsl2.nf will throw an error. The BAI column in the BAM Mapping TSV file is not actually used.

Different from FASTQ mapping tsv, in this TSV file each SAMPLE id can only appear once, meaning the pipeline will not combine different BAMs for you for the same sample.

# Execution Mode

There are a variety of execution modes that can be executed by TEMPO. Specific details can be found in the sub-workflows section. Additionally, there are special cases under which TEMPO can run.

# --mapping <tsv> only

When no additional sub-workflow arguments are given, only alignment steps will be performed.

# --mapping/--bamMapping <tsv> and --pairing <tsv> (We are describing two modes in this section)

When --mapping <tsv> is given, the pipeline will use FASTQ input and start from alignment steps.

When --bamMapping <tsv> is given, the pipeline will use BAM input and skip alignment steps.

When no additional sub-workflow arguments are given, the pipeline will throw an error indicating that --pairing <tsv> is not used.

# --aggregate <tsv> only

This mode can only be run when the TEMPO produced output structure path is provided as PATH column in --aggregate <tsv>. It explicitly relies on the output structure (only up to the parent folder of of a TEMPO generated output directory) that are auto-generated by TEMPO to identify how and what files need to be aggregated together as a cohort level result under folder cohort_level/[cohort]. Please refer to Outputs for more detail.

When using this mode, no sub-workflow arguments need to be given.

# Running the Pipeline on Juno

Note

First follow the instructions to set up your enviroment on Juno.

# Submitting the Pipeline to LSF

We recommend submitting your nextflow run dsl2.nf <...> command to the cluster via bsub, which will launch a leader job from which individual processes are submitted as jobs to the cluster.

bsub -W <hh:mm> -n 2 -R "rusage[mem=<requested memory>]" \
    -o <LSF output file name>.out -e <LSF error file name>.err \
    nextflow run dsl2.nf -profile juno <...> 

We recommend that users check the documentation for LSF (opens new window) to clarify each of the arguments above. However,

  • -W <hh:mm> sets the time allotted for nextflow run dsl2.nf to run to completion.
  • -n 2 is requesting one slot. This should be sufficient for nextflow run dsl2.nf
  • -o <LSF output file name>.out is the name of the STDOUT file, which is quite informative for Nextflow. We strongly encourage users to set this.
  • -e <LSF output file name>.err is the name of the STDERR file. Please set this.
  • -R "rusage[mem=<requested memory>]" is the requested memory for nextflow run dsl2.nf, which will not be memory intensive at all.

Here is a concrete example of a bsub command to process 25 WES TN pairs, running somatic and germline variant calling modules:


    bsub -W 80:00 -n 2 -R "rusage[mem=8]" \
    -o nf_output.out \
    -e nf_output.err \
    nextflow run <path-to-repository>/dsl2.nf \
    --mapping test_inputs/local/WES_25TN.tsv \
    --pairing test_inputs/local/WES_25TN_pairing.tsv \
    --outDir results \
    -profile juno \
    --workflows="SNV,qc,lohhla" \
    --aggregate true

Be aware

Whereas a few exome samples finish within a few hours, larger batches and genomes will take .s. Allow for this by setting-W to a good amount of hours. The pipeline will die if the leader job does, but can be resumed subsequently.

# Running From a screen Session

Another option is to use a screen session for running the pipeline interactively, for example naming and entering a screen session as follows:

screen -RD new_screen_name

It is normally not a good idea to run things on the log-in nodes of the cluster. Instead we recommend scheduling an interactive session via e.g. bsub -Is -n 2 -R "rusage[mem=8]" csh and running the screen within that session.

Users are welcome to use nohup or tmux as well.

# Running the Pipeline on AWS

Note

These instructions will assume the user is moderately knowledgeable of AWS. Please refer to AWS Setup and the AWS Glossary we have curated.

# Modifying or Resuming Pipeline Run

Nextflow supports modify and resume (opens new window).

To resume an interrupted Nextflow pipeline run, add -resume (note the single dash) to your command-line call to access the cache history of Nextflow and continue a job from where it left off. This will trigger a check of which jobs already completed before starting unfinished jobs in the pipeline.

This function also allows you to make changes to values in the dsl2.nf script and continue from where you left off. Nextflow will use the cached information from the unchanged sections while running only the modified processes. If you want to make changes to processes that already successfully completed, you have to manually delete the subdirectories in work where those processes where run.

Note

  • If you use -resume for the first time of a timeline run, Nextflow will recognize this as superfluous, and continue.
  • To peacefully interrupt an ongoing Nextflow pipeline run, do control+C once and wait for Nextflow to kill submitted jobs. Otherwise orphan jobs might be left on the cluster.

To resume the pipeline from a specific run, please read the pages here on using resume (opens new window)and as well troubleshooting resumed runs (opens new window) for more complicated use cases.

In order to resume from a specific time you ran the pipeline, first check the specific pipeline runs with nextflow log:

> nextflow log

TIMESTAMP            DURATION  RUN NAME          STATUS  REVISION ID  SESSION ID                            COMMAND                                    
2019-05-06 12:07:32  1.2s      focused_carson    ERR     a9012339ce   7363b3f0-09ac-495b-a947-28cf430d0b85  nextflow run hello                         
2019-05-06 12:08:33  21.1s     mighty_boyd       OK      a9012339ce   7363b3f0-09ac-495b-a947-28cf430d0b85  nextflow run rnaseq-nf -with-docker        
2019-05-06 12:31:15  1.2s      insane_celsius    ERR     b9aefc67b4   4dc656d2-c410-44c8-bc32-7dd0ea87bebf  nextflow run rnaseq-nf                     
2019-05-06 12:31:24  17s       stupefied_euclid  OK      b9aefc67b4   4dc656d2-c410-44c8-bc32-7dd0ea87bebf  nextflow run rnaseq-nf -resume -with-docker

Users can then restart the pipeline at specific run, using either the RUN NAME or the SESSION ID. For instance

> nextflow run rnaseq-nf -resume mighty_boyd

or equivalently

> nextflow run naseq-nf -resume 4dc656d2-c410-44c8-bc32-7dd0ea87bebf

Sometimes the resume feature may not work entirely as expected, as described in troubleshooting tips here on the Nextflow blog (opens new window)

# After A Successful Run

Nextflow generate many intermediate output files. All the relevant output data should be in the directory given to the outDir argument. Once you have verified that the data are satisfactory, everything outside this directory can be removed. In particular, the work directory will contain all intermediate output files, which takes up a great deal of disk space and should be removed. The nextflow clean -force command does all of this. Also see nextflow clean -help for options.

Be aware

Once these files are removed, modifications to or resumption of a pipeline run cannot be done.