Skip to content

rna-seek run

1. About

The rna-seek executable is composed of several inter-related sub commands. Please see rna-seek -h for all available options.

This part of the documentation describes options and concepts for rna-seek run sub command in more detail. With minimal configuration, the run sub command enables you to start running the data processing and quality-control pipeline.

Setting up the RNA-seek pipeline is fast and easy! In its most basic form, rna-seek run only has three required inputs.

2. Synopsis

$ rna-seek run [--help] \
            [--batch-id BATCH_ID] \
            [--groups GROUPS] [--contrasts CONTRASTS] \
            [--covariates COVARIATE [COVARIATE ...]] \
            [--call-gene-fusions] [--prokaryote] \
            [--small-rna] [--star-2-pass-basic] \
            [--dry-run] [--mode {slurm, local}] \
            [--shared-resources SHARED_RESOURCES] \
            [--singularity-cache SINGULARITY_CACHE] \
            [--sif-cache SIF_CACHE] \
            [--tmp-dir TMP_DIR] \
            [--threads THREADS] \
            --input INPUT [INPUT ...] \
            --output OUTPUT \
            --genome {hg38_48, mm10_M25, mm39_M37, custom.json}

The synopsis for each command shows its parameters and their usage. Optional parameters are shown in square brackets.

A user must provide a list of FastQ files (globbing is supported) to analyze via --input argument, an output directory to store results via --output argument and select reference genome for alignment and annotation via the --genome argument. If you are running the pipeline outside of Biowulf, you will need to additionally provide the the following options: --shared-resources, --tmp-dir. More information about each of these options can be found below.

To run differential expression analyses, a groups and contrasts file must be provided. Please see the downstream analysis section below for more information about how to create these files.

Use you can always use the -h option for information on a specific sub command.

2.1 Required Arguments

Each of the following arguments are required. Failure to provide a required argument will result in a non-zero exit-code.

--input INPUT [INPUT ...]

Input FastQ file(s) to process.
type: file

One or more FastQ files can be provided. From the command-line, each FastQ file should seperated by a space. Globbing is supported! This makes selecting FastQ files easier. Input FastQ files should be gzipp-ed. The pipeline supports single-end and pair-end RNA-seq data; however, the pipeline will not process a mixture of single-end and paired-end samples together. If you have a mixture of single-end and pair-end samples to process, please process them as two seperate instances of the RNA-seek pipeline (with two seperate output directories).

Example: --input .tests/*.R?.fastq.gz


--output OUTPUT

Path to an output directory.
type: path

This location is where the pipeline will create all of its output files, also known as the pipeline's working directory. If the provided output directory does not exist, it will be initialized automatically.

Example: --output /data/$USER/RNA_hg38


--genome {hg38_48, mm10_M25, mm39_M37, custom.json}

Reference genome.
type: string or file

This option defines the reference genome for your set of samples. On Biowulf, RNA-seek does comes bundled with pre built reference files for human and mouse samples; however, it is worth noting that the pipeline does accept a custom reference genome built with the build sub command. Building a new reference genome is easy! You can create a custom reference genome with a single command. This is extremely useful when working with non-model organisms. New users can reference the documentation's getting started section to see how a reference genome is built.

Pre built Option
Here is a list of available pre built genomes on Biowulf: hg38_48, mm10_M25, mm39_M37. Please see the resources page for more information about each pre built option. It is worth noting that hg38_30 and mm10_M21 can still be selected. They have just been hidden from the command-line interface to encourage users to select newer reference genomes.

Custom Option
A user can also supply a custom reference genome built with the build sub command. Please supply the custom reference JSON file that was generated by the build sub command. The name of this custom reference JSON file is dependent on the values provided to the following rna-seek build args, --ref-name REF_NAME and --gtf-ver GTF_VER, where the name of the provided custom reference JSON file would be: {REF_NAME}_{GTF_VER}.json.

Example: --genome hg38_48 OR --genome /data/${USER}/hg38_36/hg38_36.json

2.2 Analysis Options

--call-gene-fusions

Call gene fusions.
type: boolean

A gene fusion is a genetic alteration where two different genes become joined together, forming a new hybrid gene. If this option is provided, an extra set of steps will run to detect gene fusions with arriba. These steps will conditionally run depending on whether there are reference files for running arriba. Currently, these reference files exist for hg38, hg19, and mm10. Please note if these reference are missing, then gene fusions will not be called.

Example: --call-gene-fusions


--prokaryote

Run with prokaryotic genome alignment options.
type: boolean

Prokaryotic genomes, like bacteria, do not contain introns. If provided, this option will use an optimized set of options for aligning against prokaryotic genomes. This option will force STAR to avoid spliced alignments, and it will also run STAR in a 2-pass basic mode. By default, the pipeline is setup for handling alignment against eukarytoic genomes, so this option should be provided if you are working with a prokaryotic genome. This option should not be combined with the small RNA option.

Example: --prokaryote


--small-rna

Run STAR using ENCODE's recomendations for small RNA.
type: boolean

This option should only be used with small RNA libraries. These are rRNA-depleted libraries that have been size selected to contain fragments shorter than 200bp. Size selection enriches for small RNA species such as miRNAs, siRNAs, or piRNAs. Also, this option should not be combined with the star 2-pass basic option. If the two options are combined, STAR will run in pass basic mode. This means that STAR will not run with ENCODE's recommendations for small RNA alignment. As so, please take caution not to combine both options together.

Please note: This option is only supported with single-end data.

Example: --small-rna


--star-2-pass-basic

Run STAR in per sample 2-pass mapping mode.
type: boolean

It is recommended to use this option when processing a set of unrelated samples or when processing samples in a clinical setting. It is not adivsed to use this option for a study with multiple related samples.

By default, the pipeline ultilizes a multi sample 2-pass mapping approach where the set of splice junctions detected across all samples are provided to the second pass of STAR. This option overrides the default behavior so each sample will be processed in a per sample two-pass basic mode. This option should not be combined with the small RNA option. If the two options are combined, STAR will run in pass basic mode.

Example: --star-2-pass-basic

2.3 Downstream analysis options

Each of the following options can optionally be provided to run differential expression and pathway (coming soon) analyses. Please note that if these options are not provided, the pipeline will still run all pre-processing and QC steps.

--batch-id BATCH_ID

Batch ID for downstream analysis.
type: string
default: none

This option can be provided to ensure that downstream analyses output files are not over-written between different runs of the pipeline. This can occur after updating the group file with additional covariates or dropping samples. By default, project-level files in differential expression and pathway analysis folder could get over-written between pipeline runs if this option is not provided. The output directory name for a given contrast will resolve to {group1}_vs_{group2} within the differential_gene_expression folder. As so, if the groups file is updated to remove samples or add additional covariates without updating the group names, it could over write the previous analyses output files. Any identifer provided to this option will be used to create a sub directory in the differential expression folder. This ensures project-level files (which are unique) will not get over written. With that being said, it is always a good idea to provide this option. A unique batch id should be provided between runs. This batch id should be composed of alphanumeric characters and it should not contain a white space or tab characters. Here is a list of valid or acceptable characters: aA-Zz, 0-9, -, _.

Example: --batch-id 2025_04_08


--groups GROUPS

Groups file containing sample metadata.
type: TSV file
default: none

Path to a sample sheet in TSV format with important information for differential expression. This file is used to map the sample names to groups and other metadata such as covariates (optional). At a minimum, this file contains two columns containing the basename of the sample along with its group label. Additional columns can be optionally added to define covariates. To run differential expression analyses, a groups and contrasts file must be provided together. Please see the example below for more information about how to create this file.

Here is a more detailed description of the sample sheet:
Sample: This is the basename and prefix of a sample's R1 FastQ file, e.g /path/WT_S4.R1.fastq.gz becomes WT_S4 in this file. Please note a sample should only be listed once in the groups file, and that there should only be a 1:1 relationship between samples and groups. This is a required column.
Group: This is the group label for a given sample. A group can represent anything from an experimental condition, a treatment, timepoint, or disease state, etc. Group names must start with a letter and should then only be composed of alphanumeric characters (Aa-Za,0-9), periods (.), or underscores (_). Hyphens (-) are not allowed in group names! This is a required column.
Additional Columns (optional): Any additional columns to define covariates. The column names here must be unique and follow the same naming rules as group names. These additional column names can then be provided to the --covariates option. If an additional column is provided and its is not supplied to the --covariates option, then it will only be used for exploratory plots. Please see that option for more information.

Contents of example groups file:

Sample    Group   Batch
WT_S1 LPS_WT  B1
WT_S2 LPS_WT  B2
WT_S3 LPS_WT  B1
WT_S4 LPS_WT  B2
WT_S5 Veh_WT  B1
WT_S6 Veh_WT  B2
WT_S7 Veh_WT  B1
WT_S8 Veh_WT  B2
KO_S9 LPS_KO  B1
KO_S10    LPS_KO  B2
KO_S11    LPS_KO  B1
KO_S12    LPS_KO  B2
KO_S13    Veh_KO  B1
KO_S14    Veh_KO  B2
KO_S15    Veh_KO  B1
KO_S16    Veh_KO  B2

where:
Sample is the base name of each sample's input R1 fastq file without .R1.fastq.gz file extension.
Group represents each sample's group name(s).
• Any additional columns are optional covariates. Only the Sample and Group columns are required.

Example: --groups .tests/groups.tsv


--contrasts CONTRASTS

Contrasts file containing comparisons to make.
type: TSV file
default: none

This tab-delimited file is used to setup comparisons within different groups of samples. Please see the --groups option above for more information about how to define groups for a set of samples. In its most basic form, this file consists of two columns containing the names of two group to compare. The names defined in this file must also exist in the groups file. The second column represents the baseline group. Complex comparisons can be made. The last line in the example contrasts file below contains a difference of differences (diff-diff) comparison. For diff-diff comparisons, please wrap each diff in parentheses. To perform differential expression analysis, a groups and contrast file must be provided. Please note that unlike the groups file, the contrasts file does not contain a header line.

Contents of example contrasts file:

LPS_KO    Veh_KO
LPS_WT    Veh_WT
LPS_KO    LPS_WT
Veh_KO    Veh_WT
(LPS_KO-Veh_KO)   (LPS_WT-Veh_WT)

where:
• Any groups listed in this file must exist in the --groups file.
• 2nd column represents the baseline of the comparison, i.e Veh_KO on the first line in the example above.

Example: --contrasts .tests/contrasts.tsv


--covariates COVARIATE [COVARIATE ...]

Covariates for differential analyses.
type: string
default: none

This option allows the user to define and adjust for any covariates that may be present. Covariates are known biological (i.e sex, disease stage, high-/low-BMI) or technical sources (i.e. batch, high-/low-RIN) of variation that can confound the results of differential expression analysis. If you have more than one covariate, please provide a space seperated list of covariates (i.e --covariates Batch Site) to this option. Covariates must match the header names of any Additional Columns in the groups file. Covariate values should be encoded as categorical variables. Numerical covariates are currently not supported.

Example: --covariates Batch

2.4 Orchestration Options

Each of the following arguments are optional and do not need to be provided.

--dry-run

Dry run the pipeline.
type: boolean

Displays what steps in the pipeline remain or will be run. Does not execute anything!

Example: --dry-run


--mode {slurm,local}

Execution Method. type: string
default: slurm

Execution Method. Defines the mode or method of execution. Vaild mode options include: slurm or local.

local
Local executions will run serially on compute instance. This is useful for testing, debugging, or when a users does not have access to a high performance computing environment. If this option is not provided, it will default to a local execution mode.

slurm
The slurm execution method will submit jobs to a cluster using a slurm + singularity backend. This method will automatically submit the master job to the cluster. It is recommended running RNA-seek in this mode as execution will be significantly faster in a distributed environment.

Example: --mode slurm


--shared-resources SHARED_RESOURCES

Local path to shared resources.
type: path

The pipeline uses a set of shared reference files that can be re-used across reference genomes. These currently include reference files for kraken and FQScreen. These reference files can be downloaded with the build sub command's --shared-resources option. With that being said, these files only need to be downloaded once. We recommend storing this files in a shared location on the filesystem that other people can access. If you are running the pipeline on Biowulf, you do NOT need to download these reference files! They already exist on the filesystem in a location that anyone can acceess; however, if you are running the pipeline on another cluster or target system, you will need to download the shared resources with the build sub command, and you will need to provide this option every time you run the pipeline. Please provide the same path that was provided to the build sub command's --shared-resources option. Again, if you are running the pipeline on Biowulf, you do NOT need to provide this option. For more information about how to download shared resources, please reference the build sub command's --shared-resources option.

Example: --shared-resources /data/shared/rna-seek


--singularity-cache SINGULARITY_CACHE

Overrides the $SINGULARITY_CACHEDIR environment variable.
type: path
default: --output OUTPUT/.singularity

Singularity will cache image layers pulled from remote registries. This ultimately speeds up the process of pull an image from DockerHub if an image layer already exists in the singularity cache directory. By default, the cache is set to the value provided to the --output argument. Please note that this cache cannot be shared across users. Singularity strictly enforces you own the cache directory and will return a non-zero exit code if you do not own the cache directory! See the --sif-cache option to create a shareable resource.

Example: --singularity-cache /data/$USER/.singularity


--sif-cache SIF_CACHE

Path where a local cache of SIFs are stored.
type: path

Uses a local cache of SIFs on the filesystem. This SIF cache can be shared across users if permissions are set correctly. If a SIF does not exist in the SIF cache, the image will be pulled from Dockerhub and a warning message will be displayed. The rna-seek cache subcommand can be used to create a local SIF cache. Please see rna-seek cache for more information. This command is extremely useful for avoiding DockerHub pull rate limits. It also remove any potential errors that could occur due to network issues or DockerHub being temporarily unavailable. We recommend running RNA-seek with this option when ever possible.

Example: --sif-cache /data/$USER/SIFs


--tmp-dir TMP_DIR

Path on the file system for writing temporary files.
type: path
default: /lscratch/$SLURM_JOBID

This is a path on the file system for writing temporary output files. By default, the temporary directory is set to '/lscratch/$SLURM_JOBID' for backwards compatibility with the NIH's Biowulf cluster; however, if you are running the pipeline on another cluster, this option will need to be specified. Ideally, this path should point to a dedicated location on the filesystem for writing tmp files. On many systems, this location is set to somewhere in /scratch. If you need to inject a variable into this string that should NOT be expanded, please quote this options value in single quotes. Again, if you are running the pipeline on Biowulf, you do NOT need to provide this option.

Example: --tmp-dir /cluster_scratch/$USER/


--threads THREADS

Max number of threads for each process.
type: int
default: 2

Max number of threads for each process. This option is more applicable when running the pipeline with --mode local. It is recommended setting this vaule to the maximum number of CPUs available on the host machine.

Example: --threads 12

2.5 Misc Options

Each of the following arguments are optional and do not need to be provided.

-h, --help

Display Help.
type: boolean

Shows command's synopsis, help message, and an example command

Example: --help

3. Example

3.1 Biowulf

On Biowulf getting started with the pipeline is fast and easy! The pipeline comes bundled with pre-built human and mouse reference genomes. In the example below, we will use the pre-built human reference genome.

# Step 0.) Grab an interactive node (do not run on head node)
srun -N 1 -n 1 --time=12:00:00 -p interactive --mem=8gb  --cpus-per-task=4 --pty bash
module purge
module load singularity snakemake

# Step 1.) Dry run pipeline with provided test data
./rna-seek run --input .tests/*.R?.fastq.gz \
               --output /data/$USER/RNA_hg38 \
               --genome hg38_48 \
               --mode slurm \
               --star-2-pass-basic \
               --sif-cache /data/OpenOmics/SIFs/ \
               --dry-run

# Step 2.) Run RNA-seek pipeline
# The slurm mode will submit jobs to the cluster.
# It is recommended running rna-seek in this mode.
./rna-seek run --input .tests/*.R?.fastq.gz \
               --output /data/$USER/RNA_hg38 \
               --genome hg38_48 \
               --mode slurm \
                --sif-cache /data/OpenOmics/SIFs/ \
               --star-2-pass-basic

3.2 Generic SLURM Cluster

Running the pipeline outside of Biowulf is easy; however, there are a few extra steps you must first take. Before getting started, you will need to build reference files for the pipeline. Please note when running the build sub command for the first time, you will also need to provide the --shared-resources option. This option will download our kraken2 database and bowtie2 indices for FastQ Screen. The path provided to this option should be provided to the --shared-resources option of the run sub command. Next, you will also need to provide a path to write temporary output files via the --tmp-dir option. We also recommend providing a path to a SIF cache. You can cache software containers locally with the cache sub command.

# Step 0.) Grab an interactive node (do not run on head node)
srun -N 1 -n 1 --time=2:00:00 -p interactive --mem=8gb  --cpus-per-task=4 --pty bash
# Add snakemake and singularity to $PATH,
# This step may vary across clusters, you
# can reach out to a sys admin if snakemake
# and singularity are not installed.
module purge
module load singularity snakemake

# Step 1.) Dry run pipeline with provided test data
./rna-seek run --input .tests/*.R?.fastq.gz \
               --output /data/$USER/RNA_hg38 \
               --genome /data/$USER/hg38_36/hg38_36.json \
               --mode slurm \
               --sif-cache /data/$USER/cache \
               --star-2-pass-basic \
               --shared-resources /data/shared/rna-seek \
               --tmp-dir /cluster_scratch/$USER/ \
               --dry-run

# Step 2.) Run RNA-seek pipeline
# The slurm mode will submit jobs to the cluster.
# It is recommended running rna-seek in this mode.
./rna-seek run --input .tests/*.R?.fastq.gz \
               --output /data/$USER/RNA_hg38 \
               --genome /data/$USER/hg38_36/hg38_36.json \
               --mode slurm \
               --sif-cache /data/$USER/cache \
               --star-2-pass-basic \
               --shared-resources /data/shared/rna-seek \
               --tmp-dir /cluster_scratch/$USER/ \
               --dry-run

Last update: 2025-11-20
Back to top