Resources
1. Reference genomes¶
On Biowulf, RNA-seek comes bundled with the following pre-built GENCODE1 reference genomes:
Genome | Species | Annotation Version | Notes |
---|---|---|---|
hg38_30 | Homo sapiens (human) | Gencode Release v30 | GRCh38, Annotation Release date: 11/2018 |
mm10_M21 | Mus musculus (mouse) | Gencode Release M21 | GRCm38, Annotation Release date: 04/2019 |
However, building new reference genomes is easy!
If you do not have access to Biowulf or you are looking for a reference genome and/or annotation that is currently not available, it can be built with RNA-seek's build sub-command. Given a genomic FASTA file (ref.fa) and a GTF file (genes.gtf), rna-seek build
will create all of the required reference files to run the RNA-seek pipeline. Once the build pipeline completes, you can supply the newly generated reference.json to the --genome
of rna-seek run
. For more information, please see the help page for the run and build sub commands.
2. Tools and versions¶
Raw data > Adapter Trimming > Alignment > Quantification (genes and isoforms, gene-fusions)
Tool | Version | Docker | Notes |
---|---|---|---|
FastQC2 | 0.11.9 | nciccbr/ccbr_fastqc_0.11.9 | Quality-control step to assess sequencing quality, run before and after adapter trimming |
Cutadapt3 | 1.18 | nciccbr/ccbr_cutadapt_1.18 | Data processing step to remove adapter sequences and perform quality trimming |
Kraken4 | 2.1.1 | nciccbr/ccbr_kraken_v2.1.1 | Quality-control step to assess microbial taxonomic composition |
KronaTools5 | 2.7.1 | nciccbr/ccbr_kraken_v2.1.1 | Quality-control step to visualize kraken output |
FastQ Screen6 | 0.13.0 | nciccbr/ccbr_fastq_screen_0.13.0 | Quality-control step to assess contamination; additional dependencies: bowtie2/2.3.4 , perl/5.24.3 |
STAR7 | 2.7.6a | nciccbr/ccbr_arriba_2.0.0 | Data processing step to align reads against reference genome (using its two-pass mode) |
bbtools8 | 38.87 | nciccbr/ccbr_bbtools_38.87 | Quality-control step to calculate insert_size of assembled reads pairs with bbmerge |
QualiMap9 | 2.2.1 | nciccbr/ccbr_qualimap | Quality-control step to assess various alignment metrics |
Picard10 | 2.18.20 | nciccbr/ccbr_picard | Quality-control step to run MarkDuplicates , CollectRnaSeqMetrics and AddOrReplaceReadGroups |
Preseq11 | 2.0.3 | nciccbr/ccbr_preseq | Quality-control step to estimate library complexity |
SAMtools12 | 1.7 | nciccbr/ccbr_arriba_2.0.0 | Quality-control step to run flagstat to calculate alignment statistics |
bam2strandedbw | custom | nciccbr/ccbr_bam2strandedbw | Summarization step to convert STAR aligned PE bam file into forward and reverse strand bigwigs suitable for a genomic track viewer like IGV |
RSeQC13 | 4.0.0 | nciccbr/ccbr_rseqc_4.0.0 | Quality-control step to infer stranded-ness and read distributions over specific genomic features |
RSEM14 | 1.3.3 | nciccbr/ccbr_rsem_1.3.3 | Data processing step to quantify gene and isoform counts |
Arriba15 | 2.0.0 | nciccbr/ccbr_arriba_2.0.0 | Data processing step to quantify gene-fusions |
RNA Report | custom | nciccbr/ccbr_rna | Summarization step to identify outliers and assess techincal sources of variation |
MultiQC16 | 1.12 | skchronicles/multiqc | Reporting step to aggregate sample statistics and quality-control information across all sample |
3. Acknowledgements¶
3.1 Biowulf¶
If you utilized NIH's Biowulf cluster to run RNA-seek, please do not forget to provide an acknowlegement!
The continued growth and support of NIH's Biowulf cluster is dependent upon its demonstrable value to the NIH Intramural Research Program. If you publish research that involved significant use of Biowulf, please cite the cluster.
Suggested citation text:
This work utilized the computational resources of the NIH HPC Biowulf cluster. (http://hpc.nih.gov)
4. References¶
1. Harrow, J., et al., GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res, 2012. 22(9): p. 1760-74.
2. Andrews, S. (2010). FastQC: a quality control tool for high throughput sequence data.
3. Martin, M. (2011). "Cutadapt removes adapter sequences from high-throughput sequencing reads." EMBnet 17(1): 10-12.
4. Wood, D. E. and S. L. Salzberg (2014). "Kraken: ultrafast metagenomic sequence classification using exact alignments." Genome Biol 15(3): R46.
5. Ondov, B. D., et al. (2011). "Interactive metagenomic visualization in a Web browser." BMC Bioinformatics 12(1): 385.
6. Wingett, S. and S. Andrews (2018). "FastQ Screen: A tool for multi-genome mapping and quality control." F1000Research 7(2): 1338.
7. Dobin, A., et al., STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 2013. 29(1): p. 15-21.
8. Bushnell, B., Rood, J., & Singer, E. (2017). BBMerge - Accurate paired shotgun read merging via overlap. PloS one, 12(10), e0185056.
9. Okonechnikov, K., et al. (2015). "Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data." Bioinformatics 32(2): 292-294.
10. The Picard toolkit. https://broadinstitute.github.io/picard/.
11. Daley, T. and A.D. Smith, Predicting the molecular complexity of sequencing libraries. Nat Methods, 2013. 10(4): p. 325-7.
12. Li, H., et al. (2009). "The Sequence Alignment/Map format and SAMtools." Bioinformatics 25(16): 2078-2079.
13. Wang, L., et al. (2012). "RSeQC: quality control of RNA-seq experiments." Bioinformatics 28(16): 2184-2185.
14. Li, B. and C.N. Dewey, RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics, 2011. 12: p. 323.
15. Uhrig, S., et al. (2021). "Accurate and efficient detection of gene fusions from RNA sequencing data". Genome Res. 31(3): 448-460.
16. Ewels, P., et al. (2016). "MultiQC: summarize analysis results for multiple tools and samples in a single report." Bioinformatics 32(19): 3047-3048.