Day 5: Tracking Threats: Genomic Detection of AMR, Virulence, and Plasmid Mobility¶
Date: September 5, 2025 Duration: 09:00-13:00 CAT Focus: Genome quality and functional gene annotation fundamentals, AMR and virulence factors and plasmid detection
Overview¶
Day 5 introduces Nextflow, a powerful workflow management system for creating reproducible and scalable bioinformatics pipelines. We'll explore the fundamentals of Nextflow, the nf-core community standards, and begin developing a pipeline for genomic analysis including QC, assembly, quality assessment, and annotation.
Learning Objectives¶
By the end of Day 5, you will be able to:
- Understand the principles of reproducible computational workflows
- Write basic Nextflow scripts with processes and channels
- Utilize nf-core tools and community pipelines
- Design workflow architecture for genomic analysis
- Implement data flow using Nextflow channels
- Begin developing a pipeline for QC, assembly, and annotation
Schedule¶
Time (CAT) | Topic | Links | Trainer |
---|---|---|---|
09:00 | Reproducible workflows with Nextflow and nf-core | Mamana Mbiyavanga | |
10:30 | Developing a Nextflow pipeline for QC, de novo assembly, quality assessment and annotation | Mamana Mbiyavanga | |
11:30 | Break | ||
12:00 | Developing a Nextflow pipeline for QC, de novo assembly, quality assessment and annotation | Mamana Mbiyavanga |
Key Topics¶
1. Introduction to Workflow Management¶
- Challenges in bioinformatics reproducibility
- Benefits of workflow management systems
- Nextflow vs other workflow systems (Snakemake, CWL, WDL)
- Container technologies (Docker, Singularity)
2. Nextflow Fundamentals¶
- Nextflow architecture and concepts
- Processes, channels, and operators
- Configuration files and profiles
- Resource management and executors
- Error handling and resume capabilities
3. nf-core Community and Standards¶
- nf-core pipeline structure
- Community guidelines and best practices
- Using nf-core tools
- Available nf-core pipelines for genomics
- Contributing to nf-core
4. Building a Genomic Analysis Pipeline¶
- Pipeline design and planning
- Implementing QC processes (FastQC, MultiQC)
- Assembly process integration (SPAdes)
- Quality assessment steps (QUAST)
- Annotation process (Prokka)
5. Nextflow Scripting¶
- Writing process definitions
- Channel operations and data flow
- Parameter handling
- Conditional execution
- Module organization
Tools and Software¶
Workflow Management¶
- Nextflow - Workflow orchestration system
- nf-core tools - Pipeline development framework
- Tower - Workflow monitoring platform
Containerization¶
- Docker - Container platform
- Singularity - HPC-friendly containers
- Conda - Package management
Pipeline Components¶
- FastQC - Read quality control
- MultiQC - Aggregate reporting
- SPAdes - Genome assembly
- QUAST - Assembly assessment
- Prokka - Genome annotation
Hands-on Exercises¶
Exercise 1: First Nextflow Script (30 minutes)¶
Create and run a simple Nextflow pipeline.
#!/usr/bin/env nextflow
// Define parameters
params.input = "data/*.fastq"
params.outdir = "results"
// Create a channel from input files
Channel
.fromPath(params.input)
.set { fastq_ch }
// Define a process
process countReads {
input:
path fastq from fastq_ch
output:
path "*.count" into counts_ch
script:
"""
echo "Processing ${fastq}"
wc -l ${fastq} > ${fastq.baseName}.count
"""
}
// View the results
counts_ch.view()
Exercise 2: Building a QC Pipeline (60 minutes)¶
Implement quality control with FastQC and MultiQC.
process fastqc {
container 'biocontainers/fastqc:v0.11.9'
input:
tuple val(sample_id), path(reads)
output:
path "*_fastqc.{zip,html}" into fastqc_results
script:
"""
fastqc -t ${task.cpus} ${reads}
"""
}
process multiqc {
publishDir params.outdir, mode: 'copy'
container 'ewels/multiqc:latest'
input:
path '*' from fastqc_results.collect()
output:
path 'multiqc_report.html'
script:
"""
multiqc .
"""
}
Exercise 3: Integrating Assembly (90 minutes)¶
Add genome assembly to the pipeline.
process spades_assembly {
container 'staphb/spades:latest'
cpus 4
memory '8 GB'
input:
tuple val(sample_id), path(reads1), path(reads2)
output:
tuple val(sample_id), path("${sample_id}_contigs.fasta")
script:
"""
spades.py \
-1 ${reads1} \
-2 ${reads2} \
-o spades_output \
-t ${task.cpus} \
--careful
cp spades_output/contigs.fasta ${sample_id}_contigs.fasta
"""
}
Key Concepts¶
Workflow Principles¶
- Reproducibility: Same input → same output
- Portability: Run anywhere (laptop, HPC, cloud)
- Scalability: Handle any data volume
- Resumability: Restart from failure points
Nextflow Components¶
Component | Description | Example |
---|---|---|
Process | Computational step | process fastqc { ... } |
Channel | Data flow connection | Channel.fromPath() |
Operator | Channel transformation | .map() , .filter() |
Directive | Process configuration | cpus 4 |
Best Practices¶
- Use containers: Ensure environment reproducibility
- Parameterize everything: Make pipelines flexible
- Version control: Track pipeline changes
- Document thoroughly: Help users and future self
- Test incrementally: Build and test step by step
Assessment Activities¶
Individual Tasks¶
- Create a basic Nextflow script with at least 2 processes
- Successfully run a pipeline with test data
- Modify pipeline parameters and observe changes
- Debug a pipeline with intentional errors
- Document pipeline usage
Group Discussion¶
- Compare Nextflow with traditional shell scripting
- Discuss reproducibility challenges and solutions
- Share pipeline design strategies
- Explore nf-core pipeline catalog
Common Challenges¶
Installation Issues¶
# Install Nextflow
curl -s https://get.nextflow.io | bash
./nextflow run hello
# Set up environment
export PATH=$PATH:$PWD
export NXF_VER=23.10.0
Channel Operations¶
// Common channel patterns
Channel
.fromFilePairs(params.reads)
.ifEmpty { error "No read files found!" }
.set { read_pairs_ch }
// Combining channels
fastqc_ch
.join(assembly_ch)
.map { sample, qc, assembly ->
[sample, qc, assembly]
}
Resource Management¶
process memory_intensive {
memory { 2.GB * task.attempt }
maxRetries 3
errorStrategy 'retry'
script:
"""
# Your command here
"""
}
Resources¶
Documentation¶
Tutorials¶
Community¶
Looking Ahead¶
Day 6 Preview: Nextflow Pipeline Development - Continue building the genomic analysis pipeline - Advanced Nextflow features and optimization - Pipeline testing and validation - Deployment strategies
Key Learning Outcome: Understanding workflow management principles and gaining hands-on experience with Nextflow enables creation of reproducible, scalable bioinformatics pipelines essential for modern genomic analysis.