Day 5: Tracking Threats: Genomic Detection of AMR, Virulence, and Plasmid Mobility¶

Date: September 5, 2025 Duration: 09:00-13:00 CAT Focus: Genome quality and functional gene annotation fundamentals, AMR and virulence factors and plasmid detection

Overview¶

Day 5 introduces Nextflow, a powerful workflow management system for creating reproducible and scalable bioinformatics pipelines. We'll explore the fundamentals of Nextflow, the nf-core community standards, and begin developing a pipeline for genomic analysis including QC, assembly, quality assessment, and annotation.

Learning Objectives¶

By the end of Day 5, you will be able to:

Understand the principles of reproducible computational workflows
Write basic Nextflow scripts with processes and channels
Utilize nf-core tools and community pipelines
Design workflow architecture for genomic analysis
Implement data flow using Nextflow channels
Begin developing a pipeline for QC, assembly, and annotation

Schedule¶

Time (CAT)	Topic	Trainer
09:00	Reproducible workflows with Nextflow and nf-core	Mamana Mbiyavanga
10:30	Developing a Nextflow pipeline for QC, de novo assembly, quality assessment and annotation	Mamana Mbiyavanga
11:30	Break
12:00	Developing a Nextflow pipeline for QC, de novo assembly, quality assessment and annotation	Mamana Mbiyavanga

Key Topics¶

1. Introduction to Workflow Management¶

Challenges in bioinformatics reproducibility
Benefits of workflow management systems
Nextflow vs other workflow systems (Snakemake, CWL, WDL)
Container technologies (Docker, Singularity)

2. Nextflow Fundamentals¶

Nextflow architecture and concepts
Processes, channels, and operators
Configuration files and profiles
Resource management and executors
Error handling and resume capabilities

3. nf-core Community and Standards¶

nf-core pipeline structure
Community guidelines and best practices
Using nf-core tools
Available nf-core pipelines for genomics
Contributing to nf-core

4. Building a Genomic Analysis Pipeline¶

Pipeline design and planning
Implementing QC processes (FastQC, MultiQC)
Assembly process integration (SPAdes)
Quality assessment steps (QUAST)
Annotation process (Prokka)

5. Nextflow Scripting¶

Writing process definitions
Channel operations and data flow
Parameter handling
Conditional execution
Module organization

Tools and Software¶

Workflow Management¶

Nextflow - Workflow orchestration system
nf-core tools - Pipeline development framework
Tower - Workflow monitoring platform

Containerization¶

Docker - Container platform
Singularity - HPC-friendly containers
Conda - Package management

Pipeline Components¶

FastQC - Read quality control
MultiQC - Aggregate reporting
SPAdes - Genome assembly
QUAST - Assembly assessment
Prokka - Genome annotation

Hands-on Exercises¶

Exercise 1: First Nextflow Script (30 minutes)¶

Create and run a simple Nextflow pipeline.

#!/usr/bin/env nextflow

// Define parameters
params.input = "data/*.fastq"
params.outdir = "results"

// Create a channel from input files
Channel
    .fromPath(params.input)
    .set { fastq_ch }

// Define a process
process countReads {
    input:
    path fastq from fastq_ch

    output:
    path "*.count" into counts_ch

    script:
    """
    echo "Processing ${fastq}"
    wc -l ${fastq} > ${fastq.baseName}.count
    """
}

// View the results
counts_ch.view()

Exercise 2: Building a QC Pipeline (60 minutes)¶

Implement quality control with FastQC and MultiQC.

process fastqc {
    container 'biocontainers/fastqc:v0.11.9'

    input:
    tuple val(sample_id), path(reads)

    output:
    path "*_fastqc.{zip,html}" into fastqc_results

    script:
    """
    fastqc -t ${task.cpus} ${reads}
    """
}

process multiqc {
    publishDir params.outdir, mode: 'copy'
    container 'ewels/multiqc:latest'

    input:
    path '*' from fastqc_results.collect()

    output:
    path 'multiqc_report.html'

    script:
    """
    multiqc .
    """
}

Exercise 3: Integrating Assembly (90 minutes)¶

Add genome assembly to the pipeline.

process spades_assembly {
    container 'staphb/spades:latest'
    cpus 4
    memory '8 GB'

    input:
    tuple val(sample_id), path(reads1), path(reads2)

    output:
    tuple val(sample_id), path("${sample_id}_contigs.fasta")

    script:
    """
    spades.py \
        -1 ${reads1} \
        -2 ${reads2} \
        -o spades_output \
        -t ${task.cpus} \
        --careful

    cp spades_output/contigs.fasta ${sample_id}_contigs.fasta
    """
}

Key Concepts¶

Workflow Principles¶

Reproducibility: Same input → same output
Portability: Run anywhere (laptop, HPC, cloud)
Scalability: Handle any data volume
Resumability: Restart from failure points

Nextflow Components¶

Component	Description	Example
Process	Computational step	`process fastqc { ... }`
Channel	Data flow connection	`Channel.fromPath()`
Operator	Channel transformation	`.map()`, `.filter()`
Directive	Process configuration	`cpus 4`

Best Practices¶

Use containers: Ensure environment reproducibility
Parameterize everything: Make pipelines flexible
Version control: Track pipeline changes
Document thoroughly: Help users and future self
Test incrementally: Build and test step by step

Assessment Activities¶

Individual Tasks¶

Create a basic Nextflow script with at least 2 processes
Successfully run a pipeline with test data
Modify pipeline parameters and observe changes
Debug a pipeline with intentional errors
Document pipeline usage

Group Discussion¶

Compare Nextflow with traditional shell scripting
Discuss reproducibility challenges and solutions
Share pipeline design strategies
Explore nf-core pipeline catalog

Common Challenges¶

Installation Issues¶

# Install Nextflow
curl -s https://get.nextflow.io | bash
./nextflow run hello

# Set up environment
export PATH=$PATH:$PWD
export NXF_VER=23.10.0

Channel Operations¶

// Common channel patterns
Channel
    .fromFilePairs(params.reads)
    .ifEmpty { error "No read files found!" }
    .set { read_pairs_ch }

// Combining channels
fastqc_ch
    .join(assembly_ch)
    .map { sample, qc, assembly -> 
        [sample, qc, assembly]
    }

Resource Management¶

process memory_intensive {
    memory { 2.GB * task.attempt }
    maxRetries 3
    errorStrategy 'retry'

    script:
    """
    # Your command here
    """
}

Resources¶

Documentation¶

Tutorials¶

Community¶

Looking Ahead¶

Day 6 Preview: Nextflow Pipeline Development - Continue building the genomic analysis pipeline - Advanced Nextflow features and optimization - Pipeline testing and validation - Deployment strategies

Key Learning Outcome: Understanding workflow management principles and gaining hands-on experience with Nextflow enables creation of reproducible, scalable bioinformatics pipelines essential for modern genomic analysis.