Day 7: Advanced Nextflow, Version Control with GitHub & Real Genomics Applications¶

Date: September 9, 2025 Duration: 09:00-13:00 CAT Focus: Version control, containerization, and production-ready MTB analysis pipelines

Learning Philosophy: Build → Version → Containerize → Deploy → Collaborate¶

Building on Exercise 3 from Day 6, this module transforms your basic pipeline into a professional, production-ready workflow:

Build: Extend your Exercise 3 pipeline with advanced features
Version: Track changes and collaborate using Git and GitHub
Containerize: Package tools using Docker for reproducibility
Deploy: Run pipelines reliably across different environments
Collaborate: Share and maintain pipelines as a team

Table of Contents¶

Learning Objectives¶

By the end of Day 7, you will be able to:

Version Control: Use Git and GitHub to track pipeline changes and collaborate effectively
Containerization: Understand Docker basics and integrate containers into Nextflow workflows
MTB Analysis: Develop production-ready pipelines for Mycobacterium tuberculosis genomics
Clinical Applications: Apply genomics workflows to real-world pathogen surveillance
Professional Skills: Document, test, and deploy bioinformatics pipelines professionally

Prerequisites¶

Completed Day 6 Exercise 3 (Quality Control Pipeline)
Basic understanding of Nextflow DSL2 syntax
Familiarity with command line operations
Access to the training cluster environment

Schedule¶

Time (CAT)	Topic	Duration	Trainer
09:00	Version Control: Git and GitHub for Pipeline Development	45 min	Mamana Mbiyavanga
09:45	Containerization: Docker Fundamentals and DockerHub	45 min	Mamana Mbiyavanga
10:30	Break	15 min
10:45	MTB Analysis Pipeline Development	45 min	Mamana Mbiyavanga
11:30	Clinical Genomics Applications and Pathogen Surveillance	45 min	Mamana Mbiyavanga
12:15	Professional Pipeline Development and Deployment	45 min	Mamana Mbiyavanga
13:00	End

From Exercise 3 to Production Pipeline¶

Recap: What We Built in Exercise 3¶

In Day 6, Exercise 3, we created a progressive quality control pipeline (qc_pipeline.nf) that included:

flowchart LR
    A[Raw FASTQ Files] --> B[FastQC Raw]
    B --> C[Trimmomatic]
    C --> D[FastQC Trimmed]
    D --> E[SPAdes Assembly]
    E --> F[Prokka Annotation]
    F --> G[MultiQC Report]

    style A fill:#f5f5f5,stroke:#757575,stroke-width:1px,color:#000
    style G fill:#f5f5f5,stroke:#757575,stroke-width:1px,color:#000
    style B fill:#e8f5e8,stroke:#388e3c,stroke-width:2px,color:#000
    style C fill:#e8f5e8,stroke:#388e3c,stroke-width:2px,color:#000
    style D fill:#e8f5e8,stroke:#388e3c,stroke-width:2px,color:#000
    style E fill:#e8f5e8,stroke:#388e3c,stroke-width:2px,color:#000
    style F fill:#e8f5e8,stroke:#388e3c,stroke-width:2px,color:#000

Today's Enhancement Strategy¶

We'll transform this basic pipeline into a production-ready MTB analysis workflow by adding:

Version Control: Track all changes using Git
Containerization: Replace module loading with Docker containers
MTB-Specific Features: Add tuberculosis-specific analysis steps
Clinical Applications: Include AMR detection and typing
Professional Standards: Documentation, testing, and deployment

Version Control: Git and GitHub for Pipeline Development¶

Why Version Control for Bioinformatics?¶

graph TD
    A[Pipeline Development Challenges] --> B[Multiple Versions]
    A --> C[Collaboration Issues]
    A --> D[Lost Changes]
    A --> E[No Backup]

    F[Git Solutions] --> G[Track All Changes]
    F --> H[Collaborate Safely]
    F --> I[Never Lose Work]
    F --> J[Professional Standards]

    style A fill:#ffebee,stroke:#f44336,stroke-width:2px,color:#000
    style F fill:#e8f5e8,stroke:#4caf50,stroke-width:2px,color:#000

Real-world scenarios:

Research labs: Multiple researchers working on the same pipeline
Clinical labs: Regulatory requirements for change tracking
Collaborations: Sharing pipelines between institutions
Publications: Reproducible research requirements

Git Fundamentals for Bioinformatics¶

Setting Up Git for Your Pipeline¶

# Navigate to your workflows directory
cd /users/$USER/microbial-genomics-training/workflows

# Initialize Git repository
git init

# Configure Git (first time only)
git config --global user.name "Your Name"
git config --global user.email "your.email@uct.ac.za"

# Check current status
git status

Your First Commit: Saving Exercise 3¶

# Add your Exercise 3 pipeline to version control
git add qc_pipeline.nf
git add nextflow.config
git add samplesheet.csv

# Create your first commit
git commit -m "Initial commit: Exercise 3 QC pipeline

- FastQC for raw and trimmed reads
- Trimmomatic for quality trimming
- SPAdes for genome assembly
- Prokka for annotation
- MultiQC for reporting
- Tested with 10 TB samples"

# View your commit history
git log --oneline

Understanding Git Workflow¶

graph LR
    A[Working Directory] --> B[Staging Area]
    B --> C[Repository]

    A -->|git add| B
    B -->|git commit| C
    C -->|git push| D[GitHub]

    style A fill:#fff3e0,stroke:#ff9800,stroke-width:2px,color:#000
    style B fill:#e3f2fd,stroke:#2196f3,stroke-width:2px,color:#000
    style C fill:#e8f5e8,stroke:#4caf50,stroke-width:2px,color:#000
    style D fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px,color:#000

GitHub for Pipeline Collaboration¶

Creating Your Pipeline Repository¶

Go to GitHub.com and sign in
Click "New Repository"
Repository settings:
Name: mtb-analysis-pipeline
Description: Production-ready Mycobacterium tuberculosis genomics pipeline
Public/Private: Choose based on your needs
Initialize with README: ✅

Connecting Local Repository to GitHub¶

# Add GitHub as remote origin
git remote add origin https://github.com/yourusername/mtb-analysis-pipeline.git

# Push your local commits to GitHub
git branch -M main
git push -u origin main

# Verify connection
git remote -v

Professional README for Your Pipeline¶

Create a comprehensive README.md:

# MTB Analysis Pipeline

A production-ready Nextflow pipeline for *Mycobacterium tuberculosis* genomic analysis.

## Quick Start

```bash
nextflow run main.nf --input samplesheet.csv --outdir results/

Features¶

✅ Quality control with FastQC
✅ Read trimming with Trimmomatic
✅ Genome assembly with SPAdes
✅ Gene annotation with Prokka
✅ Comprehensive reporting with MultiQC
🔄 AMR detection (coming soon)
🔄 Lineage typing (coming soon)

Requirements¶

Nextflow ≥ 21.04.0
Docker or Singularity
8+ GB RAM recommended

Citation¶

If you use this pipeline, please cite: [Your Publication]

---

## Containerization: Docker Fundamentals and DockerHub

### Why Containers for Bioinformatics?

**The Problem:**
```bash
# Traditional approach - dependency hell
module load fastqc/0.12.1
module load trimmomatic/0.39
module load spades/3.15.4
module load prokka/1.14.6

# What if modules aren't available?
# What if versions differ?
# What if you're on a different system?

The Container Solution:

# Container approach - everything included
container 'biocontainers/fastqc:v0.11.9'
container 'staphb/trimmomatic:0.39'
container 'staphb/spades:3.15.4'
container 'staphb/prokka:1.14.6'

Docker Fundamentals¶

Understanding Docker Images¶

graph TD
    A[DockerHub Registry] --> B[Docker Image]
    B --> C[Docker Container]

    A -->|docker pull| B
    B -->|docker run| C

    D[biocontainers/fastqc:v0.11.9] --> E[FastQC Image]
    E --> F[Running FastQC Container]

    style A fill:#e3f2fd,stroke:#2196f3,stroke-width:2px,color:#000
    style B fill:#fff3e0,stroke:#ff9800,stroke-width:2px,color:#000
    style C fill:#e8f5e8,stroke:#4caf50,stroke-width:2px,color:#000

Key Docker Concepts¶

Image: A template containing the software and dependencies
Container: A running instance of an image
Registry: A repository of images (like DockerHub)
Tag: Version identifier for images

DockerHub for Bioinformatics Tools¶

Finding Bioinformatics Containers¶

Popular Bioinformatics Container Repositories:

BioContainers: biocontainers/toolname
StaPH-B: staphb/toolname (State Public Health Bioinformatics)
Broad Institute: broadinstitute/toolname
nf-core: nfcore/toolname

Searching for Tools¶

# Search DockerHub for bioinformatics tools
# Visit: https://hub.docker.com/

# Example searches:
# - "biocontainers fastqc"
# - "staphb trimmomatic"
# - "broadinstitute gatk"

Understanding Container Tags¶

# Different ways to specify versions
container 'biocontainers/fastqc:v0.11.9'     # Specific version
container 'biocontainers/fastqc:latest'      # Latest version
container 'staphb/spades:3.15.4'            # Specific version
container 'staphb/spades:latest'             # Latest version (not recommended for production)

Testing Containers with Singularity¶

Why Test Containers First? Before integrating containers into your Nextflow pipeline, it's good practice to test them individually to ensure they work correctly and understand their requirements.

Singularity vs Docker on HPC:

Docker: Requires root privileges, not available on most HPC systems
Singularity: Designed for HPC environments, can run Docker containers without root
Nextflow: Automatically converts Docker containers to Singularity when needed

Basic Singularity Testing Commands:

# Test FastQC container
singularity exec docker://biocontainers/fastqc:v0.11.9 fastqc --version

# Test with real data
singularity exec docker://biocontainers/fastqc:v0.11.9 \
    fastqc /data/Dataset_Mt_Vc/tb/raw_data/SRR1180160_1.fastq.gz --outdir ./test_output

# Test Trimmomatic container
singularity exec docker://staphb/trimmomatic:0.39 \
    trimmomatic PE --help

# Test SPAdes container
singularity exec docker://staphb/spades:3.15.4 \
    spades.py --version

# Test Prokka container
singularity exec docker://staphb/prokka:1.14.6 \
    prokka --version

Interactive Container Testing:

# Enter container interactively to explore
singularity shell docker://biocontainers/fastqc:v0.11.9

# Inside the container, you can:
# - Check what tools are available
# - Explore file system structure
# - Test commands manually
# - Verify dependencies

# Example interactive session:
Singularity> which fastqc
Singularity> fastqc --help
Singularity> ls /usr/local/bin/
Singularity> exit

Testing with Bind Mounts:

# Bind mount your data directory to test with real files
singularity exec \
    --bind /data/Dataset_Mt_Vc/tb/raw_data:/input \
    --bind /tmp:/output \
    docker://biocontainers/fastqc:v0.11.9 \
    fastqc /input/SRR1180160_1.fastq.gz --outdir /output

# Check the results
ls -la /tmp/*.html /tmp/*.zip

Container Validation Script:

Create a simple script to test all your containers:

#!/bin/bash
# container_test.sh - Test bioinformatics containers

echo "Testing bioinformatics containers..."

# Test FastQC
echo "Testing FastQC..."
singularity exec docker://biocontainers/fastqc:v0.11.9 fastqc --version
if [ $? -eq 0 ]; then
    echo "✅ FastQC container works"
else
    echo "❌ FastQC container failed"
fi

# Test Trimmomatic
echo "Testing Trimmomatic..."
singularity exec docker://staphb/trimmomatic:0.39 trimmomatic -version
if [ $? -eq 0 ]; then
    echo "✅ Trimmomatic container works"
else
    echo "❌ Trimmomatic container failed"
fi

# Test SPAdes
echo "Testing SPAdes..."
singularity exec docker://staphb/spades:3.15.4 spades.py --version
if [ $? -eq 0 ]; then
    echo "✅ SPAdes container works"
else
    echo "❌ SPAdes container failed"
fi

# Test Prokka
echo "Testing Prokka..."
singularity exec docker://staphb/prokka:1.14.6 prokka --version
if [ $? -eq 0 ]; then
    echo "✅ Prokka container works"
else
    echo "❌ Prokka container failed"
fi

echo "Container testing complete!"

Running the Test Script:

# Make the script executable
chmod +x container_test.sh

# Run the tests
./container_test.sh

Expected Output:

Testing bioinformatics containers...
Testing FastQC...
FastQC v0.11.9
✅ FastQC container works
Testing Trimmomatic...
0.39
✅ Trimmomatic container works
Testing SPAdes...
SPAdes v3.15.4
✅ SPAdes container works
Testing Prokka...
prokka 1.14.6
✅ Prokka container works
Container testing complete!

Troubleshooting Container Issues:

# If a container fails, check:

# 1. Container exists and is accessible
singularity pull docker://biocontainers/fastqc:v0.11.9

# 2. Check container contents
singularity inspect docker://biocontainers/fastqc:v0.11.9

# 3. Run with verbose output
singularity exec --debug docker://biocontainers/fastqc:v0.11.9 fastqc --version

# 4. Check for missing dependencies
singularity exec docker://biocontainers/fastqc:v0.11.9 ldd /usr/local/bin/fastqc

Best Practices for Container Testing:

Always test containers before using in production pipelines
Use specific version tags rather than 'latest'
Test with real data similar to your analysis
Document working container versions
Create validation scripts for your container stack

Hands-on Exercise: Test Your Containers¶

We've provided a ready-to-use container testing script in your workflows directory:

# Navigate to workflows directory
cd /users/$USER/microbial-genomics-training/workflows

# Run the container test script
./container_test.sh

This script will test all the containers we'll use in our MTB pipeline and provide colored output showing which containers are working correctly.

Container Integration in Nextflow¶

Converting Exercise 3 to Use Containers¶

Let's update our qc_pipeline.nf to use containers instead of modules:

Before (Module-based):

process fastqc_raw {
    module 'fastqc/0.12.1'
    publishDir "${params.outdir}/fastqc_raw", mode: 'copy'
    tag "$sample_id"

    input:
    tuple val(sample_id), path(reads)

    output:
    path "*.{html,zip}"

    script:
    """
    echo "Running FastQC on raw reads: ${sample_id}"
    fastqc ${reads[0]} ${reads[1]} --threads 2
    """
}

After (Container-based):

process fastqc_raw {
    container 'biocontainers/fastqc:v0.11.9'
    publishDir "${params.outdir}/fastqc_raw", mode: 'copy'
    tag "$sample_id"

    input:
    tuple val(sample_id), path(reads)

    output:
    path "*.{html,zip}"

    script:
    """
    echo "Running FastQC on raw reads: ${sample_id}"
    fastqc ${reads[0]} ${reads[1]} --threads 2
    """
}

Complete Containerized Pipeline¶

Let's create mtb_pipeline.nf - our enhanced, containerized version:

#!/usr/bin/env nextflow

// Enable DSL2
nextflow.enable.dsl = 2

// Parameters
params.input = "samplesheet.csv"
params.outdir = "/data/users/$USER/nextflow-training/results"
params.adapters = "/data/timmomatic_adapter_Combo.fa"

// FastQC process for raw reads
process fastqc_raw {
    container 'biocontainers/fastqc:v0.11.9'
    publishDir "${params.outdir}/fastqc_raw", mode: 'copy'
    tag "$sample_id"

    input:
    tuple val(sample_id), path(reads)

    output:
    path "*.{html,zip}"

    script:
    """
    echo "Running FastQC on raw reads: ${sample_id}"
    fastqc ${reads[0]} ${reads[1]} --threads 2
    """
}

// Trimmomatic for quality trimming
process trimmomatic {
    container 'staphb/trimmomatic:0.39'
    publishDir "${params.outdir}/trimmed", mode: 'copy'
    tag "$sample_id"

    input:
    tuple val(sample_id), path(reads)

    output:
    tuple val(sample_id), path("${sample_id}_*_paired.fastq.gz")
    path "${sample_id}_*_unpaired.fastq.gz"

    script:
    """
    echo "Running Trimmomatic on ${sample_id}"

    trimmomatic PE -threads 2 \\
        ${reads[0]} ${reads[1]} \\
        ${sample_id}_R1_paired.fastq.gz ${sample_id}_R1_unpaired.fastq.gz \\
        ${sample_id}_R2_paired.fastq.gz ${sample_id}_R2_unpaired.fastq.gz \\
        ILLUMINACLIP:${params.adapters}:2:30:10 \\
        LEADING:3 TRAILING:3 \\
        SLIDINGWINDOW:4:15 MINLEN:36
    """
}

// FastQC process for trimmed reads
process fastqc_trimmed {
    container 'biocontainers/fastqc:v0.11.9'
    publishDir "${params.outdir}/fastqc_trimmed", mode: 'copy'
    tag "$sample_id"

    input:
    tuple val(sample_id), path(reads)

    output:
    path "*.{html,zip}"

    script:
    """
    echo "Running FastQC on trimmed reads: ${sample_id}"
    fastqc ${reads[0]} ${reads[1]} --threads 2
    """
}

// SPAdes assembly
process spades_assembly {
    container 'staphb/spades:3.15.4'
    publishDir "${params.outdir}/assemblies", mode: 'copy'
    tag "$sample_id"
    cpus 4
    memory '8 GB'

    input:
    tuple val(sample_id), path(reads)

    output:
    tuple val(sample_id), path("${sample_id}_assembly")
    tuple val(sample_id), path("${sample_id}_assembly/contigs.fasta")

    script:
    """
    echo "Running SPAdes assembly for ${sample_id}"

    spades.py \\
        --pe1-1 ${reads[0]} \\
        --pe1-2 ${reads[1]} \\
        --threads ${task.cpus} \\
        --memory ${task.memory.toGiga()} \\
        -o ${sample_id}_assembly
    """
}

// Prokka annotation
process prokka_annotation {
    container 'staphb/prokka:1.14.6'
    publishDir "${params.outdir}/annotation", mode: 'copy'
    tag "$sample_id"
    cpus 2

    input:
    tuple val(sample_id), path(assembly_dir)
    tuple val(sample_id), path(contigs)

    output:
    tuple val(sample_id), path("${sample_id}_annotation")
    tuple val(sample_id), path("${sample_id}_annotation/${sample_id}.gff")

    script:
    """
    echo "Running Prokka annotation for ${sample_id}"

    prokka \\
        --outdir ${sample_id}_annotation \\
        --prefix ${sample_id} \\
        --genus Mycobacterium \\
        --species tuberculosis \\
        --kingdom Bacteria \\
        --cpus ${task.cpus} \\
        ${contigs}

    echo "Annotation completed for ${sample_id}"
    echo "Results written to: ${sample_id}_annotation/"
    """
}

// MultiQC to summarize all results
process multiqc {
    container 'ewels/multiqc:v1.12'
    publishDir "${params.outdir}", mode: 'copy'

    input:
    path "*"

    output:
    path "multiqc_report.html"

    script:
    """
    echo "Running MultiQC to summarize all results"
    multiqc . --filename multiqc_report.html
    """
}

// Main workflow
workflow {
    // Read input samplesheet
    read_pairs_ch = Channel
        .fromPath(params.input)
        .splitCsv(header: true)
        .map { row -> [row.sample, [file(row.fastq_1), file(row.fastq_2)]] }

    // Run FastQC on raw reads
    fastqc_raw_results = fastqc_raw(read_pairs_ch)

    // Run Trimmomatic
    trimmed_results = trimmomatic(read_pairs_ch)

    // Run FastQC on trimmed reads
    fastqc_trimmed_results = fastqc_trimmed(trimmed_results)

    // Run SPAdes assembly on trimmed reads
    assembly_results = spades_assembly(trimmed_results)

    // Run Prokka annotation on assembled contigs
    annotation_results = prokka_annotation(assembly_results[0], assembly_results[1])

    // Collect all FastQC results and run MultiQC
    all_fastqc = fastqc_raw_results.mix(fastqc_trimmed_results).collect()
    multiqc_results = multiqc(all_fastqc)

    // Show final results
    assembly_results[0].view { "Assembly completed: $it" }
    assembly_results[1].view { "Contigs file: $it" }
    annotation_results[0].view { "Annotation completed: $it" }
    annotation_results[1].view { "GFF file: $it" }
    multiqc_results.view { "MultiQC report created: $it" }
}

Updating Nextflow Configuration for Containers¶

Update your nextflow.config:

// Nextflow configuration for MTB pipeline
params {
    input = "samplesheet.csv"
    outdir = "/data/users/$USER/nextflow-training/results"
    adapters = "/data/timmomatic_adapter_Combo.fa"
}

// Work directory configuration
workDir = "/data/users/$USER/nextflow-training/work"

// Container configuration
docker {
    enabled = true
    runOptions = '-u $(id -u):$(id -g)'
}

// Process resource configuration
process {
    withName: 'fastqc_raw' {
        cpus = 2
        memory = '4 GB'
        time = '30m'
    }

    withName: 'trimmomatic' {
        cpus = 2
        memory = '4 GB'
        time = '1h'
    }

    withName: 'fastqc_trimmed' {
        cpus = 2
        memory = '4 GB'
        time = '30m'
    }

    withName: 'spades_assembly' {
        cpus = 4
        memory = '8 GB'
        time = '2h'
    }

    withName: 'prokka_annotation' {
        cpus = 2
        memory = '4 GB'
        time = '1h'
    }

    withName: 'multiqc' {
        cpus = 1
        memory = '2 GB'
        time = '15m'
    }
}

// SLURM profile for cluster execution
profiles {
    slurm {
        process {
            executor = 'slurm'
            queue = 'Main'
            clusterOptions = '--account=b83'
        }
    }
}

MTB Analysis Pipeline Development¶

Mycobacterium tuberculosis Genomics¶

Understanding MTB Genomics¶

Key Characteristics:

Genome size: ~4.4 Mb
GC content: ~65.6%
Genes: ~4,000 protein-coding genes
Clinical relevance: Major global pathogen, drug resistance concerns
Assembly challenges: Repetitive sequences, PE/PPE gene families

graph TD
    A[MTB Genomic Features] --> B[High GC Content 65.6%]
    A --> C[PE/PPE Gene Families]
    A --> D[Repetitive Sequences]
    A --> E[Drug Resistance Genes]

    F[Analysis Challenges] --> G[Assembly Complexity]
    F --> H[Annotation Difficulty]
    F --> I[Variant Calling Issues]

    J[Clinical Applications] --> K[Drug Resistance Testing]
    J --> L[Outbreak Investigation]
    J --> M[Lineage Typing]

    style A fill:#e3f2fd,stroke:#2196f3,stroke-width:2px,color:#000
    style F fill:#ffebee,stroke:#f44336,stroke-width:2px,color:#000
    style J fill:#e8f5e8,stroke:#4caf50,stroke-width:2px,color:#000

MTB-Specific Pipeline Requirements¶

Essential Analysis Steps:

Quality Control: Standard FastQC + MTB-specific metrics
Assembly: Optimized for high GC content
Annotation: MTB-specific gene databases
AMR Detection: Drug resistance gene identification
Lineage Typing: Phylogenetic classification
Variant Calling: SNP identification for outbreak analysis

Pathogen-Specific Considerations¶

Enhanced MTB Pipeline Features¶

Let's add MTB-specific processes to our pipeline:

// QUAST assembly quality assessment
process quast_assessment {
    container 'staphb/quast:5.0.2'
    publishDir "${params.outdir}/quast", mode: 'copy'
    tag "$sample_id"

    input:
    tuple val(sample_id), path(contigs)

    output:
    path "${sample_id}_quast"

    script:
    """
    echo "Running QUAST assessment for ${sample_id}"

    quast.py \\
        --output-dir ${sample_id}_quast \\
        --threads 2 \\
        --min-contig 500 \\
        --labels ${sample_id} \\
        ${contigs}
    """
}

// AMR detection with AMRFinderPlus
process amr_detection {
    container 'staphb/amrfinderplus:3.10.23'
    publishDir "${params.outdir}/amr", mode: 'copy'
    tag "$sample_id"

    input:
    tuple val(sample_id), path(contigs)

    output:
    path "${sample_id}_amr.tsv"

    script:
    """
    echo "Running AMR detection for ${sample_id}"

    amrfinder \\
        --nucleotide ${contigs} \\
        --organism Mycobacterium \\
        --threads 2 \\
        --output ${sample_id}_amr.tsv

    echo "AMR analysis completed for ${sample_id}"
    """
}

// MTB lineage typing with TB-Profiler
process lineage_typing {
    container 'staphb/tbprofiler:4.1.1'
    publishDir "${params.outdir}/lineage", mode: 'copy'
    tag "$sample_id"

    input:
    tuple val(sample_id), path(reads)

    output:
    path "${sample_id}_lineage.json"
    path "${sample_id}_lineage.txt"

    script:
    """
    echo "Running lineage typing for ${sample_id}"

    tb-profiler profile \\
        --read1 ${reads[0]} \\
        --read2 ${reads[1]} \\
        --prefix ${sample_id}_lineage \\
        --threads 2

    echo "Lineage typing completed for ${sample_id}"
    """
}

// MLST typing
process mlst_typing {
    container 'staphb/mlst:2.22.0'
    publishDir "${params.outdir}/mlst", mode: 'copy'
    tag "$sample_id"

    input:
    tuple val(sample_id), path(contigs)

    output:
    path "${sample_id}_mlst.tsv"

    script:
    """
    echo "Running MLST typing for ${sample_id}"

    mlst \\
        --scheme mtbc \\
        ${contigs} > ${sample_id}_mlst.tsv

    echo "MLST typing completed for ${sample_id}"
    """
}

Clinical Genomics Applications¶

Complete MTB Clinical Pipeline¶

Here's our enhanced pipeline with all MTB-specific features:

#!/usr/bin/env nextflow

// Enable DSL2
nextflow.enable.dsl = 2

// Parameters
params.input = "samplesheet.csv"
params.outdir = "/data/users/$USER/nextflow-training/results"
params.adapters = "/data/timmomatic_adapter_Combo.fa"

// Import processes (previous processes here...)

// Enhanced workflow with MTB-specific analysis
workflow {
    // Read input samplesheet
    read_pairs_ch = Channel
        .fromPath(params.input)
        .splitCsv(header: true)
        .map { row -> [row.sample, [file(row.fastq_1), file(row.fastq_2)]] }

    // Standard QC and assembly pipeline
    fastqc_raw_results = fastqc_raw(read_pairs_ch)
    trimmed_results = trimmomatic(read_pairs_ch)
    fastqc_trimmed_results = fastqc_trimmed(trimmed_results)
    assembly_results = spades_assembly(trimmed_results)
    annotation_results = prokka_annotation(assembly_results[0], assembly_results[1])

    // MTB-specific analyses
    quast_results = quast_assessment(assembly_results[1])
    amr_results = amr_detection(assembly_results[1])
    lineage_results = lineage_typing(trimmed_results)
    mlst_results = mlst_typing(assembly_results[1])

    // Comprehensive reporting
    all_qc = fastqc_raw_results.mix(fastqc_trimmed_results).collect()
    multiqc_results = multiqc(all_qc)

    // Output summaries
    assembly_results[1].view { "Assembly: $it" }
    annotation_results[1].view { "Annotation: $it" }
    amr_results.view { "AMR results: $it" }
    lineage_results.view { "Lineage: $it" }
    mlst_results.view { "MLST: $it" }
}

Clinical Interpretation Workflow¶

flowchart TD
    A[Raw Sequencing Data] --> B[Quality Control]
    B --> C[Genome Assembly]
    C --> D[Quality Assessment]

    D --> E[Drug Resistance Testing]
    D --> F[Lineage Typing]
    D --> G[MLST Typing]

    E --> H[Clinical Report]
    F --> H
    G --> H

    H --> I[Treatment Decision]
    H --> J[Outbreak Investigation]
    H --> K[Surveillance Database]

    style A fill:#f5f5f5,stroke:#757575,stroke-width:1px,color:#000
    style H fill:#fff3e0,stroke:#ff9800,stroke-width:2px,color:#000
    style I fill:#e8f5e8,stroke:#4caf50,stroke-width:2px,color:#000
    style J fill:#e3f2fd,stroke:#2196f3,stroke-width:2px,color:#000
    style K fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px,color:#000

Step 1: Clone and Setup PHoeNIx

# Navigate to your working directory
cd /data/users/$USER/nextflow-training

# Create Phoenix analysis directory
mkdir -p phoenix-analysis
cd phoenix-analysis

# Clone the PHoeNIx pipeline directly
git clone https://github.com/CDCgov/phoenix.git
cd phoenix

# Check the Phoenix help to understand entry workflows
nextflow run main.nf --help

Step 2: Setup Required Databases

PHoeNIx requires several databases. The system has a pre-installed Kraken2 database:

# Check available Kraken2 database
ls -la /data/kraken2_db_standard/

# Note: The standard database requires >200GB RAM for production use
# For training environments, ensure adequate memory allocation
echo "Kraken2 database location: /data/kraken2_db_standard/"

Step 3: Prepare Sample Sheet for PHoeNIx

PHoeNIx uses a specific samplesheet format. Let's create one for our TB data:

# Create PHoeNIx samplesheet
cat > phoenix_samplesheet.csv << 'EOF'
sample,fastq_1,fastq_2
Test_Sample,/data/Dataset_Mt_Vc/tb/raw_data/ERR036221_1.fastq.gz,/data/Dataset_Mt_Vc/tb/raw_data/ERR036221_2.fastq.gz
EOF

echo "PHoeNIx samplesheet created:"
cat phoenix_samplesheet.csv

Step 4: Configure Resources for Production Use

PHoeNIx requires significant computational resources, especially for Kraken2 taxonomic classification. The pipeline has hardcoded resource limits that need to be adjusted:

# First, modify PHoeNIx base configuration for Kraken2 memory requirements
cd phoenix
sed -i 's/cpus   = { check_max( 2                  , '\''cpus'\''    ) }/cpus   = { check_max( 10                 , '\''cpus'\''    ) }/' conf/base.config
sed -i 's/memory = { check_max( 10.GB \* task.attempt, '\''memory'\''  ) }/memory = { check_max( 200.GB \* task.attempt, '\''memory'\''  ) }/' conf/base.config
sed -i 's/time   = { check_max( 4.h                 , '\''time'\''    ) }/time   = { check_max( 8.h                 , '\''time'\''    ) }/' conf/base.config

# Verify the changes
echo "🔍 Checking the updated configuration:"
grep -A 4 "withName:KRAKEN2_KRAKEN2" conf/base.config

# Create cluster configuration for SLURM
cd ..
cat > cluster.config << 'EOF'
process {
    executor = 'slurm'
    queue = 'batch'

    // PHoeNIx-specific resource allocations - 200GB memory for large Kraken2 database (within cluster limits)
    withName: 'PHOENIX:PHOENIX_EXTERNAL:KRAKEN2_TRIMD:KRAKEN2_TRIMD' {
        cpus = 10
        memory = '200 GB'
        time = '8h'
    }

    withName: 'PHOENIX:PHOENIX_EXTERNAL:KRAKEN2_WTASMBLD:KRAKEN2_WTASMBLD' {
        cpus = 10
        memory = '200 GB'
        time = '8h'
    }
}

singularity {
    enabled = true
    autoMounts = true
    cacheDir = '/data/users/singularity_cache'
}
EOF

Step 5: Run PHoeNIx Pipeline

# Set up Singularity cache directory
export NXF_SINGULARITY_CACHEDIR="/data/users/singularity_cache"
mkdir -p $NXF_SINGULARITY_CACHEDIR

# Run PHoeNIx with correct entry workflow and configuration
cd phoenix
nextflow run main.nf \
    -entry PHOENIX \
    --input /data/users/$USER/nextflow-training/phoenix-analysis/phoenix_samplesheet.csv \
    --kraken2db $KRAKEN2_DB_PATH \
    --outdir tb_analysis_results \
    -c /users/mamana/microbial-genomics-training/cluster.config \
    -profile singularity,slurm \
    -resume

# Monitor the run
tail -f .nextflow.log

Running from GitHub (After Pushing Your Workflow)¶

Once you've developed and pushed your own workflow to GitHub, you can run it directly:

Step 1: Push Your Workflow to GitHub

# Initialize git repository (if not already done)
cd /users/$USER/microbial-genomics-training/workflows
git init
git add .
git commit -m "Add microbial genomics training workflows"

# Add your GitHub repository as remote
git remote add origin https://github.com/$USER/microbial-genomics-training.git
git push -u origin main

Step 2: Run Workflow from GitHub

# Run your workflow directly from GitHub
nextflow run $USER/microbial-genomics-training \
    --input samplesheet.csv \
    --outdir /data/users/$USER/nextflow-training/results \
    -profile singularity \
    -resume

# Or run a specific workflow file
nextflow run $USER/microbial-genomics-training/workflows/qc_pipeline.nf \
    --input samplesheet.csv \
    --outdir /data/users/$USER/nextflow-training/results \
    -profile singularity \
    -resume

Troubleshooting PHoeNIx Issues¶

Common Issue 1: Memory Allocation for Kraken2

If you encounter exit status 137 (killed due to memory constraints):

# Check if PHoeNIx base configuration was properly modified
grep -A 4 "withName:KRAKEN2_KRAKEN2" phoenix/conf/base.config

# Should show 200.GB memory allocation, not 10.GB
# If not updated, re-run the sed commands from Step 4

Common Issue 2: Entry Workflow Not Specified

If you get "No entry workflow specified" error:

# Always use -entry PHOENIX parameter
nextflow run main.nf \
    -entry PHOENIX \
    --input /data/users/$USER/nextflow-training/phoenix-analysis/phoenix_samplesheet.csv \
    --kraken2db $KRAKEN2_DB_PATH \
    --outdir tb_analysis_results \
    -c /users/mamana/microbial-genomics-training/cluster.config \
    -profile singularity,slurm \
    -resume

Common Issue 3: Kraken2 Database Path

If you encounter the error about ktaxonomy.tsv not found:

# Check if the database path is correct
ls -la /data/kraken2_db/
ls -la /data/kraken2_db/ktaxonomy.tsv

# If using module system, check the environment variable
echo $KRAKEN2_DB_PATH

# Update the command with correct database path
nextflow run main.nf \
    -entry PHOENIX \
    --input /data/users/$USER/nextflow-training/phoenix-analysis/phoenix_samplesheet.csv \
    --kraken2db $KRAKEN2_DB_PATH \
    --outdir tb_analysis_results \
    -c /users/mamana/microbial-genomics-training/cluster.config \
    -profile singularity,slurm \
    -resume

Common Issue 4: QC Pipeline File Function Error

If your QC pipeline shows "Argument of file function cannot be null":

# Check samplesheet format - ensure multiple samples
cat > samplesheet_fixed.csv << 'EOF'
sample,fastq_1,fastq_2
ERR036221,/data/Dataset_Mt_Vc/tb/raw_data/ERR036221_1.fastq.gz,/data/Dataset_Mt_Vc/tb/raw_data/ERR036221_2.fastq.gz
ERR036223,/data/Dataset_Mt_Vc/tb/raw_data/ERR036223_1.fastq.gz,/data/Dataset_Mt_Vc/tb/raw_data/ERR036223_2.fastq.gz
ERR036226,/data/Dataset_Mt_Vc/tb/raw_data/ERR036226_1.fastq.gz,/data/Dataset_Mt_Vc/tb/raw_data/ERR036226_2.fastq.gz
EOF

# Run with fixed samplesheet
nextflow run qc_pipeline.nf \
    --input samplesheet_fixed.csv \
    -c /users/mamana/microbial-genomics-training/cluster.config \
    -resume \
    -profile slurm,Singularity

Common Issue 5: GRiPHin Directory Parsing Errors

If you encounter NotADirectoryError during the GRiPHin post-processing step:

# Error example:
# ERROR ~ Error executing process > 'PHOENIX:PHOENIX_EXTERNAL:GRIPHIN (1)'
# NotADirectoryError: [Errno 20] Not a directory: 'pipeline_report.html/qc_stats/'

# Solution 1: Clean up leftover HTML files (preferred method)
# Remove any leftover HTML report files from previous runs
find /data/users/$USER/nextflow-training/phoenix-analysis/phoenix -name "pipeline_report.html" -type f -delete
find /data/users/$USER/nextflow-training/phoenix-analysis/phoenix -name "pipeline_timeline.html" -type f -delete

# Solution 2: Apply GRiPHin patch (if HTML cleanup doesn't work)
# The patch file is available at /users/mamana/microbial-genomics-training/griphin_patch.txt

# Create backup of GRiPHin.py
cp /data/users/$USER/nextflow-training/phoenix-analysis/phoenix/bin/GRiPHin.py \
   /data/users/$USER/nextflow-training/phoenix-analysis/phoenix/bin/GRiPHin.py.backup

# Apply the patch to add HTML files to skip list
sed -i 's/skip_list_b = \["BiosampleAttributes_Microbe.1.0.xlsx", "Sra_Microbe.1.0.xlsx", "Phoenix_Summary.tsv", "pipeline_info", "GRiPHin_Summary.xlsx", "multiqc", "samplesheet_converted.csv", "Directory_samplesheet.csv", "sra_samplesheet.csv"\]/skip_list_b = ["BiosampleAttributes_Microbe.1.0.xlsx", "Sra_Microbe.1.0.xlsx", "Phoenix_Summary.tsv", "pipeline_info", "GRiPHin_Summary.xlsx", "multiqc", "samplesheet_converted.csv", "Directory_samplesheet.csv", "sra_samplesheet.csv", "pipeline_report.html", "pipeline_timeline.html", "trace.txt", "dag.html"]/' \
/data/users/$USER/nextflow-training/phoenix-analysis/phoenix/bin/GRiPHin.py

# Verify the patch was applied
grep -n "skip_list_b" /data/users/$USER/nextflow-training/phoenix-analysis/phoenix/bin/GRiPHin.py

# To rollback if needed:
# cp /data/users/$USER/nextflow-training/phoenix-analysis/phoenix/bin/GRiPHin.py.backup \
#    /data/users/$USER/nextflow-training/phoenix-analysis/phoenix/bin/GRiPHin.py

# After applying either solution, resume the pipeline
nextflow run phoenix/main.nf \
    -entry PHOENIX \
    --input phoenix_samplesheet.csv \
    --kraken2db /data/users/kraken2_db_local \
    --outdir tb_analysis_results \
    -c /users/mamana/microbial-genomics-training/cluster.config \
    -profile singularity,slurm \
    -resume

Checking PHoeNIx Results

# Check the output structure
tree /data/users/$USER/nextflow-training/phoenix_results/

# Key outputs to examine:
echo "=== Assembly Quality ==="
cat /data/users/$USER/nextflow-training/phoenix_results/Test_Sample/assembly/Test_Sample_assembly_stats.txt

echo "=== Taxonomic Classification ==="
cat /data/users/$USER/nextflow-training/phoenix_results/Test_Sample/kraken2/Test_Sample_kraken2_report.txt

echo "=== AMR Detection ==="
cat /data/users/$USER/nextflow-training/phoenix_results/Test_Sample/amr/Test_Sample_amrfinder.tsv

Key PHoeNIx Features Demonstrated¶

Our successful PHoeNIx run demonstrates several important production pipeline features:

Resource Management: Proper memory allocation (256GB) for large databases
Entry Workflows: Using -entry PHOENIX for specific analysis types
Resume Capability: -resume flag to continue from cached successful tasks
Comprehensive Analysis: Assembly, annotation, taxonomy, AMR, virulence
Standardized Output: Consistent directory structure and file formats
Production Ready: Error handling, resource optimization, SLURM integration

Comparing PHoeNIx vs Our Custom Pipeline¶

What PHoeNIx Adds:

Comprehensive QC: Multiple quality assessment tools
Advanced Assembly: Multiple assemblers with quality filtering
Taxonomic Classification: Kraken2 for species identification
AMR Detection: AMRFinderPlus for resistance gene detection
Contamination Detection: Screen for contaminating sequences
Standardized Reporting: Consistent output formats
Production Ready: Error handling, resource management

Exercise: Compare Outputs

# Compare assembly statistics
echo "=== Our Pipeline Assembly ==="
grep -c '>' /data/users/$USER/nextflow-training/results/assemblies/ERR036221_assembly/contigs.fasta

echo "=== PHoeNIx Assembly ==="
grep -c '>' /data/users/$USER/nextflow-training/phoenix_results/Test_Sample/assembly/Test_Sample_contigs.fasta

# Compare file sizes
echo "=== File Size Comparison ==="
ls -lh /data/users/$USER/nextflow-training/results/assemblies/ERR036221_assembly/contigs.fasta
ls -lh /data/users/$USER/nextflow-training/phoenix_results/Test_Sample/assembly/Test_Sample_contigs.fasta

Professional Pipeline Development¶

Pipeline Documentation¶

Creating Professional Documentation¶

Essential Documentation Components:

README.md: Overview and quick start
CHANGELOG.md: Version history
docs/: Detailed documentation
examples/: Sample data and configs

Example Documentation Structure:

mtb-analysis-pipeline/
├── README.md
├── CHANGELOG.md
├── main.nf
├── nextflow.config
├── docs/
│   ├── usage.md
│   ├── parameters.md
│   ├── output.md
│   └── troubleshooting.md
├── examples/
│   ├── samplesheet.csv
│   └── test_config.config
└── containers/
    └── Dockerfile

Testing and Validation¶

Pipeline Testing Strategy¶

# Create test data
mkdir -p test_data
# Add small test datasets

# Test with minimal data
nextflow run main.nf \\
    --input test_data/test_samplesheet.csv \\
    --outdir test_results \\
    -profile test

# Validate outputs
ls test_results/

Deployment Strategies¶

Version Control Best Practices¶

# Create development branch
git checkout -b feature/amr-detection

# Make changes and commit
git add .
git commit -m "Add AMR detection with AMRFinderPlus

- Added amr_detection process
- Updated workflow to include AMR analysis
- Added AMRFinderPlus container
- Updated documentation"

# Push to GitHub
git push origin feature/amr-detection

# Create pull request on GitHub
# Merge after review

Release Management¶

# Create release
git tag -a v1.0.0 -m "Release v1.0.0: Production MTB pipeline

Features:
- Complete QC pipeline
- MTB-specific assembly
- AMR detection
- Lineage typing
- MLST analysis
- Comprehensive reporting"

git push origin v1.0.0

Running Workflows from GitHub¶

Once your workflow is on GitHub, you and others can run it directly from the repository:

Running Your Published Workflow¶

# Run workflow directly from GitHub
nextflow run yourusername/mtb-analysis-pipeline \
    --input samplesheet.csv \
    --outdir results \
    -profile singularity

# Run specific version/release
nextflow run yourusername/mtb-analysis-pipeline \
    -r v1.0.0 \
    --input samplesheet.csv \
    --outdir results \
    -profile singularity

# Run from specific branch
nextflow run yourusername/mtb-analysis-pipeline \
    -r feature/amr-detection \
    --input samplesheet.csv \
    --outdir results \
    -profile singularity

Your colleagues can now run your pipeline easily:

# Anyone can run your pipeline with:
nextflow run yourusername/mtb-analysis-pipeline \
    --input their_samples.csv \
    --outdir their_results \
    -profile singularity

# They can also clone and modify:
git clone https://github.com/yourusername/mtb-analysis-pipeline.git
cd mtb-analysis-pipeline
nextflow run . --input samples.csv --outdir results

Benefits of GitHub-hosted Workflows¶

✅ Version Control: Track all changes and releases
✅ Collaboration: Multiple developers can contribute
✅ Reproducibility: Anyone can run the exact same version
✅ Documentation: README, wiki, and issues for support
✅ Distribution: Easy sharing with the community
✅ Continuous Integration: Automated testing with GitHub Actions

Hands-on Exercise: Building Your MTB Pipeline¶

Exercise: Convert Exercise 3 to Production MTB Pipeline¶

Objective: Transform your Day 6 Exercise 3 pipeline into a production-ready MTB analysis workflow.

Step 1: Initialize Version Control¶

cd /users/$USER/microbial-genomics-training/workflows

# Initialize Git repository
git init
git add qc_pipeline.nf nextflow.config samplesheet.csv
git commit -m "Initial commit: Exercise 3 baseline"

# Create development branch
git checkout -b feature/mtb-production

Step 2: Create Containerized Pipeline¶

Create mtb_pipeline.nf with the containerized processes shown above.

Step 3: Add MTB-Specific Features¶

Add the AMR detection, lineage typing, and MLST processes.

Step 4: Test the Pipeline¶

# Test with a subset of samples
nextflow run mtb_pipeline.nf \\
    --input samplesheet_test.csv \\
    --outdir /data/users/$USER/nextflow-training/results_mtb

# Verify outputs
ls -la /data/users/$USER/nextflow-training/results_mtb/

Step 5: Document and Commit¶

# Create documentation
echo "# MTB Analysis Pipeline" > README.md
# Add comprehensive documentation

# Commit changes
git add .
git commit -m "Complete MTB production pipeline

- Containerized all processes
- Added AMR detection
- Added lineage typing
- Added MLST analysis
- Comprehensive documentation"

Step 6: Create GitHub Repository¶

Create repository on GitHub
Push your code
Create releases
Add documentation

Summary and Next Steps¶

What We Accomplished Today¶

✅ Version Control: Learned Git and GitHub for pipeline development
✅ Containerization: Integrated Docker containers for reproducibility
✅ MTB Pipeline: Built production-ready tuberculosis analysis workflow
✅ Clinical Applications: Added AMR detection and typing capabilities
✅ Professional Standards: Documentation, testing, and deployment

Your Production Pipeline Features¶

flowchart LR
    A[Raw Data] --> B[QC Pipeline]
    B --> C[Assembly]
    C --> D[Annotation]
    D --> E[AMR Detection]
    D --> F[Lineage Typing]
    D --> G[MLST Analysis]

    E --> H[Clinical Report]
    F --> H
    G --> H

    style A fill:#f5f5f5,stroke:#757575,stroke-width:1px,color:#000
    style H fill:#e8f5e8,stroke:#4caf50,stroke-width:2px,color:#000

🔥 Real-World Pipeline: PHoeNIx (CDC)¶

Introduction to PHoeNIx¶

PHoeNIx (Platform-agnostic Healthcare-associated and antimicrobial resistant pathogen analysis) is a production-ready Nextflow pipeline developed by the Centers for Disease Control and Prevention (CDC). It's specifically designed for analyzing healthcare-associated and antimicrobial resistant pathogens, making it perfect for our MTB analysis.

Why PHoeNIx for MTB Analysis?

🏛️ CDC-developed: Trusted, authoritative source for pathogen analysis
🎯 Healthcare focus: Designed for clinical and public health applications
🔬 AMR detection: Built-in antimicrobial resistance analysis
📊 Comprehensive reports: Clinical-grade output reports
🐳 Containerized: Uses Docker/Singularity for reproducibility
🚀 Production-ready: Used in real public health laboratories

Exercise 4: Setting Up PHoeNIx with Your TB Data¶

Step 1: Understanding PHoeNIx Requirements¶

PHoeNIx requires:

Nextflow (≥21.10.3) ✅ Already installed
Docker or Singularity ✅ Already available
Kraken2 database (we'll download this)
Paired-end FASTQ files ✅ We have TB data

Step 2: Clone and Setup PHoeNIx¶

# Initialize module system
source /opt/lmod/8.7/lmod/lmod/init/bash

# Load required modules
module load nextflow/25.04.6
module load kraken2/2.1.3

# Set up Singularity directories in user data space
export NXF_SINGULARITY_CACHEDIR=/data/users/singularity_cache
export NXF_SINGULARITY_TMPDIR=/data/users/temp
export NXF_SINGULARITY_LOCALCACHEDIR=/data/users/singularity_cache

# Set up Kraken2 database path for local analysis
export KRAKEN2_DB_PATH=/data/users/kraken2_db_local

# Create the directories
mkdir -p $NXF_SINGULARITY_CACHEDIR
mkdir -p $NXF_SINGULARITY_TMPDIR
mkdir -p $NXF_SINGULARITY_LOCALCACHEDIR

echo "✅ Singularity cache directory: $NXF_SINGULARITY_CACHEDIR"
echo "✅ Singularity temp directory: $NXF_SINGULARITY_TMPDIR"
echo "✅ Singularity local cache directory: $NXF_SINGULARITY_LOCALCACHEDIR"
echo "✅ Kraken2 database path: $KRAKEN2_DB_PATH"

# Add to bashrc for persistence
echo "export NXF_SINGULARITY_CACHEDIR=/data/users/singularity_cache" >> ~/.bashrc
echo "export NXF_SINGULARITY_TMPDIR=/data/users/temp" >> ~/.bashrc
echo "export NXF_SINGULARITY_LOCALCACHEDIR=/data/users/singularity_cache" >> ~/.bashrc
echo "export KRAKEN2_DB_PATH=/data/users/kraken2_db_local" >> ~/.bashrc

# Navigate to our workflows directory
cd /data/users/$USER/nextflow-training

# Create PHoeNIx workspace
mkdir phoenix-analysis
cd phoenix-analysis

# Clone the PHoeNIx repository
git clone https://github.com/CDCgov/phoenix.git
cd phoenix

# Check the pipeline structure
ls -la
cat README.md | head -20

Step 3: Setup Kraken2 Database¶

PHoeNIx requires a Kraken2 database for taxonomic classification. The system has a pre-installed standard database that we'll configure:

# Load the kraken2 module (already loaded in Step 2)
# source /opt/lmod/8.7/lmod/lmod/init/bash
# module load kraken2/2.1.3

# Set up the standard Kraken2 database path
# The system has a pre-installed standard database at this location
export KRAKEN2_DB_PATH=/data/users/kraken2_db_local

# Verify the database exists and contains required files
echo "🔍 Checking Kraken2 database at: $KRAKEN2_DB_PATH"
ls -la $KRAKEN2_DB_PATH

# Check for essential database files
echo "📋 Verifying database files:"
if [ -f "$KRAKEN2_DB_PATH/ktaxonomy.tsv" ]; then
    echo "✅ ktaxonomy.tsv found"
else
    echo "❌ ktaxonomy.tsv missing"
fi

if [ -f "$KRAKEN2_DB_PATH/hash.k2d" ]; then
    echo "✅ hash.k2d found"
else
    echo "❌ hash.k2d missing"
fi

if [ -f "$KRAKEN2_DB_PATH/opts.k2d" ]; then
    echo "✅ opts.k2d found"
else
    echo "❌ opts.k2d missing"
fi

if [ -f "$KRAKEN2_DB_PATH/taxo.k2d" ]; then
    echo "✅ taxo.k2d found"
else
    echo "❌ taxo.k2d missing"
fi

# Add to bashrc for persistence across sessions
echo "export KRAKEN2_DB_PATH=/data/users/kraken2_db_local" >> ~/.bashrc

# Display database information
echo "📊 Database information:"
echo "   Path: $KRAKEN2_DB_PATH"
echo "   Size: $(du -sh $KRAKEN2_DB_PATH 2>/dev/null | cut -f1 || echo 'Unknown')"
echo "   Files: $(ls -1 $KRAKEN2_DB_PATH | wc -l) files"

# Test the database with kraken2 (optional)
echo "🧪 Testing database with kraken2..."
kraken2 --db $KRAKEN2_DB_PATH --help > /dev/null 2>&1 && echo "✅ Database is accessible" || echo "⚠️  Database test failed"

echo "✅ Kraken2 database configured successfully!"

About the Kraken2 Standard Database:

The standard Kraken2 database (/data/kraken2_db_standard/) contains:

Taxonomic data: Complete NCBI taxonomy for species identification
Reference genomes: Bacterial, archaeal, and viral genomes
Size: Approximately 50-100 GB (compressed)
Coverage: Comprehensive microbial genome collection
Purpose: Accurate taxonomic classification of sequencing reads

Database Files Explained:

ktaxonomy.tsv: Taxonomic hierarchy and names
hash.k2d: K-mer hash table for sequence matching
opts.k2d: Database options and parameters
taxo.k2d: Taxonomic assignment data

Step 4: Prepare Your TB Samplesheet¶

PHoeNIx uses a specific samplesheet format. Let's create one for our TB data:

# Navigate back to phoenix analysis directory
cd /data/users/$USER/nextflow-training/phoenix

# Create PHoeNIx samplesheet
cat > phoenix_samplesheet.csv << 'EOF'
sample,fastq_1,fastq_2
TB_sample_1,/data/Dataset_Mt_Vc/tb/raw_data/ERR036221_1.fastq.gz,/data/Dataset_Mt_Vc/tb/raw_data/ERR036221_2.fastq.gz
TB_sample_2,/data/Dataset_Mt_Vc/tb/raw_data/ERR036223_1.fastq.gz,/data/Dataset_Mt_Vc/tb/raw_data/ERR036223_2.fastq.gz
TB_sample_3,/data/Dataset_Mt_Vc/tb/raw_data/ERR036226_1.fastq.gz,/data/Dataset_Mt_Vc/tb/raw_data/ERR036226_2.fastq.gz
EOF

echo "✅ PHoeNIx samplesheet created: phoenix_samplesheet.csv"

Step 5: Create Cluster Configuration¶

We'll create a cluster configuration file for SLURM execution with proper Singularity settings. Note that a basic cluster.config was introduced in Day 6 Exercise 3 - we'll enhance it here for PHoeNIx:

# Create cluster configuration file optimized for PHoeNIx
# This will override PHoeNIx's internal cluster.config to prevent GRiPHin errors
cat > cluster.config << EOF
// Cluster configuration for genomic analysis pipeline

params {
    outdir = "/data/users/$USER/nextflow-training/phoenix-analysis/results"
    kraken2_db = "/data/users/kraken2_db_local"
}

profiles {
    singularity {
        enabled = true
        autoMounts = true
        cacheDir = '/data/users/singularity_cache'
        libraryDir = '/data/users/singularity_cache'
        tmpDir = '/data/users/temp'
    }

    slurm {
        process {
            executor = 'slurm'

            // Default resources
            cpus = 2
            memory = '4 GB'
            time = '2h'

            // PHoeNIx-specific resource allocations
            withName: 'PHOENIX:PHOENIX_EXTERNAL:KRAKEN2_TRIMD:KRAKEN2_TRIMD' {
                cpus = 10
                memory = '100 GB'
                time = '8h'
            }

            withName: 'PHOENIX:PHOENIX_EXTERNAL:KRAKEN2_WTASMBLD:KRAKEN2_WTASMBLD' {
                cpus = 10
                memory = '100 GB'
                time = '8h'
            }

            withName: 'PHOENIX:PHOENIX_EXTERNAL:SPADES_WF:SPADES' {
                cpus = 8
                memory = '64 GB'
                time = '12h'
            }

            withName: 'PHOENIX:PHOENIX_EXTERNAL:BBDUK' {
                cpus = 4
                memory = '32 GB'
                time = '4h'
            }

            withName: 'PHOENIX:PHOENIX_EXTERNAL:FASTP_TRIMD' {
                cpus = 4
                memory = '8 GB'
                time = '2h'
            }

            withName: 'PHOENIX:PHOENIX_EXTERNAL:PROKKA' {
                cpus = 4
                memory = '8 GB'
                time = '3h'
            }
        }

        executor {
            queueSize = 20
            submitRateLimit = '10 sec'
        }
    }
}

// Enhanced reporting disabled to prevent GRiPHin directory parsing errors
trace {
    enabled = false
}

timeline {
    enabled = false
}

report {
    enabled = false
}
EOF

echo "✅ Cluster configuration created: cluster.config"

Important: Configuration Override

This cluster.config file is designed to override PHoeNIx's internal configuration settings. Specifically:

Enhanced reporting disabled: Prevents GRiPHin post-processing errors caused by HTML report files
Shared database path: References the system-wide Kraken2 database location
Memory optimizations: Set to 100GB (within cluster capacity) instead of default 256GB
Clean process selectors: Only includes valid PHoeNIx-specific process names to avoid configuration warnings

The -c flag in Nextflow commands ensures this configuration takes precedence over PHoeNIx's internal settings.

Step 5b: Create Environment Setup Script¶

Create a reusable script to set up the Singularity environment:

# Create environment setup script
cat > setup_singularity_env.sh << 'EOF'
#!/bin/bash
# Singularity environment setup for PHoeNIx

# Set up Singularity directories in user data space
export SINGULARITY_CACHEDIR=/data/users/singularity_cache
export SINGULARITY_TMPDIR=/data/users/temp
export SINGULARITY_LOCALCACHEDIR=/data/users/singularity_cache

# Create directories if they don't exist
mkdir -p $SINGULARITY_CACHEDIR
mkdir -p $SINGULARITY_TMPDIR
mkdir -p $SINGULARITY_LOCALCACHEDIR

# Load required modules
source /opt/lmod/8.7/lmod/lmod/init/bash
module load nextflow/25.04.6
module load kraken2/2.1.3

echo "✅ Singularity environment configured:"
echo "   Cache: $SINGULARITY_CACHEDIR"
echo "   Temp:  $SINGULARITY_TMPDIR"
echo "   Local: $SINGULARITY_LOCALCACHEDIR"
echo "✅ Modules loaded: nextflow, kraken2"
EOF

chmod +x setup_singularity_env.sh
echo "✅ Environment setup script created: setup_singularity_env.sh"

# Test the script
./setup_singularity_env.sh

Step 6: Run PHoeNIx Test¶

First, let's run PHoeNIx with test data to ensure everything works:

# Load required modules (if not already loaded)
source /opt/lmod/8.7/lmod/lmod/init/bash
module load nextflow/25.04.6
module load kraken2/2.1.3

# Set up Singularity directories (if not already set)
export NXF_SINGULARITY_CACHEDIR=/data/users/singularity_cache
export NXF_SINGULARITY_TMPDIR=/data/users/temp
export NXF_SINGULARITY_LOCALCACHEDIR=/data/users/singularity_cache

# Ensure directories exist
mkdir -p $NXF_SINGULARITY_CACHEDIR
mkdir -p $NXF_SINGULARITY_TMPDIR
mkdir -p $NXF_SINGULARITY_LOCALCACHEDIR

# Navigate to PHoeNIx directory
cd /data/users/$USER/nextflow-training/phoenix

# Set up Kraken2 database path (if not already set)
export KRAKEN2_DB_PATH=/data/users/kraken2_db_local

# IMPORTANT: Configure PHoeNIx for high-memory requirements within cluster limits
# PHoeNIx has hardcoded resource limits that need adjustment for large Kraken2 databases
sed -i 's/cpus   = { check_max( 2                  , '\''cpus'\''    ) }/cpus   = { check_max( 10                 , '\''cpus'\''    ) }/' conf/base.config
sed -i 's/memory = { check_max( 10.GB \* task.attempt, '\''memory'\''  ) }/memory = { check_max( 200.GB \* task.attempt, '\''memory'\''  ) }/' conf/base.config
sed -i 's/time   = { check_max( 4.h                 , '\''time'\''    ) }/time   = { check_max( 8.h                 , '\''time'\''    ) }/' conf/base.config

echo "🔍 Verifying PHoeNIx configuration changes:"
grep -A 4 "withName:KRAKEN2_KRAKEN2" conf/base.config

# Run PHoeNIx test using the cloned repository
nextflow run main.nf \
    -profile singularity,test,slurm \
    -entry PHOENIX \
    --kraken2db $KRAKEN2_DB_PATH \
    --outdir test_results \
    -c /users/mamana/microbial-genomics-training/cluster.config \
    -resume

echo "✅ PHoeNIx test completed successfully!"
echo "✅ Singularity containers cached in: $NXF_SINGULARITY_CACHEDIR"

Step 7: Run PHoeNIx with Your TB Data¶

Now let's analyze our TB samples using the cloned repository:

# Ensure Singularity environment is set up
export NXF_SINGULARITY_CACHEDIR=/data/users/singularity_cache
export NXF_SINGULARITY_TMPDIR=/data/users/temp
export NXF_SINGULARITY_LOCALCACHEDIR=/data/users/singularity_cache

# Set up Kraken2 database path (if not already set)
export KRAKEN2_DB_PATH=/data/users/kraken2_db_local

# Run PHoeNIx with TB data using the local repository and complete database
nextflow run main.nf \
    -profile singularity,slurm \
    -entry PHOENIX \
    --input /data/users/$USER/nextflow-training/phoenix-analysis/phoenix_samplesheet.csv \
    --kraken2db $KRAKEN2_DB_PATH \
    --outdir tb_analysis_results \
    -c /users/mamana/microbial-genomics-training/cluster.config \
    -resume

echo "🔥 PHoeNIx TB analysis started!"
echo "📦 All containers will be cached in: $NXF_SINGULARITY_CACHEDIR"
echo "💾 Expected memory usage: 200GB for Kraken2 processes (within cluster limits)"
echo "⏱️  Expected runtime: 2-8 hours depending on data size"

Troubleshooting PHoeNIx Issues¶

If you encounter errors, here are common solutions:

# Error 1: "Process terminated with an error exit status (137)" - Memory Issue
# This indicates the process was killed due to insufficient memory
echo "🔍 Checking PHoeNIx memory configuration:"
grep -A 4 "withName:KRAKEN2_KRAKEN2" conf/base.config

# Should show 200.GB, not 10.GB. If not, re-run the sed commands:
sed -i 's/memory = { check_max( 10.GB \* task.attempt, '\''memory'\''  ) }/memory = { check_max( 200.GB \* task.attempt, '\''memory'\''  ) }/' conf/base.config

# Error 2: "No entry workflow specified"
# Always include -entry PHOENIX parameter
nextflow run main.nf \
    -entry PHOENIX \
    --input /data/users/$USER/nextflow-training/phoenix-analysis/phoenix_samplesheet.csv \
    --kraken2db $KRAKEN2_DB_PATH \
    --outdir tb_analysis_results \
    -c /users/mamana/microbial-genomics-training/cluster.config \
    -profile singularity,slurm \
    -resume

# Error 3: "No such file or directory: kraken2_db_standard_folder/ktaxonomy.tsv"
# This affects the KRAKENTOOLS_MAKEKREPORT process (post-processing)
# PERMANENT SOLUTION: Create a local Kraken2 database copy with ktaxonomy.tsv
echo "🔧 Creating permanent fix for ktaxonomy.tsv issue..."

# Create the ktaxonomy.tsv file
cat > /tmp/ktaxonomy.tsv << 'EOF'
1|root|1|no rank|
2|Bacteria|131567|superkingdom|
1763|Mycobacterium|1763|genus|
1773|Mycobacterium tuberculosis|1763|species|
131567|cellular organisms|1|no rank|
85007|Actinobacteria|2|phylum|
1224|Proteobacteria|2|phylum|
1239|Firmicutes|2|phylum|
EOF

# Create a local database copy with all files
mkdir -p $HOME/kraken2_db_with_ktaxonomy

# Create symbolic links to all existing files
for file in /data/kraken2_db_standard/*; do
    if [ -f "$file" ]; then
        ln -sf "$file" $HOME/kraken2_db_with_ktaxonomy/
    elif [ -d "$file" ]; then
        ln -sf "$file" $HOME/kraken2_db_with_ktaxonomy/
    fi
done

# Add the missing ktaxonomy.tsv file
cp /tmp/ktaxonomy.tsv $HOME/kraken2_db_with_ktaxonomy/

echo "✅ Created local database copy with ktaxonomy.tsv"
ls -la $HOME/kraken2_db_with_ktaxonomy/

# Update the kraken2db parameter to use the fixed database
KRAKEN2_DB="$HOME/kraken2_db_with_ktaxonomy"

# Error 4: Singularity issues
# Solution: Check Singularity setup and permissions
module list
which singularity
ls -la $NXF_SINGULARITY_CACHEDIR

# Re-run with verbose output for debugging
nextflow run main.nf \
    -entry PHOENIX \
    --input /data/users/$USER/nextflow-training/phoenix-analysis/phoenix_samplesheet.csv \
    --kraken2db $KRAKEN2_DB_PATH \
    --outdir tb_analysis_results \
    -c /users/mamana/microbial-genomics-training/cluster.config \
    -profile singularity,slurm \
    -resume \
    -with-trace \
    -with-report \
    -with-timeline

Step 8: Understanding PHoeNIx Outputs¶

While the analysis runs, let's explore what PHoeNIx produces:

# PHoeNIx creates comprehensive outputs:
tree tb_analysis_results/ -L 2

# Key output directories:
# ├── ASSEMBLY/          # Genome assemblies
# ├── ANNOTATION/        # Gene annotations
# ├── AMR/              # Antimicrobial resistance results
# ├── MLST/             # Multi-locus sequence typing
# ├── QC/               # Quality control metrics
# ├── REPORTS/          # Summary reports
# └── TAXA/             # Species identification

Expected PHoeNIx Performance¶

Based on our successful testing, here's what to expect:

# ✅ Successfully completed processes (in order):
echo "1. Input validation and corruption checks"
echo "2. Raw read statistics (GET_RAW_STATS)"
echo "3. Quality trimming with FASTP"
echo "4. FastQC on trimmed reads"
echo "5. Kraken2 taxonomic classification (256GB memory)"
echo "6. Krona visualization"
echo "7. SPAdes genome assembly"
echo "8. QUAST assembly quality assessment"
echo "9. MASH distance calculations"
echo "10. FastANI species identification"
echo "11. Prokka annotation"
echo "12. AMR detection with AMRFinderPlus"

# 📊 Performance metrics:
echo "Memory usage: Up to 256GB for Kraken2 processes"
echo "Runtime: 2-8 hours depending on data size"
echo "SLURM jobs: ~50-60 jobs total"
echo "Output size: ~500MB-2GB per sample"

# 🎯 Key success indicators:
echo "✅ Kraken2 processes complete without exit status 137"
echo "✅ SPAdes assembly produces contigs"
echo "✅ Species identified as Mycobacterium tuberculosis"
echo "✅ AMR genes detected and reported"

Alternative: Running PHoeNIx from Remote Repository¶

If you prefer not to clone the repository locally, you can still run PHoeNIx directly from GitHub:

# Run PHoeNIx directly from GitHub repository
nextflow run cdcgov/phoenix \
    -r v2.1.1 \
    -profile singularity,slurm \
    -entry PHOENIX \
    --input /data/users/$USER/nextflow-training/phoenix-analysis/phoenix_samplesheet.csv \
    --kraken2db $KRAKEN2_DB_PATH \
    --outdir tb_analysis_results \
    -resume

# This approach downloads the pipeline automatically but doesn't give you
# local access to modify configuration files

Exercise 5: Analyzing PHoeNIx Results¶

Exploring the Results Structure¶

# Navigate to results (using the actual output directory from the command)
cd /data/users/$USER/nextflow-training/phoenix-analysis/tb_analysis_results

# Check the overall structure
echo "📁 PHoeNIx Output Structure:"
ls -la

# Check the main summary report  
echo "📊 Summary Reports:"
ls -la summaries/ 2>/dev/null || echo "Summaries directory not yet created"

# View quality control results
echo "🔍 Quality Control Summary:"
head summaries/Phoenix_Summary.tsv 2>/dev/null || echo "Phoenix_Summary.tsv not yet generated"

# Check AMR results
echo "🦠 Antimicrobial Resistance Results:"
ls -la amr/ 2>/dev/null || echo "AMR directory not yet created"
head amr/*_amrfinder_all.tsv 2>/dev/null || echo "AMR files not yet generated"

# View assembly statistics  
echo "🧬 Assembly Statistics:"
ls -la assembly/ 2>/dev/null || echo "Assembly directory not yet created"
head assembly/*_assembly_stats.txt 2>/dev/null || echo "Assembly stats not yet generated"

# Check annotation results
echo "📝 Annotation Results:"
ls -la annotation/ 2>/dev/null || echo "Annotation directory not yet created"

Understanding Clinical Outputs¶

PHoeNIx provides clinical-grade outputs structured for pathogen genomics:

Species Identification: Kraken2-based taxonomic classification
Quality Metrics: Read quality, assembly statistics, contamination assessment
AMR Profiling: Resistance genes via AMRFinderPlus and NCBI database
MLST Typing: Multi-locus sequence typing for strain classification
Comprehensive Summary: All results consolidated in Phoenix_Summary.tsv

Comparing with Your Exercise 3 Pipeline¶

Let's compare PHoeNIx results with our custom pipeline:

# Set environment variable for consistent paths
export KRAKEN2_DB_PATH=/data/users/kraken2_db_local

# Compare assembly approaches
echo "=== Exercise 3 Results ==="
ls -la /data/users/$USER/nextflow-training/results/assemblies/ 2>/dev/null || echo "Exercise 3 results not found"

echo "=== PHoeNIx Results ==="
ls -la /data/users/$USER/nextflow-training/phoenix-analysis/tb_analysis_results/assembly/ 2>/dev/null || echo "PHoeNIx results not yet generated"

# Compare annotation approaches
echo "=== Exercise 3 Prokka Annotation ==="
ls -la /data/users/$USER/nextflow-training/results/annotation/ 2>/dev/null || echo "Exercise 3 annotation not found"

echo "=== PHoeNIx Annotation (Prokka + additional tools) ==="
ls -la /data/users/$USER/nextflow-training/phoenix-analysis/tb_analysis_results/annotation/ 2>/dev/null || echo "PHoeNIx annotation not yet generated"

# Check PHoeNIx-specific clinical outputs
echo "=== PHoeNIx Clinical Features (not in Exercise 3) ==="
echo "• AMR Analysis:" && ls -la /data/users/$USER/nextflow-training/phoenix-analysis/tb_analysis_results/amr/ 2>/dev/null || echo "  AMR results pending"
echo "• MLST Typing:" && ls -la /data/users/$USER/nextflow-training/phoenix-analysis/tb_analysis_results/mlst/ 2>/dev/null || echo "  MLST results pending"  
echo "• Clinical Summary:" && ls -la /data/users/$USER/nextflow-training/phoenix-analysis/tb_analysis_results/summaries/ 2>/dev/null || echo "  Summary pending"

Key Learning Points from PHoeNIx Testing¶

Our successful PHoeNIx implementation demonstrates several critical production pipeline concepts:

1. Resource Management in Production Pipelines¶

Memory Requirements: Large databases like Kraken2 require 256GB+ memory
Hardcoded Limits: Production pipelines may have configuration files that override user settings
Resource Optimization: Different processes require different resource allocations

2. Pipeline Configuration Management¶

Base Configuration Files: Understanding conf/base.config vs user configuration
Entry Workflows: Complex pipelines may have multiple entry points (-entry PHOENIX)
Profile Management: Combining profiles (singularity,slurm) for different environments

3. Troubleshooting Production Issues¶

Exit Status 137: Always indicates memory issues, not configuration problems
Missing Files: Post-processing steps may require additional database files
Resume Functionality: Critical for long-running analyses in production

4. Real-World Bioinformatics Challenges¶

Database Dependencies: Production tools require specific database formats and files
Container Management: Singularity cache management for large container images
SLURM Integration: Proper job scheduling and resource allocation

Key Learning Points¶

Production Pipeline Advantages¶

Standardization: Consistent analysis across laboratories
Validation: Extensively tested and validated
Clinical Focus: Designed for healthcare applications
Comprehensive: Includes all necessary analyses
Reporting: Professional-grade output reports
Maintenance: Actively maintained and updated

When to Use Production Pipelines¶

✅ Clinical diagnostics: Patient sample analysis
✅ Public health surveillance: Outbreak investigations
✅ Regulatory compliance: FDA/CDC requirements
✅ Multi-site studies: Standardized protocols
✅ High-throughput: Large sample volumes

When to Build Custom Pipelines¶

✅ Research questions: Novel analysis approaches
✅ Specialized organisms: Non-standard pathogens
✅ Method development: Testing new algorithms
✅ Educational purposes: Learning workflow development
✅ Resource constraints: Limited computational resources

🎉 Exercise 4 Success Summary¶

Congratulations! You have successfully:

✅ Configured PHoeNIx for production use with 256GB memory allocation ✅ Resolved memory issues by modifying hardcoded pipeline configuration ✅ Implemented proper entry workflows using -entry PHOENIX ✅ Integrated SLURM scheduling with Singularity containerization ✅ Processed real TB genomic data through a CDC production pipeline ✅ Learned troubleshooting skills for production bioinformatics environments

Key Achievement: You can now run production-grade pathogen genomics pipelines that are used in real public health laboratories worldwide.

This exercise demonstrates how to integrate production-grade pipelines into your bioinformatics workflows, providing the foundation for real-world pathogen genomics analysis.

Running Workflows from GitHub¶

Why Use GitHub-Hosted Workflows?¶

Running workflows directly from GitHub repositories offers several advantages:

Version Control: Specify exact versions or branches
Collaboration: Share workflows easily with colleagues
Reproducibility: Ensure everyone uses the same workflow version
No Local Storage: No need to download workflows locally
Automatic Updates: Pull latest changes automatically

Exercise 5: Running Your Workflow from GitHub¶

Step 1: Push Your Workflow to GitHub¶

First, let's push your Exercise 3 pipeline to GitHub:

# Navigate to your workflows directory
cd /users/$USER/microbial-genomics-training/workflows

# Initialize module system and load git
source /opt/lmod/8.7/lmod/lmod/init/bash
module load git

# Add your workflow files
git add qc_pipeline.nf nextflow.config samplesheet.csv
git commit -m "Add Exercise 3 QC pipeline for GitHub execution"

# Push to your GitHub repository
git push origin main

Step 2: Run Workflow from GitHub¶

Now anyone can run your workflow directly from GitHub:

# Run your workflow from GitHub (replace USERNAME with your GitHub username)
nextflow run USERNAME/microbial-genomics-training/workflows/qc_pipeline.nf \
    --input /users/$USER/microbial-genomics-training/workflows/samplesheet.csv \
    --outdir /data/users/$USER/nextflow-training/results_github

# Run a specific version/branch
nextflow run USERNAME/microbial-genomics-training/workflows/qc_pipeline.nf \
    -r v1.0 \
    --input /users/$USER/microbial-genomics-training/workflows/samplesheet.csv \
    --outdir /data/users/$USER/nextflow-training/results_v1

# Run from a specific branch
nextflow run USERNAME/microbial-genomics-training/workflows/qc_pipeline.nf \
    -r development \
    --input /users/$USER/microbial-genomics-training/workflows/samplesheet.csv \
    --outdir /data/users/$USER/nextflow-training/results_dev

Your colleagues can now run your workflow:

# Anyone with access can run your workflow
nextflow run USERNAME/microbial-genomics-training/workflows/qc_pipeline.nf \
    --input their_samplesheet.csv \
    --outdir their_results/

# They can also specify different parameters
nextflow run USERNAME/microbial-genomics-training/workflows/qc_pipeline.nf \
    --input their_data.csv \
    --outdir their_results/ \
    --adapters /path/to/their/adapters.fa

Step 4: Version Management¶

Use Git tags for stable releases:

# Create a release version
git tag -a v1.0 -m "Release version 1.0: Stable QC pipeline"
git push origin v1.0

# Users can now run the stable version
nextflow run USERNAME/microbial-genomics-training/workflows/qc_pipeline.nf \
    -r v1.0 \
    --input samplesheet.csv \
    --outdir results_stable/

Benefits of GitHub Workflow Execution¶

Centralized Distribution: One source of truth for your workflow
Version Control: Track changes and maintain stable releases
Collaboration: Easy sharing and contribution from team members
Documentation: README files and wiki for usage instructions
Issue Tracking: Bug reports and feature requests
Continuous Integration: Automated testing with GitHub Actions

Best Practices¶

Tag Releases: Use semantic versioning (v1.0.0, v1.1.0, etc.)
Document Parameters: Clear README with parameter descriptions
Test Data: Include small test datasets for validation
License: Add appropriate license for sharing
Changelog: Document changes between versions

Key Skills Developed¶

Git workflow: Track changes, collaborate, and manage versions
Container integration: Use Docker for reproducible environments
MTB genomics: Understand pathogen-specific analysis requirements
Clinical applications: Apply genomics to real-world healthcare problems
Professional development: Document, test, and deploy pipelines properly

Tomorrow: Day 8 Preview¶

Day 8 will focus on Comparative Genomics:

Pan-genome analysis with your MTB pipeline outputs
Phylogenetic inference from core genome SNPs
Tree construction and visualization
Population genomics and outbreak investigation

Your production MTB pipeline from today will provide the foundation for comparative analysis across multiple isolates!

Resources¶

Documentation¶

Container Repositories¶

MTB-Specific Tools¶

Professional Development¶

PHoeNIx Production Troubleshooting Quick Reference¶

Essential fixes for common deployment issues:

KRAKENTOOLS Format Error¶

# Error: "ValueError: not enough values to unpack (expected 5, got 1)"
# Cause: Wrong ktaxonomy.tsv format (4 fields vs 5)
# Fix: Use correct PHoeNIx format
cp phoenix/assets/databases/ktaxonomy.tsv ../kraken2_db_local/

Day 7: Advanced Nextflow, Version Control with GitHub & Real Genomics Applications¶

Learning Philosophy: Build → Version → Containerize → Deploy → Collaborate¶

Table of Contents¶

🔧 Building on Exercise 3¶

📚 Version Control Fundamentals¶

🐳 Containerization Introduction¶

🧬 MTB Analysis Pipeline Development¶

🚀 Professional Development¶

Learning Objectives¶

Prerequisites¶

Schedule¶

From Exercise 3 to Production Pipeline¶

Recap: What We Built in Exercise 3¶

Today's Enhancement Strategy¶

Version Control: Git and GitHub for Pipeline Development¶

Why Version Control for Bioinformatics?¶

Git Fundamentals for Bioinformatics¶

Setting Up Git for Your Pipeline¶

Your First Commit: Saving Exercise 3¶

Understanding Git Workflow¶

GitHub for Pipeline Collaboration¶

Creating Your Pipeline Repository¶

Connecting Local Repository to GitHub¶

Professional README for Your Pipeline¶

Features¶

Requirements¶

Citation¶

Docker Fundamentals¶

Understanding Docker Images¶

Key Docker Concepts¶

DockerHub for Bioinformatics Tools¶

Finding Bioinformatics Containers¶

Searching for Tools¶

Understanding Container Tags¶

Testing Containers with Singularity¶

Hands-on Exercise: Test Your Containers¶

Container Integration in Nextflow¶

Converting Exercise 3 to Use Containers¶

Complete Containerized Pipeline¶

Updating Nextflow Configuration for Containers¶

MTB Analysis Pipeline Development¶

Mycobacterium tuberculosis Genomics¶

Understanding MTB Genomics¶

MTB-Specific Pipeline Requirements¶

Pathogen-Specific Considerations¶

Enhanced MTB Pipeline Features¶

Clinical Genomics Applications¶

Complete MTB Clinical Pipeline¶

Clinical Interpretation Workflow¶

Running from GitHub (After Pushing Your Workflow)¶

Troubleshooting PHoeNIx Issues¶

Key PHoeNIx Features Demonstrated¶

Comparing PHoeNIx vs Our Custom Pipeline¶

Professional Pipeline Development¶

Pipeline Documentation¶

Creating Professional Documentation¶

Testing and Validation¶

Pipeline Testing Strategy¶

Deployment Strategies¶

Version Control Best Practices¶

Release Management¶

Running Workflows from GitHub¶

Running Your Published Workflow¶

Sharing Your Workflow¶

Benefits of GitHub-hosted Workflows¶

Hands-on Exercise: Building Your MTB Pipeline¶

Exercise: Convert Exercise 3 to Production MTB Pipeline¶

Step 1: Initialize Version Control¶

Step 2: Create Containerized Pipeline¶

Step 3: Add MTB-Specific Features¶

Step 4: Test the Pipeline¶

Step 5: Document and Commit¶

Step 6: Create GitHub Repository¶

Summary and Next Steps¶

What We Accomplished Today¶

Your Production Pipeline Features¶

🔥 Real-World Pipeline: PHoeNIx (CDC)¶

Introduction to PHoeNIx¶

Exercise 4: Setting Up PHoeNIx with Your TB Data¶

Step 1: Understanding PHoeNIx Requirements¶