Day 9: Bring your own data¶

Date: September 11, 2025
Duration: 09:00-13:00 CAT
Focus: Mobile genetic elements in AMR, independent analysis of participant datasets

Overview¶

Day 9 begins with understanding the role of mobile genetic elements in AMR spread, then transitions to hands-on application of all techniques learned throughout the course. Participants will analyze their own datasets with guidance from trainers, troubleshoot real-world challenges, and develop customized analysis approaches for their specific research questions.

Learning Objectives¶

By the end of Day 9, you will be able to:

Understand the role of plasmids, integrons, and transposons in AMR dissemination
Apply learned bioinformatics techniques to your own research data
Troubleshoot common issues encountered in real-world analyses
Adapt standard protocols to meet specific research requirements
Develop analysis strategies for novel research questions
Integrate multiple analysis approaches into comprehensive workflows
Plan sustainable bioinformatics practices for ongoing research

Schedule¶

Time (CAT)	Topic	Trainer
09:00	Role of plasmids, integrons, and transposons in AMR spread	Ephifania Geza
10:30	Participants to analyse their own data	All trainers
11:30	Break
12:00	Participants to analyse their own data	All trainers

Session Structure¶

Mobile Genetic Elements Session (09:00-10:30)¶

Understanding plasmids and their role in horizontal gene transfer
Integrons and gene cassette systems
Transposons and insertion sequences
Tools for mobile element detection and analysis

Opening Data Analysis Session (10:30-10:45)¶

Brief recap of key course concepts
Overview of available support and resources
Formation of analysis groups based on data types
Technical setup verification

Individual Analysis Time (09:15-11:30)¶

Independent work on participant datasets
One-on-one consultation with trainers
Peer collaboration and knowledge sharing
Documentation of analysis approaches

Brief presentations of initial findings
Discussion of challenges encountered
Sharing of successful analysis strategies

Advanced Analysis (12:00-13:00)¶

Continued independent work
Focus on complex analyses or troubleshooting
Preparation for Day 10 presentations
Final consultations with trainers

Data Types and Analysis Approaches¶

Genomic Data¶

Common analyses for participants with genomic datasets:

Quality control and preprocessing
FastQC assessment and interpretation
Adapter trimming and quality filtering
Species identification and contamination detection
Genome assembly and annotation
De novo assembly optimization
Assembly quality assessment
Functional annotation and gene prediction
Comparative genomics
MLST and serotyping
Antimicrobial resistance gene detection
Phylogenetic analysis and clustering

Metagenomic Data¶

For participants with microbiome or metagenomic samples:

Community profiling
Taxonomic classification
Abundance estimation and normalization
Diversity analysis (alpha and beta)
Functional analysis
Pathway reconstruction
Antimicrobial resistance profiling
Metabolic potential assessment
Clinical applications
Pathogen detection in complex samples
Co-infection analysis
Treatment response monitoring

Outbreak Investigation Data¶

For epidemiological and outbreak datasets:

Transmission analysis
SNP-based clustering
Phylogenetic reconstruction
Temporal and geographic analysis
Resistance surveillance
Multi-drug resistance patterns
Resistance gene distribution
Treatment outcome correlations

Technical Support Available¶

Computational Resources¶

Access to high-performance computing cluster
Pre-installed bioinformatics software environments
Container images for reproducible analysis
Shared storage for large datasets

Analysis Pipelines¶

Nextflow workflows developed during the course
Customizable analysis templates
Pre-configured environment profiles
Automated reporting tools

Expert Guidance¶

Trainer specializations available:

Trainer	Expertise Areas
Ephifania Geza	Genomic surveillance, AMR analysis, metagenomics, clinical applications
Arash Iranzadeh	Phylogenomics, comparative genomics, outbreak investigation
Sindiswa Lukhele	Sequencing technologies, quality control, species identification
Mamana Mbiyavanga	Workflow development, HPC systems, pipeline optimization

Common Analysis Workflows¶

Genomic Surveillance Workflow¶

# 1. Initial data assessment
fastqc raw_data/*.fastq.gz
multiqc fastqc_results/

# 2. Species identification
kraken2 --db minikraken2_v2 --paired sample_R1.fastq sample_R2.fastq

# 3. Quality trimming
trimmomatic PE sample_R1.fastq sample_R2.fastq \
    sample_R1_trimmed.fastq sample_R1_unpaired.fastq \
    sample_R2_trimmed.fastq sample_R2_unpaired.fastq \
    SLIDINGWINDOW:4:20 MINLEN:50

# 4. Assembly
spades.py -1 sample_R1_trimmed.fastq -2 sample_R2_trimmed.fastq -o assembly/

# 5. Assembly quality assessment
quast.py assembly/scaffolds.fasta -o quast_results/

# 6. Annotation
prokka assembly/scaffolds.fasta --outdir annotation/ --prefix sample

Metagenomic Analysis Workflow¶

# 1. Host DNA removal (if applicable)
kneaddata --input sample_R1.fastq --input sample_R2.fastq \
    --reference-db human_genome --output cleaned_data/

# 2. Taxonomic profiling
metaphlan cleaned_data/sample_paired_1.fastq,cleaned_data/sample_paired_2.fastq \
    --bowtie2out sample.bowtie2.bz2 --nproc 4 --input_type fastq \
    --output_file sample_profile.txt

# 3. Functional profiling
humann --input cleaned_data/sample_paired.fastq \
    --output functional_analysis/ --nucleotide-database chocophlan \
    --protein-database uniref90

# 4. Diversity analysis in R
Rscript diversity_analysis.R sample_profile.txt metadata.csv

Troubleshooting Guide¶

Common Issues and Solutions¶

Low-Quality Data¶

# Check read quality distribution
fastqc *.fastq.gz

# Aggressive quality trimming if needed
trimmomatic PE input_R1.fastq input_R2.fastq \
    output_R1.fastq unpaired_R1.fastq \
    output_R2.fastq unpaired_R2.fastq \
    SLIDINGWINDOW:4:25 LEADING:20 TRAILING:20 MINLEN:75

# Consider different assembly strategies
# For poor quality data, try more conservative parameters
spades.py --careful --cov-cutoff auto -1 R1.fastq -2 R2.fastq -o assembly/

Contamination Issues¶

# Check for contamination
kraken2 --db standard --paired sample_R1.fastq sample_R2.fastq \
    --report contamination_report.txt

# Remove contaminant sequences
extract_kraken_reads.py -k sample.kraken -s1 sample_R1.fastq \
    -s2 sample_R2.fastq -o clean_R1.fastq -o2 clean_R2.fastq \
    --exclude --taxid 9606  # Exclude human reads

Assembly Problems¶

# If SPAdes fails, try different assemblers
# Unicycler for hybrid assembly
unicycler -1 short_R1.fastq -2 short_R2.fastq -l long_reads.fastq -o assembly/

# Or SKESA for quick assembly
skesa --reads short_R1.fastq,short_R2.fastq --cores 8 > assembly.fasta

Memory/Resource Issues¶

# Monitor resource usage
htop

# Reduce memory usage for large datasets
spades.py --memory 16 -1 R1.fastq -2 R2.fastq -o assembly/

# Use subsampling for initial testing
seqtk sample -s100 input_R1.fastq 100000 > subset_R1.fastq
seqtk sample -s100 input_R2.fastq 100000 > subset_R2.fastq

Analysis Documentation¶

Laboratory Notebook Template¶

# Analysis Log: [Your Dataset Name]
**Date**: [Current Date]
**Analyst**: [Your Name]
**Data Source**: [Description of samples]

## Objectives
- Primary research question
- Specific analyses planned
- Expected outcomes

## Data Description
- Sample type and collection method
- Sequencing platform and parameters
- Data quality metrics

## Analysis Steps
### Step 1: Quality Control
- Command used: `fastqc *.fastq.gz`
- Results: [Summary of quality metrics]
- Decision: [Any quality filtering applied]

### Step 2: [Next Analysis]
- Command: [Exact command used]
- Parameters chosen: [Rationale for parameter selection]
- Results: [Key findings]

## Challenges Encountered
- Issue: [Description of problem]
- Solution attempted: [What was tried]
- Outcome: [Whether resolved]

## Key Findings
- [Major results from analysis]
- [Statistical summaries]
- [Biological interpretations]

## Next Steps
- Additional analyses needed
- Questions raised
- Follow-up experiments

Resource Management¶

Data Organization¶

# Recommended directory structure
project_name/
├── raw_data/          # Original sequencing files
├── quality_control/   # QC reports and cleaned data
├── analysis/          # Main analysis outputs
├── scripts/           # Custom scripts and commands
├── results/           # Final results and figures
└── documentation/     # Analysis logs and notes

Backup Strategies¶

# Regular backup of important results
rsync -av results/ backup_drive/project_results/
tar -czf analysis_$(date +%Y%m%d).tar.gz analysis/

# Version control for scripts
git init
git add scripts/
git commit -m "Initial analysis scripts"

Collaboration Guidelines¶

Peer Support¶

Form analysis groups based on similar data types
Share successful parameter combinations
Collaborate on troubleshooting challenging datasets
Review each other's analysis approaches

Trainer Consultation¶

Prepare specific questions about your data
Document issues with exact error messages
Have your analysis objectives clearly defined
Be ready to explain your research context

Assessment and Preparation for Day 10¶

Presentation Preparation¶

Participants should prepare a 5-minute presentation covering:

Research Question: What you aimed to investigate
Data Overview: Type and source of your dataset
Methods Applied: Which course techniques you used
Key Results: Main findings from your analysis
Challenges: Obstacles encountered and solutions found
Future Directions: Next steps for your research

Technical Documentation¶

Save all commands used in a script file
Document parameter choices and rationale
Prepare summary statistics and key figures
Note any analysis limitations or assumptions

Success Metrics¶

By the end of Day 9, participants should have:

Successfully processed their own dataset
Applied at least 3 different analysis techniques from the course
Documented their analysis workflow
Identified key findings relevant to their research
Prepared materials for Day 10 presentation
Established ongoing analysis plan

Resources for Continued Learning¶

Online Communities¶

Software Documentation¶

Tool-specific manuals and tutorials
GitHub repositories for pipeline development
Container registries for reproducible environments

Professional Development¶

Local bioinformatics user groups
International conferences and workshops
Online course platforms for advanced topics

Looking Ahead¶

Day 10 Preview: Wrap-up session including: - Participant presentations of analysis results - Discussion of lessons learned and best practices - Information about ongoing support resources - Course completion and next steps planning

Key Learning Outcome: Independent application of bioinformatics skills to real research data builds confidence and reveals the practical challenges and rewards of computational biology in actual research contexts.