HPC and ILIFU Training Materials¶
Getting Started¶
This document provides an overview of HPC concepts and ILIFU infrastructure.
For hands-on practice:
- SLURM Jobs: See High Performance Computing with SLURM: Practical Tutorial for step-by-step SLURM exercises
- Unix Commands: See Unix Commands for Pathogen Genomics - Practical Tutorial for genomics command-line basics
Quick Setup (if needed for examples below):
mkdir -p ~/hpc_practice && cd ~/hpc_practice
cp -r /cbio/training/courses/2025/micmet-genomics/sample-data/* .
Table of Contents¶
- Introduction to High Performance Computing (HPC)
- ILIFU Infrastructure Overview
- Getting Started with ILIFU
- SLURM Job Scheduling
- Resource Allocation and Management
- Best Practices
- Practical Examples
- Troubleshooting
Introduction to HPC¶
What is High Performance Computing?¶
High Performance Computing (HPC) is the use of powerful computers with multiple processors working in parallel to solve complex computational problems that require significant processing power, memory, or time.
Key Characteristics:
- Parallel processing: Multiple CPUs/cores work simultaneously on the same problem
- Cluster architecture: Hundreds or thousands of interconnected compute nodes
- High memory capacity: Large RAM for data-intensive computations
- Fast storage systems: High-speed file systems for handling large datasets
- Job scheduling: Queue management systems to optimize resource allocation
- Specialized hardware: GPUs, high-speed interconnects (InfiniBand), and custom processors
Why Use HPC?¶
- Speed: Complete computations faster than desktop computers
- Scale: Handle larger datasets and more complex problems
- Efficiency: Optimize resource utilization
- Cost-effective: Share expensive hardware among researchers
Traditional Computing vs HPC¶
Desktop Computer HPC Cluster
┌──────────────┐ ┌─────┬─────┬─────┐
│ 1 CPU │ vs │Node1│Node2│Node3│
│ 8GB RAM │ │ 32 │ 64 │128 │
│ 1TB Disk │ │cores│cores│cores│
└──────────────┘ └─────┴─────┴─────┘
HPC = Many computers working together
Traditional Computing | HPC Cluster |
---|---|
Single processor | Hundreds of processors |
Limited memory (8-32GB) | Massive shared memory (TB) |
Local storage (TB) | Distributed storage (PB) |
Individual use | Shared resources |
Desktop/Laptop | Specialized data centers |
Why Do We Need HPC?¶
Real-world Problems That Need HPC¶
🔬 Astronomy: Processing telescope data¶
- Data volume: 1-2 TB per night from modern telescopes
- Local machine: 3-4 weeks to process one night's data
- HPC cluster: 2-3 hours with parallel processing
- Example: MeerKAT telescope generates 2.5 TB/hour during observations
🧬 Bioinformatics: Genome assembly¶
- Data size: 100-300 GB of raw sequencing reads
- Local machine (8GB RAM): Often fails due to memory limits
- Local machine (32GB RAM): 2-3 weeks for bacterial genome
- HPC cluster: 4-6 hours with 256GB RAM
- Example: Human genome assembly needs ~1TB RAM, impossible on most desktops
🌡️ Climate Modeling: Weather simulations¶
- Computation: Millions of grid points × thousands of time steps
- Local machine: 6-8 months for regional model (if it runs at all)
- HPC cluster: 12-24 hours on 100+ cores
- Example: 10km resolution global model needs 10,000+ CPU hours
🧮 Machine Learning: Training deep neural networks¶
- Model size: GPT-3 has 175 billion parameters
- Local machine (single GPU): 355 years to train
- HPC cluster (1000 GPUs): 34 days
- Example: Training ResNet-50 on ImageNet: 2 weeks (laptop) → 1 hour (8 GPUs)
🦠 Pathogen Genomics: Outbreak analysis¶
- Dataset: 1000 M. tuberculosis genomes for outbreak investigation
- Local machine tasks and times:
- Quality control: 50 hours (3 min/sample)
- Read alignment: 167 hours (10 min/sample)
- Variant calling: 83 hours (5 min/sample)
- Phylogenetic tree: 48-72 hours
- Total: ~15 days of continuous processing
- HPC cluster:
- All samples in parallel: 4-6 hours total
- Tree construction on high-memory node: 2-3 hours
- Real example: COVID-19 surveillance processing 10,000 genomes weekly - impossible without HPC
Additional Pathogen Genomics Use Cases¶
🧬 Bacterial Genome Assembly (Illumina + Nanopore)¶
- Dataset: Hybrid assembly of 50 bacterial isolates
- Computational requirements:
- RAM: 16-32GB per genome
- CPU: 8-16 cores optimal per assembly
- Local machine (16GB RAM, 4 cores):
- One genome at a time only
- Per genome: 3-4 hours
- Total time: 150-200 hours (6-8 days)
- Risk of crashes with large genomes
- HPC cluster (256GB RAM, 32 cores/node):
- Process 8 genomes simultaneously per node
- Use 7 nodes for all 50 genomes
- Total time: 3-4 hours
- Speedup: 50x faster
💊 AMR Gene Detection Across Multiple Species¶
- Dataset: 5000 bacterial genomes from hospital surveillance
- Tools: AMRFinder, CARD-RGI, ResFinder
- Computational requirements:
- Database size: 2-5GB per tool
- RAM: 4-8GB per genome
- CPU time: 5-10 minutes per genome per tool
- Local machine (8 cores):
- Sequential processing: 5000 × 3 tools × 7.5 min = 1875 hours (78 days)
- Database loading overhead adds 20% more time
- HPC cluster (100 nodes, 32 cores each):
- Parallel processing across nodes
- Shared database in memory
- Total time: 6-8 hours
- Speedup: 230x faster
🌍 Phylogeographic Analysis of Cholera Outbreak¶
- Dataset: 2000 V. cholerae genomes from Haiti outbreak
- Computational requirements:
- Alignment: 100GB RAM for reference-based
- SNP calling: 4GB per genome
- Tree building (RAxML-NG): 64-128GB RAM
- BEAST analysis: 32GB RAM, 1000+ hours CPU time
- Local machine attempts:
- Alignment: Often fails (out of memory)
- If successful: 48 hours
- SNP calling: 133 hours (4 min/genome)
- RAxML tree: Fails on most laptops (needs >64GB RAM)
- BEAST: 6-8 weeks for proper MCMC convergence
- HPC cluster:
- Alignment: 2 hours on high-memory node
- SNP calling: 2 hours (parallel)
- RAxML: 4-6 hours on 64 cores
- BEAST: 48 hours on 32 cores
- Total: 2-3 days vs 2-3 months
🔬 Real-time Nanopore Sequencing Analysis¶
- Scenario: Meningitis outbreak, need results in <24 hours
- Data flow: 20 samples, 5GB data/sample, arriving over 12 hours
- Pipeline: Basecalling → QC → Assembly → Typing → AMR
- Local machine challenges:
- Can't keep up with data generation
- Basecalling alone: 2 hours/sample (40 hours total)
- Sequential processing: Miss the 24-hour deadline
- HPC solution:
- Real-time processing as data arrives
- GPU nodes for basecalling: 10 min/sample
- Parallel assembly and analysis
- Results available within 2-3 hours of sequencing
- Clinical impact: Treatment decisions in same day
Computational Requirements Comparison Table¶
Task | Local Machine | HPC Cluster | Speedup |
---|---|---|---|
100 TB genomes QC | 8GB RAM, 5 hours | 256GB RAM, 10 min | 30x |
1000 genome alignment | 16GB RAM, 7 days | 32GB/node × 50, 3 hours | 56x |
Phylogenetic tree (5000 taxa) | Often fails (>64GB needed) | 512GB RAM, 6 hours | ∞ |
Pan-genome analysis (500 genomes) | 32GB RAM, 2 weeks | 256GB RAM, 8 hours | 42x |
GWAS (10,000 samples) | Impossible (<1TB RAM) | 1TB RAM node, 24 hours | ∞ |
Metagenomic assembly | 64GB RAM, 3 days | 512GB RAM, 4 hours | 18x |
Why These Tasks Fail on Local Machines¶
- Memory Walls:
- De novo assembly: Needs 100-1000x coverage data in RAM
- Tree building: O(n²) memory for distance matrices
-
Pan-genome: Stores all genomes simultaneously
-
Time Constraints:
- Outbreak response: Need results in hours, not weeks
- Grant deadlines: Can't wait months for analysis
-
Iterative analysis: Need to test multiple parameters
-
Data Volume:
- Modern sequencer: 100-500GB per run
- Surveillance programs: 100s of genomes weekly
- Can't even store data on laptop (typical: 256GB-1TB SSD)
ILIFU Infrastructure¶
What is ILIFU?¶
- Inter-University Institute for Data Intensive Astronomy
- South African national research data facility
- Supports astronomy, bioinformatics, and other data-intensive sciences
- Located at University of Cape Town and University of the Western Cape
ILIFU Services¶
- Compute Cluster: High-performance computing resources
- Storage: Large-scale data storage solutions
- Cloud Services: Virtualized computing environments
- Data Transfer: High-speed data movement capabilities
- Support: Technical assistance and training
ILIFU Cluster Architecture¶
ILIFU (Inter-University Institute for Data Intensive Astronomy) is a cloud computing infrastructure designed for data-intensive research in astronomy, bioinformatics, and other computational sciences. The facility operates on an OpenStack platform with containerized workloads using Singularity and job scheduling through SLURM.
graph TB
subgraph "ILIFU Infrastructure"
subgraph "Cloud Platform"
OS[OpenStack Cloud<br/>Infrastructure-as-a-Service]
end
subgraph "Compute Resources"
CN[Compute Nodes<br/>Max: 96 CPUs per job<br/>Max: 1500 GB RAM per job]
end
subgraph "Container Platform"
SP[Singularity Containers<br/>HPC-optimized<br/>Rootless execution]
end
subgraph "Job Scheduler"
SL[SLURM Workload Manager<br/>Max runtime: 336 hours]
end
end
subgraph "Research Domains"
AST[Astronomy<br/>MeerKAT, SKA]
BIO[Bioinformatics<br/>Genomics, Metagenomics]
DS[Data Science<br/>ML/AI Research]
end
OS --> CN
CN --> SP
SP --> SL
AST --> SL
BIO --> SL
DS --> SL
style OS fill:#e1f5fe
style CN fill:#e8f5e9
style SP fill:#fff3e0
style SL fill:#f3e5f5
Figure: ILIFU cloud infrastructure architecture supporting multiple research domains
Resource Specifications¶
Based on ILIFU's cloud infrastructure configuration:
Maximum Job Resources¶
- CPUs: Up to 96 cores per job
- Memory: Up to 1500 GB (1.5 TB) RAM per job
- Runtime: Maximum 336 hours (14 days) per job
- Storage: Distributed file systems for large-scale data
Key Features¶
- OpenStack Platform: Provides flexible cloud computing resources
- Singularity Containers: Enables reproducible, portable workflows
- SLURM Scheduler: Manages resource allocation and job queuing
- Multi-domain Support: Serves astronomy, bioinformatics, and data science communities
Access Methods¶
- SSH access to login nodes
- Jupyter notebooks for interactive computing
- Web-based interfaces for specific services
- API access for programmatic interaction
Infrastructure Components¶
Component | Description | Purpose |
---|---|---|
OpenStack | Cloud computing platform | Infrastructure management and virtualization |
SLURM | Workload manager | Job scheduling and resource allocation |
Singularity | Container platform | Application deployment and portability |
CephFS | Distributed storage | High-performance shared file system |
Login Nodes | Access points | User entry and job submission |
Compute Nodes | Processing units | Actual computation execution |
Note: ILIFU operates as a cloud infrastructure rather than a traditional fixed HPC cluster, allowing dynamic resource allocation based on user requirements. Specific hardware configurations may vary as resources are allocated on-demand through the OpenStack platform
Getting Started with ILIFU¶
Account Setup¶
- Request Access: Apply through your institution
- SSH Keys: Generate and register SSH key pairs
- VPN: Configure institutional VPN if required
- Initial Login: Connect to login nodes
Basic Commands¶
# Login to ILIFU
ssh username@training.ilifu.ac.za
# Check your home directory
ls -la ~
# Check available modules
module avail
# Load a module
module load python/3.12.3 # Or use system python3
💡 Next Steps: After logging in, follow the hands-on exercises in High Performance Computing with SLURM: Practical Tutorial
File System Layout¶
/home/username/ # Your home directory (limited space)
/scratch/username/ # Temporary fast storage
/data/project/ # Shared project data
/software/ # Installed software
Data Management¶
- Home Directory: Small, backed up, permanent
- Scratch Space: Large, fast, temporary (auto-cleaned)
- Project Directories: Shared, persistent, for collaboration
SLURM Basics¶
📚 For detailed SLURM tutorials and exercises, see: High Performance Computing with SLURM: Practical Tutorial
What is SLURM?¶
graph TB
subgraph "Users"
U1[User 1]
U2[User 2]
U3[User 3]
end
subgraph "Login Nodes"
LN[Login Node<br/>- SSH Access<br/>- Job Submission<br/>- File Editing]
end
subgraph "SLURM Controller"
SC[SLURM Scheduler<br/>- Resource Allocation<br/>- Job Queuing<br/>- Priority Management]
DB[(Accounting<br/>Database)]
end
subgraph "Compute Nodes"
CN1[Compute Node 1<br/>CPUs: 32<br/>RAM: 128GB]
CN2[Compute Node 2<br/>CPUs: 32<br/>RAM: 128GB]
CN3[Compute Node 3<br/>CPUs: 32<br/>RAM: 128GB]
CNN[... More Nodes]
end
subgraph "Storage"
FS[Shared Filesystem<br/>/home<br/>/scratch<br/>/data]
end
U1 --> LN
U2 --> LN
U3 --> LN
LN -->|sbatch/srun| SC
SC --> DB
SC -->|Allocates| CN1
SC -->|Allocates| CN2
SC -->|Allocates| CN3
SC -->|Allocates| CNN
CN1 --> FS
CN2 --> FS
CN3 --> FS
CNN --> FS
LN --> FS
style U1 fill:#e1f5fe
style U2 fill:#e1f5fe
style U3 fill:#e1f5fe
style LN fill:#fff3e0
style SC fill:#f3e5f5
style DB fill:#f3e5f5
style CN1 fill:#e8f5e9
style CN2 fill:#e8f5e9
style CN3 fill:#e8f5e9
style CNN fill:#e8f5e9
style FS fill:#fce4ec
Figure: HPC cluster architecture showing the relationship between users, login nodes, SLURM scheduler, compute nodes, and shared storage
About SLURM¶
Simple Linux Utility for Resource Management (SLURM) is a job scheduling and cluster management tool that:
- Job scheduler: Allocates compute resources efficiently among users
- Resource manager: Controls access to CPUs, memory, and other resources
- Workload manager: Manages job queues and priorities based on fairness policies
- Framework components: Login nodes for access, compute nodes for execution, scheduler for coordination, and accounting database for tracking
Key SLURM Concepts¶
- Job: A computational task submitted to the cluster
- Partition: Group of nodes with similar characteristics
- Queue: Collection of jobs waiting for resources
- Node: Individual compute server
- Core/CPU: Processing unit within a node
Basic SLURM Commands¶
# Submit a job
sbatch job_script.sh
# Check job status
squeue -u username
# Cancel a job
scancel job_id
# Check node information
sinfo
# Check your job history
sacct -u username
Job Script Template¶
To create a job script, use the nano text editor:
Then copy and paste the following template:
#!/bin/bash
#SBATCH --job-name=my_job
#SBATCH --partition=Main
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=8GB
#SBATCH --time=01:00:00
#SBATCH --output=output_%j.log
#SBATCH --error=error_%j.log
# Load modules
module load python/3.12.3 # Or use system python3
# Run your program
python my_script.py
To save and exit nano:
- Press Ctrl+X
to exit
- Press Y
to confirm save
- Press Enter
to accept the filename
SLURM Directives Explained¶
--job-name
: Human-readable job name--partition
: Which partition to use--nodes
: Number of nodes required--ntasks-per-node
: Tasks per node--cpus-per-task
: CPUs per task--mem
: Memory requirement--time
: Maximum runtime--output/--error
: Log file locations
Resource Management¶
Understanding Resources¶
- CPU Cores: Processing units
- Memory (RAM): Working memory
- GPU: Graphics processing units
- Storage: Disk space
- Network: Data transfer bandwidth
Resource Allocation Strategies¶
# CPU-intensive job
#SBATCH --cpus-per-task=16
#SBATCH --mem=32GB
# Memory-intensive job
#SBATCH --cpus-per-task=4
#SBATCH --mem=64GB
# GPU job
#SBATCH --gres=gpu:1
#SBATCH --partition=GPU
# Parallel job
#SBATCH --nodes=2
#SBATCH --ntasks=32
Monitoring Resource Usage¶
# Check job efficiency
seff job_id
# Real-time job monitoring
sstat job_id
# Detailed job information
scontrol show job job_id
Best Practices¶
Job Submission¶
- Test small first: Start with short test runs
- Use checkpoints: Save progress regularly
- Estimate resources: Don't over-request
- Use appropriate partitions: Match job to partition
- Clean up: Remove temporary files
Code Optimization¶
# Use parallel processing
#SBATCH --cpus-per-task=8
# In Python
from multiprocessing import Pool
with Pool(8) as pool:
results = pool.map(my_function, data)
Data Management¶
- Use scratch space for temporary files
- Compress data when possible
- Clean up regularly
- Use appropriate file formats
Common Mistakes to Avoid¶
- Requesting too many resources
- Running jobs on login nodes
- Not using version control
- Ignoring error messages
- Not testing scripts locally first
Practical Examples¶
📝 Complete Step-by-Step Tutorials: For detailed, hands-on SLURM exercises with explanations, see High Performance Computing with SLURM: Practical Tutorial
Example 1: Python Data Analysis¶
To create this script:
# Open nano editor
nano data_analysis.sh
# Copy and paste the script below, then:
# Press Ctrl+X to exit
# Press Y to save
# Press Enter to confirm filename
Script content:
#!/bin/bash
#SBATCH --job-name=data_analysis
#SBATCH --partition=Main
#SBATCH --cpus-per-task=4
#SBATCH --mem=16GB
#SBATCH --time=02:00:00
#SBATCH --output=analysis_%j.log
module load python/3.12.3 # Or use system python3
# Install with: pip install pandas numpy matplotlib
python data_analysis.py input.csv
Submit with: sbatch data_analysis.sh
Example 2: R Statistical Analysis¶
Create the script with nano:
Script content:
#!/bin/bash
#SBATCH --job-name=r_stats
#SBATCH --partition=Main
#SBATCH --cpus-per-task=1
#SBATCH --mem=8GB
#SBATCH --time=01:30:00
module load R/4.4.1 # Check available version
Rscript statistical_analysis.R
Submit with: sbatch r_stats.sh
Example 3: GPU Machine Learning¶
Create the script:
Script content:
#!/bin/bash
#SBATCH --job-name=ml_training
#SBATCH --partition=GPU
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32GB
#SBATCH --time=04:00:00
# module load cuda # Check if GPU/CUDA is available
module load python/3.12.3 # Or use system python3
python train_model.py
Submit with: sbatch ml_training.sh
Example 4: Array Jobs¶
Create the array job script:
Script content:
#!/bin/bash
#SBATCH --job-name=array_job
#SBATCH --partition=Main
#SBATCH --array=1-100
#SBATCH --cpus-per-task=1
#SBATCH --mem=4GB
#SBATCH --time=00:30:00
# Process different files based on array index
input_file="data_${SLURM_ARRAY_TASK_ID}.txt"
output_file="result_${SLURM_ARRAY_TASK_ID}.txt"
python process_data.py $input_file $output_file
Submit with: sbatch array_job.sh
Troubleshooting¶
Common Issues and Solutions¶
Job Won't Start¶
# Check partition limits
scontrol show partition
# Check job details
scontrol show job job_id
# Check node availability
sinfo -N
Out of Memory Errors¶
# Check memory usage
sstat -j job_id --format=AveCPU,AvePages,AveRSS,AveVMSize
# Increase memory request
#SBATCH --mem=32GB
Job Timeouts¶
# Check time limits
scontrol show partition
# Increase time limit
#SBATCH --time=04:00:00
# Use checkpointing for long jobs
Module Issues¶
# List available modules
module avail
# Check module conflicts
module list
# Purge and reload
module purge
module load python/3.12.3 # Or use system python3
Getting Help¶
- Documentation: Check ILIFU docs
- Help Desk: Submit support tickets
- Community: Ask on forums or Slack
- Training: Attend workshops
- Practical Tutorials: Work through High Performance Computing with SLURM: Practical Tutorial
Quick Reference¶
Essential SLURM Commands¶
Command | Purpose | Example Output |
---|---|---|
sbatch script.sh |
Submit job | Submitted batch job 10 |
squeue -u $USER |
Check your jobs | Shows running/pending jobs |
scancel job_id |
Cancel job | Terminates specified job |
sinfo |
Node information | Shows partition and node status |
sacct -j job_id |
Job accounting | Shows job completion details |
seff job_id |
Job efficiency | Shows resource utilization |
Example Command Outputs¶
Checking Partition Information¶
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
training* up 14-00:00:0 7 idle compute-1-sep2025,compute-2-sep2025,compute-3-sep2025,compute-4-sep2025,compute-5-sep2025,compute-6-sep2025,compute-7-sep2025
Job Submission and Status¶
$ sbatch hello.sh
Submitted batch job 10
$ squeue -u mamana
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
10 training hello mamana R 0:01 1 compute-1-sep2025
Job Efficiency Report¶
$ seff 10
Job ID: 10
Cluster: training
User/Group: mamana/training
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 1
CPU Utilized: 00:00:00
CPU Efficiency: 0.00% of 00:00:01 core-walltime
Job Wall-clock time: 00:00:01
Memory Utilized: 4.80 MB
Memory Efficiency: 0.48% of 1.00 GB
Common SBATCH Directives¶
Directive | Purpose | Example |
---|---|---|
--job-name |
Job name | my_analysis |
--partition |
Partition | Main , GPU |
--cpus-per-task |
CPU cores | 4 |
--mem |
Memory | 16GB |
--time |
Runtime limit | 02:00:00 |
--gres |
GPU resources | gpu:1 |
File Transfer¶
# Upload data
scp local_file.txt username@training.ilifu.ac.za:~/
# Download results
scp username@training.ilifu.ac.za:~/results.txt ./
# Sync directories
rsync -av local_dir/ username@training.ilifu.ac.za:~/remote_dir/
Additional Resources¶
- ILIFU Documentation: https://docs.ilifu.ac.za
- SLURM Documentation: https://slurm.schedmd.com/documentation.html
- HPC Best Practices: Various online resources
- Training Materials: Regular workshops and tutorials