HPC and ILIFU Training Materials¶

Getting Started¶

This document provides an overview of HPC concepts and ILIFU infrastructure.

For hands-on practice:

SLURM Jobs: See High Performance Computing with SLURM: Practical Tutorial for step-by-step SLURM exercises
Unix Commands: See Unix Commands for Pathogen Genomics - Practical Tutorial for genomics command-line basics

Quick Setup (if needed for examples below):

mkdir -p ~/hpc_practice && cd ~/hpc_practice
cp -r /cbio/training/courses/2025/micmet-genomics/sample-data/* .

Table of Contents¶

Introduction to High Performance Computing (HPC)
ILIFU Infrastructure Overview
Getting Started with ILIFU
SLURM Job Scheduling
Resource Allocation and Management
Best Practices
Practical Examples
Troubleshooting

Introduction to HPC¶

What is High Performance Computing?¶

High Performance Computing (HPC) is the use of powerful computers with multiple processors working in parallel to solve complex computational problems that require significant processing power, memory, or time.

Key Characteristics:

Parallel processing: Multiple CPUs/cores work simultaneously on the same problem
Cluster architecture: Hundreds or thousands of interconnected compute nodes
High memory capacity: Large RAM for data-intensive computations
Fast storage systems: High-speed file systems for handling large datasets
Job scheduling: Queue management systems to optimize resource allocation
Specialized hardware: GPUs, high-speed interconnects (InfiniBand), and custom processors

Why Use HPC?¶

Speed: Complete computations faster than desktop computers
Scale: Handle larger datasets and more complex problems
Efficiency: Optimize resource utilization
Cost-effective: Share expensive hardware among researchers

Traditional Computing vs HPC¶

Desktop Computer          HPC Cluster
┌──────────────┐         ┌─────┬─────┬─────┐
│   1 CPU      │   vs    │Node1│Node2│Node3│
│   8GB RAM    │         │ 32  │ 64  │128  │
│   1TB Disk   │         │cores│cores│cores│
└──────────────┘         └─────┴─────┴─────┘

HPC = Many computers working together

Traditional Computing	HPC Cluster
Single processor	Hundreds of processors
Limited memory (8-32GB)	Massive shared memory (TB)
Local storage (TB)	Distributed storage (PB)
Individual use	Shared resources
Desktop/Laptop	Specialized data centers

Why Do We Need HPC?¶

Real-world Problems That Need HPC¶

🔬 Astronomy: Processing telescope data¶

Data volume: 1-2 TB per night from modern telescopes
Local machine: 3-4 weeks to process one night's data
HPC cluster: 2-3 hours with parallel processing
Example: MeerKAT telescope generates 2.5 TB/hour during observations

🧬 Bioinformatics: Genome assembly¶

Data size: 100-300 GB of raw sequencing reads
Local machine (8GB RAM): Often fails due to memory limits
Local machine (32GB RAM): 2-3 weeks for bacterial genome
HPC cluster: 4-6 hours with 256GB RAM
Example: Human genome assembly needs ~1TB RAM, impossible on most desktops

🌡️ Climate Modeling: Weather simulations¶

Computation: Millions of grid points × thousands of time steps
Local machine: 6-8 months for regional model (if it runs at all)
HPC cluster: 12-24 hours on 100+ cores
Example: 10km resolution global model needs 10,000+ CPU hours

🧮 Machine Learning: Training deep neural networks¶

Model size: GPT-3 has 175 billion parameters
Local machine (single GPU): 355 years to train
HPC cluster (1000 GPUs): 34 days
Example: Training ResNet-50 on ImageNet: 2 weeks (laptop) → 1 hour (8 GPUs)

🦠 Pathogen Genomics: Outbreak analysis¶

Dataset: 1000 M. tuberculosis genomes for outbreak investigation
Local machine tasks and times:
Quality control: 50 hours (3 min/sample)
Read alignment: 167 hours (10 min/sample)
Variant calling: 83 hours (5 min/sample)
Phylogenetic tree: 48-72 hours
Total: ~15 days of continuous processing
HPC cluster:
All samples in parallel: 4-6 hours total
Tree construction on high-memory node: 2-3 hours
Real example: COVID-19 surveillance processing 10,000 genomes weekly - impossible without HPC

Additional Pathogen Genomics Use Cases¶

🧬 Bacterial Genome Assembly (Illumina + Nanopore)¶

Dataset: Hybrid assembly of 50 bacterial isolates
Computational requirements:
RAM: 16-32GB per genome
CPU: 8-16 cores optimal per assembly
Local machine (16GB RAM, 4 cores):
One genome at a time only
Per genome: 3-4 hours
Total time: 150-200 hours (6-8 days)
Risk of crashes with large genomes
HPC cluster (256GB RAM, 32 cores/node):
Process 8 genomes simultaneously per node
Use 7 nodes for all 50 genomes
Total time: 3-4 hours
Speedup: 50x faster

💊 AMR Gene Detection Across Multiple Species¶

Dataset: 5000 bacterial genomes from hospital surveillance
Tools: AMRFinder, CARD-RGI, ResFinder
Computational requirements:
Database size: 2-5GB per tool
RAM: 4-8GB per genome
CPU time: 5-10 minutes per genome per tool
Local machine (8 cores):
Sequential processing: 5000 × 3 tools × 7.5 min = 1875 hours (78 days)
Database loading overhead adds 20% more time
HPC cluster (100 nodes, 32 cores each):
Parallel processing across nodes
Shared database in memory
Total time: 6-8 hours
Speedup: 230x faster

🌍 Phylogeographic Analysis of Cholera Outbreak¶

Dataset: 2000 V. cholerae genomes from Haiti outbreak
Computational requirements:
Alignment: 100GB RAM for reference-based
SNP calling: 4GB per genome
Tree building (RAxML-NG): 64-128GB RAM
BEAST analysis: 32GB RAM, 1000+ hours CPU time
Local machine attempts:
Alignment: Often fails (out of memory)
If successful: 48 hours
SNP calling: 133 hours (4 min/genome)
RAxML tree: Fails on most laptops (needs >64GB RAM)
BEAST: 6-8 weeks for proper MCMC convergence
HPC cluster:
Alignment: 2 hours on high-memory node
SNP calling: 2 hours (parallel)
RAxML: 4-6 hours on 64 cores
BEAST: 48 hours on 32 cores
Total: 2-3 days vs 2-3 months

🔬 Real-time Nanopore Sequencing Analysis¶

Scenario: Meningitis outbreak, need results in <24 hours
Data flow: 20 samples, 5GB data/sample, arriving over 12 hours
Pipeline: Basecalling → QC → Assembly → Typing → AMR
Local machine challenges:
Can't keep up with data generation
Basecalling alone: 2 hours/sample (40 hours total)
Sequential processing: Miss the 24-hour deadline
HPC solution:
Real-time processing as data arrives
GPU nodes for basecalling: 10 min/sample
Parallel assembly and analysis
Results available within 2-3 hours of sequencing
Clinical impact: Treatment decisions in same day

Computational Requirements Comparison Table¶

Task	Local Machine	HPC Cluster	Speedup
100 TB genomes QC	8GB RAM, 5 hours	256GB RAM, 10 min	30x
1000 genome alignment	16GB RAM, 7 days	32GB/node × 50, 3 hours	56x
Phylogenetic tree (5000 taxa)	Often fails (>64GB needed)	512GB RAM, 6 hours	∞
Pan-genome analysis (500 genomes)	32GB RAM, 2 weeks	256GB RAM, 8 hours	42x
GWAS (10,000 samples)	Impossible (<1TB RAM)	1TB RAM node, 24 hours	∞
Metagenomic assembly	64GB RAM, 3 days	512GB RAM, 4 hours	18x

Why These Tasks Fail on Local Machines¶

Memory Walls:
De novo assembly: Needs 100-1000x coverage data in RAM
Tree building: O(n²) memory for distance matrices
Pan-genome: Stores all genomes simultaneously
Time Constraints:
Outbreak response: Need results in hours, not weeks
Grant deadlines: Can't wait months for analysis
Iterative analysis: Need to test multiple parameters
Data Volume:
Modern sequencer: 100-500GB per run
Surveillance programs: 100s of genomes weekly
Can't even store data on laptop (typical: 256GB-1TB SSD)

ILIFU Infrastructure¶

What is ILIFU?¶

Inter-University Institute for Data Intensive Astronomy
South African national research data facility
Supports astronomy, bioinformatics, and other data-intensive sciences
Located at University of Cape Town and University of the Western Cape

ILIFU Services¶

Compute Cluster: High-performance computing resources
Storage: Large-scale data storage solutions
Cloud Services: Virtualized computing environments
Data Transfer: High-speed data movement capabilities
Support: Technical assistance and training

ILIFU Cluster Architecture¶

ILIFU (Inter-University Institute for Data Intensive Astronomy) is a cloud computing infrastructure designed for data-intensive research in astronomy, bioinformatics, and other computational sciences. The facility operates on an OpenStack platform with containerized workloads using Singularity and job scheduling through SLURM.

graph TB
    subgraph "ILIFU Infrastructure"
        subgraph "Cloud Platform"
            OS[OpenStack Cloud<br/>Infrastructure-as-a-Service]
        end

        subgraph "Compute Resources"
            CN[Compute Nodes<br/>Max: 96 CPUs per job<br/>Max: 1500 GB RAM per job]
        end

        subgraph "Container Platform"
            SP[Singularity Containers<br/>HPC-optimized<br/>Rootless execution]
        end

        subgraph "Job Scheduler"
            SL[SLURM Workload Manager<br/>Max runtime: 336 hours]
        end
    end

    subgraph "Research Domains"
        AST[Astronomy<br/>MeerKAT, SKA]
        BIO[Bioinformatics<br/>Genomics, Metagenomics]
        DS[Data Science<br/>ML/AI Research]
    end

    OS --> CN
    CN --> SP
    SP --> SL

    AST --> SL
    BIO --> SL
    DS --> SL

    style OS fill:#e1f5fe
    style CN fill:#e8f5e9
    style SP fill:#fff3e0
    style SL fill:#f3e5f5

Figure: ILIFU cloud infrastructure architecture supporting multiple research domains

Resource Specifications¶

Based on ILIFU's cloud infrastructure configuration:

Maximum Job Resources¶

CPUs: Up to 96 cores per job
Memory: Up to 1500 GB (1.5 TB) RAM per job
Runtime: Maximum 336 hours (14 days) per job
Storage: Distributed file systems for large-scale data

Key Features¶

OpenStack Platform: Provides flexible cloud computing resources
Singularity Containers: Enables reproducible, portable workflows
SLURM Scheduler: Manages resource allocation and job queuing
Multi-domain Support: Serves astronomy, bioinformatics, and data science communities

Access Methods¶

SSH access to login nodes
Jupyter notebooks for interactive computing
Web-based interfaces for specific services
API access for programmatic interaction

Infrastructure Components¶

Component	Description	Purpose
OpenStack	Cloud computing platform	Infrastructure management and virtualization
SLURM	Workload manager	Job scheduling and resource allocation
Singularity	Container platform	Application deployment and portability
CephFS	Distributed storage	High-performance shared file system
Login Nodes	Access points	User entry and job submission
Compute Nodes	Processing units	Actual computation execution

Note: ILIFU operates as a cloud infrastructure rather than a traditional fixed HPC cluster, allowing dynamic resource allocation based on user requirements. Specific hardware configurations may vary as resources are allocated on-demand through the OpenStack platform

Getting Started with ILIFU¶

Account Setup¶

Request Access: Apply through your institution
SSH Keys: Generate and register SSH key pairs
VPN: Configure institutional VPN if required
Initial Login: Connect to login nodes

Basic Commands¶

# Login to ILIFU
ssh username@training.ilifu.ac.za

# Check your home directory
ls -la ~

# Check available modules
module avail

# Load a module
module load python/3.12.3  # Or use system python3

💡 Next Steps: After logging in, follow the hands-on exercises in High Performance Computing with SLURM: Practical Tutorial

File System Layout¶

/home/username/          # Your home directory (limited space)
/scratch/username/       # Temporary fast storage
/data/project/          # Shared project data
/software/              # Installed software

Data Management¶

Home Directory: Small, backed up, permanent
Scratch Space: Large, fast, temporary (auto-cleaned)
Project Directories: Shared, persistent, for collaboration

SLURM Basics¶

📚 For detailed SLURM tutorials and exercises, see: High Performance Computing with SLURM: Practical Tutorial

What is SLURM?¶

graph TB
    subgraph "Users"
        U1[User 1]
        U2[User 2]
        U3[User 3]
    end

    subgraph "Login Nodes"
        LN[Login Node<br/>- SSH Access<br/>- Job Submission<br/>- File Editing]
    end

    subgraph "SLURM Controller"
        SC[SLURM Scheduler<br/>- Resource Allocation<br/>- Job Queuing<br/>- Priority Management]
        DB[(Accounting<br/>Database)]
    end

    subgraph "Compute Nodes"
        CN1[Compute Node 1<br/>CPUs: 32<br/>RAM: 128GB]
        CN2[Compute Node 2<br/>CPUs: 32<br/>RAM: 128GB]
        CN3[Compute Node 3<br/>CPUs: 32<br/>RAM: 128GB]
        CNN[... More Nodes]
    end

    subgraph "Storage"
        FS[Shared Filesystem<br/>/home<br/>/scratch<br/>/data]
    end

    U1 --> LN
    U2 --> LN
    U3 --> LN

    LN -->|sbatch/srun| SC
    SC --> DB
    SC -->|Allocates| CN1
    SC -->|Allocates| CN2
    SC -->|Allocates| CN3
    SC -->|Allocates| CNN

    CN1 --> FS
    CN2 --> FS
    CN3 --> FS
    CNN --> FS
    LN --> FS

    style U1 fill:#e1f5fe
    style U2 fill:#e1f5fe
    style U3 fill:#e1f5fe
    style LN fill:#fff3e0
    style SC fill:#f3e5f5
    style DB fill:#f3e5f5
    style CN1 fill:#e8f5e9
    style CN2 fill:#e8f5e9
    style CN3 fill:#e8f5e9
    style CNN fill:#e8f5e9
    style FS fill:#fce4ec

Figure: HPC cluster architecture showing the relationship between users, login nodes, SLURM scheduler, compute nodes, and shared storage

About SLURM¶

Simple Linux Utility for Resource Management (SLURM) is a job scheduling and cluster management tool that:

Job scheduler: Allocates compute resources efficiently among users
Resource manager: Controls access to CPUs, memory, and other resources
Workload manager: Manages job queues and priorities based on fairness policies
Framework components: Login nodes for access, compute nodes for execution, scheduler for coordination, and accounting database for tracking

Key SLURM Concepts¶

Job: A computational task submitted to the cluster
Partition: Group of nodes with similar characteristics
Queue: Collection of jobs waiting for resources
Node: Individual compute server
Core/CPU: Processing unit within a node

Basic SLURM Commands¶

# Submit a job
sbatch job_script.sh

# Check job status
squeue -u username

# Cancel a job
scancel job_id

# Check node information
sinfo

# Check your job history
sacct -u username

Job Script Template¶

To create a job script, use the nano text editor:

# Open nano editor to create your script
nano my_job.sh

Then copy and paste the following template:

#!/bin/bash
#SBATCH --job-name=my_job
#SBATCH --partition=Main
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=8GB
#SBATCH --time=01:00:00
#SBATCH --output=output_%j.log
#SBATCH --error=error_%j.log

# Load modules
module load python/3.12.3  # Or use system python3

# Run your program
python my_script.py

To save and exit nano: - Press Ctrl+X to exit - Press Y to confirm save - Press Enter to accept the filename

SLURM Directives Explained¶

--job-name: Human-readable job name
--partition: Which partition to use
--nodes: Number of nodes required
--ntasks-per-node: Tasks per node
--cpus-per-task: CPUs per task
--mem: Memory requirement
--time: Maximum runtime
--output/--error: Log file locations

Resource Management¶

Understanding Resources¶

CPU Cores: Processing units
Memory (RAM): Working memory
GPU: Graphics processing units
Storage: Disk space
Network: Data transfer bandwidth

Resource Allocation Strategies¶

# CPU-intensive job
#SBATCH --cpus-per-task=16
#SBATCH --mem=32GB

# Memory-intensive job
#SBATCH --cpus-per-task=4
#SBATCH --mem=64GB

# GPU job
#SBATCH --gres=gpu:1
#SBATCH --partition=GPU

# Parallel job
#SBATCH --nodes=2
#SBATCH --ntasks=32

Monitoring Resource Usage¶

# Check job efficiency
seff job_id

# Real-time job monitoring
sstat job_id

# Detailed job information
scontrol show job job_id

Best Practices¶

Job Submission¶

Test small first: Start with short test runs
Use checkpoints: Save progress regularly
Estimate resources: Don't over-request
Use appropriate partitions: Match job to partition
Clean up: Remove temporary files

Code Optimization¶

# Use parallel processing
#SBATCH --cpus-per-task=8

# In Python
from multiprocessing import Pool
with Pool(8) as pool:
    results = pool.map(my_function, data)

Data Management¶

Use scratch space for temporary files
Compress data when possible
Clean up regularly
Use appropriate file formats

Common Mistakes to Avoid¶

Requesting too many resources
Running jobs on login nodes
Not using version control
Ignoring error messages
Not testing scripts locally first

Practical Examples¶

📝 Complete Step-by-Step Tutorials: For detailed, hands-on SLURM exercises with explanations, see High Performance Computing with SLURM: Practical Tutorial

Example 1: Python Data Analysis¶

To create this script:

# Open nano editor
nano data_analysis.sh

# Copy and paste the script below, then:
# Press Ctrl+X to exit
# Press Y to save
# Press Enter to confirm filename

Script content:

#!/bin/bash
#SBATCH --job-name=data_analysis
#SBATCH --partition=Main
#SBATCH --cpus-per-task=4
#SBATCH --mem=16GB
#SBATCH --time=02:00:00
#SBATCH --output=analysis_%j.log

module load python/3.12.3  # Or use system python3
# Install with: pip install pandas numpy matplotlib

python data_analysis.py input.csv

Submit with: sbatch data_analysis.sh

Example 2: R Statistical Analysis¶

Create the script with nano:

nano r_stats.sh
# Paste the script below, save with Ctrl+X, Y, Enter

Script content:

#!/bin/bash
#SBATCH --job-name=r_stats
#SBATCH --partition=Main
#SBATCH --cpus-per-task=1
#SBATCH --mem=8GB
#SBATCH --time=01:30:00

module load R/4.4.1  # Check available version

Rscript statistical_analysis.R

Submit with: sbatch r_stats.sh

Example 3: GPU Machine Learning¶

Create the script:

nano ml_training.sh
# Paste content, save with Ctrl+X, Y, Enter

Script content:

#!/bin/bash
#SBATCH --job-name=ml_training
#SBATCH --partition=GPU
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32GB
#SBATCH --time=04:00:00

# module load cuda  # Check if GPU/CUDA is available
module load python/3.12.3  # Or use system python3

python train_model.py

Submit with: sbatch ml_training.sh

Example 4: Array Jobs¶

Create the array job script:

nano array_job.sh
# Paste content, save with Ctrl+X, Y, Enter

Script content:

#!/bin/bash
#SBATCH --job-name=array_job
#SBATCH --partition=Main
#SBATCH --array=1-100
#SBATCH --cpus-per-task=1
#SBATCH --mem=4GB
#SBATCH --time=00:30:00

# Process different files based on array index
input_file="data_${SLURM_ARRAY_TASK_ID}.txt"
output_file="result_${SLURM_ARRAY_TASK_ID}.txt"

python process_data.py $input_file $output_file

Submit with: sbatch array_job.sh

Troubleshooting¶

Common Issues and Solutions¶

Job Won't Start¶

# Check partition limits
scontrol show partition

# Check job details
scontrol show job job_id

# Check node availability
sinfo -N

Out of Memory Errors¶

# Check memory usage
sstat -j job_id --format=AveCPU,AvePages,AveRSS,AveVMSize

# Increase memory request
#SBATCH --mem=32GB

Job Timeouts¶

# Check time limits
scontrol show partition

# Increase time limit
#SBATCH --time=04:00:00

# Use checkpointing for long jobs

Module Issues¶

# List available modules
module avail

# Check module conflicts
module list

# Purge and reload
module purge
module load python/3.12.3  # Or use system python3

Getting Help¶

Documentation: Check ILIFU docs
Help Desk: Submit support tickets
Community: Ask on forums or Slack
Training: Attend workshops
Practical Tutorials: Work through High Performance Computing with SLURM: Practical Tutorial

Quick Reference¶

Essential SLURM Commands¶

Command	Purpose	Example Output
`sbatch script.sh`	Submit job	`Submitted batch job 10`
`squeue -u $USER`	Check your jobs	Shows running/pending jobs
`scancel job_id`	Cancel job	Terminates specified job
`sinfo`	Node information	Shows partition and node status
`sacct -j job_id`	Job accounting	Shows job completion details
`seff job_id`	Job efficiency	Shows resource utilization

Example Command Outputs¶

Checking Partition Information¶

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
training*    up 14-00:00:0      7   idle compute-1-sep2025,compute-2-sep2025,compute-3-sep2025,compute-4-sep2025,compute-5-sep2025,compute-6-sep2025,compute-7-sep2025

Job Submission and Status¶

$ sbatch hello.sh
Submitted batch job 10

$ squeue -u mamana
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                10  training    hello   mamana  R       0:01      1 compute-1-sep2025

Job Efficiency Report¶

$ seff 10
Job ID: 10
Cluster: training
User/Group: mamana/training
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 1
CPU Utilized: 00:00:00
CPU Efficiency: 0.00% of 00:00:01 core-walltime
Job Wall-clock time: 00:00:01
Memory Utilized: 4.80 MB
Memory Efficiency: 0.48% of 1.00 GB

Common SBATCH Directives¶

Directive	Purpose	Example
`--job-name`	Job name	`my_analysis`
`--partition`	Partition	`Main`, `GPU`
`--cpus-per-task`	CPU cores	`4`
`--mem`	Memory	`16GB`
`--time`	Runtime limit	`02:00:00`
`--gres`	GPU resources	`gpu:1`

File Transfer¶

# Upload data
scp local_file.txt username@training.ilifu.ac.za:~/

# Download results
scp username@training.ilifu.ac.za:~/results.txt ./

# Sync directories
rsync -av local_dir/ username@training.ilifu.ac.za:~/remote_dir/

Additional Resources¶

ILIFU Documentation: https://docs.ilifu.ac.za
SLURM Documentation: https://slurm.schedmd.com/documentation.html
HPC Best Practices: Various online resources
Training Materials: Regular workshops and tutorials