Unix Commands for Pathogen Genomics - Practical Tutorial¶

Adapted from Microbial-Genomics practice scripts

Tutorial Overview¶

What You'll Learn¶

This hands-on tutorial will teach you essential Unix commands for pathogen genomics analysis. By the end, you'll be able to: - Navigate and organize genomics project directories - Process FASTQ and FASTA files efficiently - Extract meaningful information from sequencing data - Build simple analysis pipelines - Prepare data for HPC analysis

Prerequisites¶

Basic terminal/command line access
No prior Unix experience required
Access to the training server

Time Required¶

Complete tutorial: 2-3 hours
Quick essentials: 45 minutes

Learning Path¶

Setup → 2. Basic Navigation → 3. File Operations → 4. Data Processing → 5. Pipeline Building

Setup Instructions¶

Before starting the Unix command exercises, you need to prepare your workspace with sample genomics data. Here's how:

Step 1: Create Your Practice Directory¶

# Create a new directory for your practice exercises
# mkdir = "make directory"
# -p = create parent directories if needed (won't error if directory exists)
# ~ = shortcut for your home directory (/home/username)
mkdir -p ~/hpc_practice

# Navigate into your new directory
# cd = "change directory"
cd ~/hpc_practice

Step 2: Copy Sample Genomics Data¶

# Copy all sample files from the shared course folder to your current location
# cp = "copy" command
# -r = "recursive" - copy directories and all their contents
# * = wildcard that matches all files
# . = current directory (destination)
cp -r /cbio/training/courses/2025/micmet-genomics/sample-data/* .

Step 3: Verify Your Files¶

# List all files with details to confirm the copy was successful
# ls = "list" command
# -l = long format (shows permissions, size, dates)
# -a = show all files (including hidden files starting with .)
ls -la

Understanding Your Sample Files¶

The following files are now in your directory for practice:

File	Description	Used For
`sample.fastq.gz`	Compressed DNA sequencing reads	Learning `zcat`, `gunzip`, file compression
`sample1.fastq`	Uncompressed sequencing data (3 reads)	Practicing `grep`, `wc`, sequence counting
`sample2.fastq`	Another FASTQ file (2 reads)	Array processing, file comparisons
`reference.fasta`	Reference genome sequences	Learning `grep` with FASTA headers, sequence extraction
`data.txt`	Tab-delimited sample metadata	Practicing `awk`, `sed`, `cut`, `sort` commands

File Formats Explained: - FASTQ: Contains sequences + quality scores (4 lines per read) - FASTA: Contains sequences only (2 lines per sequence: header + sequence) - GZ: Gzip compressed file (saves space, common in genomics)

Quick Command Reference with Detailed Explanations¶

Essential Commands for Genomics Analysis¶

Before diving into detailed modules, here's a quick reference of the most commonly used commands in pathogen genomics, with detailed explanations of what each component does:

Navigate and Organize¶

# Create nested directories for a genomics project
mkdir -p project/{data,results,scripts}
# Explanation:
# mkdir = make directory command
# -p = create parent directories as needed (won't error if they exist)
# project/ = main project folder
# {data,results,scripts} = brace expansion creates 3 subdirectories at once
#   - data/ for raw sequencing files
#   - results/ for analysis outputs
#   - scripts/ for your analysis code

# Navigate to your project directory
cd project
# cd = change directory
# project = destination directory (relative path from current location)

# Show current working directory
pwd
# pwd = print working directory
# Returns absolute path like: /home/username/hpc_practice/project

Inspect FASTQ Files¶

# View compressed FASTQ file content
zcat sample.fastq.gz | head -20
# Explanation:
# zcat = view compressed file without extracting (z = gzip, cat = concatenate)
# sample.fastq.gz = compressed FASTQ file (common in genomics to save space)
# | = pipe operator, sends output to next command
# head -20 = show first 20 lines (5 complete reads since FASTQ uses 4 lines/read)

# Count number of reads in FASTQ file
zcat sample.fastq.gz | wc -l | awk '{print $1/4}'
# Explanation:
# zcat sample.fastq.gz = decompress and output file content
# wc -l = word count with -l flag counts lines
# | = pipe the line count to awk
# awk '{print $1/4}' = divide line count by 4 (FASTQ has 4 lines per read)
#   - $1 = first field (the line count)
#   - /4 = division to get read count
# Example: 400 lines / 4 = 100 reads

Search and Filter¶

# Find all FASTA headers in reference genome
grep "^>" reference.fasta
# Explanation:
# grep = global regular expression print (searches for patterns)
# "^>" = pattern to search for
#   - ^ = start of line anchor (line must begin with >)
#   - > = literal ">" character (FASTA headers start with >)
# reference.fasta = file to search in
# Output: Shows all sequence headers like ">chr1", ">gene_ABC123"

# Count high-quality variants
grep -c "PASS" variants.vcf
# Explanation:
# grep = search command
# -c = count matching lines instead of showing them
# "PASS" = quality filter status in VCF files
# variants.vcf = variant call format file
# Returns: Number like "1234" (count of variants passing quality filters)

Process Text Data¶

# Extract specific columns from data
awk '{print $1, $2}' data.txt
# Explanation:
# awk = powerful text processing tool
# '{print $1, $2}' = awk program
#   - {} = action block
#   - print = output command
#   - $1 = first column/field
#   - $2 = second column/field
#   - , = adds space between fields in output
# data.txt = input file
# Example input:  "Sample1 100 resistant"
# Example output: "Sample1 100"

# Replace text in files
sed 's/old/new/g' file.txt
# Explanation:
# sed = stream editor for text transformation
# 's/old/new/g' = substitution command
#   - s = substitute command
#   - /old/ = pattern to find
#   - /new/ = replacement text
#   - g = global flag (replace all occurrences, not just first)
# file.txt = input file
# Example: Changes "old_sample_name" to "new_sample_name" throughout file

Combined Pipeline Examples¶

Example 1: Quick FASTQ Quality Check¶

# Count reads and check quality score distribution
zcat sample.fastq.gz | \
  awk 'NR%4==0' | \
  cut -c1-10 | \
  sort | \
  uniq -c | \
  sort -rn

# Line-by-line explanation:
# zcat sample.fastq.gz = decompress FASTQ
# awk 'NR%4==0' = get every 4th line (quality scores)
#   - NR = line number
#   - %4==0 = divisible by 4 (4th, 8th, 12th lines...)
# cut -c1-10 = first 10 characters of quality string
# sort = alphabetically sort quality patterns
# uniq -c = count unique patterns
# sort -rn = sort by count, highest first
#   - -r = reverse order
#   - -n = numerical sort

Example 2: Extract High-Quality Reads¶

# Get read IDs with average quality > 30
zcat sample.fastq.gz | \
  paste - - - - | \
  awk '{if(length($4) > 0) print $1, length($4)}' | \
  grep "^@"

# Explanation:
# paste - - - - = combine every 4 lines into 1 tab-delimited line
# awk = process the combined lines
# $1 = read ID, $4 = quality string
# grep "^@" = filter for valid read IDs

Pro Tips for These Commands¶

Always preview before processing:

zcat file.gz | head -20  # Check format first

Count before and after filtering:

# Before
grep -c "^@" input.fastq
# After filtering
grep -c "^@" filtered.fastq

Use quotes for patterns with special characters:

grep "^>" file.fasta     # Correct
grep ^> file.fasta        # May fail - shell interprets >

Combine commands efficiently:

# Instead of creating intermediate files:
zcat file.gz > temp.txt
grep "pattern" temp.txt > result.txt

# Use pipes:
zcat file.gz | grep "pattern" > result.txt

Module 1: Directory Organization for Genomics Projects¶

Learning Objectives¶

✓ Create organized project directories ✓ Navigate between directories efficiently ✓ Understand genomics project structure

Tutorial 1.1: Creating Your First Project Structure¶

Step 1: Start with a Simple Structure¶

# Create your main project directory
mkdir my_first_project

# Enter the directory
cd my_first_project

# Check where you are
pwd
# Output: /home/username/hpc_practice/my_first_project

Step 2: Add Subdirectories¶

# Create data directories
mkdir data
mkdir results
mkdir scripts

# List what you created
ls
# Output: data  results  scripts

Step 3: Create a Complex Structure¶

# Use -p to create nested directories
mkdir -p data/{raw_reads,reference_genomes,metadata}
mkdir -p results/{qc,alignment,variants,phylogeny}
mkdir -p scripts logs tmp

# View the structure
ls -la
# The -la flags show: l=long format, a=all files

Try It Yourself:

# Exercise: Create this structure
# project/
#   ├── input/
#   │   ├── sequences/
#   │   └── references/
#   └── output/
#       ├── aligned/
#       └── reports/

# Solution:
mkdir -p project/{input/{sequences,references},output/{aligned,reports}}

Real-world application:

# Set up M. tuberculosis outbreak analysis
mkdir -p mtb_outbreak_2025/{data,results,scripts,logs}
cd mtb_outbreak_2025
mkdir -p data/{fastq,references,clinical_metadata}
mkdir -p results/{fastqc,trimming,bwa_alignment,vcf_files,phylogenetic_tree}

Module 2: File Management for Sequencing Data¶

Learning Objectives¶

✓ Create and edit text files ✓ Copy and rename sequencing files ✓ Organize data systematically

Tutorial 2.1: Working with Sample Lists¶

Step 1: Create a Sample List¶

# First, ensure you're in the right place
pwd
cd ~/hpc_practice

# Create an empty file
touch sample_list.txt

# Check it was created
ls -la sample_list.txt
# Output: -rw-r--r-- 1 user group 0 Sep 2 10:00 sample_list.txt

Step 2: Add Content to the File¶

# Method 1: Using echo (for single lines)
echo "Sample_001" > sample_list.txt
echo "Sample_002" >> sample_list.txt  # >> appends, > overwrites!

# Method 2: Using nano editor (recommended for multiple lines)
nano sample_list.txt
# Type or paste the following content:
# MTB_sample_001
# MTB_sample_002
# MTB_sample_003
# MTB_sample_004
# Then save with: Ctrl+X, Y, Enter

# View what you created
cat sample_list.txt

Step 3: Count and Verify¶

# Count lines in file
wc -l sample_list.txt
# Output: 4 sample_list.txt

# Count words
wc -w sample_list.txt
# Output: 4 sample_list.txt

# Get full statistics
wc sample_list.txt
# Output: 4  4  58 sample_list.txt
#        (lines, words, characters)

Tutorial 2.2: Organizing Sequencing Files¶

Step 1: Copy Files Safely¶

# Copy sample files to practice with
cp sample*.fastq .

# List files before renaming
ls sample*.fastq
# Output: sample1.fastq  sample2.fastq

# Create a backup first (always!)
mkdir backups
cp sample*.fastq backups/

Step 2: Rename Files Systematically¶

# Rename a single file
mv sample1.fastq patient001_reads.fastq

# Batch rename using a loop
for file in sample*.fastq; do
    # Extract the number from filename
    num=$(echo $file | grep -o '[0-9]\+')
    # Create new name
    newname="patient_${num}_sequences.fastq"
    echo "Renaming $file to $newname"
    mv "$file" "$newname"
done

# Verify the renaming
ls *.fastq

Practice Exercise:¶

# Exercise: Create copies with dates
# Copy sample.fastq.gz to sample_20250902.fastq.gz

# Solution:
date_stamp=$(date +%Y%m%d)
cp sample.fastq.gz sample_${date_stamp}.fastq.gz

Module 3: Viewing and Inspecting Genomics Files¶

Learning Objectives¶

✓ View compressed and uncompressed files ✓ Count sequences in FASTQ/FASTA files ✓ Extract specific parts of files

Tutorial 3.1: Working with FASTQ Files¶

Step 1: View Compressed Files¶

# View first 4 lines (1 complete read) of compressed file
zcat sample.fastq.gz | head -4
# Output:
# @SEQ_001
# ACGTACGTACGTACGTACGTACGTACGTACGTACGT
# +
# IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

# View first 2 reads (8 lines)
zcat sample.fastq.gz | head -8

Step 2: Count Sequences¶

# Count total lines
zcat sample.fastq.gz | wc -l
# Output: 12  (for 3 reads)

# Count number of reads (FASTQ has 4 lines per read)
zcat sample.fastq.gz | wc -l | awk '{print $1/4}'
# Output: 3

# Alternative: Count sequence headers
zcat sample.fastq.gz | grep -c "^@"
# Output: 3

Step 3: View with Line Numbers¶

# Add line numbers to output
zcat sample.fastq.gz | head -8 | cat -n
# Output:
#      1    @SEQ_001
#      2    ACGTACGTACGTACGTACGTACGTACGTACGTACGT
#      3    +
#      4    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
#      5    @SEQ_002
#      6    TGCATGCATGCATGCATGCATGCATGCATGCATGCA
#      7    +
#      8    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Tutorial 3.2: Working with FASTA Files¶

Step 1: View FASTA Headers¶

# View first line (header) of FASTA
head -1 reference.fasta
# Output: >Sequence_1 gene=ABC123

# View all headers in file
grep "^>" reference.fasta
# Output:
# >Sequence_1 gene=ABC123
# >Sequence_2 gene=DEF456

# Count sequences
grep -c "^>" reference.fasta
# Output: 2

Step 2: Extract Sequences¶

# View first 2 lines (header + sequence start)
head -2 reference.fasta

# Skip header, view sequence only
tail -n +2 reference.fasta | head -1
# Output: ACGTACGTACGTACGTACGTACGTACGTACGTACGT

# Get sequence length
tail -n +2 reference.fasta | head -1 | wc -c
# Output: 37 (includes newline)

Practice Exercise:¶

# Exercise: Count total bases in all sequences
# Hint: Remove headers first

# Solution:
grep -v "^>" reference.fasta | tr -d '\n' | wc -c

Module 4: Searching and Filtering Genomics Data¶

Learning Objectives¶

✓ Search for patterns in files ✓ Filter genomics data ✓ Use regular expressions

Tutorial 4.1: Basic Pattern Searching¶

Step 1: Simple Searches¶

# Search for a word in a file
grep "resistant" data.txt
# Output: Sample1 100 resistant
#         Sample3 150 resistant

# Case-insensitive search (-i flag)
grep -i "SAMPLE" data.txt
# Finds: Sample1, Sample2, etc.

# Count matches (-c flag)
grep -c "resistant" data.txt
# Output: 2

# Show line numbers (-n flag)
grep -n "resistant" data.txt
# Output: 1:Sample1 100 resistant
#         3:Sample3 150 resistant

Step 2: Search in Genomics Files¶

# Find sequence headers in FASTA
grep "^>" reference.fasta
# ^ means "start of line"

# Find adapter sequences in FASTQ
grep "AGATCGGAAGAG" sample1.fastq

# Search compressed files
zcat sample.fastq.gz | grep "ACGT"

Tutorial 4.2: Advanced Pattern Matching¶

Using Regular Expressions¶

# Find lines with numbers
grep '[0-9]' data.txt
# [0-9] matches any digit

# Find lines ending with specific pattern
grep 'resistant$' data.txt
# $ means "end of line"

# Extract only the matching part (-o flag)
echo "Sample123" | grep -o '[0-9]\+'
# Output: 123
# \+ means "one or more"

# Multiple patterns (OR)
grep -E 'resistant|sensitive' data.txt
# -E enables extended regex

Practice Exercise:¶

# Exercise: Find all samples with values > 150
# Hint: Use awk instead of grep for numeric comparisons

# Solution:
awk '$2 > 150' data.txt
# Output: Sample2 200 sensitive
#         Sample4 300 sensitive

Module 5: Text Processing for Genomics¶

Learning Objectives¶

✓ Extract specific columns from files ✓ Perform calculations on data ✓ Transform text efficiently

Tutorial 5.1: Using awk for Data Processing¶

Step 1: Extract Columns¶

# Print specific columns (1st and 2nd)
awk '{print $1, $2}' data.txt
# Output: Sample1 100
#         Sample2 200
#         Sample3 150
#         Sample4 300

# Print with custom separator
awk '{print $1 "," $2}' data.txt
# Output: Sample1,100

# Add text to output
awk '{print "Sample:" $1 " Value:" $2}' data.txt

Step 2: Perform Calculations¶

# Sum values in column 2
awk '{sum+=$2} END {print "Total:", sum}' data.txt
# Output: Total: 750

# Calculate average
awk '{sum+=$2; count++} END {print "Average:", sum/count}' data.txt
# Output: Average: 187.5

# Filter based on value
awk '$2 > 150 {print $0}' data.txt
# Output: Sample2 200 sensitive
#         Sample4 300 sensitive

Tutorial 5.2: Using sed for Text Manipulation¶

Step 1: Basic Substitutions¶

# Replace text (s = substitute)
echo "Sample1" | sed 's/1/A/'
# Output: SampleA

# Global replacement (g = global)
echo "Sample111" | sed 's/1/A/g'
# Output: SampleAAA

# Replace in file and save
sed 's/Sample/Patient/g' data.txt > patients.txt

# Edit file in place (careful!)
sed -i.bak 's/Sample/Patient/g' data.txt
# Creates data.txt.bak as backup

Step 2: Advanced Manipulations¶

# Delete lines containing pattern
sed '/sensitive/d' data.txt
# Removes lines with "sensitive"

# Add text to beginning of lines
sed 's/^/PREFIX_/' data.txt
# Adds PREFIX_ to each line start

# Convert spaces to tabs
sed 's/ /\t/g' data.txt > data.tsv

6. File Permissions and Management¶

Managing Permissions¶

# Make scripts executable
chmod +x scripts/analysis_pipeline.sh

# Protect raw data from accidental modification
chmod 444 data/raw_reads/*.fastq.gz

# Set directory permissions
chmod 755 results/

# Check permissions
ls -la data/raw_reads/

7. Sorting and Unique Operations¶

Processing Sample Lists¶

# Sort sample names
sort data/sample_list.txt

# Sort numerically by coverage
sort -k2 -n coverage_stats.txt

# Get unique mutations
cut -f1,2 variants.txt | sort | uniq

# Count occurrences of each mutation
cut -f3 mutations.txt | sort | uniq -c | sort -rn

8. Pipelines and Redirection¶

Creating Simple Analysis Pipelines¶

# Count reads per sample and save to file
for file in data/raw_reads/*.fastq.gz; do
    sample=$(basename $file .fastq.gz)
    count=$(zcat $file | wc -l | awk '{print $1/4}')
    echo -e "$sample\t$count"
done > results/qc/read_counts.txt

# Extract and count specific genes
grep "^>" reference.fasta | cut -d' ' -f1 | sed 's/>//' | sort | uniq -c > gene_counts.txt

# Process multiple VCF files
for vcf in results/variants/*.vcf; do
    sample=$(basename $vcf .vcf)
    pass_count=$(grep -c "PASS" $vcf)
    total_count=$(grep -v "^#" $vcf | wc -l)
    echo -e "$sample\t$total_count\t$pass_count"
done > results/variant_summary.txt

9. Practical SLURM Integration¶

Preparing Files for HPC Analysis¶

To create a SLURM job script:

# Open nano to create the script
nano prep_pathogen_data.sh

Copy and paste the following content:

#!/bin/bash
#SBATCH --job-name=prep_pathogen_data
#SBATCH --time=00:30:00
#SBATCH --mem=4GB

# Create directory structure
mkdir -p ${SLURM_JOB_ID}_analysis/{data,results,logs}

# Copy and organize files
cp /shared/data/*.fastq.gz ${SLURM_JOB_ID}_analysis/data/

# Generate file list for processing
ls ${SLURM_JOB_ID}_analysis/data/*.fastq.gz > file_list.txt

# Count and verify files
echo "Total files to process: $(wc -l < file_list.txt)"

# Create metadata file
for file in ${SLURM_JOB_ID}_analysis/data/*.fastq.gz; do
    size=$(du -h $file | cut -f1)
    reads=$(zcat $file | wc -l | awk '{print $1/4}')
    echo "$(basename $file)\t$size\t$reads"
done > ${SLURM_JOB_ID}_analysis/data/file_metadata.tsv

echo "Preparation complete. Ready for analysis."

Save the file with: Ctrl+X, then Y, then Enter

Submit the job with: sbatch prep_pathogen_data.sh

Hands-On Exercise: Complete Pathogen Analysis Workflow¶

Exercise Overview¶

Build a complete analysis pipeline step-by-step, applying all the skills you've learned.

Part 1: Setup and Data Preparation¶

# Step 1: Create project structure
mkdir -p pathogen_practice/{data,results,scripts}
cd pathogen_practice
pwd  # Verify you're in the right place

# Step 2: Create sample metadata
nano data/samples.txt
# Paste the following content:
# Mtb_patient_001_resistant
# Mtb_patient_002_susceptible  
# Mtb_patient_003_resistant
# Salmonella_outbreak_001
# Salmonella_outbreak_002
# Save with: Ctrl+X, Y, Enter

# Verify the file was created
cat data/samples.txt

Part 2: Data Analysis Tasks¶

Task 1: Find Resistant Samples¶

# Use grep to find resistant samples
grep "resistant" data/samples.txt

# Save results to file
grep "resistant" data/samples.txt > results/resistant_samples.txt

# Count how many
grep -c "resistant" data/samples.txt

Task 2: Count by Pathogen Type¶

# Count MTB samples
grep -c "Mtb" data/samples.txt

# Count Salmonella samples
grep -c "Salmonella" data/samples.txt

# Save counts
echo "MTB samples: $(grep -c 'Mtb' data/samples.txt)" > results/pathogen_counts.txt
echo "Salmonella samples: $(grep -c 'Salmonella' data/samples.txt)" >> results/pathogen_counts.txt

Task 3: Generate Summary Report¶

# Create a comprehensive summary using nano
nano results/summary_report.txt

# Type the following content (replace the values with actual counts):
# === Pathogen Analysis Summary ===
# Date: [current date]
# Total samples: 5
# Resistant samples: 2
# Susceptible samples: 1
# MTB samples: 3
# Salmonella samples: 2
# =================================
# Save with: Ctrl+X, Y, Enter

# Or use echo commands to generate it automatically:
echo "=== Pathogen Analysis Summary ===" > results/summary_report.txt
echo "Date: $(date)" >> results/summary_report.txt
echo "Total samples: $(wc -l < data/samples.txt)" >> results/summary_report.txt
echo "Resistant samples: $(grep -c "resistant" data/samples.txt)" >> results/summary_report.txt
echo "Susceptible samples: $(grep -c "susceptible" data/samples.txt)" >> results/summary_report.txt
echo "MTB samples: $(grep -c "Mtb" data/samples.txt)" >> results/summary_report.txt
echo "Salmonella samples: $(grep -c "Salmonella" data/samples.txt)" >> results/summary_report.txt
echo "=================================" >> results/summary_report.txt

# Display the report
cat results/summary_report.txt

Part 3: Challenge Exercises¶

Challenge 1: Extract Sample IDs¶

# Extract just the patient IDs (hint: use cut or awk)
# Try it yourself first!

# Solution:
cut -d'_' -f2,3 data/samples.txt
# Or using awk:
awk -F'_' '{print $2"_"$3}' data/samples.txt

Challenge 2: Sort and Count Unique Pathogens¶

# Extract pathogen names and count occurrences
# Try it yourself first!

# Solution:
cut -d'_' -f1 data/samples.txt | sort | uniq -c

Challenge 3: Create a Pipeline¶

# Find all resistant MTB samples in one command
# Try it yourself first!

# Solution:
grep "Mtb" data/samples.txt | grep "resistant"
# Or more elegantly:
grep "Mtb.*resistant" data/samples.txt

Tips for Pathogen Genomics Unix Usage¶

Always work with copies of raw sequencing data
Use meaningful file names with sample IDs and dates
Document your commands in scripts for reproducibility
Check file integrity after transfers (md5sum)
Compress large files to save space (gzip/bgzip)
Use screen or tmux for long-running processes
Regular backups of analysis results
Version control for scripts (git)

Common File Formats in Pathogen Genomics¶

Extension	Format	View Command	Description
`.fastq.gz`	Compressed FASTQ	`zcat file.fastq.gz \\| head`	Raw sequencing reads
`.fasta`	FASTA	`cat file.fasta`	Reference genomes
`.sam/.bam`	SAM/BAM	`samtools view file.bam \\| head`	Alignments
`.vcf`	VCF	`cat file.vcf`	Variant calls
`.gff/.gtf`	GFF/GTF	`cat file.gff`	Gene annotations
`.newick/.tree`	Newick	`cat file.tree`	Phylogenetic trees

Troubleshooting Guide¶

Common Issues and Solutions¶

Issue 1: "Permission denied"¶

# Problem: Can't access or modify a file
# Solution: Check permissions
ls -la filename

# Fix: Change permissions if you own the file
chmod u+rw filename

Issue 2: "No such file or directory"¶

# Problem: File path is wrong
# Solution: Check your current location
pwd

# List files to verify
ls -la

# Use absolute paths to be sure
/full/path/to/file

Issue 3: "Command not found"¶

# Problem: Tool not installed or not in PATH
# Solution: Check if command exists
which command_name

# Load module if available
module avail
module load tool_name

Issue 4: File is empty or corrupted¶

# Check file size
ls -lh filename

# Check file type
file filename

# For compressed files, test integrity
gzip -t file.gz

Issue 5: Out of disk space¶

# Check available space
df -h

# Find large files
du -sh * | sort -h

# Clean up temporary files
rm -rf tmp/*

Best Practices to Avoid Issues¶

Always backup before modifying

cp important_file important_file.backup

Use tab completion to avoid typos

cat sam[TAB]  # Completes to sample.fastq

Preview commands with echo first

echo mv *.fastq backup/  # See what would happen
mv *.fastq backup/       # Then run for real

Check file contents before processing

head -5 file.txt  # Preview first 5 lines
wc -l file.txt    # Count total lines

Quick Reference Card¶

Essential Commands Summary¶

Task	Command	Example
List files	`ls -la`	`ls -la *.fastq`
Change directory	`cd`	`cd ~/hpc_practice`
Create directory	`mkdir -p`	`mkdir -p data/reads`
Copy files	`cp -r`	`cp sample.fastq backup/`
Move/rename	`mv`	`mv old.txt new.txt`
View compressed	`zcat`	`zcat file.gz \\| head`
Count lines	`wc -l`	`wc -l sample.txt`
Search text	`grep`	`grep "pattern" file`
Extract columns	`awk`	`awk '{print $1}' file`
Replace text	`sed`	`sed 's/old/new/g' file`
Sort data	`sort`	`sort -n numbers.txt`
Get unique	`uniq`	`sort file \\| uniq`

Next Steps¶

After mastering these Unix commands, you're ready to: 1. Submit SLURM jobs - See High Performance Computing with SLURM: Practical Tutorial 2. Learn HPC concepts - See HPC and ILIFU Training Materials 3. Build analysis pipelines with Nextflow 4. Perform quality control with FastQC 5. Align reads with BWA 6. Call variants with SAMtools/BCFtools

Remember: Unix commands are the foundation of all bioinformatics pipelines!