Objectives¶
- To perform De novo genome assembly and reference-based genome assembly
- Assess assembly quality using quality metrics
Target Organisms¶
Mycobacterium tuberculosis¶
- Genome size: ~4.4 Mb
- Characteristics: High GC content ~65.6%, complex secondary structures
- Genes ~4,000 protein-coding genes
- Clinical relevance: Major global pathogen, drug resistance concerns
- Assembly challenges: Repetitive sequences, PE/PPE gene families
Vibrio cholerae¶
- Characteristics: Dual chromosome structure
- Genome size: ~4.0 Mb (two chromosomes: ~3.0 Mb + ~1.1 Mb)
- GC content: ~47.7%
- Genes: ~3,800 protein-coding genes
- Clinical relevance: Cholera pandemics, epidemic strain tracking
- Assembly challenges: Chromosome separation, mobile genetic elements
Genome assembly¶
Genome assembly involves reconstructing a genome from a set of fragmented DNA sequences (reads) obtained from sequencing technologies.
It aims to piece together the reads to create a continuous sequence that represents the genome of an organism.
Key Concepts¶
- Reads: Fragmented sequences of DNA obtained from sequencing technologies.
- Contigs: Continuous sequences formed by overlapping reads.
- Scaffolds: Ordered and oriented sets of contigs, sometimes with gaps, which represent larger regions of the genome.
- Coverage: The average number of times each base in the genome is sequenced, which affects the accuracy of the assembly.
- Assembly can be:
- De novo assembly - the construction of the genome from scratch without a reference.
- Reference-guided assembly - uses a known reference genome to guide the assembly of the new genome.
Steps in Genome Assembly¶
- Preprocessing (quality control, adapter sequences and low-quality bases filtering with FASTQC, TRIMMOMATIC/FASTP/BBMap).
- Assembly
- Scaffolding (involves use of long-range information from paired-end or mate-pair reads to order and orient contigs into scaffolds (e.g., SSPACE).
- Gap Filling (use additional reads or assembly techniques to close gaps within scaffolds (e.g., GapCloser).
- Polishing (correcting sequencing errors and improving the accuracy of the assembly, e.g., Pilon).
- Evaluation (Assembly Metrics including N50 (length of the contig such that 50% of the total assembly length is in contigs of this length or longer); genome completeness; and accuracy, e.g., QUAST).
Genome Assembly Algorithms¶
Overlap Layout Consensus (OLC)¶
- Suitable for long-read sequencing (e.g., PacBio, Oxford Nanopore)
- Determines overlap between reads
- Arranges reads based on ovelaps
- Resolves conflict to build the final sequences
- Handles long reads and complex regions well
- Computationally intensive
De Bruijn Graph (DBG)¶
- Best for short-read sequencing (e.g., illumina).
- Uses small overlapping sequences (k-mers) to build a graph, nodes: k-mers, and edges: overlaps.
- Fast and memory efficient for large datasets
- Struggles with repetitive regions
Hybrid Methods¶
- Combine DBG and OLC strengths to improve assembly quality, especially for error-prone long reads.
Assembler Classification by Read Type¶
Short-Read Assemblers¶
- Short-read assemblers process high volumes of short sequences using DBG algorithms
- Ideal for technologies like Illumina.
- SPAdes (widely used for small genomes)
- Velvet (older though useful)
- SOAPdenovo (short-read assembler for larger genomes),
- ABySS (large genomes).
Long-Read Assemblers¶
- Long-read assemblers handle fewer, longer sequences using OLC algorithms
- Can span entire genes and repetitive regions
- Optimized for platforms like PacBio and ONT.
- Higher error rates (can be corrected by short reads or polishing)
Examples of Long-Read Assemblers for Bacterial Genomes¶
-
FLYE (Recommended for most bacterial genomes)
-
Excellent for PacBio and Oxford Nanopore
- Good repeat resolution
-
Usage: flye --nano-raw reads.fastq --out-dir output --genome-size 4.5m
-
Canu (High accuracy, slower)
-
Gold standard for accuracy
- Requires significant computational resources
-
Usage: canu -p prefix -d output genomeSize=4.5m -nanopore reads.fastq
-
Unicycler (Hybrid approach)
-
Combines short and long reads
- Excellent for complete genomes
-
Usage: unicycler -1 short_R1.fq -2 short_R2.fq -l long_reads.fq -o output
-
Raven (Fast, lightweight)
-
Quick assemblies for preliminary analysis
- Good for large datasets
-
Usage: raven reads.fastq > assembly.fasta
-
NextDenovo (High accuracy for Nanopore)
-
Specialized for Oxford Nanopore data
- Good error correction
- Usage: nextDenovo config.txt
Polishing Tools:
- Medaka (Nanopore): medaka_consensus -i reads.fastq -d assembly.fasta -o polished
- Pilon (with short reads): pilon --genome assembly.fasta --frags mapped_reads.bam
- Racon: racon reads.fastq mappings.paf assembly.fasta
Hybrid Assemblers¶
- Combine short and long reads, integrating DBG and OLC methods for improved accuracy and contiguity.
- Balances accuracy (short reads) and completeness (long reads, resolving repeats and structural variants)
- Requires careful data integration
- Unicycler, MaSuRCA
Reference-guided Assembly¶
- Works well given a closely related reference genome
- Aligns reads to a reference genome and fills gaps
- Faster and less computationally demanding
- May miss novel sequences or structural variations
NOTE¶
Assembly Strategy | Subtypes | Read Type Compatibility | Examples | Notes |
---|---|---|---|---|
De novo assembly | DBG, OLC, Hybrid | Short, long, hybrid | Velvet, SPAdes, Canu, Flye, MaSuRCA, UniCycler | No reference genome used |
Reference-guided assembly | Mapping-based | Short, long | BWA, Bowtie2, Novoalign, Minimap2 | Aligns reads to a known reference genome |
Assembler Selection Factors¶
Choosing an assembler depends on" - Sequencing technology, - Project goals, and - Available computational resources.
Best Practices for Genome Assembly¶
- Start with high-quality, high-coverage sequencing data to improve the accuracy of the assembly.
- Try multiple assemblers and compare results, as different tools may perform better for different data types.
- Use iterative rounds of assembly, scaffolding, and polishing to gradually improve the assembly.
- Validate the final assembly using independent data, such as long-read sequencing or optical mapping.
- Keep detailed records of all parameters and steps used in the assembly process for reproducibility.