Pacific Biosciences of California Inc (PACB): Large Genome Assembly with PacBio Long R...

Pacific Biosciences of California Inc (PACB)

Reply Private New

Next 10 Prev Next

Send PM Follow Ignore

Followers	3
Posts	607
Boards Moderated	0
Alias Born	06/07/2010

Paulieme

Re: None

Monday, 12/01/2014 10:45:30 PM

Monday, December 01, 2014 10:45:30 PM

Large Genome Assembly with PacBio Long Reads
lhon edited this page Nov 13, 2014 · 53 revisions
PacBio long reads can be used in a number of ways to generate and improve de novo assemblies for large genomes. You can take several different approaches:

1.PacBio-only de novo assembly. Using just PacBio reads from a long insert library, the reads are often preprocessed before being assembled using an Overlap-Layout-Consensus algorithm. The best known implementation of this is HGAP.

2.Hybrid de novo assembly. Using a combination of PacBio and short read data, the reads are used together during assembly to generate a hybrid assembly.

3.Gap filling. Starting with an existing mate-pair based assembly, the internal gaps (consisting of Ns) inside the scaffolds are filled using PacBio sequences.

4.Scaffolding. Using an existing assembly (such as an assembly based on short read data), PacBio reads are used to join contigs.

Figure 1. Illustration of PacBio assembly approaches

Below we discuss what software is available, choosing software, and additional considerations.

?Software Options
Name Description
PacBio-only
HGAP A workflow to first preassemble reads, assemble the preassembled reads using Celera® Assembler, then polish using Quiver.

•Supports up to 100 Mb from SMRT Portal, which is part of SMRT Analysis.
•Larger genomes are possible from the command line using either smrtpipe.py or the Makefile-based smrtmake.

HBAR-DTK An experimental toolkit for running HGAP-style assemblies.
Falcon An experimental diploid assembler, tested on ~100 Mb genomes. 2014 AGBT presentation by Jason Chin.
PBcR self-correction A mode within PBcR (aka pacBioToCA) to do self-correction in the same style as HGAP. Celera® Assembler 8.2 uses the MHAP algorithm for faster overlap calculation during the self-correction phase.
Celera® Assembler Celera® Assembler 8.1 now offers a way to directly assemble subreads.
Sprai A preassembly-based assembler that aims to generate longer contigs.
Hybrid
pacBioToCA An error correction module in Celera® Assembler originally designed to align short reads to PacBio reads and generate consensus sequences. These error corrected reads can then be assembled by Celera® Assembler.
ECTools A set of tools that uses contigs instead of short reads for correction.
Spades A short read assembler that added PacBio hybrid assembly support as of version 3.0.
Cerulean Cerulean starts with an assembly graph from Abyss and extends contigs by resolving bubbles in the graph using PacBio long reads. Was successfully run on genomes <100 Mb.
dbg2olc dbg2olc uses Illumina contigs as anchors to build an overlap graph with PacBio reads, allowing very fast performance.
Gap Filling
PBJelly 2 PBJelly upgrades genomes by using PacBio reads to fill in gaps in scaffolds. Has been shown to work with genomes >1 Gb. Part of the PBSuite of applications including PB Honey. See also PAG 2014: Kim Worley, "Improving Genomes using Long Reads and PB Jelly 2
Scaffolding
AHA AHA ("A Hybrid Assembler") is designed to join existing contigs using PacBio reads. Limited to genomes greater than 200 Mb; part of SMRT Analysis.
PBJelly 2 The new version of PBJelly has support for joining scaffolds.

?Considerations
?Coverage and Choosing Software
The choice of algorithms depends on how much PacBio sequencing can be obtained and what types of short read data are available. We recommend PacBio-only de novo assembly when it is possible to get at least 50X PacBio coverage. HGAP performs best with the minimum recommended coverage; with higher coverage a greater number of the longest reads becomes available for assembly. For larger genomes, PBcR in Celera Assembler 8.2 beta uses MHAP which offers faster assembly times.

For a hybrid assembly involving both PacBio and short read sequencing, PBcR and ECTools can work well with around 20X PacBio coverage. If a high quality set of scaffolds exists, then PBJelly 2 can be used. We recommend at least PacBio 5X coverage to fill gaps; higher coverage enables better consensuses in gap filled regions and increases the number of addressable gaps, as random sampling at lower coverage can lead to coverage gaps.

Figure 2. PacBio algorithm suggestions from a PAG 2014 presentation by Mike Schatz

?Repetitive Content
One of the biggest challenges with de novo assembly is repeat content. In general, the solution is to work with insert sizes that can span repeats and identify unique anchoring sequence on each side. PacBio long reads are uniquely useful in sequencing long inserts, given that they can read from one end of the insert to the other.

?Ploidy
Most existing assemblers were designed for haploid genomes. When a diploid genome has little structural variation between the chromosome copies, then a haploid approach can work well, with the occasional structural heterozygosity appearing as separate contigs. In diploid genomes with larger structural variation or multiploid genomes, assemblies based on haploid assemblers are increasingly fragmented. For these genomes, consider Falcon - though it is considered experimental code. Note also that Celera® Assembler can be configured to favor merging haplotypes.

If possible, select strains to minimize heterozygosity, which helps facilitate assembly. This includes using inbred lines, double haploid strains, and other effectively haploid genomes. For example, the human mole sequenced is a double haploid genome.

?Coverage Bias with Short-Read Data
Short read data has coverage bias in regions with extreme GC composition because short read technologies require amplification. Even if PCR-free sample preparation methods are used, ultimately there is bridge amplification during sequencing.

In addition, with error correction approaches such as PBcR, short reads made of simple repeats are difficult to use given that the kmers used to seed overlaps are at high frequency and thus often filtered out (see PAG 2014, Mike Schatz slide 12).

?Computational Requirements
De novo assembly algorithms using PacBio reads generally use an overlap-layout-consensus algorithm to arrange long reads (such as Celera® Assembler, which HGAP and pacBioToCA both use). Because the overlap phase requires an all-by-all alignment, computation time scales quadratically with the genome size. For larger genomes approaching one gigabase and greater, assembling genomes of this size requires significant computational resources. For example, the initial overlap step in preassembly for the 54X human assembly required 405,000 CPU hours. Compute times are also described in the pacBioToCA-based drosophila assembly. There are efforts to reduce the computational burden, such as Dazzler (blog) and MHAP (blog post, webinar).

Hybrid assembly using PBcR also adds a layer of computational complexity, since aligning 100X of short reads to PacBio reads is a computationally intensive task. One way to reduce computational time is to align short read contigs to PacBio reads, such as through ectools, which effectively compresses down the short read data. This type of approach also has the advantage of increasing the mappability of short read data, since assembled contigs are longer than the individual reads.

?Draft Genome Quality
Gap filling of mate pair-based scaffolded assemblies are particularly sensitive to the quality of the starting assembly. When aligning PacBio reads across gaps in the scaffolds, misassemblies in the scaffolds can result in improper alignments and incorrectly-filled or unfillable gaps.

?Large insert libraries
Even though this is a discussion of assembly algorithms, key to a successful assembly is the longest reads possible through careful sample preparation. We recommend the largest insert libraries possible (e.g. 20 kb) using BluePippin™ size selection (see 20 kb Template Preparation Using BluePippin Size-Selection) and sequencing with the P6-C4 chemistry.

?Datasets and Example Projects
•Human dataset, see also blog post
•2014 PAG presentation by Allen Van Deynze discussing the spinach assembly
•Arabidopsis dataset
•Drosophila dataset, see also this blog post
•Saccharomyces cerevisiae dataset and assembly
•Neurospora Crassa (Fungus) Genome, Epigenome, and Transcriptome, see also this poster at AGBT 2014
•Other datasets
?Additional Links
•http://www.homolog.us/blogs/blog/2014/02/21/opinionated-history-genome-assembly-algorithms/
•PAG 2014: Michael Schatz, “De novo assembly of complex genomes using single molecule sequencing”
•2014 AGBT presentation by Richard McCombie discussing the assembly of rice and yeast, including a coverage titration of the Arabidopsis dataset and assembly performance of ectools versus HGAP https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/Large-Genome-Assembly-with-PacBio-Long-Reads