Pacific Biosciences of California Inc (PACB): Pacbio: Why We Stopped Using PacBioToCA...

Pacific Biosciences of California Inc (PACB)

Reply Private New

Next 10 Prev Next

Send PM Follow Ignore

Followers	3
Posts	607
Boards Moderated	0
Alias Born	06/07/2010

Paulieme

Re: None

Thursday, 07/25/2013 10:23:51 PM

Thursday, July 25, 2013 10:23:51 PM

Pacbio: Why We Stopped Using PacBioToCA and Lived Happily Thereafter
This article came out yesterday (7/24/13) A new edited version with updated PACBIO program: When we started working on PacBio data one year back, everyone recommended PacBioToCA. Pause for a moment to imagine how summer of 2012 was. Everyone was talking about Illumina, 454, de Bruijn graph, Velvet assembler and so on, and these ‘weird’ reads show up from nowhere. Using an analogy, everyone is talking about pizza and BioMickWatson shows five other foods that are like genome assembly, namely Eton mess, spaghetti Bolognese, Marmite, ‘macaroni’ cheese and anchovite. The initial impulse is to turn all those into toppings for pizza to make them attractive. That is what PacBioToCA does. It turns PacBios into Illuminas and then let you forget about them. In detail, PacBioToCA painstakingly aligns all Illumina reads on to the PacBios and then locally assemble the Illumina reads. From that point onward, you are back to Illumina world. However, the alignment was incredibly time-consuming. LSC – the same story. It aligns all Illumina reads on to PacBio using Novoalign, an incredibly slow PacBio-unaware aligner. We realized that it made more sense to assemble Illumina reads first and then align them on PacBios.

Over time we learned that any PacBio pipeline not using BLASR is not doing the analysis right. Mark Chaisson spent a lot of time to turn BLASR into an incredibly powerful tool. It includes the read-filtering program. It even has a PacBio read simulator, which, according to Mark, matches experimental data better than the published simulator PBSIM.

The main advantage of BLASR is its knowledge of indels being the primary mode of error in PacBio reads. So, it is very PacBio-aware, which other aligners are not.

Edit.

PacBioToCA also got upgraded, which we have not kept pace with. Here are the latest updates from Michael Schatz and others -

Also, Jason Chin mentioned a better approach -

The linked presentations are available here and most possibly Jason is suggesting the following pipeline -

If you are using Mike Schatz’s method, the following twitter discussion may be of help to you.

www.simplesharebuttons.comShare..... 4 0 0 0 0 .July 24th, 2013 | Category: Pacbio
5 comments to Pacbio: Why We Stopped Using PacBioToCA and Lived Happily Thereafter
Pacbio: Why We Stopped Using PacBioToCA and Liv...
July 24, 2013 at 9:57 am
[...] [...] [...]
.. Mark Chaisson
July 24, 2013 at 9:13 pm
BLASR is very conservative about ending alignments early, as opposed to pushing to the very end of a read. The reason for this is that the end of a read is sometimes hard to nail down. Because of this, BLASR is not very good at mapping Illumina reads since an early termination of an alignment cuts off lots of the read.

If one really insists on correcting pacbio, it always seemed better to do a *very* conservative Illumina assembly, and do some weighted mapping of the resulting contigs to this. No repeats should be resolved, and graph-based error correction would be kept to a minimum.
.. samanta
July 25, 2013 at 1:30 am
Mark, What do you mean by ‘weighted’ mapping? How does one set up the weights?
.. Mark
July 25, 2013 at 8:49 am
Good question – it’s a somewhat open question which is why I didn’t take the time to fill in what that meant. I’ll first define what I mean by a conservative assembly, which then relates to the weighting.

A typical de Bruijn assembly has the following steps:
1. Count k-mer frequency. The multiplicities are sampled from a mixture model where correct k-mers follow a Gaussian centered about the coverage, and incorrect k-mers that have an exponential distribution.
2. Pick a threshold that includes as little of the exponential distribution, and as much as the Gaussian distribution as possible, and build a de Bruijn graph using k-mers with multiplicity at least this cutoff.
3. Perform graph-based error correction – removing “bubbles”, and “tips” – errors in the middle of reads, and errors at the ends fo reads.
4. Use paired-read information to resolve repeats.

The sequence information from steps 2-4 may be used to map to pacbio reads to error correct them. Because data is being reduced, at any of these steps it is possible to remove true-positive sequences that will then leave gaps in the corrected pacbio reads, and possibly create gaps in the assembly, or to mis-assemble a contig and again create a gap in the pacbio assembly. While the number of gaps may be small, the overall effect may be large on the *mighty* N50.

A conservative assembly would remove as little data as possible, and not perform any repeat resolution with mate-pairs. This would result in a more complicated graph that contains more short edges, as well as spurious edges caused by sequencing error. These would be mapped back to the pacbio reads, and one would weight by: 1. length of the match, 2. confidence in the assembled contig, such as average coverage, and 3. number of alternate contigs that map to the same position that support a different consensus sequence.
.. samanta
July 25, 2013 at 12:22 pm
Thanks Mark. It is roughly the same prescription I have been following, as guided by Jason Chin. (for twitter discussions, click on this link)
.. http://www.homolog.us/blogs/blog/2013/07/24/pacbio-why-we-stopped-using-pacbiotoca-and-lived-happily-thereafter/