Pacific Biosciences of California Inc (PACB): (From Cornell University Library) DBG2OL...

Pacific Biosciences of California Inc (PACB)

Reply Private New

Replies (1) Next 10 Prev Next

Send PM Follow Ignore

Followers	3
Posts	607
Boards Moderated	0
Alias Born	06/07/2010

Paulieme

Re: None

Tuesday, 10/14/2014 3:40:12 PM

Tuesday, October 14, 2014 3:40:12 PM

(From Cornell University Library) DBG2OLC: Efficient Assembly of Large Genomes Using the Compressed Overlap Graph
Authors: Chengxi Ye, Chris Hill, Sergey Koren, Jue Ruan, Zhanshan (Sam)Ma, James A. Yorke, Aleksey Zimin
(Submitted on 10 Oct 2014)
Abstract: The genome assembly computational problem is preventing the transition from the prevalent second generation to third generation sequencing technology. The problem emerged because the erroneous long reads made the assembly pipeline prohibitive expensive in terms of computational time and memory space consumption.
In this paper, we propose and demonstrate a novel algorithm that allows efficient assembly of long erroneous reads of mammalian size genomes on a desktop PC. Our algorithm converts the de novo genome assembly problem from the de Bruijn graph to the overlap layout consensus framework. We only need to focus on the overlaps composed of reads that are non-contained within any contigs built with de Bruijn graph algorithm, rather than on all the overlaps in the genome data sets. For each read spanning through several contigs, we compress the regions that lie inside each de Bruijn graph contigs, which greatly lowers the length of the reads and therefore the complexity of the assembly problem. The new algorithm transforms previously prohibitive tasks such as pair-wise alignment into jobs that can be completed within small amount of time. A compressed overlap graph that preserves all necessary information is constructed with the compressed reads to enable the final-stage assembly.
We implement the new algorithm in a proof-of-concept software package DBG2OLC. Experiments with the sequencing data from the third generation technologies show that our method is able to assemble large genomes much more efficiently than existing methods. On a large PacBio human genome dataset we calculated the pair-wise alignment of 54x erroneous long reads of human genome in 6 hours on a desktop computer compared to the 405,000 CPU hours using a clusters, previously reported by Pacific Biosciences. The final assembly results were in comparably high quality.
Subjects: Genomics (q-bio.GN)
Cite as: arXiv:1410.2801 [q-bio.GN]
(or arXiv:1410.2801v1 [q-bio.GN] for this version)

Submission history
From: Chengxi Ye [view email]
[v1] Fri, 10 Oct 2014 14:58:15 GMT (492kb)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)
Link back to: arXiv, form interface, contact.
http://arxiv.org/abs/1410.2801 (another link)"We implement the new algorithm in a proof-of-concept software package DBG2OLC. Experiments with the sequencing data from the third generation technologies show that our method is able to assemble large genomes much more efficiently than existing methods. On a large PacBio human genome dataset we calculated the pair-wise alignment of 54x erroneous long reads of human genome in 6 hours on a desktop computer compared to the 405,000 CPU hours using a clusters, previously reported by Pacific Biosciences. The final assembly results were in comparably high quality."

http://www.homolog.us/blogs/blog/2014/10/13/very-efficient-hybrid-assembler-for-pacbio-data/