Pacific Biosciences of California Inc (PACB): Publication date Jan 23, 2014 -- Hierarc...

Pacific Biosciences of California Inc (PACB)

Reply Private New

Next 10 Prev Next

Send PM Follow Ignore

Followers	3
Posts	607
Boards Moderated	0
Alias Born	06/07/2010

Paulieme

Re: None

Friday, 01/31/2014 5:18:02 PM

Friday, January 31, 2014 5:18:02 PM

Publication date Jan 23, 2014 -- Hierarchical genome assembly method using single long insert library
The present invention is generally directed to a hierarchical genome assembly process for producing high-quality de novo genome assemblies. The method utilizes a single, long-insert, shotgun DNA library in conjunction with Single Molecule, Real-Time (SMRT®) DNA sequencing, and obviates the need for additional sample preparation and sequencing data sets required for previously described hybrid assembly strategies. Efficient de novo assembly from genomic DNA to a finished genome sequence is demonstrated for several microorganisms using as little as three SMRT® cells, and for bacterial artificial chromosomes (BACs) using sequencing data from just one SMRT® Cell. Part of this new assembly workflow is a new consensus algorithm which takes advantage of SMRT® sequencing primary quality values, to produce a highly accurate de novo genome sequence, exceeding 99.999% (QV 50) accuracy. The methods are typically performed on a computer and comprise an algorithm that constructs sequence alignment graphs from pairwise alignment of sequence reads to a common reference.
Advances in biomolecule sequence determination, in particular with respect to nucleic acid and protein samples, have revolutionized the fields of cellular and molecular biology. Facilitated by the development of automated sequencing systems, it is now possible to sequence an entire genome, for example, of a micro-organism. However, the quality of the sequence information must be carefully monitored, and may be compromised by many factors related to the biomolecule itself or the sequencing system used, including the composition of the biomolecule (e.g., base composition of a nucleic acid molecule), experimental and systematic noise, variations in observed signal strength, and differences in reaction efficiencies. As such, processes must be implemented to analyze and improve the quality of the data from such sequencing technologies.

The standard of sequencing accuracy was set to 99.99% by the National Human Genome Research Institute (NHGRI) in 1998. While a single base-call for each position in a template may not achieve such accuracy, with increases in coverage multiple overlapping sequencing reads for a template sequence having lower raw read accuracy can be used to determine a consensus sequence with acceptably high accuracy. Consensus calling algorithms attempt to distinguish sequencing error from variants (e.g., SNP's) using multiple “queries” for a given position. A variety of such algorithms have been developed to address changes in sequencing coverage, error profiles, and information accompanying base-calls as new sequencing systems are developed, e.g., /////////////////////////////////////////////////////////// Most third party genome assemblers, e.g., Celera®Assembler®, assume that the overlap between the reads can be detected with high identity. For example, an overlap might be called when the identity in the alignment between two reads is above 94%. While it is not necessary to assemble the sequence of an entire genome using such stringent requirements, (e.g., the ALLORA assembler from Pacific Biosciences, Menlo Park, Calif., can use reads that only have 70% identity between each other), it remains preferable to construct inputs whose overlap can be detected with high identity before passing them to a third party assembler. Moreover, when there are repeats in a genome, it is also favorable to generate input that can clearly distinguish the different repeats. Finally, it is also preferable that some artifacts, e.g., chimeric reads and high quality region identification errors, due to sequencing reactions, can be filtered out before the assembly step.

Sequencing technologies that combine reads from libraries of different lengths of DNA have been developed to generate reads that can satisfy the more stringent input requirements for third party assemblers. However, most of these methods require preparation and separate sequencing of multiple DNA libraries.

The hierarchical genome assembly process starts with using the longer reads to put other reads together, in a similar manner to a sequence assembly process. The method utilizes certain special features of SMRT® sequencing wherein the read length distribution is not a constant but an exponential one. It is understood that, for a typical sequencing run with a long inserted library, the probability P(l) of obtaining a read with read length l, is proportional to exp(-l/L), where L is the average read length. In other words, SMRT® sequencing produces not only shorter fragments but also a number of longer ones. An alignment algorithm (e.g., as implemented in a program such as BLASR, from Pacific Biosciences, Menlo Park, Calif.) can be used to align all the reads to a longer read, thereby creating a mini-assembly for each long read.

In order to utilize all continuous long reads (CLR's) from raw sequencing data, for example as generated by the PacBio® RS®, the longer portion of the raw reads, using a pre-specified length cutoff, Icutoff, are extracted to provide the “seeds” for constructing pre-assemblies. These seed reads are used to recruit other reads as a scaffold. It is desirable to achieve about 15-20× genome coverage of such seed sequences so that a sufficient amount of coverage of pre-assembled reads will be generated for the subsequent assembly. The pre-assembled reads are constructed by aligning all reads to each of the seed reads. Each read is mapped to multiple targeted seed reads using the program BLASR (Chaisson and Tesler 2012). The number of read hits mapping to the seed sequences is controlled by the “-bestn” parameter when calling the program BLASR for mapping. Such number should be smaller than the total coverage of the seed sequences on the genome. If the “-bestn” number is too high, it is likely that reads from similar repeats will be mapped to each other, which could result in consensus errors. Conversely, if the chosen “-bestn” number is too low, the quality of the pre-assembly consensus may be decreased. The optimal choice might also depend on DNA fragment library construction, which can affect the subread length distribution. A preferred value of “-bestn” is 12 reads to map to the seed reads. Further study will allow a reasonable choice for optimized results.

(for full story,use link) http://www.google.com/patents/US20140025312?utm_content=buffer8c1e8&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer