Friday, January 31, 2014 5:18:02 PM
The present invention is generally directed to a hierarchical genome assembly process for producing high-quality de novo genome assemblies. The method utilizes a single, long-insert, shotgun DNA library in conjunction with Single Molecule, Real-Time (SMRT®) DNA sequencing, and obviates the need for additional sample preparation and sequencing data sets required for previously described hybrid assembly strategies. Efficient de novo assembly from genomic DNA to a finished genome sequence is demonstrated for several microorganisms using as little as three SMRT® cells, and for bacterial artificial chromosomes (BACs) using sequencing data from just one SMRT® Cell. Part of this new assembly workflow is a new consensus algorithm which takes advantage of SMRT® sequencing primary quality values, to produce a highly accurate de novo genome sequence, exceeding 99.999% (QV 50) accuracy. The methods are typically performed on a computer and comprise an algorithm that constructs sequence alignment graphs from pairwise alignment of sequence reads to a common reference.
Advances in biomolecule sequence determination, in particular with respect to nucleic acid and protein samples, have revolutionized the fields of cellular and molecular biology. Facilitated by the development of automated sequencing systems, it is now possible to sequence an entire genome, for example, of a micro-organism. However, the quality of the sequence information must be carefully monitored, and may be compromised by many factors related to the biomolecule itself or the sequencing system used, including the composition of the biomolecule (e.g., base composition of a nucleic acid molecule), experimental and systematic noise, variations in observed signal strength, and differences in reaction efficiencies. As such, processes must be implemented to analyze and improve the quality of the data from such sequencing technologies.
The standard of sequencing accuracy was set to 99.99% by the National Human Genome Research Institute (NHGRI) in 1998. While a single base-call for each position in a template may not achieve such accuracy, with increases in coverage multiple overlapping sequencing reads for a template sequence having lower raw read accuracy can be used to determine a consensus sequence with acceptably high accuracy. Consensus calling algorithms attempt to distinguish sequencing error from variants (e.g., SNP's) using multiple “queries” for a given position. A variety of such algorithms have been developed to address changes in sequencing coverage, error profiles, and information accompanying base-calls as new sequencing systems are developed, e.g., /////////////////////////////////////////////////////////// Most third party genome assemblers, e.g., Celera®Assembler®, assume that the overlap between the reads can be detected with high identity. For example, an overlap might be called when the identity in the alignment between two reads is above 94%. While it is not necessary to assemble the sequence of an entire genome using such stringent requirements, (e.g., the ALLORA assembler from Pacific Biosciences, Menlo Park, Calif., can use reads that only have 70% identity between each other), it remains preferable to construct inputs whose overlap can be detected with high identity before passing them to a third party assembler. Moreover, when there are repeats in a genome, it is also favorable to generate input that can clearly distinguish the different repeats. Finally, it is also preferable that some artifacts, e.g., chimeric reads and high quality region identification errors, due to sequencing reactions, can be filtered out before the assembly step.
Sequencing technologies that combine reads from libraries of different lengths of DNA have been developed to generate reads that can satisfy the more stringent input requirements for third party assemblers. However, most of these methods require preparation and separate sequencing of multiple DNA libraries.
The hierarchical genome assembly process starts with using the longer reads to put other reads together, in a similar manner to a sequence assembly process. The method utilizes certain special features of SMRT® sequencing wherein the read length distribution is not a constant but an exponential one. It is understood that, for a typical sequencing run with a long inserted library, the probability P(l) of obtaining a read with read length l, is proportional to exp(-l/L), where L is the average read length. In other words, SMRT® sequencing produces not only shorter fragments but also a number of longer ones. An alignment algorithm (e.g., as implemented in a program such as BLASR, from Pacific Biosciences, Menlo Park, Calif.) can be used to align all the reads to a longer read, thereby creating a mini-assembly for each long read.
In order to utilize all continuous long reads (CLR's) from raw sequencing data, for example as generated by the PacBio® RS®, the longer portion of the raw reads, using a pre-specified length cutoff, Icutoff, are extracted to provide the “seeds” for constructing pre-assemblies. These seed reads are used to recruit other reads as a scaffold. It is desirable to achieve about 15-20× genome coverage of such seed sequences so that a sufficient amount of coverage of pre-assembled reads will be generated for the subsequent assembly. The pre-assembled reads are constructed by aligning all reads to each of the seed reads. Each read is mapped to multiple targeted seed reads using the program BLASR (Chaisson and Tesler 2012). The number of read hits mapping to the seed sequences is controlled by the “-bestn” parameter when calling the program BLASR for mapping. Such number should be smaller than the total coverage of the seed sequences on the genome. If the “-bestn” number is too high, it is likely that reads from similar repeats will be mapped to each other, which could result in consensus errors. Conversely, if the chosen “-bestn” number is too low, the quality of the pre-assembly consensus may be decreased. The optimal choice might also depend on DNA fragment library construction, which can affect the subread length distribution. A preferred value of “-bestn” is 12 reads to map to the seed reads. Further study will allow a reasonable choice for optimized results.
(for full story,use link) http://www.google.com/patents/US20140025312?utm_content=buffer8c1e8&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer
Recent PACB News
- Ambry Genetics and PacBio Announce Collaboration to Sequence Up to 7,000 Human Genomes Aimed at Providing Answers for Families Battling Rare Diseases • PR Newswire (US) • 05/15/2024 01:45:00 PM
- Form S-3ASR - Automatic shelf registration statement of securities of well-known seasoned issuers • Edgar (US Regulatory) • 05/09/2024 08:33:12 PM
- Form 10-Q - Quarterly report [Sections 13 or 15(d)] • Edgar (US Regulatory) • 05/09/2024 08:21:46 PM
- Form 8-K - Current report • Edgar (US Regulatory) • 05/09/2024 08:12:15 PM
- PacBio Announces First Quarter 2024 Financial Results • PR Newswire (US) • 05/09/2024 08:05:00 PM
- PacBio Announces Preliminary First Quarter 2024 Revenue and Updates 2024 Revenue Guidance • PR Newswire (US) • 04/16/2024 12:05:00 PM
- Estonia National Biobank Selects PacBio to Sequence 10,000 Whole Genomes • PR Newswire (US) • 03/27/2024 12:00:00 PM
- PacBio Grants Equity Incentive Award to New Employee • PR Newswire (US) • 03/22/2024 08:30:00 PM
- PacBio Announces PureTarget™ Repeat Expansion Panel, Expanding its Portfolio of End-to-End Clinical Research Solutions • PR Newswire (US) • 03/12/2024 01:05:00 PM
- Form 4 - Statement of changes in beneficial ownership of securities • Edgar (US Regulatory) • 03/06/2024 10:36:07 PM
- Form 4 - Statement of changes in beneficial ownership of securities • Edgar (US Regulatory) • 03/06/2024 10:30:18 PM
- Form 4 - Statement of changes in beneficial ownership of securities • Edgar (US Regulatory) • 03/06/2024 10:26:40 PM
- Form 4 - Statement of changes in beneficial ownership of securities • Edgar (US Regulatory) • 03/06/2024 10:22:45 PM
- Form 144 - Report of proposed sale of securities • Edgar (US Regulatory) • 03/04/2024 11:32:39 PM
- Form 144 - Report of proposed sale of securities • Edgar (US Regulatory) • 03/04/2024 11:22:32 PM
- Form 4 - Statement of changes in beneficial ownership of securities • Edgar (US Regulatory) • 02/26/2024 09:55:28 PM
- Form 4 - Statement of changes in beneficial ownership of securities • Edgar (US Regulatory) • 02/26/2024 09:36:09 PM
- Form 4 - Statement of changes in beneficial ownership of securities • Edgar (US Regulatory) • 02/26/2024 09:25:48 PM
- Form 4 - Statement of changes in beneficial ownership of securities • Edgar (US Regulatory) • 02/26/2024 09:19:42 PM
- PacBio to Present at Upcoming Investor Conferences • PR Newswire (US) • 02/26/2024 09:05:00 PM
- Form 4 - Statement of changes in beneficial ownership of securities • Edgar (US Regulatory) • 02/21/2024 11:25:13 PM
- Form 4 - Statement of changes in beneficial ownership of securities • Edgar (US Regulatory) • 02/21/2024 11:20:57 PM
- Form 4 - Statement of changes in beneficial ownership of securities • Edgar (US Regulatory) • 02/21/2024 11:17:14 PM
- Form 4 - Statement of changes in beneficial ownership of securities • Edgar (US Regulatory) • 02/21/2024 11:07:18 PM
- Form 144 - Report of proposed sale of securities • Edgar (US Regulatory) • 02/20/2024 09:17:12 PM
North Bay Resources Announces 50/50 JV at Fran Gold Project, British Columbia; Initiates NI 43-101 Resources Estimate and Bulk Sample • NBRI • May 21, 2024 9:07 AM
Greenlite Ventures Inks Deal to Acquire No Limit Technology • GRNL • May 17, 2024 3:00 PM
Music Licensing, Inc. (OTC: SONG) Subsidiary Pro Music Rights Secures Final Judgment of $114,081.30 USD, Demonstrating Strength of Licensing Agreements • SONGD • May 17, 2024 11:00 AM
VPR Brands (VPRB) Reports First Quarter 2024 Financial Results • VPRB • May 17, 2024 8:04 AM
ILUS Provides a First Quarter Filing Update • ILUS • May 16, 2024 11:26 AM
Cannabix Technologies and Omega Laboratories Inc. enter Strategic Partnership to Commercialize Marijuana Breathalyzer Technology • BLO • May 16, 2024 8:13 AM