Monday, January 13, 2014Data Release: Preliminary de novo Haploid and Diploid Assemblies of Drosophila melanogaster
Model organisms such as yeast, Arabidopsis and Drosophila have been essential to progress in genetic and biomedical research for more than 100 years. Model organisms are the best, fastest, most effective way to advance science especially when human experimentation may not be feasible. Numerous biological principles have been elucidated using model organisms, including Nobel-prize winning discoveries by Thomas Hunt Morgan that genes are carried on chromosomes; by Hermann Muller for the discovery that X-ray irradiation causes mutations; and by Edward B. Lewis, Christiane Nüsslein-Volhard, and Eric Wieschaus for their discoveries revealing the genetic control of early embryonic development - all using the fruit fly Drosophila melanogaster. D. melanogaster also played a crucial role in the development of genomics by being the first multicellular organism to be sequenced and assembled by a whole genome shotgun method, followed by directed clone-based finishing. The resulting D. melanogaster reference genome sequence provides an invaluable test bed for developing new genome sequencing and assembly technologies.
In collaboration with Dr. Casey Bergman at the University of Manchester and Drs. Susan Celniker and Roger Hoskins of the Berkeley Drosophila Genome Project (BDGP) at Lawrence Berkeley National Laboratory, we have sequenced adult males from a subline of the ISO1 (y; cn, bw, sp) strain of D. melanogaster. This is the same stock used in the official BDGP reference assemblies since the first genome sequence release in 2000. The DNA was size-selected for >15 kb elution using the BluePippinTM system (Sage Science), and in total, ~15 Gb of sequence was generated from a 20 kb library using P5-C3 sequencing chemistry on the PacBio® RS II.
Total number of bases: 15,208,567,933 bp
Total number of reads: 1,514,730
Average read length: 10,040 bp
Half of sequenced bases in reads greater than: 14,214 bp
PacBio RS II instrument time for sequencing: 6 days
Number of SMRT® Cells: 42
Some preliminary analyses and step-by-step instructions for downloading, mapping, and visualizing these raw data are described on the Bergman lab blog. This analysis shows that the depth of coverage for this dataset is >90x for reads mapping to autosomes in the D. melanogaster Release 5 reference genome sequence. Dr. Bergman also shows that individual PacBio long reads can uniquely localize repetitive transposable elements up to ~10 kb in size, and can be used to fill at least one of the rare but persistent gaps remaining in the euchromatic portion of the reference genome.
Most assembly algorithms collapse contigs into a single copy of the genome (haploid); however this does not reflect the true underlying state in diploid genomes such as flies or human, which have two copies of each chromosome - inheriting one copy from the mother and another copy from the father. In the case of this inbred subline, with limited allelic variation, both assembly strategies are feasible. We attempted both a traditional haploid assembly as well as a first-ever diploid assembly of the D. melanogaster genome, with preliminary results summarized below. The maximum contig length in both haploid (25 Mb) and diploid (21 Mb) assemblies produced contigs that span almost the entire length of chromosome arm 3L:
The preliminary haploid assembly using the PacBio Corrected Reads (PBcR) pipeline in Celera® Assembler 8.1 was carried out in collaboration with Drs. Sergey Koren and Adam Phillippy at the University of Maryland. They were able to assemble entire chromosome arms de novo into fewer pieces than the current version of the reference genome (Release 5) – an effort that has spanned two decades, cost millions of dollars, and involved laborious BAC design and sequencing as well as manual finishing and optical mapping. This level of completeness in a de novo assembly is unprecedented in a metazoan genome, and required a total of 6 weeks from initial fly collection and sorting to final analysis and assembly. The actual sequencing time was 6 days. The contig count for each chromosome is tabulated below.
The larger number of contigs for chromosome X can be explained because only adult males were sequenced here (with a 50:50 ratio of chrX to chrY), whereas the Release 5 genome is from mixed-sex embryos. Chromosome X and Y assemblies can further be improved using a higher coverage of reads, and this analysis is underway. Additional analysis and data associated with the haploid assembly are provided here by Dr. Sergey Koren and Dr. Adam Phillippy, including 70x of high-quality, pre-assembled reads using the Celera Assembler PBcR pipeline. Dr. Koren will be presenting the full results at the International Plant and Animal Genome XXII meeting on Tuesday, January 14th.
We also attempted a diploid assembly using an early version of FALCON (Fast Alignment and CONsensus), a new assembly algorithm that is currently being developed at PacBio to explore de novo assembly of diploid genomes. The FALCON assembler extends the Hierarchical Genome Assembly Process (HGAP), with faster implementations of the aligner and consensus algorithms for the pre-assembly and overlap steps of the process. A string-graph is used for the layout stage, which preserves structural and phasing information in polymorphic and heterozygous diploid genomes. The algorithm outputs “primary contigs” and their “associated contigs”, capturing alternative local variants associated with a primary contig. Details were presented at the Genome Informatics Conference in Cold Spring Harbor in November 2013 with slides available here. We assessed the results of our preliminary FALCON assembly by aligning the assembled contigs (light green) to the euchromatic arms (dark green) of the D. melanogaster reference genome using nucmer from the Mummer3 package. Alignments to 2L, 2R, 3L, 3R, 4, X and Y are shown below with sequence identity of each alignment block color-coded in blue indicating >99.96% identity and yellow indicating >99.9% identity. The blue arches represent “unused edges” in the FALCON string graph, and can be interpreted as potential connection points if a more aggressive contig joining threshold were desired.
By sequencing sex-selected male flies, we were able to increase the coverage of reads from the Y-chromosome, which remains one of the most challenging regions left to assemble in the reference genome. Only ~1% of chromosome Y is represented in the reference (Release 5). More recently the BDGP has assembled 7.5% of the Y and we anticipate more than 50% of the Y-chromosome can be assembled with the new data. Ken Wan and Sue Celniker identified contigs containing Y linked genes. The self-self dot-plot below shows that FALCON was able to de novo assemble a 650 kb region of the heterochromatic region of chromosome Y containing a very complex nested repeat structure. In addition, the gene Pp1-Y2 (a testis-specific phosphatase) which spans several gaps of unknown size in the Release 5 sequence is now entirely contained in this single contig.
A development version of FALCON is available on github to developers and bioinformaticians interested in this new assembly process.
We are also releasing an updated dataset of the Arabidopsis thaliana genome using the latest P5-C3 chemistry for comparison to our initial data release using P4-C2 chemistry. In total, we have now released four datasets from three important model organisms: S. cerevisiae (yeast), A. thaliana (flowering plant), and D. melanogaster (fruit fly). This data is freely available and we invite the research community to explore the collection:
http://blog.pacificbiosciences.com/
Recent PACB News
- Ambry Genetics and PacBio Announce Collaboration to Sequence Up to 7,000 Human Genomes Aimed at Providing Answers for Families Battling Rare Diseases • PR Newswire (US) • 05/15/2024 01:45:00 PM
- Form S-3ASR - Automatic shelf registration statement of securities of well-known seasoned issuers • Edgar (US Regulatory) • 05/09/2024 08:33:12 PM
- Form 10-Q - Quarterly report [Sections 13 or 15(d)] • Edgar (US Regulatory) • 05/09/2024 08:21:46 PM
- Form 8-K - Current report • Edgar (US Regulatory) • 05/09/2024 08:12:15 PM
- PacBio Announces First Quarter 2024 Financial Results • PR Newswire (US) • 05/09/2024 08:05:00 PM
- PacBio Announces Preliminary First Quarter 2024 Revenue and Updates 2024 Revenue Guidance • PR Newswire (US) • 04/16/2024 12:05:00 PM
- Estonia National Biobank Selects PacBio to Sequence 10,000 Whole Genomes • PR Newswire (US) • 03/27/2024 12:00:00 PM
- PacBio Grants Equity Incentive Award to New Employee • PR Newswire (US) • 03/22/2024 08:30:00 PM
- PacBio Announces PureTarget™ Repeat Expansion Panel, Expanding its Portfolio of End-to-End Clinical Research Solutions • PR Newswire (US) • 03/12/2024 01:05:00 PM
- Form 4 - Statement of changes in beneficial ownership of securities • Edgar (US Regulatory) • 03/06/2024 10:36:07 PM
- Form 4 - Statement of changes in beneficial ownership of securities • Edgar (US Regulatory) • 03/06/2024 10:30:18 PM
- Form 4 - Statement of changes in beneficial ownership of securities • Edgar (US Regulatory) • 03/06/2024 10:26:40 PM
- Form 4 - Statement of changes in beneficial ownership of securities • Edgar (US Regulatory) • 03/06/2024 10:22:45 PM
- Form 144 - Report of proposed sale of securities • Edgar (US Regulatory) • 03/04/2024 11:32:39 PM
- Form 144 - Report of proposed sale of securities • Edgar (US Regulatory) • 03/04/2024 11:22:32 PM
- Form 4 - Statement of changes in beneficial ownership of securities • Edgar (US Regulatory) • 02/26/2024 09:55:28 PM
- Form 4 - Statement of changes in beneficial ownership of securities • Edgar (US Regulatory) • 02/26/2024 09:36:09 PM
- Form 4 - Statement of changes in beneficial ownership of securities • Edgar (US Regulatory) • 02/26/2024 09:25:48 PM
- Form 4 - Statement of changes in beneficial ownership of securities • Edgar (US Regulatory) • 02/26/2024 09:19:42 PM
- PacBio to Present at Upcoming Investor Conferences • PR Newswire (US) • 02/26/2024 09:05:00 PM
- Form 4 - Statement of changes in beneficial ownership of securities • Edgar (US Regulatory) • 02/21/2024 11:25:13 PM
- Form 4 - Statement of changes in beneficial ownership of securities • Edgar (US Regulatory) • 02/21/2024 11:20:57 PM
- Form 4 - Statement of changes in beneficial ownership of securities • Edgar (US Regulatory) • 02/21/2024 11:17:14 PM
- Form 4 - Statement of changes in beneficial ownership of securities • Edgar (US Regulatory) • 02/21/2024 11:07:18 PM
- Form 144 - Report of proposed sale of securities • Edgar (US Regulatory) • 02/20/2024 09:17:12 PM
"Defo's Morning Briefing" Set to Debut for "GreenliteTV" • GRNL • May 21, 2024 2:28 PM
North Bay Resources Announces 50/50 JV at Fran Gold Project, British Columbia; Initiates NI 43-101 Resources Estimate and Bulk Sample • NBRI • May 21, 2024 9:07 AM
Greenlite Ventures Inks Deal to Acquire No Limit Technology • GRNL • May 17, 2024 3:00 PM
Music Licensing, Inc. (OTC: SONG) Subsidiary Pro Music Rights Secures Final Judgment of $114,081.30 USD, Demonstrating Strength of Licensing Agreements • SONGD • May 17, 2024 11:00 AM
VPR Brands (VPRB) Reports First Quarter 2024 Financial Results • VPRB • May 17, 2024 8:04 AM
ILUS Provides a First Quarter Filing Update • ILUS • May 16, 2024 11:26 AM