LT.Swing trade!
Register for free to join our community of investors and share your ideas. You will also get access to streaming quotes, interactive charts, trades, portfolio, live options flow and more tools.
Register for free to join our community of investors and share your ideas. You will also get access to streaming quotes, interactive charts, trades, portfolio, live options flow and more tools.
PubMed.gov (US National Library of Medicine National Institutes of Health) 2014 Apr 3. De Novo Assembly of the Quorum-Sensing Pandoraea sp. Strain RB-44 Complete Genome Sequence Using PacBio Single-Molecule Real-Time Sequencing Technology.
Abstract
We report the first complete genome sequence of Pandoraea sp. strain RB-44, which was found to possess quorum-sensing properties. To the best of our knowledge, this is the first documentation of both a complete genome sequence and quorum-sensing properties of a Pandoraea species.
http://www.ncbi.nlm.nih.gov/pubmed/24699956?dopt=Abstract&utm_source=dlvr.it&utm_medium=twitter
Lots of useful assembly info relating to @PacBio being discussed today. Follow hashtag #Livpacbio 7:43 AM - 4 Apr 2014 ////---https://twitter.com/search?q=%23Livpacbio&src=hash /////////- Ian Goodhead ·9 hrs ago
"Full length (1500bp) 16S tested. Improvement over Illumina V4 studies on MiSeq." Exciting stuff for community profiling #livpacbio 3:34 AM - 4 Apr 2014
New Products: PacBio's SMRT Analysis 2.2.0
April 01, 2014
http://files.pacb.com/Training/SMRTAnalysisv22Overview/story.html
Wednesday, April 2, 2014FDA-Supported Pathogen Database to Expand with SMRT Sequencing
http://blog.pacificbiosciences.com/2014/04/fda-supported-pathogen-database-to.html?utm_content=buffer5f406&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer /// https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/Large-Genome-Assembly-with-PacBio-Long-Reads
The Wall Street Journal March 31, 2014, 7:35 a.m. ET
.Pacific Biosciences Releases Software Upgrade to Support Full-Length Transcript Sequencing and HLA Haplotype Phasing ./////////// MENLO PARK, Calif., March 31, 2014 (GLOBE NEWSWIRE) -- Pacific Biosciences of California, Inc., (Nasdaq:PACB) provider of the PacBio(R) RS II, today announced the release of a software upgrade for its Single Molecule, Real-Time (SMRT(R) ) DNA Sequencing platform. SMRT Analysis 2.2 provides enhanced functionality to support two additional applications that uniquely benefit from the company's long-read sequencing technology: Iso-Seq(TM) full-length transcript/isoform sequencing, and human leukocyte antigen (HLA) haplotype phasing.
The study of mRNA transcript isoforms has been challenging due to the short read lengths of other sequencing technologies. Long PacBio reads enable full-length transcript sequencing, as well as the identification of alternatively spliced forms of a gene. As a result, new genes and isoforms are accessible for study.
For example, Steve Quakeand Thomas Südhof,Professors at Stanford University and Investigators with the Howard Hughes Medical Institute, together with colleagues, used SMRT Sequencing to characterize the genes encoding neurexins, which are involved in the formation of connections between cells in the human brain(i) . Because of the high number of different splice isoforms, these genes have been extremely difficult to study and, despite extensive efforts, the full extent of neurexin alternative splicing remained unclear. Using PacBio long-read sequencing, the researchers identified hundreds of different isoforms in the neurexin gene family, highlighting a staggering complexity of these gene products and providing more insight to the notion that neurexins function as recognition molecules that contribute to the specification of cell connections in the brain.
The Iso-Seq application can also be used for transcriptome-wide studies, improving the ability to annotate genes in reference genomes. Long sequence reads spanning full-length gene transcripts will eliminate the need for an RNA-seq assembly step, providing more complete gene models and more comprehensive annotation of transcribed genes.
Michael Snyder's lab at Stanford University demonstrated the utility of PacBio long-read sequencing for assessing transcribed regions across the human genome in a paper(ii) last October. Dr. Snyder commented: "Full length transcriptome sequencing allows the analysis of complete transcriptomes including the deciphering of complex transcripts and the discovering of new ones. PacBio sequencing works remarkably well for this."
The second new application is HLA haplotype phasing. The HLA loci are a group of genes critical to immune system function. In humans, the HLA genes are extraordinarily polymorphic. Several thousand alleles have been described and the number of new alleles continues to increase. HLA allele-specific genotyping is critical for autoimmune disease-association studies, drug hypersensitivity research and other applications. Accurate phasing of HLA polymorphisms has previously required several experiments at great expense. The long reads provided by PacBio sequencing are ideally suited for accurate allele-level genotyping with unambiguous allele phasing.
PacBio's SMRT Analysis 2.2 generates consensus sequences that can be input into third-party software for HLA analysis. This data has successfully been used with the Conexio Genomics (Perth, Australia) Assign MPS sequence analysis software.
"PacBio's analysis pipeline independently generates the consensus sequence of each allele in a heterozygous sample, including non-coding regions," said David Sayer, Chief Executive Officer of Conexio Genomics. "When analyzed in our sequence analysis software, this data results in a completely phased, immutable HLA genotype. The analysis is simple and rapid."
"The SMRT Analysis 2.2 upgrade streamlines two important applications that are uniquely enabled by our robust long-read sequencing technology, " said Michael Hunkapiller, President and CEO of Pacific Biosciences. "We are excited about the trajectory that has unfolded with each increase in the performance of the PacBio RS II system, and look forward to seeing what novel insights the research community will uncover with these new applications."
The new SMRT Analysis software upgrade is available for download from Pacific Biosciences' DevNet website. To access the software, data, and documentation, visit www.pacbiodevnet.com.
For more information on the new SMRT Analysis software and the PacBio RS II, please visit www.pacificbiosciences.com.
About the PacBio RS II and SMRT Sequencing
Pacific Biosciences' Single Molecule, Real-Time (SMRT) Sequencing technology achieves the industry's longest read lengths, highest consensus accuracy(iii,iv) and the least degree of bias.(v) These characteristics, combined with the ability to detect many types of DNA base modifications (e.g., methylation) as part of the sequencing process, make the PacBio RS II an essential tool for many scientists for studying genetic and genomic variation. The PacBio platform is being used as the sequencing solution to address a growing number of complex medical, agricultural and industrial problems.
About Pacific Biosciences
Pacific Biosciences of California, Inc. (Nasdaq:PACB) offers the PacBio RS II DNA Sequencing System to help scientists solve genetically complex problems. Based on its novel Single Molecule, Real-Time (SMRT) technology, the company's products enable: targeted sequencing to more comprehensively characterize genetic variations; de novo genome assembly to more fully identify, annotate and decipher genomic structures; and DNA base modification identification to help characterize epigenetic regulation and DNA damage. By providing access to information that was previously inaccessible, Pacific Biosciences enables scientists to increase their understanding of biological systems.
http://online.wsj.com/article/PR-CO-20140331-904887.html
Institute for Genome Sciences Awarded FDA Contract to
Expand Genome Sequence Database for Pathogen Identification
Baltimore, Md. — April 1, 2014. Researchers at the Institute for Genome Sciences at the University
of Maryland School of Medicine have been awarded a research program contract from the U.S.
Food and Drug Administration (FDA) to sequence, assemble, and annotate a population of bacterial
pathogens using two high-throughput sequencing (HTS) technologies in support of the expansion of
a vetted public reference database.
The continued development of HTS technologies for accurate identification of microorganisms for
diagnostic use will have significant impact on human healthcare, biothreat response, food safety,
and other areas. Developing a comprehensive, curated database of microbial genome sequences and
associated metadata will serve as a valuable reference to evaluate and assess HTS-based diagnostic
devices. Leading the sequencing and analysis phases of the project, the Genomics Resource Center
(GRC) at the Institute is a cutting-edge genomic sequencing and analysis center with a long history
of high-quality microbial genomics research that has sequenced and analyzed more than 5,000
microbial genome sequences in just the past five years.
The genome sequencing will use two HTS platforms, Illumina and Pacific Biosciences, and
multiple genome assembler software packages and assembly QA/QC pipelines to assemble and
validate the resulting draft genome sequences. By using two complementary sequencing platforms,
GRC researchers will be able to cross-validate consensus sequences to generate the highest possible
genome sequence accuracy. The comprehensive, curated database to which these annotated genome
sequences will be added will enable high confidence confirmation of in vitro microbial pathogen
identification. This database will be accessible through the collection of the National Center for
Biotechnology Information (NCBI)’s public domain databases. The combination of genomic data
and metadata will help to advance the goal of developing HTS-based in vitro diagnostics and the
assessment of their potential.
The GRC was formed to serve the global genomics and bioinformatics communities, and its
reputation is built on both its deep history in sequencing, genomics and analysis, and its end-to-end
service level from initial project consultation through publication. The GRC is led by Luke Tallon,
scientific director and founding leader of the GRC, and Lisa Sadzewicz, administrator director of
the facility. “We are excited to contribute our genome sequencing and analysis expertise to this
important project with the FDA,” says Tallon.
“This database will be an important reference for the scientific and medical diagnostic
communities,” says Claire Fraser, PhD, Director of the Institute for Genome Sciences. “We have
worked with federal agencies and global scientific partners to sequence and analyze an extensive
population of bacterial pathogens since our Institute launched in 2007 and are pleased to develop
this reference database with the FDA.”
“The Institute for Genome Sciences is truly unique to an academic medical university because it
houses cutting-edge sequencing technologies overseen by internationally renowned experts in the
field who are deeply engaged in the research enterprise,” says E. Albert Reece, MD, PhD, MBA,
vice president for medical affairs at the University of Maryland, and John Z. and Akiko K. Bowers
distinguished professor and dean of the University of Maryland School of Medicine. “This award
recognizes the strength of the University of Maryland School of Medicine’s genomics program,
which will make significant contributions to better identifying and, ultimately, treating infectious
diseases.”
About the University of Maryland School of Medicine
Established in 1807, the University of Maryland School of Medicine is the first public medical
school in the United States, the first to institute a residency-training program. The School of
Medicine was the founding school of the University of Maryland and today is an integral part of the
11-campus University System of Maryland. On the University of Maryland’s Baltimore campus, the
School of Medicine serves as the anchor for a large academic health center which aims to provide
the best medical education, conduct the most innovative biomedical research and provide the best
patient care and community service to Maryland and beyond. www.medschool.umaryland.edu
About the Institute for Genome Sciences
The Institute for Genome Sciences (IGS) is an international research center within the University of
Maryland School of Medicine. Comprised of an interdisciplinary, multidepartment team of
investigators, the Institute uses the powerful tools of genomics and bioinformatics to understand
genome function in health and disease, to study molecular and cellular networks in a variety of
model systems, and to generate data and bioinformatics resources of value to the international
scientific community. www.igs.umaryland.edu
#### http://www.igs.umaryland.edu/labs/grc/files/2014/04/FDA-Announcement-3-26-14_final.pdf
PacBio Blog///Thursday, March 27, 2014///-As Genome Editing Gains Traction, SMRT Sequencing Provides Accurate View of Results
A new paper published in Cell Reports describes how Single Molecule, Real-Time (SMRT®) Sequencing can be used to greatly improve outcome reporting for a variety of popular genome-editing approaches.
“Quantifying genome-editing outcomes at endogenous loci with SMRT sequencing” comes from lead authors Ayal Hendel and Eric Kildebeck from the Porteus lab at Stanford University, along with other collaborators at Stanford and the Georgia Institute of Technology. The goal for this study was to contribute to the tremendous innovations occurring in the genome editing field — from CRISPR to TALENs and more — by finding a better tool to measure results of the editing procedures.
“A variety of reporter assays for tracking genome editing outcomes have been developed, but previously none have allowed for the frequency of different genome editing outcomes to be measured simultaneously at any endogenous locus of the investigator’s choosing,” the authors report. They turned to SMRT Sequencing and developed “a method for tracking genome editing outcomes at any site of interest.”
The challenge with measuring the results of genome editing, according to the paper, is that when a new set of reagents is developed, “the activity levels of nucleases and the frequency of the desired gene editing event must be determined and often need to be optimized for the specific cell type and system used by the researcher.” Current reporting tools include gel-based assays, fluorescent reporters, clone analysis, and more. “While each of these assays can provide a piece of the puzzle, they are often limited by the inability to measure the desired gene editing outcome directly, the need for reporter cell lines to optimize gene editing conditions, and limitations in detection sensitivity for difficult applications,” the authors note.
With its ultra-long reads, SMRT Sequencing performed well in tests that directly measured the results of genome-editing experiments. The scientists used a particularly active pair of TALENs (Transcription Activator-Like Effector Nucleases) to generate site-specific double-stranded breaks and introduce several point mutations. Then, they used SMRT Sequencing on the region of interest to measure non-homologous end-joining (NHEJ) and homology-directed repair (HDR) events. The method was found to be highly reproducible and showed excellent concordance with orthogonal validation methods.
The scientists then demonstrated the broad applicability of the SMRT Sequencing-based approach by applying it to more difficult experimental platforms such as human primary cells. They also measured the activities of different classes of nucleases at multiple genomic sites, optimized for different parameters of gene editing efficiencies, and demonstrated the detection of rare mutations including large insertions and deletions hundreds of base pairs in length. Indeed, the long-read sequencing proved to be particularly useful for measuring effectiveness of long donor DNA templates, which increase the efficiency of gene editing.
“SMRT DNA sequencing provides a rapid, quantitative, and sensitive strategy for tracking genome editing outcomes at endogenous loci,” the scientists conclude. “With the flexibility to evaluate new engineered nucleases and targeting constructs directly at desired loci without the development of reporter systems, SMRT DNA sequencing can help researchers minimize the time from conception to realization of their genome editing goal and drive this field even faster.”
For more information on the rapidly growing world of genome editing, check out this New York Times article, this journal review, or this commentary from the New England Journal of Medicine.
http://blog.pacificbiosciences.com/2014/03/as-genome-editing-gains-traction-smrt.html?utm_content=buffer1d398&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer
Posted March 25, 2014-- New Results
Preparation of next-generation DNA sequencing libraries from ultra-low amounts of input DNA: Application to single-molecule, real-time (SMRT) sequencing on the Pacific Biosciences RS II.
Abstract
We have developed and validated an amplification-free method for generating DNA sequencing libraries from very low amounts of input DNA (500 picograms - 20 nanograms) for single-molecule sequencing on the Pacific Biosciences (PacBio) RS II sequencer. The common challenge of high input requirements for single-molecule sequencing is overcome by using a carrier DNA in conjunction with optimized sequencing preparation conditions and re-use of the MagBead-bound complex. Here we describe how this method can be used to produce sequencing yields comparable to those generated from standard input amounts, but by using 1000-fold less starting material. (Click on Preview PDF) ---------------- ntroduction
In just the last few years, the development of second-generation sequencing (SGS) and
third-generation sequencing (TGS) platforms, and the applications they enable, has driven the
development of genomics, fundamentally altered our approach to life and medical sciences, and
made possible the promise of personalized healthcare. (1, 3, 4) Library preparation for SGS and
TGS can be labor-intensive and often requires starting material in the microgram range. (5, 6)
These protocols require such large amounts of starting material due to the high rates of template
loss during the wash steps that follow enzymatic reactions, with only a small fraction of the
original starting material being represented in the final sequencer-ready product. This
requirement for relatively large amounts of starting material can be a significant impediment to
the sequencing of samples with limited amounts of DNA such as needle biopsy material, forensic
or ChIP-seq samples, microorganisms refractory to growth in synthetic media, or when searching
for rare sequence variants in unamplified nucleic acid samples. This is especially true when
preparing unamplified libraries for single-molecule sequencing using the PacBio RS II
sequencer. Unlike all SGS technologies, which rely on PCR and/or clonal amplification of DNA
to generate thousands of copies of each template molecule for sequencing, PacBio library
preparation does not require amplification of the DNA template during library preparation.(7)
A previously described method utilized the PacBio RS sequencer for direct sequencing
from as little as one nanogram of input DNA.(2) The method employs the use of random hexamer
primers to anneal to the template DNA to provide the binding sites for the PacBio polymerase,
thereby bypassing library preparation altogether. The method was applied to sequencing of
ssDNA and dsDNA small genomes, with sequencing yields of mapped reads from less than a
hundred to a few thousand per SMRT Cell. With the decrease in time and cost associated with
library preparation, this method may be well-suited for rapid identification of infectious disease
agents. However, because the sequencing yield produced is only a few thousand reads per
SMRT Cell, application to larger or more complex genomes may be limited. Here we describe a
simple, amplification-free method capable of producing standard sequencing yields by utilizing
closed-circular plasmid DNA as a ‘carrier’ to minimize sample loss during library preparation.
We hypothesized that closed-circular plasmid DNA will not receive the SMRTbell adaptor
Downloaded from http://biorxiv.org/ on March 26, 2014
3
molecules that provide a priming site for the sequencing reaction, thereby mitigating loss of
target DNA without contributing significantly to the sequencing output (Picture 1). The plasmid
carriers are inexpensive and can be prepared in bulk. In addition to employing the use of a
plasmid carrier during library construction, we have optimized the conditions of the final
preparation steps of the libraries for sequencing, including sequencing primer annealing,
polymerase binding, and MagBead binding. To maximize potential sequencing yield, we also reuse
the MagBead-bound complex in subsequent sequencing runs. With the use of a circular
plasmid carrier, optimized library preparation conditions, and re-use of the MagBead-bound
complex, we demonstrate this method is capable of producing comparable, unbiased, per-
SMRTcell sequencing yields from 1000-fold less starting material compared to the standard
PacBio library preparation protocols.
Picture 1. Principle of the low-input library preparation method. SMRTbell adaptors will ligate
to linear DNA inserts of interest, but not to closed-circular plasmid DNA that is added as a
carrier to the sample.
Materials and Methods
The 2kb Low-Input Template Preparation and Sequencing protocol can be found on the
Pacific Biosciences website in the Shared Protocols section of the SMRT Community Sample
Adaptor ligation No adaptor ligation
Polished linear DNA Closed circular DNA
SMRTbell
molecule
Downloaded from http://biorxiv.org/ on March 26, 2014
4
Network (SampleNet). ? genomic DNA (catalog no. 25250-010) was from Invitrogen™. HB101
E. coli genomic DNA was purified using the GenElute™ Bacterial Genomic DNA Kit (catalog
no. NA2110) from Sigma-Alrich®. pUC18 plasmid (catalog no. 3218) was from Clontech.
pBR322 plasmid (catalog no. N3033S) was from New England Biolabs®. Exonuclease III and
Exonuclease VII kits (catalog no. EX4405K and EN510100, respectively) were from Epicentre®.
All cleanup steps were performed using Agencourt AMPure XP beads (catalog no. A63881). A
Covaris M220 Focused-ultrasonicator™ was used to shear DNA into 2kb fragments. Covaris g-
Tubes™ were used with an Eppendorf® 5424 centrifuge to shear DNA into 20kb fragments.
Pacific Biosciences’ DNA Template Prep Kits 2.0 (250bp - <3kb and 3kb – 10kb), (catalog nos.
001-540-726 and 001-540-835, respectively) were used to prepare the 2kb and 20kb fragment
libraries, the DNA/Polymerase Binding Kit XL 1.0 and DNA/Polymerase Binding Kit P4
(catalog nos.100-150-800 and 100-236-500, respectively) were used during the annealing and
binding reactions. A Blue Pippin™ from Sage Sciences was used to select large fragments in the
20kb fragment libraries. A GeneAmp® PCR System 9700 thermal cycler from Applied
Biosystems was used for the annealing and binding reactions. Non-standard adjustments were
made to the Annealing and Binding Calculator (versions 1.3.3, 2.0.1.0, and 2.0.1.2) provided by
Pacific Biosciences to calculate the sequencing primer annealing, polymerase binding, and
MagBead binding concentrations such that all library material available was loaded onto the
sample plate for sequencing. The PacBio DNA Sequencing Kit 2.0 (catalog no. 001-554-002),
SMRT Cell 8Pac v3 (catalog no. 100-171-800), MagBead Kit (catalog no.100-133-600), and
MagBead Station were used for all sequencing. All 2kb sequencing was performed using C2
Chemistry and the XL polymerase on an RS I, with 2 x 55 minute movies, unless otherwise
noted. Stage start was not enabled to maximize CCS yield. All 20kb sequencing was performed
using the P4 polymerase on an RS II, with 1 x 120 minute movies. Stage start was selected to
maximize insert read length. Sequence analysis was performed with SMRT portal, SMRT pipe,
and SMRT View, versions 1.4, 2.0, and 2.1, all from Pacific Biosciences.
See Supplemental Material for Bulk Plasmid Carrier Preparation, Low-Input Shearing,
Library Preparation, Binding Calculator Adjustments and Sequencing Details
Downloaded from http://biorxiv.org/ on March 26, 2014
5
Results and discussion
Meeting PacBio sample input requirements can sometimes be a challenge and the sole
limiting factor for several potential sequencing applications. If standard input amounts are used
for library construction and a 20% recovery is assumed, the resulting final libraries contain tens
of billions of SMRTbell molecules. This enables the possibility of sequencing hundreds of
SMRT Cells when MagBead loading is employed, since only tens of millions of SMRTbell
molecules are actually added to the SMRT Cell during sequencing. Considered in terms of mass,
less than one nanogram is actually needed for sequencing one SMRT Cell (Supplemental Table
1). For a 2kb library, the number of SMRTbell molecules needed to produce standard
sequencing yield are present in approximately 180 picograms. Theoretically, it should be
possible to begin library construction with 5 nanograms and assume that 1 nanogram of final
library resulting from a 20% recovery could still produce the standard expected sequencing yield.
However, we have found that once the input amount used for library construction goes below 50
nanograms, the sequencing yield is significantly lower than the expected yield produced when
using standard input amounts (Fig.1). This may be due to decreased efficiencies in the
enzymatic reactions throughout library construction as well as decreased AMPure bead binding
efficiency at lower molar concentrations. The latter could also explain the lower recovery
percentages seen when shearing small amounts of DNA (see Low Input Shearing section).
Downloaded from http://biorxiv.org/ on March 26, 2014
6
Figure 1. Mapped yield summary from decreasing amounts of 2kb ? DNA libraries without
plasmid carrier.
Template loss during library preparation should be largely random and not sequencespecific.
To determine if this is indeed the case, both 500bp and 2kb libraries were prepared by
diluting decreasing amounts of sheared plasmid DNA (pBR322) into an amount of sheared
lambda DNA that brought the total amounts up to 250 nanograms and 500 nanograms,
respectively, which are the standard input requirements for 500bp and 2kb libraries. Libraries
were prepared and subsequently sequenced. As expected, the number of reads sequenced from
either the sheared plasmid or the sheared lambda was in direct proportion to that of the input
amounts, by mass, during the dilution (Supplemental Fig. 1a and b). To further validate this
finding, a serial dilution of one known 2kb amplicon into a second known 2kb amplicon was
performed, beginning with a 10% spike-in by molarity and then serial diluting six times down to
0.15625%. Libraries were then prepared from each serial dilution using 500 nanograms as input
for each and were subsequently sequenced. Again, a very strong correlation between expected
Input Amount (ng)
Mapped Sequencing Yield
per Decreasing 2kb ? Input--without Carrier
(Mb)
Downloaded from http://biorxiv.org/ on March 26, 2014
7
and observed yield showed that there is no sequence-specific template loss during library
preparation nor preferential sequencing bias (Supplemental Fig. 1c).
Because the end-repair and adaptor ligation reactions should be agnostic to carrier or
target DNA, using linear DNA as a carrier provides a benefit of enabling quality control checks
throughout library construction. However, sequencing yield of the low-input target will be
limited by the molar ratio of target molecules to carrier molecules. As shown in Figure 1, a
minimum of 50 nanograms is required to obtain standard expected sequencing yield, so for very
low input target amounts (less than one nanogram), only approximately two percent of the
sequencing yield could be expected to stem from the target compared to the carrier. For this
reason, we decided to test the effectiveness of using a closed-circular carrier DNA molecule as a
means to reduce the amount of carrier DNA sequenced, since the adaptors should not ligate
compared to a linear carrier.
As described in the Library Preparation section, decreasing amounts of 2kb sheared
lambda target DNA were used to prepare libraries, with 500 nanograms of the carrier pUC18
plasmid (2,686bp) spiked into each library after ligase inactivation. Input amounts for the target
were chosen based on the expected recovery percentage as it relates to the resulting theoretical
ratio of SMRTbell molecules to ZMWs (Supplemental Table 2). Sequencing was performed
using C2 Chemistry and the XL polymerase on an RS I, with 2 x 55 minute movies. As shown
in Figure 2, use of a circular carrier enables the sequencing of as little as 500 picograms of target
DNA without an appreciable loss of mapped read throughput, generating as much as 160 Mb of
mapped target DNA sequence in a single SMRT Cell run. Less than one percent of the overall
mapped yield originated from the circular carrier, suggesting the exonuclease treatment is very
efficient at digesting nicked plasmids to prevent undesired binding of the polymerase. As
predicted in Supplemental Table 2, the mapped sequencing yields of the target DNA remained
comparable to that of the control until the ratio of SMRTbells to ZMWs became limiting below
0.5 nanograms. As expected, there was very little coverage bias across the target genome
(Supplemental Fig. 2), further supporting our assumption that template loss during library
preparation and loading of the SMRTbells for sequencing is independent of sequence
composition.
Downloaded from http://biorxiv.org/ on March 26, 2014
8
Figure 2. Mapped yield summary from decreasing amounts of 2kb ? DNA libraries in a plasmid
carrier. 500ng of plasmid carrier (puc18) was used for each target input amount.
Much of the original PacBio MagBead-bound sample that is prepared for sequencing is
left over in the bottom of the sample plate well following the initial run. In an attempt to use as
much of the low-input target as possible for sequencing, PacBio Bead Binding Buffer was added
in sufficient volume to accommodate the required dead volume that was lost from the first run,
and samples were sequenced a second time.
0.00
25.00
50.00
75.00
100.00
125.00
150.00
175.00
200.00
225.00
Control
(500ng)
50ng 5ng 1ng 0.5ng 0.05ng 0.005ng
169.68
127.99
134.95
206.08
160.03
23.64
2.57
0.00 0.09 0.40 3.75 0.06 0.04 0.06
M
a
? Target Input Amount
2kb ? Low-Input Experiment
Mapped ? Yield (Mb)
Mapped Carrier Yield (Mb)
(Mb)
Downloaded from http://biorxiv.org/ on March 26, 2014
9
Figure 3. Mapped Yield summary from re-use of decreasing amounts of 2kb ? DNA libraries
with a plasmid carrier.
While there was a noticeable decline in yield, the results were promising enough to
warrant a third attempt at sequencing by again adding PacBio Bead Binding Buffer to the sample
well and re-running (Fig. 3). Although the mapped yield from the two lowest target input
amounts (5 and 50 picograms, respectively) was not comparable to those from the standard 500
nanogram input, re-sequencing the samples two more times produced approximately 5-50
Target Input Amount (picogram) -- Carrier Input Amount (nanogram)
2kb ? Low-Input Re-Use Results
(Total Re-Use Mapped Yield in Red)
Third Use
Second Use
First Use
392.34 Mb
0.13 Mb 49.80 Mb 0.13 Mb 5.82 Mb
(Mb)
0.17 Mb
Downloaded from http://biorxiv.org/ on March 26, 2014
10
million additional mapped bases, which could provide enough data depending on the goals of the
sequencing study.
Application of the 2kb low-input method to actual study samples produced sequencing
yields comparable to those from the ? proof-of-principle experiments. Eight amplicons with
very little starting material were sequenced using the 2kb low-input protocol (Fig. 4).
Figure 4. Sequencing yield summary from “real world” samples using the 2kb low-input
protocol.
Input amounts varied from approximately 500 picograms to 5 nanograms. While mapped
yields were lower for some samples compared to those from the ? experiments, the depth of
coverage and number of circular consensus sequences produced for each sample was sufficient
for determining haplotype phasing of highly homologous isoforms and detection of very rare
quasispecies in a diversely mixed population of sequences ( Input Amount (ng)
2kb Low Input Samples
Post-Filter Yield
Target Mapped Yield
Carrier Mapped Yield
* P4/C2-120 minute movie
(Mb)
Downloaded from http://biorxiv.org/ on March 26, 2014
11
Figure 5. Sequencing yield summary from 20kb low-input experiments using decreasing
amounts of E.coli target in plasmid carrier.
To test the low-input protocol for long-insert libraries, we prepared 20kb libraries using
E.coli as the target and pBR322 as the plasmid carrier as described in the Materials and Methods
section and Supplemental Material. Samples were re-sequenced using the same re-use strategy
employed in the 2kb low-input experiments. Figure 5 shows the post-filter and mapped yield for
0
25
Input Amount (ng)-Post-Filter Yield, Target Mapped Yield, Carrier Mapped Yield
* Second use = SMRTcell failure.
20kb E. coli Low-Input Re-Use Results
(Total Re-Use Yield in Red)
Third Use
Second Use
First Use
Downloaded from http://biorxiv.org/ on March 26, 2014
12
the E. coli target and plasmid carrier. The yields show a linear relationship with decreasing input
amounts. Mapped yield from the plasmid carrier is much less compared to the 2kb experiments,
presumably because most of the plasmid was removed by the Blue Pippin during size selection.
Supplemental Table 4 summarizes the results from aggregating the data for each input amount
and running PacBio’s Hierarchical Genome Assembly Process, version 2 (HGAP2). Coverage
plots of the contigs from each input amount were fairly uniform and contained no drop-outs, with
100% of all bases being called (Supplemental Fig. 3). Consensus concordance was greater than
99.95 for all input amounts, demonstrating the utility of the low input protocol for small genome
assembly even from 5 nanograms of starting material.
The methods described here demonstrate that it is possible to produce per-SMRT Cell
sequencing yields comparable to both standard 2kb and 20kb libraries, using 1000-fold less
starting material. The added cost and time from preparing the circular plasmid carrier is only a
small fraction of the overall cost of library preparation, and can be performed in bulk to save
time. While shearing and cleaning small amounts of starting material presently remains a
challenge, there is still much room for optimization of current processes and development of new
technologies to increase recovery rates. Projects for which the starting material is pure and high
quality, but the amount available is the limiting factor, may now be enabled by the use of these
methods.
Competing Interests Statement
Leidos Biomedical Research, Inc. employees declare no competing interests.
PacBio Authors are full-time employees at Pacific Biosciences, a company commercializing
single molecule, real-time sequencing technologies.
Acknowledgments
We would like to thank PacBio for their input and guidance during optimization of the
sequencing conditions.
Downloaded from http://biorxiv.org/ on March 26, 2014
13
References
1. Chan IS, Ginsburg GS. 2011. Personalized medicine: progress and promise. Annu
Rev Genomics Hum Genet. 2011;12:217-44. doi: 10.1146/annurev-genom-082410-
101446.
2. Coupland, P., T. Chandra, M. Quail, W. Reik, and H. Swerdlow. 2012. Direct
sequencing of small genomes on the Pacific Biosciences RS without library preparation.
BioTechniques 53:365-372.
3. Daniel C. Koboldt, Karyn Meltz Steinberg, David E. Larson, Richard K. Wilson,
and Elaine R. Mardis. 2013. The Next-Generation Sequencing Revolution and Its
Impact on Genomics. Cell, Volume 155, Issue 1, 27-38, 26 September 2013.
4. Mardis ER. Next-generation DNA sequencing methods. 2008. Annu Rev Genomics
Hum Genet. 2008;9:387-402. doi: 10.1146/annurev.genom.9.081307.164359.
5. Michael L. Metzker. Sequencing technologies – the next generation. 2010. Nature
Reviews Genetics 11, 31-46 (January 2010) | doi:10.1038/nrg2626.
6. Michael A Quail, Miriam Smith, Paul Coupland, Thomas D Otto, Simon R Harris,
Thomas R Connor, Anna Bertoni, Harold P Swerdlow and Yong Gu. A tale of three
next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences
and Illumina MiSeq sequencers. 2012. BMC Genomics 2012, 13:341.
7. Kevin J. Travers, Chen-Shan Chin, [...], and Stephen W. Turner. 2010. A flexible
and efficient template format for circular consensus sequencing and SNP detection.
Nucleic Acids Res. 2010 August; 38(15): e159. -- http://biorxiv.org/content/early/2014/03/25/003566
Wednesday, March 26, 2014- PacBio Blog---- Wednesday, March 26, 2014/// Importance of Finished Microbial Genome Highlighted for Ethanol-Generating Clostridium
A paper in BioMed Central’s Biotechnology for Biofuels journal demonstrates how finished microbial genomes using Single Molecule, Real-Time (SMRT®) Sequencing are having an impact on the biotechnology industry.
The publication, “Comparison of single-molecule sequencing and hybrid approaches for finishing the genome of Clostridium autoethanogenum and analysis of CRISPR systems in industrial relevant Clostridia,” comes from scientists at Oak Ridge National Laboratory, the University of Tennessee, and New Zealand-based biofuels company LanzaTech. Lead authors Steven Brown and Shilpa Nagaraju and their colleagues used PacBio® sequencing to generate a finished genome sequence for a complex class III microbe that previously could not be assembled to closure.
Clostridium autoethanogenum (strain JA1-1; DSM10061) is an acetogen that can ferment waste gases such as carbon monoxide into biofuels and commodity chemicals, so it is of considerable interest to the biotech industry. Its genome has one chromosome of about 4.3 Mb and very low GC content of just 31%. The strain is categorized as a class III microbe, indicating that it is difficult to assemble due to its high repeat content, prophage, and multiple copies of the rRNA gene operons. Before this study, a draft genome assembly had been published with 100 contigs.
In this project, the authors used various sequencing technologies in an attempt to improve on that draft assembly. They were unsuccessful using short-read sequencing technologies. “Assemblies based upon shorter read DNA technologies were confounded by the large number repeats and their size, which in the case of the rRNA gene operons were ~5 kb,” the scientists report.
But it was a different story when they tried SMRT Sequencing on the PacBio RS II. “Remarkably, one PacBio library preparation and two single molecule real-time sequencing (SMRT) cells produced sufficient sequence such that it could be assembled into one contiguous DNA fragment that represented the DSM 10061 genome,” the authors write. “This is one of the first de novo sequenced genomes we are aware of that has been closed without manual finishing or additional data, despite the complexity of the C. autoethanogenum genome.”
In comparing the PacBio assembly to earlier efforts with short-read technologies, the scientists found many fully sequenced genes that were missed entirely or only partially covered with draft assemblies. Some of these genes proved important for understanding the detailed metabolism of the organism and enabled a more comprehensive comparison to a closely related Clostridium strain. The team also used several analysis tools to test the accuracy of the final assembly. One of these checked for collapsed repeats, finding none in the PacBio assembly but several in assemblies using short-read sequence data. Another assessment found that while the previously published 100-contig draft genome predicted a single copy of the 16S rRNA gene, the PacBio assembly predicted nine copies — the same number of rRNA clusters in a closely related Clostridium strain. The authors note that these very large repetitive regions likely contributed to the inability of short-read technologies to fully sequence the organism.
Another major finding was the presence of a CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats) system in this microbial strain. CRISPRs are prokaryotic DNA loci that carry the memory of past bacterial infections of phages and plasmids to provide immunity against mobile genetic elements. Closely related strains do not have the CRISPR system, and other related strains used in industrial fermentation that lack CRISPR systems have proven susceptible to bacteriophage infections. In this paper, scientists reason that the presence of a CRISPR system may make this strain particularly successful for industrial-scale fermentation of biotech products.
The authors add, “The relatively low cost to generate the PacBio data (approximately US$1,500) and the outcome of this study support the assertion this technology will be valuable in future studies where a complete genome sequence is important and for complex genomes that contain large repeat elements.”
http://blog.pacificbiosciences.com/2014/03/importance-of-finished-microbial-genome.html?utm_content=bufferc7cca&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer
Seminar
PacBio Seminar
Organised by CAT-AgroFood, Plant Research International
Date Wed 26 March 2014
Time 09:00 to 16:00
Venue Hof van Wageningen
?mailen
inShare.
Programme 9:00 Coffee
9:30 Gabino F. Sanchez Perez (PRI Biosciences, Wageningen UR) – Welcome and opening remarks
9:45 Christoph Koenig (Pacific Biosciences) - "Fundamentals and Applications of Single Molecule, Real-Time SMRT® Sequencing"
10:15 Luigi Faino (Phytopathology, Wageningen UR) - "Third generation sequencing: a magnifying glass to study genomic rearrangements in the fungal plant pathogen Verticilium"
10:45 Coffee break
11:30 Alexander Wittenberg (Keygene) – “Towards finished plant and (plant) pathogen genomes”
12:00 Yahya Anvar (Leiden Genome Technology Center / Leiden University Medical Center) - "Deciphering the complete sequence and global methylation state of genomes”
12:30 Lunch
14:00 Matthew Hestand (KU Leuven) - "Long Amplicon Sequencing of Repetitive Regions and Genomic Rearrangements"
14:30 Thomas Hackl (Julius-Maximilians-Universität Würzburg) – “Proovread: high accuracy PacBio hybrid correction for large genomes and transcriptomes
15:00 Elio Schijlen (PRI Biosciences, Wageningen UR)
15:30 Reception
Pacific Biosciences is co-organiser of this seminar.
http://www.wageningenur.nl/en/activity/PacBio-Seminar.htm
March 24th, 2014 Pacbio – Jason Chin’s AGBT Presentation on Diploid Assembly
(videos) http://www.homolog.us/blogs/blog/2014/03/24/pacbio-jason-chins-agbt-presentation-on-diploid-assembly/
PacBio is revolutionizing DNA sequencing, and is seeking a talented individual for an exceptional career opportunity in our Technical Support group. Our ideal candidate is a well-rounded top performer who can be a key contributor in a high-energy growth environment. We will give special consideration to candidates with Linux or sysadmin experience.
http://ch.tbe.taleo.net/CH02/ats/careers/requisition.jsp?org=PACIFICBIOSCIENCES&cws=1&rid=1289
Comparison of single-molecule sequencing and hybrid approaches for finishing the genome of Clostridium autoethanogenum and analysis of CRISPR systems in industrial relevant Clostridia
Published: 21 March 2014
Abstract (provisional)
Background
Clostridium autoethanogenum strain JA1-1 (DSM 10061) is an acetogen capable of fermenting CO, CO2 and H2 (e.g. from syngas or waste gases) into biofuel ethanol and commodity chemicals such as 2,3-butanediol. A draft genome sequence consisting of 100 contigs has been published.
Results
A closed, high-quality genome sequence for C. autoethanogenum DSM10061 was generated using only the latest single-molecule DNA sequencing technology and without the need for manual finishing. It is assigned to the most complex genome classification based upon genome features such as repeats, prophage, nine copies of the rRNA gene operons. It has a low G+C content of 31.1%. Illumina, 454, Illumina/454 hybrid assemblies were generated and then compared to the draft and PacBio assemblies using summary statistics, CGAL, QUAST and REAPR bioinformatics tools and comparative genomic approaches. Assemblies based upon shorter read DNA technologies were confounded by the large number repeats and their size, which in the case of the rRNA gene operons were ~5 kb. CRISPR (Clustered Regularly Interspaced Short Paloindromic Repeats) systems among biotechnologically relevant Clostridia were classified and related to plasmid content and prophages. Potential associations between plasmid content and CRISPR systems may have implications for historical industrial scale Acetone-Butanol-Ethanol (ABE) fermentation failures and future large scale bacterial fermentations. While C. autoethanogenum contains an active CRISPR system, no such system is present in the closely related Clostridium ljungdahlii DSM 13528. A common prophage inserted into the Arg-tRNA shared between the strains suggests a common ancestor. However, C. ljungdahlii contains several additional putative prophages and it has more than double the amount of prophage DNA compared to C. autoethanogenum. Other differences include important metabolic genes for central metabolism (as an additional hydrogenase and the absence of a phophoenolpyruvate synthase) and substrate utilization pathway (mannose and aromatics utilization) that might explain phenotypic differences between C. autoethanogenum and C. ljungdahlii.
Conclusions
Single molecule sequencing will be increasingly used to produce finished microbial genomes. The complete genome will facilitate comparative genomics and functional genomics and support future comparisons between Clostridia and studies that examine the evolution of plasmids, bacteriophage and CRISPR systems.
http://www.biotechnologyforbiofuels.com/content/7/1/40/abstract
In Brief This Week: Bio-Rad; PacBio; Qiagen; SQI Diagnostics; Integrated DNA Technologies
March 21, 2014
http://www.genomeweb.com/brief-week-bio-rad-pacbio-qiagen-sqi-diagnostics-integrated-dna-technologies?utm_source=twitterfeed&utm_medium=twitter&utm_campaign=Feed%3A+genomeweb%2Fgenomeweb-daily-news+%28GenomeWeb+Daily+News%29
Wednesday, March 19, 2014 PacBio Blog--- Assessment of Highly Complex Alternative Splicing of Neurexins Performed with SMRT Sequencing
A new paper in the Proceedings of the National Academy of Sciences from the laboratories of Stephen R. Quake and Thomas C. Südhof (both at Stanford University) describes the direct, full-length transcript sequencing of RNA molecules that are essential to synapse formation in the mammalian brain. The team used Single Molecule, Real-Time (SMRT®) Sequencing to analyze full-length mRNAs from different members of the neurexin gene family and used that information to examine alternative splicing events.
In the publication entitled “Cartography of neurexin alternative splicing mapped by single-molecule long-read mRNA sequencing,” the scientists highlight the importance of understanding alternative splicing in neurexins. “Indirect evidence has indicated that extensive alternative splicing of neurexin mRNAs may produce hundreds if not thousands of neurexin isoforms, but no direct evidence for such diversity has been available,” they write. The alternative splice isoforms are differentially regulated in different brain regions, exhibit a diurnal cycle, and are modulated by development, neurotrophins, and neuronal activity.
Prior to SMRT Sequencing, the authors write that “despite extensive studies, the full extent of neurexin alternative splicing remains unclear.” Neurexin isoforms have been previously analyzed with full-length cDNA sequencing and PCR analysis, but “only a small fraction of these isoforms were actually identified in sequenced full-length cDNAs,” the authors note. “The relatively large size of a-neurexin transcripts (~4–5 kb) has made it difficult to obtain information about their full-length sequence, and hence about the use of alternative splice sites within single transcripts.”
For this study, researchers used the PacBio® platform to sequence transcripts generated by three neurexin genes in adult mice. “Read lengths of up to 30 kb enabled us to identify all of the splice combinations within a single transcript,” they report. With sequencing reads representing more than 25,000 full-length mRNAs, the team made several important discoveries. These include: a novel alternatively spliced exon; even higher isoform diversity than was anticipated; and the finding that splicing events seem to occur independently of one another. The team was able to map out the full transcript landscape for a neurexin gene, showing alternative splicing at all six canonical sites as well as at several noncanonical sites.
Being able to directly assess alternative splicing not only provided evidence for suspected isoform diversity, but also revealed “that neurexins are likely even more polymorphic than previously thought,” the team reports. Based on their observations from SMRT Sequencing, they calculated how many neurexin variants were possible in total. “We observed in this manner a minimal diversity of 1,159 isoforms for Nrxn1a, 1,120 isoforms for Nrxn3a, and a total of 152 isoforms for all three ß-neurexins,” they write. “Thus, earlier estimates of 2,000–3,000 neurexin variants created by alternative splicing may have been an underestimate, because our present study arrived at the same numbers by analyzing only one brain region and one developmental stage.” http://blog.pacificbiosciences.com/2014/03/assessment-of-highly-complex.html?utm_content=bufferb0093&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer
PacBio Workshop on Genome Sequence Assembly and Analysis
Lausanne, 20 March 2014
Overview
Recent improvements in the Pacific Biosciences RSII technology as well as PacBio data analysis methods have greatly increased the utility of this sequencing platform for both small and large genome sequencing.
To highlight these improvements and applications, and to give potential users an opportunity to discuss with current users how this technology can be incorporated into their sequencing projects, a PacBio Workshop on Genome Sequence Assembly and Analysis will be held in Lausanne.
Application
Registration is not necessary. Attendance is open to all.
Additional information
Location
UNIL - Genopode Building - Auditorium B
For more information, please contact training@isb-sib.ch
.Programme
Thursday March 20
14:00 Welcome
Keith Harshman, Genomic Technologies Facility, Center for Integrative Genomics, University of Lausanne
14:05-14:35 De novo assembly of Petunia using PacBio data combined with Illumina data
Rémy Bruggmann, Department of Biology, SIB Swiss Institute of Bioinformatics & University of Bern
14:35-15:05 From phenotypes to genotypes with the human pathogen Candida glabrata
Dominique Sanglard, Institute of Microbiology, CHUV
15:05-15:35 Bacterial genome assembly using PacBio data
Daniel Wüthrich, Department of Biology, SIB Swiss Institute of Bioinformatics & University of Bern
15:35 – 16:05 Coffee Break
16:05-16:35 De novo assembly of a large plant genome with the help of PacBio reads
Emanuel Schmid, Vital-IT, SIB Swiss Institute of Bioinformatics
16:35-17:05 Comparative DNA methylation studies in bacterial genomes
Laurent Falquet, Department of Biology, SIB Swiss Institute of Bioinformatics & University of Fribourg
17:05-17:35 Benefits of SMRT sequencing for analysis of plant and animal genomes
Gerrit Kuhn, Pacific Biosciences
17:35-18:00 Conclusions and General Discussion
.Skip Search forumsSearch forums
SearchSearchGo
Advanced searchSkip Upcoming eventsUpcoming events
PacBio Workshop on Genome Sequence Assembly and Analysis
Thursday, 20 March, 2:00 PM
» 6:00 PM
..http://edu.isb-sib.ch/course/view.php?id=174
A Near Perfect de novo Assembly of a Eukaryotic Genome Using Sequence Reads of Greater Than 10 Kilobases Generated by the PacBio RS II ---- http://aa314.gondor.co/webinar/a-near-perfect-de-novo-assembly-of-a-eukaryotic-genome-using-sequence-reads-of-greater-than-10-kilobases-generated-by-the-pacific-biosciences-rs-ii/?utm_content=buffer37ab6&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer
PacBio, the Post-Hype Sleeper of Genomics
Luke Timmerman3/3/14-- Hype and biotech go hand in hand, but genomics takes exaggeration up a few extra notches. When genomics companies fail, they tend to crash especially hard. Yet every now and then, a company that’s monumentally hyped falls flat and then figures out a way to become a solid, if not spectacular, player.
That’s the storyline that’s slowly taken shape at Pacific Biosciences.
Menlo Park, CA-based PacBio (NASDAQ: PACB), readers may recall, had a few minutes of fame on Wall Street. Backed by big-name venture capitalists like Kleiner Perkins Caufield & Byers, it debuted with an $800 million market valuation on its IPO day in the fall of 2010.
Of course, PacBio seduced investors with a promise of technology revolution. Forecasts were that it would sequence whole human genomes for $100, in about 15 minutes, by 2013. The “third-generation” of sequencing had arrived. Medicine, we were told, would be transformed.
That was 2010. None of those predictions came true. Not even close. Few scientists bought the $700,000, one-ton instrument. The few who bought the bulky machine let it gather dust. Competitors sprinted ahead. Layoffs were made. New management was summoned. Two years after the IPO, PacBio had a market valuation of less than $70 million and a technology value of $0. Investors appeared to have flushed $600 million of cash down the toilet.
Mike Hunkapiller, CEO of PacBio
When I visited newly installed CEO Mike Hunkapiller in his office in the fall of 2012, I knew he was a historically important figure in the development of genomics. I also thought he was an understated, down-to-earth guy with the kind of experience necessary to execute a turnaround. But cash was running low, and so was confidence in the community of genomics researchers. I thought it was time to get the company obituary ready.
Turns out, PacBio didn’t die. It may never threaten the dominance of Illumina (NASDAQ: ILMN) in genomics, and may never become profitable. But in Hunkapiller’s do-what-you-say-you’re-going-to-do-and-grind-it-out way, PacBio has improved. Its instrument now has a niche. Illumina is miles ahead on the factors that count most for customers—speed, cost, and sequencing throughput (bandwidth). But PacBio is making a name for its high-accuracy genomes, its ability to detect structural genetic variations (like RNA transcripts) that other tools can’t, and for creating high-quality genomes of small organisms like bacteria, viruses, and worms. Last fall, PacBio’s stock surged when it struck a deal with Roche to develop technology for the lucrative market to come in genomic diagnostics, where some of PacBio’s technical advantages might be more highly valued.
“It’s going to be very, very hard for anybody to take on Illumina,” says Keith Robison, a computational biologist with Cambridge, MA-based Warp Drive Bio, who uses the various sequencing instruments and writes the Omics! Omics! blog. “But PacBio is a pioneer in finding applications that don’t work well on Illumina.” If you are a scientist working in one of those areas, this is a big deal. Robison adds: “In my world, the microbial world, we want high-quality genomes, and PacBio is almost the only game in town.”
Microbial genomes aren’t what investors had in mind in 2010, but applications like that have given PacBio a pulse. By improving some unglamorous sample prep procedures, the chemistry, and engineering in some new optical enhancements for reading the DNA that passes through, PacBio has found a way deliver on some of its promise with “long reads” of DNA. By reading stretches of DNA that are an average of 8,000 to 9,000 bases long, PacBio can see subtle variations that are overlooked by Illumina and other technologies that read shorter stretches of DNA before assembling them into a genome.
During its frothier moments, PacBio touted its advantage in getting long reads of DNA, but couldn’t always deliver. Earlier versions might get a few strands that were thousands of bases long, mixed in with DNA stretches as short as 500 bases. Recent upgrades have made it possible for PacBio to get true long reads—in one reported case, as long as 54,000 bases—that have wowed researchers, Robison said. More importantly, PacBio has found a way to consistently get the 8,000 to 9,000 base stretches that researchers crave when they think about assembling really high quality, accurate genomes, Robison says.
One effort ongoing at PacBio, with collaborators at Washington University in St. Louis and the University of Washington, has been focused on creating a “platinum genome” that will serve as a new, more accurate reference that all others use when they sequence a new genome and look for what’s new. It may not mean much in a business sense, but if nothing else, it helps with mindshare when customers do a run on a competing machine and have to compare the data with what came from a PacBio instrument.
Investors, by this point, are probably wondering: Who cares? Where’s the progress?
There are some modest developments to report on this front. PacBio had an order backlog of five instruments when I profiled the company in the fall of 2012, and now it has 13. Total revenue grew by a modest 8 percent last year, but the company grew its installed base of instruments by 25 percent last year. Utilization of its instruments grew by 50 percent—meaning they are no longer gathering dust. More papers are getting published by scientists using the PacBio machine. Genomics commentators at the recent Advances in Genome Biology & Technology (AGBT) conference in Florida have nice things to say about PacBio, even while shaking their heads at the vaporware offered by others such as Oxford Nanopore.
At this point in the game, if PacBio were a baseball player, it would be a “post-hype sleeper.” It’s like the hot prospect who disappointed everybody his first couple years, tarnishing his reputation before finally starting to fulfill some of his potential. The baseball analogy would be with Alex Gordon, the heralded Kansas City Royals outfielder. In basketball, it would be Washington Wizards guard John Wall.
Whether PacBio can ever reach its Hall of Fame potential is still very much in doubt. History indicates that a big instrument maker needs to generate $100 million to $125 million in annual revenue to turn profitable, Hunkapiller says. PacBio has a long, long way to go—it pulled in $28.2 million in revenue last year. But PacBio delivered a four-fold improvement in its instrument’s throughput in 2012, did it again in 2013, and has a goal to do it again in 2014, Hunkapiller says. That kind of improvement adds up.
Tim Hunkapiller, a veteran genomics consultant (and brother of the PacBio CEO), said the company still has its limitations. It isn’t in the same league as Illumina on speed, cost, and throughput. It probably costs $30,000 to do a human genome on the PacBio instrument, while Illumina recently announced it has finally achieved the $1,000 genome. Robison noted that even when PacBio has a clear technical advantage, in something like microbial genomes, it has a hard time convincing many customers who have become accustomed to doing everything on an Illumina sequencer.
(This from an expert on PACBIO," Lex Nederbragt ?@lexnederbragt Sorry, but you can't seriously compare the $30,000 PacBio human genome with the Illumina $1,000 human genome!")
While scientists and engineers tend to fixate on the technical side of the machines, there’s a human element here. What’s the old cliché? Something about once bitten, twice shy.
“Clearly PacBio had overpromised a lot in the earlygoing,” says Tim Hunkapiller. “It got a lot of investments early by seriously, seriously overpromising. Some of it was hubris. I don’t think it was dishonest. I remember when they said, ‘By 2014, we’ll have the whole human genome in 15 minutes.’ At the time, you had to say, ‘No, you’re not, of course you’re not.’ They lost their way.”
But just because a company lost its way once doesn’t mean it’s toast. If PacBio can hang around a few more years, keep steadily improving in its own low-key way, it would be worth celebrating. Not only would it keep Illumina on its toes as a near-monopoly market leader, but it would also be able to catalyze a lot science that can’t currently be done on the rival machines. That might even atone for some of the sins from a few years ago.
http://www.xconomy.com/national/2014/03/03/pacbio-the-post-hype-sleeper-of-genomics/?single_page=true
Monday, March 3, 2014--In Acinetobacter Study, Long Reads Aid in Defining Genomic Structure
A paper recently published in mBio, an open access journal from the American Society for Microbiology, reports on the sequencing and phylogenetic analysis of several drug-resistant strains of a pathogen found in hospitals. According to the authors, incorporating PacBio® sequence was critical for generating extremely long contigs for assembly and for accurately identifying chromosomal position and structure of genomic features associated with drug resistance.
The publication, entitled “New Insights into Dissemination and Variation of the Health Care-Associated Pathogen Acinetobacter baumannii from Genomic Analysis,” from scientists at the J. Craig Venter Institute, as well as Case Western Reserve University and its affiliated hospitals. Lead author Meredith Wright, senior author Mark Adams, and their collaborators describe a study of 49 drug-resistant isolates of the A. baumannii nosocomial pathogen gathered during a single year from three branches of a hospital system in Cleveland, Ohio. Their goal was to use sequence information to determine the transmission paths of the organism during that year.
A. baumannii has only become a serious pathogen in the last few decades, and “is now a leading cause of ventilator-associated pneumonia and surgical and urinary tract infections, among other illnesses,” the authors write. That surge in infections is due largely to A. baumannii’s ability to rapidly develop resistance to drugs. With this study, scientists hoped to learn more about that ability, as well as how to prevent disease transmission.
For a deeper understanding of this, two distinct strains were sequenced using Single Molecule, Real-Time (SMRT®) Sequencing and assembled with HGAP. The PacBio assemblies yielded contig N50 values larger than 1 Mb, compared to about 100 kb for Illumina® assemblies.The authors note that the quality of PacBio-only assemblies was comparable to that of error-corrected hybrid assemblies using Illumina data.
The long reads were particularly important for evaluating the pathogen’s drug-resistance properties, which have been linked to chromosomal and plasmid-borne resistance genes, as well as resistance islands. Because they were able to resolve these elements, the authors write, SMRT Sequencing “aided in defining the genetic structure and chromosomal position of resistance islands” in the two strains.
With this and the rest of the study information, the scientists determined that transmission was not as simple as patient-to-patient. “There was limited spatial or temporal clustering of strain types and gene contents within different hospital components, indicating that an endemic and interacting A. baumannii population exists either within the UH hospital system or in patients colonized with the bacteria,” the authors write. “The movement of patients and staff between the affiliated hospital locations may contribute to strain mixing and diversification.”
Building on that, the scientists note that all of these divergent strains were “indistinguishable by conventional sequence typing methods” and that genomic analysis will be essential for accurately identifying individual strains and determining drug resistance profiles.
http://blog.pacificbiosciences.com/2014/03/in-acinetobacter-study-long-reads-aid.html?utm_content=buffer58e11&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer
Genomics Data
Available online 1 March 2014
Returning to More Finished Genomes
Jonas Korlach Pacific Biosciences, 1380 Willow Road, Menlo Park, CA 94025
Available online 1 March 2014
License: http://creativecommons.org/licenses/by-nc-nd/3.0/
Choose an option to locate/access this article:.
Show more Show less
http://dx.doi.org/10.1016/j.gdata.2014.02.003Get rights and contentGenomic data have become commonplace in most branches of the biological sciences and have fundamentally altered the way research is conducted. However, the predominance of short-read sequence data from second-generation sequencing technologies has commonly resulted in fragmented and partial genomic data characteristics. In this opinion, I will highlight how long, unbiased reads from single molecule, real-time (SMRT) sequencing now allow for a return to more contiguous and comprehensive views of genomes.
The generation of genomic data has revolutionized our ability to decipher the genetic blueprints of organisms, and thereby our understanding of the resulting biological phenomena and our means to biotechnologically and medically manipulate them. During the era of Sanger sequencing, a strong emphasis had been placed on the generation of comprehensive, finished genome information from de novo assemblies, despite the fact that this was laborious and expensive. While the advent of second-generation sequencing technologies provided significantly greater data throughput, their shorter read lengths and more pronounced sequence-context bias led to a shift towards resequencing applications, often limited to certain regions of those earlier reference genomes and focusing on single-base differences. The difficulties to produce finished genomes from short-read sequence data, even for smaller microbial genomes, resulted in a greater number of incomplete, highly fragmented, and often unannotated draft genomes [1].
The development of single molecule, real-time (SMRT) DNA sequencing has now made it possible to return to genomic data in the form of high-quality, finished genomes [2] and [3]. This is because SMRT sequencing has excellent performance characteristics in all four areas that are relevant in the evaluation of sequencing technologies:
-Accuracy: for high-quality genomic data, the absence of systematic sequencing errors is imperative. Sequence errors in SMRT sequencing are distributed randomly and are read-length independent, resulting in consensus accuracies of > QV50 across genomes (less than one error in 100,000 bases), often exceeding what can be obtained with second-generation technologies [2], [3] and [4].
-Uniformity: a prerequisite for comprehensive genomic data is the ability to sequence all the DNA that constitutes an organism's genome, irrespective of GC content of sequence complexity. SMRT sequencing has been demonstrated to exhibit the least degree of bias in sequencing data across different technologies [5], producing high-quality sequence even for extreme DNA sequence contexts [5], [6], [7] and [8].
-Contiguity: the quality of genome assemblies is strongly dependent on the read lengths of the underlying sequence data [9]. The long, multi-kilobase reads in SMRT sequencing facilitate the direct resolution of repeats and other forms of structural variation to yield the correct genome organization [2], [3], [6], [10], [11] and [12].
-Originality: because other sequencing technologies require DNA amplification, the vast majority of sequence data has been generated from DNA copies, not the original DNA that was extracted from the organism. In addition to the resulting amplification errors and bias, epigenetic DNA modifications are erased during amplification. SMRT sequencing does not require amplification, thereby eliminating such bias. SMRT sequencing also directly detects many types of DNA base modifications as part of the sequencing process (reviewed in [13]).
The scientific value resulting from these performance characteristics has been described in over 100 publications to date, spanning a wide range of biological application areas [14]. In several cases, the community has carried out direct comparisons of the quality of genomic data from different sequencing technologies, e.g. in the area of de novo assemblies of bacterial genomes [2], [3], [4] and [7]. These publications signal a shift from fragmented and incomplete draft genomes from short-read sequence data, often represented by dozens to hundreds of contigs [3], to a new paradigm whereby fully finished, highly accurate microbial genomes can be obtained from SMRT sequencing data in an efficient, automated workflow, and several institutions have already implemented the routine generation of such high-quality genomes into their production workflows. The publications also highlight the importance for simultaneous fulfillment of the performance categories outlined above: for example, the GC-rich and repeat-rich genomes of Streptomyces strains have been very difficult to resolve with short-read technologies, resulting in over 450 contigs and over 10% of genome sequence missing due to large coverage gaps [7]. In contrast, the automated SMRT sequencing-based, near-finished assembly covered the entire 8.7 Mb genome in seven contigs, the largest of which contained > 90% of the genome [7]. It is also worth noting that genomic data characteristics strongly affect sequence depth requirements, resulting in marked differences between sequencing technologies. For example, in a study comparing assemblies of the Potentilla micrantha chloroplast genome, the authors noted that as little as 120 × SMRT sequencing coverage was required to generate a finished, 1-contig de novo assembly comprising the entire genome, while the corresponding short-read assembly was still fragmented and incomplete despite > 9,000 × sequencing coverage, and was missing ~ 10% of the genome sequence [12].
While initially the new genome assembly methods utilizing the highly contiguous genomic data from SMRT sequencing were largely developed on microbial genomes, they are now being applied to larger genomes. Fig. 1 shows the de novo assembly for the yeast genome using the hierarchical genome assembly process (HGAP) developed for SMRT sequencing data [2], resulting in 30 contigs from the fully automated assembly workflow, relative to the 17 genomic elements (16 chromosomes plus mitochondrial DNA) present in the organism, i.e. each chromosome assembled into one or two contigs. With such high-quality assemblies, commonly used metrics to evaluate genome assemblies become less meaningful as they are more reflective of the organism's genome rather than the assembler's performance. For example, in this yeast assembly, the maximum contig length is 1.5 Mb because that is the longest genomic DNA element present in yeast (chromosome IV); it was assembled into a single contig.
Fig. 1.
Yeast (Saccharomyces cerevisiae) de novo assembly (green) using SMRT sequencing and HGAP, and comparison to the reference genome (strain S228C, blue). Data available at http://pacbiodevnet.com/.
Figure optionsDownload full-size imageDownload as PowerPoint slide
A second example for more comprehensive genomic data from SMRT sequencing for larger genomes was demonstrated by an HGAP assembly of the Arabidopsis genome. Its comparison to results typically obtained with short-read technologies is shown in Table 1. The HGAP assembly contains the full genome (~ 12% was missing in the short-read assembly) with almost ten times fewer contigs, and almost 100-fold longer contigs on average. The longest contig spanned > 10% of the genome, and in several cases entire chromosome arms are represented as single contigs.
Table 1.
Arabidopsis thaliana Ler-0 strain de novo assembly using SMRT sequencing data and HGAP, and comparison to a short-read assembly (Data available at http://pacbiodevnet.com/ and http://1001genomes.org/data/MPI/MPISchneeberger2011/releases/current/, respectively).
PacBio assembly Short-read assembly (2011) Improvement
Assembly size (bp) 124,572,784 110,357,164 12%
# contigs 540 4,662 8.6x
Contig N50 (bp) 6,190,353 66,600 90x
Max contig length (bp) 12,982,390 462,490 30x
Full-size table
Table optionsView in workspaceDownload as CSV
The performance characteristics of SMRT sequencing data are increasingly applied to the human genome, as well as other large and complex genomes [6], [8], [10], [11], [15] and [16]. The lack of sequence context bias and the long read lengths have been employed to resolve regions that were previously difficult or even impossible to sequence by other methods, including attempts utilizing Sanger sequencing. For example, the gene encoding for MUC5AC, important for host-defense functions in the lung and other organs and implicated in cystic fibrosis and other diseases, contains a central large exon that had been intractable to sequencing due to its complex variable number tandem repeat (VNTR) structure, resulting in a ~ 50 kb gap in the human reference genome. By applying SMRT sequencing, a recent study demonstrated that this region could be resolved for the first time, and the high level of variation of this region between individuals was highlighted [6]. Similarly, in a paper entitled 'Sequencing the unsequenceable', 100%-GC DNA comprising the CGG trinucleotide repeat region in the FMR1 gene, responsible for fragile X syndrome, was shown to be amenable to SMRT sequencing [8]. Several groups have begun to apply SMRT sequencing over the entire human genome to leverage the long read lengths for the detection of various forms of structural variation, and to resolve regions which are difficult to access with short-read technologies due to their extreme DNA context or repeat content [15] and [16]. Long SMRT sequencing reads have also been demonstrated to be valuable in transcriptome sequencing for resolving full-length transcripts and alternative splice isoforms [17] and [18].
Outlook
The high scientific value of finished genomes has been emphasized, as they constitute an important prerequisite for comparative and functional genomics, metabolic reconstructions, forensics, and many other fields [19]. It is therefore important to establish standards for the quality of genomic data so that this level of genetic characterization can be reached more routinely. The performance characteristics of SMRT sequencing result in genomic data which more closely, comprehensively and contiguously reflect the organism's genetic and epigenetic constitution. New algorithms utilizing these data continue to be developed and optimized, e.g. HGAP, PacBioToCA, HBAR-DTK, PBJelly, Cerulian, rDNATools, to name just a few [20]. The resulting ability to generate high-quality, comprehensive genomic data in increasingly automated and cost-effective workflows is thereby anticipated to have a significant impact on improving our understanding of the genetic foundations of biology.
References
[1]P.S. Chain, D.V. Grafham, R.S. Fulton, M.G. Fitzgerald, J. Hostetler et al.
Genome project standards in a new era of sequencing
Science, 326 (2009), pp. 236–237
View Record in Scopus| Full Text via CrossRef | Cited By in Scopus (117)
[2]C.S. Chin, D.H. Alexander, P. Marks, A.A. Klammer, J. Drake et al.
Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data
Nat. Methods, 10 (2013), pp. 563–569
View Record in Scopus| Full Text via CrossRef | Cited By in Scopus (12)
[3]S. Koren, G.P. Harhay, T.P. Smith, J.L. Bono, D.M. Harhay et al.
Reducing assembly complexity of microbial genomes with single-molecule sequencing
Genome Biol., 14 (2013), p. R101
Full Text via CrossRef
[4]J.G. Powers, V.J. Weigman, J. Shu, J.M. Pufky, D. Cox et al.
Efficient and accurate whole genome assembly and methylome profiling of E. coli
BMC Genomics, 14 (2013), p. 675
Full Text via CrossRef
[5]M.G. Ross, C. Russ, M. Costello, A. Hollinger, N.J. Lennon et al.
Characterizing and measuring bias in sequence data
Genome Biol., 14 (2013), p. R51
Full Text via CrossRef
[6]X. Guo, S. Zheng, H. Dang, R.G. Pace, J.R. Stonebraker et al.
Genome Reference and Sequence Variation in the Large Repetitive Central Exon of Human MUC5AC
Am. J. Respir. Cell Mol. Biol. (2013) http://dx.doi.org/10.1165/rcmb.2013-0235OC
[7]B.C. Hoefler, K. Konganti, P.D. Straight
De Novo Assembly of the Streptomyces sp. Strain Mg1 Genome Using PacBio Single-Molecule Sequencing
Genome Announc., 1 (2013) http://dx.doi.org/10.1128/genomeA.00535-00513
[8]E.W. Loomis, J.S. Eid, P. Peluso, J. Yin, L. Hickey et al.
Sequencing the unsequenceable: expanded CGG-repeat alleles of the fragile X gene
Genome Res., 23 (2013), pp. 121–128
View Record in Scopus| Full Text via CrossRef | Cited By in Scopus (12)
[9]C. Kingsford, M.C. Schatz, M. Pop
Assembly complexity of prokaryotic genomes using short reads
BMC Bioinforma., 11 (2010), p. 21
Full Text via CrossRef
[10]L.G. Maron, C.T. Guimaraes, M. Kirst, P.S. Albert, J.A. Birchler et al.
Aluminum tolerance in maize is associated with higher MATE1 gene copy number
Proc. Natl. Acad. Sci. U. S. A., 110 (2013), pp. 5241–5246
View Record in Scopus| Full Text via CrossRef | Cited By in Scopus (13)
[11]D.P. Melters, K.R. Bradnam, H.A. Young, N. Telis, M.R. May et al.
Comparative analysis of tandem repeats from hundreds of species reveals unique insights into centromere evolution
Genome Biol., 14 (2013), p. R10
View Record in Scopus| Full Text via CrossRef | Cited By in Scopus (8)
[12]M. Ferrarini, M. Moretto, J.A. Ward, N. Urbanovski, V. Stevanovi et al.
An evaluation of the PacBio RS platform for sequencing and de novo assembly of a chloroplast genome
BMC Genomics, 14 (2013), p. 670
Full Text via CrossRef
[13]B.M. Davis, M.C. Chao, M.K. Waldor
Entering the era of bacterial epigenomics with single molecule real time DNA sequencing
Curr. Opin. Microbiol., 16 (2013), pp. 192–198
Article| PDF (470 K)| View Record in Scopus | Cited By in Scopus (1)
[14]http://www.pacb.com/news_and_events/publications/
[15]K. Doi, T. Monjo, P.H. Hoang, J. Yoshimura, H. Yurino et al.
Rapid detection of expanded short tandem repeats in personal genomics using hybrid sequencing
Bioinformatics (2013) http://dx.doi.org/10.1093/bioinformatics/btt1647
[16]A.D. Patel, R. Schwab, Y.T. Liu, V. Bafna
Amplification and thrifty single molecule sequencing of recurrent somatic structural variations
Genome Res. (2013) http://dx.doi.org/10.1101/gr.161497.161113
[17]K.F. Au, V. Sebastiano, P.T. Afshar, J.D. Durruthy, L. Lee et al.
Characterization of the human ESC transcriptome by hybrid sequencing
Proc. Natl. Acad. Sci. U. S. A., 110 (2013), pp. E4821–E4830
View Record in Scopus| Full Text via CrossRef
[18]D. Sharon, H. Tilgner, F. Grubert, M. Snyder
A single-molecule long-read survey of the human transcriptome
Nat. Biotechnol., 31 (2013), pp. 1009–1014
View Record in Scopus| Full Text via CrossRef
[19]C.M. Fraser, J.A. Eisen, K.E. Nelson, I.T. Paulsen, S.L. Salzberg
The value of complete microbial genome sequencing (you get what you pay for)
J. Bacteriol., 184 (2002), pp. 6403–6405 (discusion 6405)
View Record in Scopus| Full Text via CrossRef | Cited By in Scopus (44)
[20]https://github.com/PacificBiosciences/DevNet/wiki/Compatible-Software
Copyright © 2014 Published by Elsevier Inc.
http://www.sciencedirect.com/science/article/pii/S2213596014000075
By Nick Loman on February 27, 2014
An outsiders guide to bacterial genome sequencing on the Pacific Biosciences RS
It had to happen eventually. My Twitter feed in recent times had become unbearable with the insufferably smug PacBio mafia (that’s you Keith, Lex, Adam and David) crowing about their PacBio completed bacterial genomes. So, if you can’t beat ‘em, join ‘em. Right now we have a couple of bacterial genomic epidemiology projects that would benefit from a complete reference genome. In these cases our chosen reference genomes are quite different in terms of gene content, gene organisation and have divergent nucleotide identities to the extent where we are worried about how much of the genome is truly mappable. And in terms of presenting the work for publication, there is a certain aesthetic appeal to having a complete genome.
And so, after several false starts relating to getting the strains selected and enough pure DNA isolated, we finally sent some DNA off to Lex Nederbragt at the end of last year for Pacific Biosciences RS (PacBio) sequencing!
This week I received the data back, and I thought it would be interesting to document a few things I have learnt about PacBio sequencing during this intiguing process.
It’s all about the library, stupid
The early PacBio publications in 2011 showing the results of de novo assembly of PacBio data weren’t great, giving N50 results not materially much better than 454 sequencing. This was despite the vastly longer read length achieved by the instrument, even then mean read lengths of 2kb were achievable. Since then, incremental improvements to all aspects of the sequencing workflow have resulted in dramatic improvements to assembly performance, such that single contig bacterial genome assemblies are routinely achievable. This is probably best illustrated by the HGAP paper published last year where single contig assemblies for three different bacterial species including E. coli were demonstrated. HGAP is PacBio’s assembly pipeline.
The main improvements have been:
¦The use of Covaris G-Tubes for generation of long fragments (between 6-20kb)
¦The use of BluePippin, an automated gel electrophoresis size selector for for long reads (see Lex’s blog post)
¦Improvements to the sequencing polymerase, for example to prevent against laser-induced damage and sequencing chemistry
¦Bioinformatics algorithm improvements, initially the use of Illumina/PacBio hybrid assemblies and then the HGAP pipeline which corrects the longest reads in the dataset with shorter reads before doing a traditional overlap-layout-consensus assembly.
You need a lot of DNA (like, a lot)
There is a trade-off between the amount of input DNA and which library prep you can do. 5 micrograms for 10kb libraries, ideally 10 micrograms for 20kb libraries. Not always a trivial amount to get your hands on, even for fast-growing bacteria. This is one of the things that would limit our use of PacBio for metagenomics pathogen discovery right now, because this amount of microbial DNA from a culture-free sample is basically impossible to get.
However, in fact we managed to get a library made from 2.6ug of DNA but in this case the BluePippen size cut-off had to be dropped to 4kb (from 7kb).
Input DNA Library prep BluePippen cut-off Number of reads Average read length MBases
2.6ug 10kb 4kb 36 696 5529 202.9
70 123 5125 359.4
>5ug 10kb 7kb 49 970 6898 334.7
58 755 6597 387.6
>10ug 20kb 7kb 42 431 6829 289.9
59 156 7093 419.6
Table 1. PacBio library construction parameters and accompanying output statistics (per SMRTcell, 180 minute movie)
(An aside, I wonder if Oxford Nanopore’s reported shorter than expected read length from David Jaffe’s AGBT talk may just be a question of feeding the right length fragments into the workflow. Short fragments in equals short reads out).
Choose the right polymerase
For bacterial de novo assemblies, all the experts I spoke to recommended the P4-C2 enzyme. This polymerase doesn’t generate the very longest reads, but is recommended for de novo assembly because the newer P5-C5 has systematic error issues with homopolymers (as presented by Keith Robison at AGBT). P5-C3 is therefore recommended for scaffolding assemblies, or could be used in conjunction with P4-C2.
Longer reads may mean reduced loading efficiency
You want long fragments, but we were warned by several centres that 20kb libraries load less efficiently than 10kb libraries, meaning throughput is reduced. It was suggested we would need 3 SMRTcells to get 100x coverage for an E. coli sized genome of 5mb. However in our case, it didn’t really seem that was the case for us (Table 1).
Shop around for best prices
As you almost certainly can’t afford your own PacBio, and even if you could your floor wouldn’t support its weight, you will probably be using an external provider like I did. Prices vary, but the prices I had per SMRTcell were around £350 and quotes for 10kb libraries around £400, with 20kb libraries being more expensive. In the end I went with Lex Nederbragt and the Oslo CEES – not the very cheapest but I know and trust Lex not to screw up my project and to communicate well, an important consideration (see Mick Watson’s guide to choosing a sequencing provider). In the UK, the University of Liverpool CGR have just acquired a PacBio and also would be worth a try. TGAC also provide PacBio sequencing. In the US, Duke provide a useful online quote generator and the prices seem keen.
What language you speaking?
It’s both refreshing and a bit unnerving to be doing something so familiar as bacterial assembly, but having to wrap your head around a bunch of new nomenclature. Specifically, the terms you need to understand are the following:
¦Polymerase reads: these are basically just ‘raw reads’
¦Subreads: aka ‘reads of insert’. This is, I think, the sequence between the adaptors, factoring in that the PacBio has a hairpin permitting reading of a fragment and its reverse strand. This term also relates to the becoming obsoleted circular consensus sequencing mode. Lex has a description here: (http://flxlexblog.wordpress.com/2013/06/19/longing-for-the-longest-reads-pacbio-and-bluepippin/)
¦Seeds: in the HGAP assembly process, these are the long reads which will be corrected by shorter reads
¦Pre-assembled reads: a bit confusing, these are the seeds which have been corrected, they are only assembled in the sense that they are consensus sequence from alignment of short reads to long reads and that PacBio uses an acyclic graph to generate the consensus
¦Draft assembly: the results of the Celera assembler, before polishing with the read set
The key parameter for HGAP assembly is the Mean Seed Cutoff
The seed length cutoff is the set of longest reads which give >30x coverage
This parameter is critical and defines how many of the longer, corrected reads go into the draft Celera assembly process. The default is to try and get 30x coverage from the longest reads in the dataset. This is calculated from the genome size you specify, which ideally you would know in advance. If this drops below 6000 then 6000 will be used instead. You can also specify the mean seed cutoff manually. According to Jason Chin the trade-off here is simply the time taken to correct the reads, versus the coverage going in the assembly. I am not clear if there is also any quality trade-off. Tuning this value did seem to make important differences to the assembly (a lower cut-off gave better results). The HGAP2 (for it is this version you want) tutorial is helpful on tuneable parameters for assembly.
SMRTportal is cool, but flakey
I used Amazon m2.2xlarge (32Gb RAM) instances with the latest SMRTportal 2.1.0 AMI. About half the assembly jobs I started failed, with different errors, despite doing the same thing each time. Some times it worked with the same settings. I am not sure why this should be, maybe my VM was dodgy.
HGAP is slooooooow
Being used to smashing out a Velvet assembly in a matter of minutes, the glacial speed of the HGAP pipeline is a bit of a shock. On the Amazon AMI instance assemblies were taking well over 24 hours. According to Keith Robison on Twitter this is because much of the pipeline is single-threaded, with multi-threading only occurring on a per-contig basis. So if you are expecting a single contig you are bottlenecked onto a single processor. We therefore chose the m2.2xlarge instance type because the high-memory instances have the fastest serial performance of the available instance types. Actually this is important in a clinical context. Gene Myers (yes, THAT Gene Myers) presented at AGBT 2014 to say that he had a new assembler which can do an E. coli sized genome in 30 minutes, can’t come soon enough as far as I’m concerned.
Single contig assemblies are cool
A very long contig, yesterday
Well, my first few attempts have given me two contigs, but that is cool enough. And it is pretty damn cool. If money was no object (and locally we are looking at a 20:1 cost ratio for PacBio sequencing over Illumina) then I would get them every time. As it is, for now, we will probably confine our use to when we really need to generate a quality reference sequence to map Illumina reads against, for example when investigating an outbreak without a good reference. Open pan-genome species like Pseudomonas and E. coli are good potential applications for this technology, where you have a reasonable expectation of large scale genome differences between unrelated genomes. Our Pseudomonas genomes went from 1000 contigs to 2 contigs, which does make a huge difference to alignments. As far as I can see it is pointless to use PacBio for monomorphic organisms, unless you are interested in the methylation patterns. Keith Robison wrote recently and eloquently predicting the demise of draft genome sequencing, but whilst the price differential remains I think this is premature.
Polished assemblies still need fixing
Inspecting the alignment of Illumina reads back to the polished assemblies reveals errors remain, these are typically single-base insertions relative to the reference which need further polishing (Torsten Seeman’s Nesoni would be a good choice for this)
The Norwegian Sequencing Centre rocks
I’m very grateful for Lex Nederbragt and Ave Tooming-Klunderud and the rest of the staff of the Norwegian Sequencing Centre in Oslo for their help with our projects, they have been very helpful and I recommend them highly. Send them your samples and first born!
Also many thanks to those on Twitter who have answered my stupid questions about PacBio particularly Keith Robison, Jason Chin, Adam Philippy, Torsten Seemann.
When I have more time I will dig into the assemblies produced and look a bit more about what they mean for both the bioinformatics analysis and biology.
Posted in Uncategorized
2 Responses
flxlex
February 27, 2014 at 2:16 pm | Permalink
Thanks, Nick! And congratulations on your results…
Two comments:
- your aside on Oxford read lengths is entirely correct: they will get much better lengths once they adjust the fragmentation and library prep conditions
- you could (should) run Quiver multiple times on the assembly, to get rid of the last indels
Lex
scbaker
February 27, 2014 at 4:43 pm | Permalink
Great post, Nick! This will be really helpful for those getting into PacBio sequencing. In terms of finding the right provider, I’d humbly recommend checking out our service – AllSeq’s Sequencing Marketplace http://allseq.com/information/need-sequencing
We’re adding PacBio providers all the time and we can help people find the best one for their project (whether it be price, turn around time, or application expertise).
Shawn
http://pathogenomics.bham.ac.uk/blog/2014/02/an-outsiders-guide-to-bacterial-genome-sequencing-on-the-pacific-biosciences-rs/ --http://seqanswers.com/forums/showthread.php?p=133937#post133937
A New Gold Standard for Accuracy in NGS: Mike Hunkapiller, PacBio
published by Ayanna Monteverdi on Wed, 02/26/2014 - 10:24
Guest:
Mike Hunkapiller, CEO, Pacific Biosciences
Bio and Contact Info
Listen (4:58) What is the theme for 2014 at PacBio?
Listen (2:50) Are you working on a clinical sequencer?
Listen (6:55) What are your thoughts on regulation and diagnostics?
Listen (3:12) What was your reaction to the Oxford Nanopore data just released at AGBT?
Listen (6:40) PacBio runs becoming the gold standard in microbial sequencing
Mike Hunkapiller, the CEO of Pacific Biosciences, joins us again this year as part of our annual series on NGS. Last year, Mike stressed the importance of PacBio's SMRT(TM) sequencing to do longer reads than the competition--namely Illumina. He says PacBio will continue to stay focused on further improving read length and accuracy this year as well. In fact, he says that the PacBio technology is becoming "the new gold standard" for microbial sequencing.
What does Mike think of the first data released by Oxford Nanopore recently? And what are PacBio's plans for clinical sequencing? Join us in the second installment of NGS 2014.
Today's Podcast is sponsored by Biotix - Makers of a Better Tip for Next Gen Sequencing. Find out how Biotix is setting a new standard in sample delivery here.
..http://mendelspod.com/podcast/new-gold-standard-accuracy-ngs-mike-hunkapiller-pacbio
Welcome to the $1,000 genome
Posted by Biome on 25th February 2014
"What’s possible now
With Roche shutting down their 454 sequencing business and Life Technologies largely pushing their Ion Torrent systems over the SOLiD platform, there are really only 3 technologies currently available: Ion Torrent, Pacific Biosciences and Illumina.
Ion Proton, the higher throughput of Life Technologies’ Ion Torrent machines, currently runs the PI chip, capable of producing 60 to 80 million 200bp reads in a 4 hour run, a total output of 10Gb. The PII chip, having been scheduled for release early in 2013 has now been pushed back to mid-2014. The PII chip will apparently be capable of producing 300 million 100bp reads, resulting in a 30Gb output.
Pacific Biosciences’ RS II machine, the only single-molecule sequencer on the market, does not really compete with Illumina or Ion in terms of throughput; its P5-C3 chemistry produces only 375Mb of sequence per run. The real strength of the RS II is its long reads: the average read being 8.5Kb, with the longest being in excess of 30Kb. Recently published read correction strategies remove many of the errors, and now the SMRT technology of Pacific Biosciences (or ‘PacBio’) seems the weapon of choice for finishing genomes or de novo sequencing of new genomes."
The future for Illumina’s competitors
It will be really interesting to see how Life Technologies responds to Illumina’s latest developments. Their key advantage is speed, with the Ion Torrent platforms carrying out the sequencing component in hours rather than days. However, the throughput and cost-per-base do not match current Illumina platforms, never mind the new ones. To remain a viable business, Life Technologies, and its Ion Torrent platforms, must respond.
Pacific Biosciences’ SMRT technology has evolved significantly too and has become an essential tool for those wishing to close genomes, or sequence de novo new genomes. Intriguingly, Roche, a global health-care company, announced an agreement with Pacific Biosciences to develop DNA sequencing products for clinical diagnostics. This is not a space that Pacific Biosciences have been in up until now, and it is difficult to see how their RS II system can compete with Illumina and Ion Torrent in the clinic. Because of this, rumors of a new (benchtop?) PacBio machine abound on social media.
(more on link)
http://www.biomedcentral.com/biome/welcome-to-the-1000-genome/
PacBio Technology Roadmap for 2014 .Published on Feb 18, 2014
Edwin Hauw from Pacific Biosciences presents the PacBio® technology roadmap for 2014. What to expect: sample prep improvements for low-input DNA; a new chemistry for longer reads; assemblers for low coverage or diploid genomes; analysis tools for isoform sequencing, viral minor variant detection, and long-amplicon haplotype analysis; and much more.
PacBio Technology Roadmap for 2014 .Published on Feb 18, 2014
Edwin Hauw from Pacific Biosciences presents the PacBio® technology roadmap for 2014. What to expect: sample prep improvements for low-input DNA; a new chemistry for longer reads; assemblers for low coverage or diploid genomes; analysis tools for isoform sequencing, viral minor variant detection, and long-amplicon haplotype analysis; and much more.
Tomorrows news in Japan!! 2014?2?19????PacBio???????????54X
http://pacbiobrothers.blogspot.com/2014/02/pacbio54x.html?spref=tw
Looking back at AGBT 2014
Posted on February 18, 2014 by Lex Nederbragt
I attended, for the first time, the Advances in Genome Biology and Technology (AGBT) meeting in Florida. With this post, I intend to summarise my experiences of the meeting. I will not cover everything that happened at the meeting, but focus of the areas of my own interest.
First, I have already dedicated one post to one particular talk, the one by David Jaffe on the first data from the Oxford Nanopore MinION.
Here, I will add a few additional reflections on the MinION talk. I already alluded to the fact that the E. coli sequenced was a methylation negative strain. Someone I spoke to said to know that the other species, Scardovia, also did not have methylated bases. This may indicate that methylated bases confuse the sequencing process. From methylation detection using the PacBio platform, we know that the signal from several bases upstream and downstream of the modified base is different then in the absence of base modifications. It is speculative, but perhaps the MinION can not (yet) sequence modified DNA.
Another aspect that was not touched upon by David Jaffe was the per- MinION throughput. The data presented allow for calculating a number of bases available for each strain, with around 6x in reads for the 5 Mbp E coli strain totalling 30 Mbp, and 13x coverage of the 1.6 Mbp Scardovia genome totalling 21 Mbp. But we don’t know how many MinIONs were used for the data presented. Nor was anything said about a filtering step (by quality or length) of the raw reads before they were sent to Jaffe. So, as of yet, we do still not really know how many bases to expect from a MinION run.
Finally, it remains to be seen how evenly the genome is covered and whether there us any bias against (high or low) GC regions. With the exception of the PacBio, all sequencing platforms show significant biases against extreme GC regions, hampering recovery of those regions. It is important to determine how the MinION performs in this respect.
As I’ve written, our application for the MinION Access Program was not granted. I have no problem with that as the program was massively oversubscribed, and Oxford Nanopore had to make a selection amongst the applicants. One blogger, however, is not agreeing and has set up a petition to ask Oxford to give me the possibility to test out the MinION after all. I can’t really object to that initiative, although I would fully understand if even a massive amount of signatures would not convince Oxford Nanopore – why grant me access, while there are so many others that did not get into the program.
So, that was it for announcements from new sequencing platforms at AGBT. Oh, wait, almost forgot: there was actually another one. Drumroll… Genapsys announced their GENIUS system! Who? What? Well, I missed the talk, but as far as I can tell from twitter and the buzz at the meeting, I didn’t miss much. The presenter showed what was described as a lunch box sequencer. No data was shown. I’ll leave it at that. For those interested, see this Forbes piece (with some nice comments) and this GenomeWeb article.
There were representatives from two other nanopore platforms at the meeting. I spend some time talking to Arek Bibillo at his poster describing the Genia system, a biological pore which measures molecules cleaved off nucleotides as they become incorporated during DNA synthesis. The platform has the advantage of having one molecule at the time in the pore (instead of multiple bases) and that the signal goes to background levels in between reading. I think the platform is promising, although they have not yet come very far. Apparently, the chips scale very well with a 100 000 pore chip coming, and one with another factor of 10 more pores planned.
Finally, I had a quick chat with Quantum biosystems’ Nava Whiteford, a former Oxford Nanopore employee. They recently released the first data of their electronic nanopore. The data consisted of raw signals from reading a 21 mer miRNA. Quantum is really early stage, but I applaud their being so open that they release raw data at this stage. More data releases were promised.
A quick recap of what the big established sequencing platforms were presenting at AGBT:
Illumina had already wowed us at the JPMorgan conference (Genomeweb piece) and did not have much new to add to the news on the NextSeq 500 and HiSeq X Ten. This lead to little buzz at AGBT.
Ion Torrent did not have any breakthroughs to report. In fact, I was disappointed to hear new planned release dates for the Ion Proton PII chip (early access May/June) and Ion Chef (early access ongoing), later than what we were told in October (the PII chip was then to be released in November 2013, Ion Chef before Christmas 2013).
PacBio can now truly be called an established platform. They had a massive presence with many talks and posters, including ours (which you can see here). The company’s biggest news was the release of 54 x coverage of PacBio data from the human genome, and a corresponding assembly that outperforms the current reference. Have a look at PacBio’s blog for details. Every sequencing company ultimately wants to show that they can sequence the human genome. PacBio has in that respect in my opinion outperformed them all, as they not only have the data, they have a fantastic de novo assembly. This means that de novo genome sequencing is going through a transformation: with the right funding and DNA amounts and qualities, very high quality assemblies are now possible, reaching and often surpassing the golden standard from before.
Jason Chin, chief bioinformatician at PAcBio, gave a presentation on his ongoing work towards a true diploid assembler. Mixing PacBio reads of two inbred, but different, samples of Arabidopsis to mimic a diploid species, he showed promising results of the Falcon assembler. Fully resolved, heterozygous (phased) assemblies are becoming possible!
Talking of assembling PacBio reads, the currently availble way of using the reads is to either generate around 60-100x coverage so the reads can be used to correct themselves, or using 50x or more short read data for correction of the raw PacBio reads. The corrected reads are then used in an assembly. The ultimate goal is obviously to use the raw reads natively, assembling them without correction. This is challenging due to the high single-pass error rate, making finding true overalls difficult. Besides our poster on that subject, there was an interesting talk from assembly guru Gene Myers (from Celera fame). He basically skipped the whole short-read era, but now with the long PacBio reads, he’s back: developing a new assembler, called Dazzler (for Dresden Assembler). The program can assemble small and large genomes from around 30x raw data in fairly fast times with promising results. Gene Myers told that he plans to release the software in a couple of months. Jim knight, the develop of Newbler (and a former student of Gene Myers) had a poster showing the first results of using raw PacBio reads in a hybrid assembly with Newbler. Very exciting developments, as this may particularly be helpful for large genomes, for which generating high coverage PacBio datasets is very expensive.
Roche/454, well, after their announcement that they will shut down 454 in mid 2016, the buzz on this platform was very quiet. The only thing I picked up was that the long reads from the GS FLX+ are coming to the GS Junior.
I liked very much the talk by Jeffery Schloss, entitled “Ambitious Goals, Concerted Efforts, Conscientious Collaborations – 10 Years Hence”. He described ten years of history of the National Human Genome Research Institute’s program to enable and support technological developments towards the $1000 genome. Interestingly, he mentioned that in the beginning, the goal was a $1000 genome of the quality of the mouse genome that was published in 2002, implying a de novo assembled genome. These days, the $1000 genome only refers to resequencing the sample to say 30x. Schloss did not touch upon when, and why, this change happened. Dale Yuzuki has a blog post on the talk.
Talking about references: I am very interested in augmented reference genomes, where alternative sequences (different haplotypes) are represented such that the reference becomes more complete. There were two talks describing different alternative approaches towards this goal. Valerie Schneider, from the National Center for Biotechnology gave a talk on “Taking Advantage of GRCh38”, the newly released human reference. For this release, many more alternative alleles are added. Valerie Schneider described new read mappers that take the alternatives into account, leading to better SNP callings. Another approach was described by Mark Garrison, in his talk about graph-based reference representations. Using prior information of variants, an extended reference can be build, again improving mapping results. I am very excited about these developments, as I hope this will help us represent the extensive variation we see in Atlantic cod.
Finally, my interest was peaked by two (or rather, three) technologies to take genome assemblies to a higher level. Beyond sequencing and assembly, there is a step to transform the scaffolds to chromosome size reconstructions.
Joshua Burton gave a talk titled “Chromosome-Scale Scaffolding of de novo Genome Assemblies Based on Chromatin Interactions”. He described using a technique to use Hi-C data, where chromosomal regions that are in close proximity get cross-linked, isolated and sequenced. The resulting data can be used to determine the order and orientation of scaffolds (the method has been published recently).
Another approach to the same problem is using optical mapping. There individual, very long DNA molecules are labelled at restriction sites, and imaged, and the patterns used to make longer ‘contigs’. Mapping the restriction sites present on scaffolds can then used to link up the assembly scaffolds with the optical map. The technique can also be used for detecting structural variations. At AGBT, BioNanoGenomics performed a mapping experiment of the human genome in real time at the meeting. I talked to a representative and was quite impressed with the technology.
Nabsys is a company developing a different technical solution to the same problem, with an electronical readout of the labels. They do not yet have an instrument for sale, but what they showed looked promising, not least because they can achieve a higher tag density (0.5 – 1kb for NabSys, something like 10kb for BioNanoGenomics).
Notably absent at AGBT was OpGen, a direct competitor of BioNanoGenomics.
I feel these mapping approaches are going to be a very valuable addition to genomics, both for super scaffolding assemblies, as well as for cost-effective structural variation detection.
All in all, I enjoyed the meeting very much. I’ve (re)met many people, learned a lot, had good discussions and – I can’t deny it – had a lot of fun.
7 thoughts on “Looking back at AGBT 2014”
homolog.us says:
February 18, 2014 at 17:52
Very nice summary. Thanks for posting !
Reply
Dale Yuzuki (@DaleYuzuki) says:
February 18, 2014 at 18:00
Thanks for posting this Lex, and it was great to meet you in-person.
Here’s a permalink to Jeffrey Schloss’ talk: http://www.yuzuki.org/post-agbt-2014-thoughts-jeffrey-schloss-plenary-talk-nhgri-2/ (alas got the WordPress ‘white screen of death’ and finally cut the cord on my prior hosting service, and the new one is working out very nicely).
Also agree on the promise of both BionanoGenomics and Nabsys – single molecule maps to determine higher-order rearrangements is an unmet need, although the main question is one of scale (and affordability). I’ll have some comments about Nabsys soon. (When I’m able to ‘dig out’ from a rather substantial backlog!)
Reply
lexnederbragt says:
February 18, 2014 at 20:29
Thanks! Looking forward to your upcoming post(s).
Reply
homolog.us says:
February 18, 2014 at 18:20
Nick Loman ?(@pathogenomenick) commented -
“Not sure a petition warranted to get my friend @lexnederbragt on the programme he said he didn’t want to go on! http://flxlexblog.wordpress.com/2013/10/25/would-you-buy-that-washing-machine/ … ;)”
The main rationale is that we would like to get an open, thorough and objective analysis of how the nanopore technology works, and it will be especially beneficial to have comparative analysis done with other relevant technologies. Given your experience with PacBio and your sharing of technology-related information in the blog, it would be a loss to the community, if you are not allowed an early access.
The petition text and the link are attached below and we got plenty of signs. I also requested Nick Loman by email to suggest an appropriate person at Oxford Nanopore, to whom we can send it.
——————————————————————————————————————————-
“Request for Early Access to Minion Nanopore for Lex Nederbragt’s Group
Genome scientist Lex Nederbragt has been very active in the online community about sharing information on next-generation sequencing technologies. Also, his blog is generally considered fair, impartial and objective regarding various technologies. We are saddened to learn that his request for early access to Minion has been rejected. Since Nanopore sequencing is considered an important technology by many researchers (incl. us), we believe our knowledge and decision about the technology will be greatly enhanced by evaluation done by Dr. Nederbragt.”
http://flxlexblog.wordpress.com/2014/02/18/looking-back-at-agbt-2014/
Tuesday, February 18, 2014AGBT Day 3 Highlights: Single Contigs, Dazzling Assemblers, Novel Isoforms & Honey Algorithms
Friday morning’s talks were exceptional, and included genomics heavy-hitters Dick McCombie and Gene Myers — both scientists who were truly influential in sequencing the human genome so many years ago. They have kept pushing boundaries, and their talks were fascinating.
Cold Spring Harbor Laboratory’s McCombie offered a presentation based on a late-breaking abstract showing the importance of de novo assembly — rather than resequencing, which can miss structural differences — using SMRT® Sequencing. He showed data from genome sequences of two strains of yeast (S. cerevisiae and S. pombe), both of which were generated using P5-C3 chemistry with BluePippin™ size selection from Sage Science. For the first strain, 15 of 16 chromosomes assembled into single contigs, with the final chromosome represented in two contigs. For S. pombe, one chromosome and the mitochondrial genome came together into individual contigs, while the other two chromosomes were split into two contigs each. McCombie’s team also worked with the Arabidopsis data set released by PacBio and compared it to an Illumina® sequencing-based assembly of the same plant. Contig N50 increased from 65 Kb with the MiSeq® platform to 8.4 Mb with the PacBio® platform. Finally, he showed data from a rice genome sequenced for him by PacBio. (He told attendees he had to contract the project out since his own PacBio RS II was running at capacity.) The mean read length was 10 Kb and the longest read produced was more than 54 Kb, earning McCombie the award for longest read presented at the conference.
Gene Myers, who recently joined the Max Planck Institute for Molecular Cell Biology and Genetics in Dresden, Germany, said that PacBio long reads had reinvigorated his excitement about genome assembly with the promise of being able to produce reference-quality genomes. Myers has developed a tool called Dazzler (the Dresden Azzembler) that significantly accelerates the process of assembling PacBio sequence data. Dazzler works by scrubbing data prior to assembly in order to make the entire process more efficient; Myers reported a comparison of the human genome data set we just released showing a 36-fold speedup over BLASR. The tool can fully assemble an E. coli genome from PacBio reads on a regular laptop in just 10 minutes.
Later in the day, our CSO Jonas Korlach gave a talk showcasing the Iso-Seq™ method for full isoform characterization using SMRT Sequencing. He showed papers from the laboratories of Mike Snyder and Wing Wong, both at Stanford, who used PacBio long reads to fully analyze transcriptomes. Even in well-studied cell lines, Korlach noted, scientists were finding novel transcript isoforms and even novel genes thanks to information provided in these long reads. He also spoke about a metagenomics project looking at a mock human microbiome data set from NIAID, in which SMRT Sequencing was able to fully resolve more than half of the organisms in the community and get the rest into assemblies of a few contigs. The project also resolved all plasmids and yielded methylome data for the microbiome.
The evening session on genomic technologies development featured two more PacBio users. David Wheeler from Baylor’s Human Genome Sequencing Center presented sequence data for tumor/normal pairs; his group is generating 10x coverage for tumors and 5x for the matched normal tissue. He focused on structural rearrangements such as tandem duplications and said that many of these elements were driven by the movement of repeat regions around the genome. They could clearly be resolved using the PacBio technology along with two new algorithms from Adam English called Honey-tails and Honey-spots. In the other presentation, Sean McGrath from the Genome Institute at Washington University used SMRT Sequencing for gene isoform identification and prediction. His data from a cancer cell line and from a hookworm showed the ability of PacBio sequencing to identify more genes than short-read technologies had been able to identify, and also preserved the 5’ and 3’ UTR information in many cases. http://blog.pacificbiosciences.com/2014/02/agbt-day-3-highlights-single-contigs.html?utm_content=buffer7a41f&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer
(Comments on Twitter about Oxford Nanopore Technologies) Chris Cole ?@drchriscole ·6h ago-
Catching up with the @nanopore #AGBT talk. Can see why it's underwhelming: reads lower quality than @illumina and shorter than @PacBio Retweet 1 Favorite 1 7:26 AM - 17 Feb 2014 ·
(Pacific Biosciences Inc Cornell Connection)
A Cornell faculty member and a PhD graduate of Cornell’s School of Applied and Engineering Physics founded the company.
http://smallbizdev.cornell.edu/companies/pacific-biosciences-inc
The Wall Street Journal news - February 10, 2014--- Pacific Biosciences Highlights Increased Focus on Human Genome Research at AGBT 2014; Releasing 54x Coverage Human Genome Data for de novo Assembly.Pacific Biosciences Highlights Increased Focus on Human Genome Research at AGBT 2014; Releasing 54x Coverage Human Genome Data for de novo Assembly
MENLO PARK, Calif., Feb. 10, 2014 (GLOBE NEWSWIRE) -- Pacific Biosciences of California, Inc. (Nasdaq:PACB), provider of the PacBio(R) RS II DNA Sequencing System, announced that its Single Molecule, Real-Time (SMRT(R) ) Sequencing technology will be featured in nine podium presentations and 29 posters at this year's Advances in Genome Biology and Technology (AGBT) meeting, with more than half presenting research on human and other complex genomes. In addition, on Wednesday the company will publicly release a long-read dataset for generating the first de novo human genome assembly from PacBio-only sequence reads.
"Recent performance increases on the PacBio RS II -- notably substantial improvements in throughput -- are allowing scientists to approach much larger genome sequencing projects. The list of recent accomplishments using SMRT Sequencing includes agriculturally valuable plants such as spinach, model organisms such as Drosophila and Arabidopsis, and now, very high-quality work on human genomes," said Michael Hunkapiller, Chief Executive Officer of Pacific Biosciences. "We are delighted with the diversity of applications that will be showcased by our customers this year at AGBT, including unprecedented work on the human genome."
Pacific Biosciences used its P5-C3 sequencing chemistry to generate 54x coverage on a well-studied human haploid cell line (CHM1htert), which is being utilized as part of a National Institutes of Health project to sequence and assemble an alternate reference genome (the so-called "platinum genome"), an effort led by Rick Wilson from the Washington University in St. Louis and Evan Eichler from the University of Washington in collaboration with investigators from the National Center for Biotechnology Information (NCBI).
The human genome dataset is being deposited in the public domain to offer the bioinformatics and scientific communities an additional dataset to accelerate the understanding of genome-wide variation at all genome size scales, and to improve assembly techniques. To demonstrate the value of using PacBio long-read data to create de novo assemblies of human genomes following the general Hierarchical Genome Assembly Process (HGAP), Pacific Biosciences collaborated with Google to leverage the Google Cloud Platform for the most computationally intensive part of the assembly pipeline. In a single day, the pipeline executed 405,000 CPU hours to align the long reads to each other. These data were transferred back to the company to complete the assembly process, which resulted in a 3.25 Gb assembly with a contig N50 of 4.38 Mb, and with the longest contig being 44 Mb. This represents over an order of magnitude better N50 than the most recent reference-guided assembly using Illumina(R) sequencing and BAC-clone finishing on the same sample, which had a total assembly size of 2.83 Gb and a contig N50 of 144 kb.
Deanna Church, currently of Personalis, Inc. and previously at NCBI as a founding member of the Genome Reference Consortium, commented: "The human reference assembly is central to all modern sequence analysis; therefore it is critical that the assembly is of the highest possible quality. The Genome Reference Consortium has been working towards this goal for years, and the CHM1 resource was invaluable for some of the improvements in GRCh38 (the latest human reference version of released by NCBI). However, even the latest GRCh38 assembly has regions that need improvement and additional sequencing technologies like SMRT Sequencing will clearly be necessary to continue improving the reference assembly. Personalis has also worked to develop advanced human reference versions and applauds new accomplishments by others in this area."
PacBio's 54x data initiative was a follow-on project from the October 2013 release of a 10x coverage dataset of a human genome for detecting structural variation relative to the human reference genome. Human genomes harbor many potentially medically relevant structural variations, which are often difficult or impossible to resolve using short-read technologies. To date, most studies on human variation consist of resequencing and comparing to a human reference genome. However, in order to comprehensively assess genetic variation between humans, de novo assemblies and subsequent comparison between genomes are desirable.
The unique value of this dataset will be described in several presentations at AGBT, including a talk by Gene Myers of the Max Planck Institute for Molecular Cell Biology and Genetics titled "A De Novo Whole Genome Shotgun Assembler for Noisy Long Read Data." The data will also be discussed in a presentation by PacBio's Senior Director of Bioinformatics Jason Chin titled "String Graph Assembly For Diploid Genomes With Long Reads," and in a company workshop on Friday hosted by Chief Scientific Officer Jonas Korlach.
The human genome dataset will be summarized and accessible via the PacBio blog on the morning of Wednesday, February 12. More information about PacBio-related activities at AGBT 2014 is available at www.pacb.com/agbt.
About the PacBio RS II and SMRT Sequencing
Pacific Biosciences' Single Molecule, Real-Time (SMRT) Sequencing technology achieves the industry's longest read lengths and highest consensus accuracy,(i) (,ii) along with the least degree of bias(iii) . These characteristics, combined with its ability to detect many types of DNA base modifications (e.g., methylation) as part of the sequencing process, mean the PacBio RS II provides a window into critical biological processes and medically, agriculturally, and industrially relevant genetic and genomic variation that can only be revealed with SMRT Sequencing technology.
http://online.wsj.com/article/PR-CO-20140210-904720.html
February 14th, 2014 Dazzler Assembler for PacBio Reads – Gene Myers
you know you’re in too deep when you are watching the #AGBT14 hashtag like a day trader after an IPO
We acted like those day traders, because Gene Myers, author of Celera assembler, presented on PacBio assembly in AGBT today. Thanks to tweets from @lexnederbragt, @OmicsOmicsBlog, @infoecho and many others, we got a snapshot of the talk.
Gene Myers started working on assembly problem 30 yrs ago and is returning to assembly field after 10 years. There was a big intellectual battle between Pavel Pevzner’s de Bruijn graph assembler and Myers’ OLC, and Myers conceptually combined the two approaches in his string graph paper. Obviously Myers was not happy to see de Bruijn graph stealing the show, which he described in the talk as ‘short-reads were not intellectually satisfying to me’.
The assembler he developed here is ‘PacBio-only’ and not a hybrid assembler. Possibly he does not want to touch the short reads at all, because that would require him to acknowledge that de Bruijn graphs have some value. It is better to wish those annoying short reads to go away :). For hybrid assembly, an earlier talk mentioned ECTools as very helpful.
Getting back to Myers’ talk, here are the conceptual blocks being discussed in twitter.
Conceptual blocks
1. “High error as long as its truly random is problematic w.r.t. efficiency and consensus, not quality”.
2. “sampling is perfectly Poisson and the errors are random! The location of the noise is random, Possion distribution of DNA fragments on the genome the noise does not mater ”
2. 20X coverage is good enough. “Do the math. 20x coverage with 15% error will give you a Q70 base.”
3. “in some sense, string graph is answer to assembly problem”. [Notes from Homolog.us - there is no conceptual difference between de Bruijn graph and string graph. So, we hope people stop putting those two approaches as polar extremes and start to combine them.]
4. Before building string graph, he needs to take care of – “chimeric reads, contaminant reads, unclipped primer sequence, excessively erroneous reads”
5. “Everyone I know has a bigger cluster than I do.” – Dazzle’s focus is on efficiency.
Innovations
Myers solved the efficiency problem and then worked on consensus.
Here is the workflow – Overlap, scrub, correct, overlap, scrub, assemble
More detail: align at 80% -> scrub -> correct -> align at 95% -> scrub -> assemble -> consensus.
Super easy?
The main innovations are to avoid BLASR for alignment and “scrubbing” to clean up reads done using pilegrams. Details were not presented in this 10 minute talk.
Only FASTAs needed for input, quality comes in after for consensus for Quiver. Shows E.coli, Arabidopsis, Human assy times
Quiver 40 core minutes to run E.coli!
Uses fasta at beginning, at the end Quiver(-like?) polishing on raw reads
It is a low memory assembler (16GB), but it uses distributed file system.
Useful tweets:
Myers: G.bax.h5 ‘is a moose’ of a file; fasta to .dexta down to 1/14th the size.
Myers: Can correction be bypassed? A hard question. All pure strategies – PacBio only, then assemble.
Gene Myers: “If you are a bioinformatician without a distributed file system…shame on you”
needs only 16Gb of RAM but needs a distributed file system
GM, for Dazzler, No job takes more than 16G of memory, must have a distributed file system
Speed
Assembly times on non-cluster computer using dazzle: ecoli 10mins; arabidopsis 1 hr; human 5 days!!
Availability:
“Dazzler will be available on Apple app-store :)” (joke by @MeekIsaac).
Myers said the code is still not ready for public consumption and it will take couple of month to get cleaned.
Question on Transcriptome Assembly
Myers:
Q: Will you look at transcriptome assemblies?
A: A hybrid strategy would be a good thing. May not have enough PacBio copies.
Future Plan
Myers also focusing on minimizing disk consumption of pipeline #agbt14
Can do the human sapiens data set on 150Gb (would require 2Tb without compression) #AGBT14
Myers: working on compressing data from the raw data (bax.h5 file) 14 times
Myers: bypassing correction? Plugging our poster (number 211), Celera people are doing this now #agbt14
Shoutouts
Jason Chin’s previous talk & HGAP paper.
Lex Nederbragt’s poster.
http://www.homolog.us/blogs/blog/2014/02/14/dazzle-assembler-pacbio-reads-gene-myers/
Dr. Robert Sebra work with PACBIO`s Blue Pippin made news all over the World !! http://pacbiobrothers.blogspot.com/2014/02/blue-pippin-pacbio.html
Precise Sizing and SMRT Sequencing Offer
Unprecedented Read Length for Clinical Studies/////////////// “Those are easy projects because we can sequence the epigenome and finish the entire genomic assembly in a few days while maintaining a low cost.” That genome-plus-epigenome capability explains much of the demand for PacBio sequencing, because no other platform offers the ability to look at genome-wide methylation and other base modifications. Factor in the cost, Sebra says, and it’s the obvious choice. “ (lots more)!!! /// http://www.sagescience.com/wp-content/uploads/2014/02/sage_pacbio_cs.pdf // At the Icahn Institute for Genomics and Multiscale Biology, scientists use automated
DNA sizing together with long-read sequencing to analyze clinical samples, conduct
routine surveillance on microbes, and more.
Case Study :: Mount Sinai
At the Icahn Institute for Genomics and Multiscale Biology at Mount Sinai in New York City, technology development expert Robert Sebra, Ph.D., sees tremendous need for long-read, high-accuracy clinical sequencing for use in microbial surveillance, detection of repeat expansions, and more. To meet that demand, he relies on Single Molecule, Real-Time (SMRT®) Sequencing from Pacific Biosciences
with BluePippin™ automated DNA size selection from Sage Science. Together, these tools offer a powerful
solution and industry-leading read lengths that allow Sebra and other researchers to resolve repeat elements and structural variants, rapidly close
microbial genomes, and measure epigenetic marks.
Sebra, an assistant professor of genetic and genomic sciences, is no stranger to the SMRT Sequencing
platform: he spent five years working at PacBio helping to develop that technology. Ultimately, his belief in the system led him to join the Icahn Institute, where he would get to use the PacBio® sequencer in the field. “There was a lot to be gained by taking the technology and applying it in a clinical setting,” says Sebra, who came to Mount Sinai in 2012. “I had
experienced firsthand the value of long-read
sequencing and wanted to apply it to human
and infectious disease research.”
Since its founding by Eric Schadt in 2011, the Icahn Institute has attracted some 150 leading scientists and clinicians who bring a network-based approach to various biological questions, many of them focused on cancer, Alzheimer’s disease, allergy and asthma, and infectious disease. Among the institute’s well-stocked core facilities are two PacBio RS II sequencers and a BluePippin instrument, which are used together for projects requiring extra-long reads.
Sebra’s idea that this kind of approach would be useful in a hospital environment was prescient. “I can’t emphasize enough the tremendous potential that I see for long-read sequencing in tackling hard-to-sequence samples in the clinical arena. The technology has led to novel results creating a rapid growth of interest as data become more accessible,” he says. Indeed, the institute has churned through some 1,800 SMRT Cells in the past year and shows no signs of slowing down. Sebra and his colleagues have already demonstrated the extraordinary value of long-read DNA sequencing for microbial and human clinical samples, and they have a slew of other projects in the pipeline.
Technology Focus
The move to a hospital and genomics institute may have offered Sebra many new opportunities to apply long-read sequencing, but it didn’t change his passion for technology development. He works with researchers and clinicians throughout the institute, helping them determine which technology solution best fits the biological question they are trying to answer. For Sebra, that means he has to be well-versed in the entire range of applications for next-generation sequencing platforms. “For PacBio, that application Sebra, who came to Mount Sinai in 2012. “I had
experienced firsthand the value of long-read
sequencing and wanted to apply it to human
and infectious disease research.”
Since its founding by Eric Schadt in 2011, the Icahn Institute has attracted some 150 leading scientists and clinicians who bring a network-based approach to various biological questions, many of them focused on cancer, Alzheimer’s disease, allergy and asthma, and infectious disease. Among the institute’s well-stocked core facilities are two PacBio RS II sequencers and a BluePippin instrument, which are used together for projects requiring extra-long reads.
Sebra’s idea that this kind of approach would be useful in a hospital environment was prescient. “I can’t emphasize enough the tremendous potential that I see for long-read sequencing in tackling hard-to-sequence samples in the clinical arena. The technology has led to novel results creating a rapid growth of interest as data become more accessible,” he says. Indeed, the institute has churned through some 1,800 SMRT Cells in the past year and shows no signs of slowing down. Sebra and his colleagues have already demonstrated the extraordinary value of long-read DNA sequencing for microbial and human clinical samples, and they have a slew of other projects in the pipeline.
Technology Focus
The move to a hospital and genomics institute may have offered Sebra many new opportunities to apply long-read sequencing, but it didn’t change his passion for technology development. He works with researchers and clinicians throughout the institute, helping them determine which technology solution best fits the biological question they are trying to answer. For Sebra, that means he has to be well-versed in the entire range of applications for next-generation sequencing platforms. “For PacBio, that application for diagnostics, and SMRT Sequencing is an obvious win-win for achieving these attributes in the infectious disease arena while also offering potential for novel discovery.”
As he applies long-read sequencing to these projects where it will make the biggest impact, Sebra continually
looks for ways to generate the longest possible reads. One complementary technology for the PacBio workflow is BluePippin, an automated DNA size selection platform from Sage Science. Removing smaller fragments from the sequencing library ensures that the PacBio platform focuses on the longest fragments, so accurate sizing can improve average read length considerably. “You could do a traditional pulsed field gel every time you’re trying to size select, but it takes too much time, doesn’t scale well, and the DNA input requirement is really high,” Sebra says. “BluePippin is fast and cheap, and it’s the only option for size selecting in a high-throughput fashion. We purchased one as soon as it was available.”
Since bringing in BluePippin in 2012, Sebra’s team has run more than 100 libraries using the BluePippin+PacBio combo — in fact, he says, “For projects requiring near finished genome assembly, I don’t think we’ve prepared a library without BluePippin size select since owning the instrument.” He has been pleased with the amount of size-selected library the technology yields, noting that in virtually every experiment it produces more than enough to sequence a genome to completion on the PacBio RS II. He generally excludes all fragments smaller than 10 Kb to target the ultra long fragments, but says that in cases where input DNA is especially low or the genome is quite large and requires more library, he lowers that threshold to 7 Kb. Pipeline at Work
Sebra has been pleased with the results of pairing these platforms, noting that the size selection step
has exceeded his expectations for overall improvement in read length and throughput of SMRT Sequencing. The boost to mean read length from adding BluePippin size selection ranges from about 30 percent to 125 percent, depending on the input quality, he says. Two studies — one microbial, the other human — offer a snapshot of how the pipeline is performing for ongoing efforts at the institute.
In one project, Sebra and his colleagues are working on an ambitious, big-picture study for infectious disease surveillance that could be used internally
at hospitals as well as to test external samples.
Methicillin-resistant Staphylococcus aureus, or MRSA, is especially important to surveillance programs “because of the potential in characterizing community-acquired isolates,” Sebra says. The idea for this type
of program is to sequence microbial samples and
then conduct a phylogenetic analysis to figure out
the source and history of an infection.
In one infectious disease study, the team sequenced multiple MRSA isolates using PacBio with and without BluePippin sizing, finding that prior to sizing, 50 percent of the bases are in reads 5 Kb or longer, while after sizing that number more than doubled to 12.5 Kb. Full sequencing, from sample prep through to genome assembly, took about 48 hours, and cost as little as $300 per isolate, often assembling to a single contig, Sebra notes. “The big take-home message was that we can do low-contig assemblies with just a couple of SMRT Cells,” he adds. “We could rapidly assemble isolate genomes, including plasmids, to rapidly source that isolate and improve patient treatment.” That’s one of the reasons that the PacBio technology is critical for this kind of surveillance program: those long reads allow for phasing clinically relevant plasmids in a separate circular contig. With the success of the MRSA study, Sebra says, it is now easy to “imagine scaling that approach across all infectious disease isolates.”
In a separate ongoing project, Sebra and his collaborators have sequenced a standard human genome sample — known to the scientific community as NA12878 — to above 30x coverage using PacBio with BluePippin. “With informatics strength from the Bashir group, our goal is to better resolve the structural elements larger than 10,000 base pairs that were unachievable with any other technology up to this point,” he says. “We want to discover which regions
of the genome are missing in the current reference
so we can better associate those with disease.”
There are many genetic landscapes, from trinucleotide repeats to copy number variants or inserted elements, that are linked to disease severity, Sebra says — but they are impossible to detect in assemblies where the reads are too short to assemble them. By applying long-read sequencing, he and his partners hope to rescue these missing regions. Ultimately, that could make like genome-wide association studies more fruitful. In the clinic, Sebra envisions working with clinicians to develop targeted panels of genes with known repeats or other structural variants “to better diagnose disease severity” of a patient. The effect of BluePippin sizing was also significant in the human study, increasing the mean subread length from about 2,800 bp to almost 8,000 bp. Size selection also helps to focus sequencing on pieces of the genome that otherwise may not achieve high coverage due to mapping complexity. “Without size selection, you’ll greatly reduce the coverage of redundant regions of the genome,” Sebra says. Armed with both platforms, Sebra and collaborators are pushing ahead with their human genome work, hoping to reach even higher coverage with SMRT Sequencing to generate a more complete human reference.
Advice for Others
Many people attribute the success of Sebra’s PacBio pipeline to his years working at the sequencing company and assume that these kinds of results are out of reach for new users. That couldn’t be further from the truth, says Sebra, noting that the work done on these instruments is reproducible across users with varying levels of expertise. “Other people can absolutely roll out this pipeline,” he says. “It’s quite scalable and easy to teach these techniques. In particular, user-friendly assembly pipelines such as HGAP2 enable researchers of varying degrees of expertise to conduct complete experiments from isolation to assembly.”
He notes that the single most important ingredient for this sequencing workflow is DNA quality. “It really comes down to the DNA prep, and isolating the DNA with care, to avoid physical and chemical damage before going into the BluePippin size-selection cassette and then onto the PacBio system for sequencing,” he says. That helps to optimize both technologies to ensure the longest reads possible for the highest-quality assemblies.
As for whether the BluePippin addition is right for other scientists, there’s a simple way to determine that, according to Sebra. “If your throughput of runs is high enough , a BluePippin is really pretty affordable. Size selection reduces the number of SMRT Cells required to achieve a particular sequencing goal, so it pays for itself pretty quickly.” The Pippin system is an automated gel electrophoresis platform designed to save scientists time
and money in DNA size selection. The platform uses optical fluorescence detection of DNA separations to
automatically collect size-selected fragments from pre-cast agarose gel cassettes. DNA is electro-eluted
from agarose according to user-input settings, and up to five samples may be independently size selected
per cassette. Samples are collected in buffer and removed by standard pipettes. Compared to manual gel
purification, DNA fragments are collected with much higher accuracy and reproducibility — and with no
contamination. For additional information, contact us at info@sagescience.com or 978-922-1932, or
visit our website at www.sagescience.com.
© 2014 The New York Times - By ANNE EISENBERGFEB. 8, 2014 -------The Path to Reading a Newborn’s DNA Map
What if laboratories could run comprehensive DNA tests on infants at birth, spotting important variations in their genomes that might indicate future medical problems? Should parents be told of each variation, even if any risk is still unclear? Would they even want to know?
New parents needn’t confront these difficult questions just yet. The more than four million babies born in 2014 in the United States will likely be screened in traditional ways — by public health programs that check for sickle cell anemia and several dozen other serious, treatable conditions. So far, DNA-based tests of infants play only a small part in screening.
But that may change in the next few years, as technology that can sequence and analyze the entire genome of a child becomes available, potentially detecting a range of inherited genetic conditions at birth. It’s the same type of analysis that now can tell adults — if they choose to ask for it — whether they are at high risk for a certain type of cancer, for example. As the technology becomes more sophisticated, it will inevitably expand into the world of newborns.
Launch media viewer Victo Ngai To begin dealing with the complexities of that new world, the National Institutes of Health have awarded $5 million in four pilot grants under a research program.
Genomic sequencing may reveal many problems that could be treated early in a child’s life, avoiding the diagnostic odyssey that parents can endure when medical problems emerge later, said Dr. Cynthia Powell, winner of one of the research grants and chief of the division of pediatric genetics and metabolism at the University of North Carolina School of Medicine.
The research projects are unusual in that they tightly link technical and clinical problems with ethical ones, said Dr. Edward McCabe, chief medical officer at the March of Dimes in White Plains. “So often in biomedical research, there is a siloed way of thinking,” in which technical problems are considered independent of their social implications, he said. “Here we have transdisciplinary thinking at the core.”
Jaime King, a professor at the University of California Hastings College of the Law, is, among other things, trying to create screening guidelines to suggest which genetic conditions might be mandatory for testing and follow-up, and which might be optional and left up to parents.
“This will take the wisdom of Solomon,” she said. “We will be debating it for years to come.”
One of the drawbacks of DNA tests for children — as well as for adults — is that they reveal many mutations that don’t pose problems for the people who carry them. “Many changes in the DNA sequence aren’t disease-causing,” said Dr. Robert Nussbaum, chief of the genomic medicine division at the School of Medicine of the University of California, San Francisco, and leader of one of the pilot grants. “We aren’t very good yet at distinguishing which are and which aren’t.”
For this reason, Dr. Jeffrey Botkin, a professor of pediatrics and chief of medical ethics at the University of Utah, said it was far too soon to consider comprehensive genome sequencing as a primary screening tool for newborns.
“You will get dozens of findings per child that you won’t be able to adequately interpret,” he said. “Imagine trying to explain dozens of these unknown variants to parents. And it’s an enormous psychological burden on the parents, who can’t know whether the child has an actual problem or not.” The ethical issues of sequencing are sharply different when children, rather than adults, undergo comprehensive DNA tests, he said.
Adults can decide which test information, if any, they want to receive about themselves. But children won’t usually have that option — their parents will decide. “We can’t presume that children would want or benefit from the information,” he said.
The first human genome was decoded only a decade ago, but genomics has already become a multibillion-dollar industry. If screening of newborns gains traction, established genomics companies that make DNA testing and analysis tools — companies like Illumina and Pacific Biosciences — may be beneficiaries.
Dr. Powell of the University of North Carolina wants the pilot projects to establish research data before consumer-directed programs offering comprehensive genome sequencing and analysis become widespread.
“It won’t be long before people are offered this kind of genomic screening commercially,” she said. “We want to look at it in a much more controlled setting before consumer companies get hold of it.”
Dr. Eric Green, director of the National Human Genome Research Institute, which is part of the N.I.H. and is a funding organization for the pilot grants, agreed that skepticism was warranted. “We are not ready now to deploy whole genome sequencing on a large scale,” he said, “but it would be irresponsible not to study the problem,” using the time before the technology matures to hash out difficult issues.
“We are doing these pilot studies so that when the cost of genomic sequencing comes down, we can answer the question, ‘Should we do it?’ ” he said.
“There will be an industry setting up around this,” he said. “We need to be ahead of that. We should get data to see if there is any value in sequencing all newborns.”
EMAIL: eisenberg@nytimes.com.
A version of this article appears in print on February 9, 2014, on page BU3 of the New York edition with the headline: The Path to Reading a Newborn’s DNA Map. Order Reprints|
http://www.nytimes.com/2014/02/09/business/the-path-to-reading-a-newborns-dna-map.html?_r=0
February 12th, 2014 | Genome Assembly, Pacbio
Hybrid Assembly – (ii): The Error Models of PacBio Reads
Now that we have contigs assembled from short Illumina reads aligned on to long PacBio reads, the question of which one to trust often pops us in our mind. Let us explain the issues more clearly.
——————————————————————–
A. Often we come across regions, where 3/4th of a long Illumina contig matches very well with the PacBio read (after allowing for 85% of error rate), but the remaining contig is not seen anywhere nearby.
Possibilities:
Illumina is correct.
One can make a case that Illumina contig is built from hundreds of short overlapping regions, and therefore the Illumina contig is more accurate.
PacBio is correct.
One can also argue that the particular genomic region is different in two chromosomes and the PacBio read is capturing a different chromosome compared to what is assembled from Illumina. Possibly the chromosomal region has a large insertion/deletion.
——————————————————————–
B. We also come across regions, where the Illumina contig matches PacBio closely, but has a large gap inside. The gap is usually filled with homopolymers.
Possibilities:
Once again, one can argue about both possibilities mentioned in A.
——————————————————————–
C. Third case of ambiguity is multiple copies of the same Illumina contigs matching a PacBio contig.
Possibilities:
PacBio is correct.
By its design, k-mer based de Bruijn graph assembly compresses duplicated regions into one block. Therefore, the contig assembly method used for Illumina reads is incapable of resolving tandem repeat regions.
Illumina is correct.
PacBio technology circularizes the chromosomal fragments and then goes over them again and again. Therefore, the raw PacBio reads have multiple copies of the same chromosomal region, but the initial processing step splits them into different reads. It is possible that the processing step may have missed a few circularized junctions.
—————————————————————–
Yesterday, we took time to meticulously work through a few cases to understand what is going on, and we will share the examples here. They are anecdotal cases rather than systematic analysis of the entire data set, but will illustrate the points mentioned in A, B and C to help you appreciate the issues.
This commentary will be expanded with many figures and examples.
http://www.homolog.us/blogs/blog/2014/02/12/hybrid-assembly-ii-error-models-pacbio-reads/- (Bobby Sebra's #AGBT14 poster: an all-in-one Infectious Disease Pipeline using @PacBio long reads Thursday 1:00-2:30 pic.twitter.com/2qTCXwEU2g)-https://twitter.com/IcahnInstitute/status/433706208562647040/photo/1
Wednesday, February 12, 2014-Data Release: ~54x Long-Read Coverage for PacBio-only De Novo Human Genome Assembly
We are pleased to make publicly available a new shotgun sequence dataset of long PacBio® reads from a human DNA sample. We previously released sequence data using Single Molecule, Real-Time (SMRT®) Sequencing of ~10x coverage of this sample, sufficient for reference-based detection of structural variation. Today we expand on that release with additional data that increases the total sequencing coverage to ~54x. This long-read data has enabled the generation of the first de novo human genome assembly from PacBio-only sequence reads. Download the 54x long-read coverage dataset.
The dataset was generated from sequencing a well-studied human cell line (CHM1htert), which is being utilized as part of a National Institutes of Health project to sequence and assemble an alternate reference genome (the “platinum genome”). This NIH project is being led by Rick Wilson from Washington University at St. Louis and Evan Eichler from the University of Washington in collaboration with investigators from the National Center for Biotechnology Information.
This new PacBio-only genome assembly marks a continuation of recent data releases highlighting the power of long reads for generating high-quality de novo genome assemblies of increasing size and complexity (Figure 1). For the human genome, it is a follow-on from the October 2013 release of a ~10x coverage dataset for detecting structural genomic variation. Our aim is to help scientists resolve the many structural variants that have been difficult or impossible to characterize using short-read technologies.Identifying these variants, such as large deletions, inversions, and repeat elements, is a prerequisite to understanding many diseases and thereby offers great potential in biomedical research and clinical treatment. Thus, it is essential to have full and accurate representations of these in human genome data. In addition, we believe that higher-quality de novo assemblies of human genomes will enable a greater understanding of genetic variation in genomes at all size scales in a hypothesis-free manner, without bias from conventional reference-guided approaches.
Figure 1. Progress of PacBio-only de novo assembly. (For sources, see References 1-6 below.)
Just one sequencing library type was required for this effort, in the form of ~20 kb long-insert shotgun libraries, which were size-selected using the BluePippin™ platform from Sage Science and sequenced with our P5-C3 chemistry on the PacBio RS II using 180-minute movies.
Below are some sequencing statistics of the dataset:
• Total number of reads: 21,856,161 bp
• Total number of post-filtered bases: 167,851,128,644 bp
• Average throughput/SMRT Cell: 608 Mb
• Average read length: 7,680 bp
• Half of sequenced bases in reads greater than: 10,739 bp
• Longest DNA insert sequenced: 42,774 bp
Figure 2. Subread length distribution. A subread is a DNA insert sequenced between two SMRTbell™ hairpin adapters. The solid black line (right y axis) denotes the amount of sequenced bases greater than a given subread length (x axis).
This project also offered opportunities to apply the current Hierarchical Genome Assembly Process (HGAP) tool chain for generating a first PacBio-only de novo assembly of a human genome. This assembly represents the initial result straight out of the assembly pipeline, and we and our collaborators are now working on curating and polishing the assembly. We teamed up with Google to use the Google® Cloud Platform for the most computationally intensive part of the HGAP pipeline. In a single day, the platform executed 405,000 CPU hours to align the long reads to each other. The output alignment data was transferred back to PacBio for generating pre-assembled reads using a modified version of FALCON. We then used Celera® Assembler 8.1 to generate the assembly, and our consensus caller Quiver was applied for the final sequence. The pipeline produced a 3.25 Gb assembly with a contig N50 of 4.38 Mb, and the longest contig of 44 Mb. In comparison, the most recent reference-guided assembly using Illumina® sequencing and BAC-clone finishing on the same sample had a total assembly size of 2.83 Gb and a contig N50 of 144 kb (Figure 3).
Figure 3. Historical comparison of human genome de novo assemblies including the 2007 HuRef assembly to the 2014 PacBio-only 2014 assembly. Data sources: HuRef (Venter); BGI YH; KB1; NA12878; RP11_0.7; 2013 CHM1.
This project will be highlighted in several presentations at this week’s Advances in Genome Biology and Technology (AGBT) conference, including more details on the assembly process during Jason Chin’s presentation, entitled “String Graph Assembly For Diploid Genomes With Long Reads,” on Friday at 8:30 pm. The data will also be highlighted in the PacBio workshop from our CSO Jonas Korlach on Friday at 2:40 pm. By releasing this dataset, we hope to support the bioinformatics community, along with our own efforts, to further develop and optimize computational algorithms and genome assembly pipelines for large-scale genome assemblies and structural variant detection using SMRT Sequencing. We look forward to the generation of many additional high-quality human genome de novo assemblies to reveal new insights into human genetics.
http://blog.pacificbiosciences.com/2014/02/data-release-54x-long-read-coverage-for.html?utm_content=bufferf6577&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer
first word cloud of tweets.
"54x", "Data", "Human", "Genome" are the buzz words. )( https://twitter.com/erlichya/status/433473367367098368/photo/1
PacBio Pipeline Off to a Strong Start for 2014
Posted on February 10, 2014 by nsengamalay It has been a busy January for our PacBio RSII instrument. We are excited to report a new record yield from a single SMRT cell – 896,457,524 passed filter bases! It seems we are not far off from hitting 1 G.
http://www.igs.umaryland.edu/labs/grc/2014/02/10/pacbio-pipeline-off-to-a-strong-start-for-2014/
PacBio Releases 54x Coverage Human Genome Data
February 10, 2014 by nextgenseek-- Just ahead of the upcoming AGBT 2014 at Marco Island Florida, PacBio released 54 x coverage human genome sequence data for public use. The new PacBio data was generated by using its P5-C3 sequencing chemistry on a well-studied human haploid cell line (CHM1htert). PacBio unveiled P5-C3 chemistry in 2013 fall and it produces sequence data of read lengths greater than 8,500 bp and about 50% of the reads over 10,000 bp in length.
CHM cells have a diploid genome resulting from replication of a haploid paternal (sperm) genome. It is the same cell line that is used to sequence and assemble an alternate reference genome (“platinum genome”), by Rick Wilson from the Washington University in St. Louis and Evan Eichler from the University of Washington in collaboration with investigators from the National Center for Biotechnology Information (NCBI). So, already a variety of sequence data is available on the same cell line. The addition of high depth long PacBio reads will be a great addition.
PacBio said one of the main reasons it is making the human genome dataset public is to “accelerate the understanding of genome-wide variation at all genome size scales, and to improve assembly techniques”. PacBio will give multiple presentations on the data at the AGBT including
¦Gene Myers’ talk “A De Novo Whole Genome Shotgun Assembler for Noisy Long Read Data.”
¦PacBio’s Senior Director of Bioinformatics Jason Chin’s talk “String Graph Assembly For Diploid Genomes With Long Reads”
PacBio also claimed that the use of PacBio long-read data to create de novo assemblies of human genomes using Hierarchical Genome Assembly Process (HGAP) in collaboration with Google has resulted in a 3.25 Gb assembly with a contig N50 of 4.38 Mb, and with the longest contig being 44 Mb. Compare this to total assembly size of 2.83 Gb and a contig N50 of 144 kb from the most recent reference-guided assembly using Illumina and BAC-clone finishing on the same sample.
The PacBio data is not available yet, but will be available on 12th Feb from PacBio Blog. PacBio released 10X coverage data from the same cell line jsut ahead of ASHG 2013. PacBio has also released both DNA and RNA-seq data from multiple organisms last year. Here is a link pointing to some of PacBio long read data.
Announcing the release of high depth PacBio human data, Michael Hunkapiller, Chief Executive Officer of Pacific Biosciences said
Recent performance increases on the PacBio RS II — notably substantial improvements in throughput — are allowing scientists to approach much larger genome sequencing projects. The list of recent accomplishments using SMRT Sequencing includes agriculturally valuable plants such as spinach, model organisms such as Drosophila and Arabidopsis, and now, very high-quality work on human genomes. We are delighted with the diversity of applications that will be showcased by our customers this year at AGBT, including unprecedented work on the human genome.
http://nextgenseek.com/2014/02/pacbio-releases-54x-coverage-human-genome-data/
(AGBT) February 12-15, 2014. Poster number 211.
Towards correction-free assembly of raw PacBio reads. http://figshare.com/articles/Towards_correction_free_assembly_of_raw_PacBio_reads/928231