LT.Swing trade!
Register for free to join our community of investors and share your ideas. You will also get access to streaming quotes, interactive charts, trades, portfolio, live options flow and more tools.
Register for free to join our community of investors and share your ideas. You will also get access to streaming quotes, interactive charts, trades, portfolio, live options flow and more tools.
David Glazer
Shared publicly - Today- 9:43 AM Great use of Google Cloud for genome assembly. (If you don't know what the jargon means, this is about putting together a 3-billlion-piece jigsaw puzzle, out of random chunks of a few hundred-to-thousand pre-assembled pieces, without looking at the picture on the cover.)
Pacific Biosciences collaborated with Google to leverage the Google Cloud Platform for the most computationally intensive part of the assembly pipeline. In a single day, the pipeline executed 405,000 CPU hours to align the long reads to each other. These data ... resulted in a 3.25 Gb assembly with a contig N50 of 4.38 Mb, and with the longest contig being 44 Mb. This represents over an order of magnitude better N50 than the most recent reference-guided assembly ... on the same sample, which had a total assembly size of 2.83 Gb and a contig N50 of 144 kb. https://plus.google.com/+DavidGlazer/posts/7m1T8EXPSYV#+DavidGlazer/posts/7m1T8EXPSYV
Whole story from incomplete link 0n 1/28/14 PacBio Demos First De Novo Animal Genome as it Plans
Longer Reads, Increased Throughput
January 28, 2014
By Monica Heger
Researchers have sequenced and de novo assembled theDrosophila melanogaster genome on Pacific
Biosciences' RS II — the first time an animal genome has been sequenced and assembled solely with
PacBio technology — and have produced a genome with fewer gaps and longer contigs than the current
reference.
Sergey Koren, a bioinformaticist at the National Biodefense Analysis and Countermeasures Center and
University of Maryland, developed software for error correction of PacBio reads dubbed PBcR, and
presented on the Drosophila assembly at the International Plant and Animal Genome meeting in San
Diego earlier this month.
Additionally, the company is planning this year to increase its throughput four-fold to achieve 1 gigabase
of data per SMRT cell and average read lengths greater than 10-15 kilobases, as well as improvements to
sample prep and new methods for assembly of diploid genomes.
The Drosophila genome, estimated to be around 140 megabases, but potentially as large as 220
megabases, was sequenced in six days using 42 SMRT cells to 90-fold coverage and produced average
read lengths of 10 kilobases. Using the Celera assembler, the researchers constructed a haploid assembly
in 128 contigs with an N50 length of 15 megabases and a maximum contig length of 24.6 megabases.
Total turnaround time from sample to final assembly was six weeks.
PacBio scientists collaborated on the project with researchers involved in the
Berkeley Drosophila Genome Project, and researchers from the University of Maryland and the
University of Manchester.
According to Sue Celniker, co-director of the Berkeley Drosophila Genome Project, the PacBio-only
assembly is a huge improvement over the reference genome, which is currently in its fifth iteration.
Researchers involved in the Berkeley Drosophila Genome Project have spent over 10 years working on
the reference genome using a combination of Sanger sequencing, BAC clones, and other manual and
labor-intensive approaches. Yet, using just one next-gen sequencing technology, and over just six weeks,
the PacBio technology was able to piece together regions that have proved particularly troublesome, like
heterochromatin and the Y chromosome, she said.
"There's been some persistent repeats that we couldn't get through, that [PacBio] did," she told In
Sequence. "Having those very long reads allows you to get through large arrays of repeats."
Researchers are still evaluating and comparing the PacBio assembly to the reference, so Celniker said she could not precisely say how many of the remaining gaps the PacBio assembly was able to close.--- However, it is already clear that in some cases the long reads were able to generate a more contiguous
sequence than the reference. For instance, chromosome 2R was reduced to two pieces in the PacBio
haploid assembly from 27 pieces in the reference. Chromosome 2L was reduced to between 4 and 6
pieces from 6 pieces, and chromosomes 3L and 3R were reduced to 1 and 3 pieces in the PacBio
assembly from 22 and 15 pieces, respectively.
Additionally, in the most recent release of the Drosophila reference genome, only around 1 percent of
chromosome Y is represented. While the BDGP researchers have since assembled around 7.5 percent of
the Y chromosome, the team anticipates that more than half of the Y chromosome will be assembled with
the PacBio data.
Part of the reason for less Y representation in the reference genome is that the fly DNA was taken from
embryos, so there is no way to know whether male or female DNA was being used, Casey Bergman, a
senior lecturer in computational and evolutionary biology at the University of Manchester, told IS. But in
the PacBio collaboration, only male flies were used, he said.
Bergman's lab became involved with the project last summer after it released a dataset generating wholegenome shotgun sequences using PacBio technology of the Drosophila reference strain as well as
Illumina sequences that it used to error-correct the PacBio reads. The company contacted Bergman to
collaborate on generating data and doing de novo assembly using its newer sequencing chemistry.
Bergman said that this Drosophila genome validates PacBio's technology for use inde novo assembly, and
shows the value of long reads. Genomes that have been assembled using short-read sequencing
technology, like the panda genome, are put together in contigs that are tens of kilobases, he said. But,
the Drosophila has an N50 of 12 megabases. "That is chromosome-sized segments. It is what was
declared finished for many genomes 10 years ago, and is of much higher contiguity and sequence
quality," he said.
Short-read sequencing technology is valuable for applications like identifying genes or fragments of
genes, and enables many genomes to be sequenced cost-effectively — but it doesn't give you the longrange
architecture, Bergman said.
The PacBio-only assembly also has some advantages over the hybrid PacBio/Illumina assembly,
Bergman said.
One problem with error correction, he said, is that Illumina technology does not sequence well through
repetitive regions, so the Illumina-corrected reads in those repetitive regions are not as good. "You don't
really get the gain in the regions of the genome where you need them for the long-range assemblies," he
said.
Adam Phillippy with the National Biodefense Analysis and Countermeasures Center, who worked on the
assembly, agreed. In theory, a hybrid assembly approach is beneficial because it combines two orthogonal
technologies and can take advantage of the strengths of both, he said. And indeed, in many genomic regions, a hybrid assembly works well. But, since short reads do not align well to certain regions, like
repeats, it is difficult to use short reads for error correction in those regions.
"Short reads are notoriously hard to map against a repetitive genome," Phillippy said. "It's much easier to
align long reads to long reads, so you assemble the repeats much more effectively."
Phillippy and Koren last year published a study in Genome Biology, estimating a cost of about $1,000
for de novo sequencing and assembly of microbes with PacBio technology. Additionally, the researchers
compared self-correction to hybrid correction and found that self-correction was often better in terms of
accuracy and contiguity.
Phillippy said that he expects these conclusions for microbial genomes to carry over to larger genomes,
especially as throughput and read lengths continue to increase, and the Drosophila genome is the first
evidence of that.
Further improvements
Looking ahead, Jonas Korlach, PacBio's CSO, said that the company is planning further improvements to
its read lengths and throughput this year.
The company plans to increase throughput to 1 gigabase per SMRT cell and average read lengths to
greater than 10-15 kilobases. An increase in read length will be achieved by several factors, Korlach said.
The company continues to study different polymerases and is working out ways to optimize the signal
from the nucleotide.
For instance, in its latest sequencing chemistry, P5-C3, the company incorporated a protective scaffolding
strategy, which reduces photo damage to the polymerase and enables longer reads. Korlach said that the
company continues to improve upon this strategy. Additionally, the company has found that "nicks or
damage to the DNA template can stall the polymerase and thereby reduce read length," so researchers are
looking at ways to do more "efficient DNA damage repair during sample preparation."
Korlach added that the company is also looking at ways to improve loading efficiency, which would also
increase throughput. Each SMRT cell contains 150,000 zero-mode waveguides, each of which has the
potential to be occupied by a polymerase and template complex. However, the current method of loading
is limited by Poisson statistics, meaning that only about one-third of the ZMWs will be occupied with one
polymerase-template complex with the remainder occupied by either none or more than one complex,
Korlach said. "However, we believe through improvements in loading, we can at least double the amount
that we are currently loading per SMRT cell."
These improvements, which will be delivered over the year, will come in the form of software upgrades
and a new sequencing kit, Korlach said. None will require a hardware upgrade or installation.
Monica Heger tracks trends in next-generation sequencing for research and clinical applications for
GenomeWeb's In Sequence and Clinical Sequencing News. E-mail Monica Heger or follow her GenomeWeb
Twitter accounts at @InSequence and@ClinSeqNews. -- http://files.pacb.com/pdf/MediaCoverage_Demos_FirstDeNovoAnimalGenome.pdf
Publication date Jan 23, 2014 -- Hierarchical genome assembly method using single long insert library
The present invention is generally directed to a hierarchical genome assembly process for producing high-quality de novo genome assemblies. The method utilizes a single, long-insert, shotgun DNA library in conjunction with Single Molecule, Real-Time (SMRT®) DNA sequencing, and obviates the need for additional sample preparation and sequencing data sets required for previously described hybrid assembly strategies. Efficient de novo assembly from genomic DNA to a finished genome sequence is demonstrated for several microorganisms using as little as three SMRT® cells, and for bacterial artificial chromosomes (BACs) using sequencing data from just one SMRT® Cell. Part of this new assembly workflow is a new consensus algorithm which takes advantage of SMRT® sequencing primary quality values, to produce a highly accurate de novo genome sequence, exceeding 99.999% (QV 50) accuracy. The methods are typically performed on a computer and comprise an algorithm that constructs sequence alignment graphs from pairwise alignment of sequence reads to a common reference.
Advances in biomolecule sequence determination, in particular with respect to nucleic acid and protein samples, have revolutionized the fields of cellular and molecular biology. Facilitated by the development of automated sequencing systems, it is now possible to sequence an entire genome, for example, of a micro-organism. However, the quality of the sequence information must be carefully monitored, and may be compromised by many factors related to the biomolecule itself or the sequencing system used, including the composition of the biomolecule (e.g., base composition of a nucleic acid molecule), experimental and systematic noise, variations in observed signal strength, and differences in reaction efficiencies. As such, processes must be implemented to analyze and improve the quality of the data from such sequencing technologies.
The standard of sequencing accuracy was set to 99.99% by the National Human Genome Research Institute (NHGRI) in 1998. While a single base-call for each position in a template may not achieve such accuracy, with increases in coverage multiple overlapping sequencing reads for a template sequence having lower raw read accuracy can be used to determine a consensus sequence with acceptably high accuracy. Consensus calling algorithms attempt to distinguish sequencing error from variants (e.g., SNP's) using multiple “queries” for a given position. A variety of such algorithms have been developed to address changes in sequencing coverage, error profiles, and information accompanying base-calls as new sequencing systems are developed, e.g., /////////////////////////////////////////////////////////// Most third party genome assemblers, e.g., Celera®Assembler®, assume that the overlap between the reads can be detected with high identity. For example, an overlap might be called when the identity in the alignment between two reads is above 94%. While it is not necessary to assemble the sequence of an entire genome using such stringent requirements, (e.g., the ALLORA assembler from Pacific Biosciences, Menlo Park, Calif., can use reads that only have 70% identity between each other), it remains preferable to construct inputs whose overlap can be detected with high identity before passing them to a third party assembler. Moreover, when there are repeats in a genome, it is also favorable to generate input that can clearly distinguish the different repeats. Finally, it is also preferable that some artifacts, e.g., chimeric reads and high quality region identification errors, due to sequencing reactions, can be filtered out before the assembly step.
Sequencing technologies that combine reads from libraries of different lengths of DNA have been developed to generate reads that can satisfy the more stringent input requirements for third party assemblers. However, most of these methods require preparation and separate sequencing of multiple DNA libraries.
The hierarchical genome assembly process starts with using the longer reads to put other reads together, in a similar manner to a sequence assembly process. The method utilizes certain special features of SMRT® sequencing wherein the read length distribution is not a constant but an exponential one. It is understood that, for a typical sequencing run with a long inserted library, the probability P(l) of obtaining a read with read length l, is proportional to exp(-l/L), where L is the average read length. In other words, SMRT® sequencing produces not only shorter fragments but also a number of longer ones. An alignment algorithm (e.g., as implemented in a program such as BLASR, from Pacific Biosciences, Menlo Park, Calif.) can be used to align all the reads to a longer read, thereby creating a mini-assembly for each long read.
In order to utilize all continuous long reads (CLR's) from raw sequencing data, for example as generated by the PacBio® RS®, the longer portion of the raw reads, using a pre-specified length cutoff, Icutoff, are extracted to provide the “seeds” for constructing pre-assemblies. These seed reads are used to recruit other reads as a scaffold. It is desirable to achieve about 15-20× genome coverage of such seed sequences so that a sufficient amount of coverage of pre-assembled reads will be generated for the subsequent assembly. The pre-assembled reads are constructed by aligning all reads to each of the seed reads. Each read is mapped to multiple targeted seed reads using the program BLASR (Chaisson and Tesler 2012). The number of read hits mapping to the seed sequences is controlled by the “-bestn” parameter when calling the program BLASR for mapping. Such number should be smaller than the total coverage of the seed sequences on the genome. If the “-bestn” number is too high, it is likely that reads from similar repeats will be mapped to each other, which could result in consensus errors. Conversely, if the chosen “-bestn” number is too low, the quality of the pre-assembly consensus may be decreased. The optimal choice might also depend on DNA fragment library construction, which can affect the subread length distribution. A preferred value of “-bestn” is 12 reads to map to the seed reads. Further study will allow a reasonable choice for optimized results.
(for full story,use link) http://www.google.com/patents/US20140025312?utm_content=buffer8c1e8&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer
Sorry about that! I`ll see what I can do. Unfortunery I watch very few movies? This is where I got the link, https://twitter.com/Craigledee (Gordon Gekko) http://wallstreet.wikia.com/wiki/Gordon_Gekko
Tuesday, January 28, 2014(PacBio Blog) At Plant & Animal Genome Workshop, Users Showcase Projects Enabled by SMRT Sequencing
Earlier this month, we hosted a workshop at the International Plant & Animal Genome (PAG) conference in San Diego entitled “A SMRT® Sequencing Approach to Reference Genomes, Annotation, and Haplotyping.” PacBio users presented data on various projects that have benefited from long-read sequence data, including several that had previously been attempted with short-read technologies without success. We were delighted to see reports on newer features of SMRT Sequencing, including full-length isoforms, automated haplotyping, and more. Here’s a recap, as well as links to video recordings of the presentations:
Chongyuan Luo, a scientist from Joe Ecker’s lab at the Salk Institute for Biological Studies, offered a presentation on genomic and epigenetic variations across model organism Arabidopsis thaliana. He used SMRT Sequencing to resolve three strains of the plant, sequencing each to more than 50x coverage. Compared to short-read sequence data, PacBio® data correctly identified more than 200,000 SNPs previously missed in each strain; most were enriched in the peri-centromere region. Because of that, Luo recommends using only PacBio data for a genome assembly. His team also achieved their goal of detecting structural variants that have been underrepresented by genome assemblies from short-read data.
Watch recording: Resolving the Complexity of Genomic and Epigenomic Variations in Arabidopsis
Shane Brubaker, bioinformatics director at Solazyme, Inc., talked about the need for a high-quality reference genome for a strain of algae that his company uses to produce renewable oil. The company first tried short-read sequence data, but couldn’t get through the GC-rich genome. Using PacBio sequencing, the team not only fully sequenced the genome — assembling it into just a few contigs per chromosome that even included centromere sequence — but also built a tool to perform automated haplotyping and later conducted allele-specific expression analysis. The final assembly accurately represented the diploid genome, Brubaker said, noting that CCS reads alone exceeded Sanger quality at far lower cost. “You can now get a reference assembly that is essentially finished quality without doing all those gap-closing steps,” he said. Watch recording: Assembly, Haplotyping, and Annotation of a High GC Algal Genome
Allen Van Deynze, director of research at the University of California, Davis, Seed Biotechnology Center, spoke about a spinach genome sequencing project. The plant is important in its own right, but sequencing became more urgent in an effort to find genes that confer resistance against a downy mildew that is destructive to the crop. Van Deynze reported a draft genome sequence using SMRT Sequencing (Quiver polishing was still underway at the time of the workshop) that already showed a marked improvement in N50 contig length compared to a previous short-read assembly of the genome. Watch recording: A De Novo Draft Assembly of Spinach Using Pacific Biosciences Technology
From USDA’s Agricultural Research Service, molecular biologist Sean Gordon discussed the need for long-read sequencing to map an organism’s transcriptome. His team analyzed the wood-decaying fungus Plicaturopsis crispa first with short reads and found that they were missing exons and other important information. “There is no path from short reads to accurate isoforms,” he said. They switched to SMRT Sequencing so they could observe, rather than infer, full-length transcripts. Gordon showed one particular gene to illustrate the success of the approach: with short-read sequencing, this gene was predicted to have six isoforms; with PacBio, the team observed and confirmed 118 isoforms instead. He also noted that generating a transcriptome from PacBio data does not require a reference genome. His team did have a reference for P. crispa, however, which they used to double-check the PacBio results and found them to be highly accurate. Gordon said that the long reads also enabled unexpected findings, such as abundant read-through transcription, in which multiple ORFs occurred in a transcript. (The recording is not available at this time.)
Finally, our own Edwin Hauw spoke about the PacBio technology roadmap (link: http://blog.pacificbiosciences.com/2014/01/looking-ahead-2014-pacbio-technology.html) for the coming year. Sample prep improvements are expected to reduce input DNA requirements (down to 10-100 ng), improve preps for longer insert sizes, and streamline kits. A new C4 chemistry is expected to extend average read lengths to 10-15 Kb this year, with the long-term goal of generating about 1.6 Gb per SMRT Cell. PacBio is also planning to focus on data analysis improvements, including an easy-to-use GUI for isoform sequencing and tools for viral minor-variant detection and long-amplicon haplotype analysis. In addition, Hauw told users that PacBio is working to provide better assemblers for diploid de novo genomes or low-coverage genomes, as well as a faster version of Quiver and regional methylation detection, including 5mC without bisulfite conversion, with an expected release date later in the year. Watch recording: SMRT Sequencing Road Map
http://blog.pacificbiosciences.com/2014/01/at-plant-animal-genome-workshop-users.html
The Science Web; DNA sequencing company announce totally predictable plans.
Posted on January 29, 2014----------------------------------------///Pacific Biosciences, the American DNA sequencing company, have announced typically predictable plans for their platform, the Pac Bio RS II.
The company announced on Tuesday that they would focus on “more reads, longer reads, and higher quality data”. Of course, this has been the goal of all sequencing technologies since the 1970s, and many researchers expressed surprise that PacBio seemed to have been following a different strategy up until now.
“With our unique 15% error rate, our aim was to produce a technology that would serve only a small set of niche markets”, sources at the company said. “Therefore, we produced just a few thousand really poor quality, relatively short reads” they continued.
All of this changed in 2012, when a new CEO and Chief Scientific Office joined the company, and introduced the decades old stratgey of actually producing something useful.
“We came in and we took a look at the data, and we asked what the strategy was” said Michael Caterpiller, CEO. “The board thought that their really crappy error rate gave them a unique niche in the market, and that niche needed to be defended. Some in the company actually thought we should increase the error rate, and blow everyone else out of the market in terms of poor quality data” he continued.
“So we thought – why not introduce the strategy of every other sequencing company and increase throughput, read length and quality?” continued Jonas Coolback, chief scientific officer. “It was a revolutionary idea – all of a sudden, we’d made PacBio relevant again. We started producing what researchers had wanted since the very beginning”.
Share prices in PacBio have increased steadily since 2012, and many now see the technology as the future of genome sequencing. “I don’t know why they didn’t do this sooner” said Gordon Gecko of the Satanic Investment Bank.
http://thescienceweb.wordpress.com/2014/01/29/dna-sequencing-company-announce-totally-predictable-plans/
PacBio Demos First De Novo Animal Genome as it Plans Longer Reads, Increased Throughput
January 28, 2014
http://www.genomeweb.com/sequencing/pacbio-demos-first-de-novo-animal-genome-it-plans-longer-reads-increased-through?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+genomeweb%2Finsequence+%28In+Sequence%29
PacBio Demos First De Novo Animal Genome as it Plans Longer Reads, Increased Throughput
January 28, 2014
http://www.genomeweb.com/sequencing/pacbio-demos-first-de-novo-animal-genome-it-plans-longer-reads-increased-through?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+genomeweb%2Finsequence+%28In+Sequence%29
On the previous post,I left out an important part of Mick Watsons comment(PACBIO) ? 3 hrs ago--@ctitusbrown very glad this fell in your lap. Be interested in what Moleculo can bring to the table when you have PACBIO !!!!
Sun 26 January 2014-- Living in an Ivory Basement Stochastic thoughts on science, testing, and programming.
Posted the Chick Genome Improvement Grant
By C. Titus Brown
In science.
tags: assembly
I've just posted the narrative for a recently funded USDA grant on improving the quality of the chick genome assembly on the lab's research page. The issues are laid out in detail in the grant, but, basically, the question is: how can we improve the quality of the assembly? The answer, we think, is to pursue a range of strategies that include additional sequencing to get at the microchromosomes, as well as improved assembly merging and scaffolding tools capable of dealing with a range of sequencing data types.
For this genome in particular, we now have Sanger, 454, Illumina, PacBio, and Moleculo data. How do you cross-evaluate that data, much less combine it all? Interesting questions!
We do know that reasonably sizeable chunks of the chick genome are missing or unscaffolded, in part because they're hard to sequence and in part because they're hard to assemble. The PacBio data is already leading to significant improvement in galGal 4, and now we're trying to figure out how to make use of the Moleculo data, too.
One particularly interesting approach I'm working on is to release some or all of the data so that assembler authors can experiment with all of this data. In particular, it should be possible to release a small subset of the data for whatever is not represented in the current assembly; this certainly includes a bunch of microchromosomes. I'll keep you posted.
--titus
p.s. Remember when I didn't work on euk genome assembly? Yeah, me too. I can already tell I'm going to long for the days of "simple" metagenome and transcriptome assembly work ;)
http://ivory.idyll.org/blog/2013-posted-chick-improvement-grant.html -- (Comments on this article) Mick Watson ? 3 hrs ago--@ctitusbrown very glad this fell in your lap. Be interested in what Moleculo can bring to the table when you have Retweets 10:22 AM - 26 Jan 2014 · /////////// ///////////////////////////// Titus Brown 4hrs ago- I just posted our grant narrative from grant to improve chick genome: chick-improvement-grant.html … Sanger, Illumina, PacBio, Molecula, oh my! 9:20 AM - 26 Jan 2014 ////////////////////////////////// Titus Brown ? 3 hrs anticipate much improved assembly by q3; PacBio scaffolding Tweet 10:42 AM - 26 Jan 2014 .
PacBio Blog-Thursday, January 23, 2014- PacBio Service Provider DNA Link Sees Soaring Global Demand for SMRT Sequencing
Since its founding in 2000, the service provider team at Korea-based DNA Link has sought to differentiate itself from other facilities by being an early adopter of new technologies. The company started during the height of the Human Genome Project and initially offered genotyping assays such as SNaPshot® and TaqMan®. DNA Link later adopted microarrays, and when the next-generation sequencing (NGS) wave hit, the scientists quickly embraced the technology.
DNA Link scientists purchased their PacBio® system virtually as soon as it was commercially available. Today, customer demand for Single Molecule, Real-Time (SMRT®) Sequencing is soaring — making DNA Link’s expertise in running the platform a prime asset. Most projects run on the PacBio system focus on de novo sequencing, but there is growing interest in SNP detection, haplotype phasing, and characterizing repeat regions, particularly in plant and animal genomes.
DNA Link serves several hundred scientists in its home country, which is currently its largest market, but interest from abroad is rising. This month, the company will start operations at its first US branch, located in San Diego, to help build a global presence for the rapidly growing team.
One of the qualities that differentiates the service team is its interaction with customers. Along with the company’s sales team, led by Kevin Koo, DNA Link scientists frequently visit the institutions where their customers are based and encourage in-person consultations at the start of each project. “We do that more than 10 times per week,” Koo says. While it is not possible to extend that model to customers abroad, genomic services leader Gun Eui Lee and his team make sure to attend to those clients with conference calls and emails. This attention to customer service helps to ensure expectations are understood and their scientific recommendations are communicated at the start of a new project. “Their success is our success,” says Koo, “so we want to give them the right direction.”
Other attributes that set DNA Link apart are its competitive pricing and veteran staff of scientists and bioinformaticians. “We have very experienced senior staff,” Koo says, “and they provide the consistent, high-quality data that’s our strong point.”
Much of that data rolls off the PacBio pipeline that DNA Link first established in 2011. “We saw great potential in PacBio sequencing because of the unique long reads,” Lee says. From 2012 to 2013, Lee’s lab more than tripled the number of SMRT Cells used for service projects, reflecting customer demands. “That increase is really amazing,” he says. Demand for PacBio sequencing has been robust within Korea and abroad — so much so that DNA Link is now running more SMRT Cells than flow cells or chips for any of its other sequencing platforms.
Scientific interest for the PacBio service centers primarily on de novo sequencing, Lee says. His sequencing team has used SMRT Sequencing on a number of different organisms, including plant, human, animal, and bacterial genomes. Plant genomes in particular have been of interest, he says, because of their challenging repeat sequences. Short-read assemblies for these large, complex genomes have provided limited information — “but by adding these PacBio long reads, we’ve found that a lot of those repetitive regions were fixed,” Lee adds.
For more on DNA Link, read the full profile or visit their website http://blog.pacificbiosciences.com/2014/01/pacbio-service-provide.r-dna-link-sees.html?utm_content=buffere1ce0&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer
Originally Published: 13 September 2013// Updated by Next-Generation-Sequencing on Jan. 22, 2014
Reducing assembly complexity of microbial genomes with single-molecule sequencing
Abstract
Background
The short reads output by first- and second-generation DNA sequencing instruments cannot completely reconstruct microbial chromosomes. Therefore, most genomes have been left unfinished due to the significant resources required to manually close gaps in draft assemblies. Third-generation, single-molecule sequencing addresses this problem by greatly increasing sequencing read length, which simplifies the assembly problem.
Results
To measure the benefit of single-molecule sequencing on microbial genome assembly, we sequenced and assembled the genomes of six bacteria and analyzed the repeat complexity of 2,267 complete bacteria and archaea. Our results indicate that the majority of known bacterial and archaeal genomes can be assembled without gaps, at finished-grade quality, using a single PacBio RS sequencing library. These single-library assemblies are also more accurate than typical short-read assemblies and hybrid assemblies of short and long reads.
Conclusions
Automated assembly of long, single-molecule sequencing data reduces the cost of microbial finishing to $1,000 for most genomes, and future advances in this technology are expected to drive the cost lower. This is expected to increase the number of completed genomes, improve the quality of microbial genome databases, and enable high-fidelity, population-scale studies of pan-genomes and chromosomal organization.
This link Changes Daily!! Only good for 12hrs more!! http://paper.li/mtwolfinger/1342906758 // Scrool down for story!!
Immigrant Innovators • Dr. jonas korlach. The Administration • Champions of Change (From THE WHITE HOUSE)!!!!
Dr. Jonas Korlach is Chief Scientific Officer at Pacific Biosciences. He co-invented the company’s SMRT technology with Stephen Turner, Ph.D., Pacific Biosciences Founder and Chief Technology Officer, when the two were graduate students at Cornell University. SMRT technology dramatically improves the accuracy and speed of DNA sequencing. Dr. Korlach joined Pacific Biosciences as the company's eighth employee in 2004. Dr. Korlach is the recipient of multiple grants, an inventor on 33 issued U.S. patents, and an author of numerous scientific studies on the principles and applications of SMRT technology, including publications in Nature, Science, and PNAS. He received both his Ph.D. and his M.S. degrees in Biochemistry, Molecular and Cell Biology from Cornell, and received M.S. and B.A. degrees in Biological Sciences from Humboldt University in Berlin, Germany. //// http://www.whitehouse.gov/champions/immigrant-innovators/dr.-jonas-korlach-
Join the PacBio® team as we discuss the recently released human MCF-7 Iso-Seq transcriptome dataset. The webinar will cover both the bioinformatics pipeline that was used in generating the data and explore the pipeline results.
•Bioinformatics Pipeline?
?Identification of full-length reads using polyA tail signal and adapters
?Iterative clustering of reads at the isoform-level
?Final consensus clustering using Quiver
•Pipeline Results
?Novel transcripts
?Alternative splice forms
?Alternative polyadenlyation
?Fusion genes
Who should attend:
Researchers and bioinformaticians interested in full-length transcriptome sequencing.
Recommended pre-reading:
•Full-length cDNA sequencing protocol
•Basic understanding of PacBio sequencing data format
Presenter BiographyElizabeth Tseng
Senior Bioinformatics Scientist, Pacific Biosciences
Elizabeth obtained her doctorate degree in Computer Science & Engineering from the University of Washington in 2012. Her thesis work focused on the computational discovery of bacterial non-coding RNAs and gut microbiome. After joining PacBio, she decided to give prokaryotes a break and now supports and develops eukaryotic transcriptome-related collaborations.
(Click on your preferred)
time below to register:
Wednesday, January 22 8:00 a.m. PST
Wednesday, January 22 5:00 p.m. PST
http://programs.pacificbiosciences.com/l/1652/2013-12-06/2tw4q9
Thanks,Just over 3X my $2.33 purchase price!! __Article From Fool.com //// http://www.fool.com/investing/general/2014/01/21/why-inteliquent-pacific-biosciences-of-california.aspx
PacBio Blog
Tuesday, January 21, 2014Genome Research Paper: Resolve Complex Genomic Regions for a ‘Fraction of the Cost’ With SMRT Sequencing
A new Genome Research paper describes the application of Single Molecule, Real-Time (SMRT®) Sequencing to resolve repeat-heavy genomic regions in important reference genomes such as human and chimpanzee. In the process, the authors drew some important conclusions about cost, pooling, and coverage requirements for this type of work.
“Reconstructing complex regions of genomes using long-read sequencing technology” comes from lead author John Huddleston and senior author Evan Eichler at the University of Washington, along with collaborators at Washington University, the University of Bari, Bilkent University, and Pacific Biosciences.
In the paper, Eichler and his collaborators note the steep cost of finishing a BAC clone to high quality using Sanger sequencing. That problem has proliferated as short-read sequencing leaves more genomes in draft form, the authors write. “Although we can generate much more sequence, the short sequence read data and inability to scaffold across repetitive structures translates into more gaps, missing data, and more incomplete references assemblies,” according to the paper.
To find a more cost-effective alternative, they tested the PacBio® sequencing platform in complex genomic regions. In the first project, the team sequenced eight BAC clones representing a 1.3 Mbp region of chromosome 17q21.31 from a hydatidiform mole sample and assembled results using HGAP and Quiver. The region is known for having high-identity segmental duplications as well as large structural polymorphisms. They report an average of 245x coverage per clone; each clone assembled into a single contig, and six of the eight clones only required a single SMRT Cell. After comparing differences between the PacBio assembly and an existing high-quality Sanger assembly, the authors say, the new sequence showed 99.994% identity to Sanger. To validate the mismatches, “we targeted 44 differences using Illumina® sequencing and find that PacBio and Sanger assemblies share a comparable number of validated variants, albeit with different sequence context biases,” they add.
In the second project, Huddleston et al. performed similar work on a nearly 800 Kb region of the chimpanzee genome that has a significant number of duplications. The study, involving five BAC clones, again demonstrated the accuracy of the sequencing platform and assembly protocol. A validation procedure using BAC-end and fosmid-end sequences confirmed “the order, orientation, and sequence accuracy of the clone-based assembly of this complex region of the chimpanzee genome,” the scientists write.
In the paper, the team drew several conclusions from their efforts. One was that results in the first project could have been generated with almost exactly the same degree of accuracy using 100x coverage instead of more than 200x. This was confirmed using random downsampling of the data set. Separately, they demonstrated successful pooling of BACs, showing that pools of two or three samples could be properly separated post-sequencing to achieve high-quality assemblies of each clone.
One of the goals of this study was to evaluate whether PacBio sequencing could bring back the quality of Sanger-finished genomes without the prohibitive cost. They conclude that SMRT Sequencing can indeed accomplish this task “for a fraction of the cost and time of traditional finishing approaches.” The authors report that sequencing a single BAC clone with Sanger costs $4,000 to $5,000, while the same task using PacBio costs approximately $625. That cost would decrease further using pooled BACs, they note.
Finally, the authors suggest taking advantage of existing BAC-end sequence data to select clones that span gaps in important draft genomes and using SMRT Sequencing to increase the quality of those assemblies. “The approach we have described provides a strategy to resolve these more structurally complex regions during the final stages of assembly, ensuring that the 1000-2000 genes mapping therein become incorporated within future mammalian genome assemblies,” they write.
For more from Evan Eichler’s lab, check out his presentation discussing this work from ASHG 2013: Resolving Complex Regions of Genomes using Long-Read Sequencing Technology. http://blog.pacificbiosciences.com/2014/01/genome-research-paper-resolve-complex.html?utm_content=buffera25da&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer
PacificBiosciences / FALCON -- January 19, 2014--- Running an Amazon EC2 instance that has HBAR-DTK + Falcon pre-installed
The stable version of StarCluster does not support the c3 instance. For assembly, using one node of c3.8xlarge instance is more convenient. In my test, I can finish single E. coli genome within almost one hour. Namely, one can assembly a bacteria genome in less then 5 bucks. (from Jason Chin @infoecho Running an AWS instance with HBAR-DTK+FALCON pre-installed to assemble a bacteria genome in 1 hr with PacBio reads https://github.com/PacificBiosciences/FALCON/blob/master/examples/readme.md …Retweeted by Keith Robison and 1 other)
PacBio Blog
Thursday, January 16, 2014Looking Ahead: The 2014 PacBio Technology Roadmap
By Jonas Korlach, Chief Scientific Officer
2013 was an eventful and exciting year for PacBio. As I described in the 2013 roadmap post a year ago, we have applied numerous improvements to SMRT® Sequencing, resulting in longer read lengths, greater sequencing throughput, new and improved data-analysis methods, and more efficient workflows. We are very pleased that these advances resulted in so many publications, conference presentations, and social media contributions, with the number of peer-reviewed scientific publications from the scientific community now exceeding 100. On behalf of all of us at Pacific Biosciences, I would like to express my heartfelt gratitude to the scientific community for their time and efforts to apply PacBio® sequencing to solve their research questions, and for their invaluable help to drive applications for SMRT Sequencing forward. We all very much look forward to working together with you in 2014!
As with any relatively new technology, significant improvement and optimization potential exists upon the initial introduction. SMRT Sequencing is no exception to this, and we intend to continue to leverage this potential to the benefit of the research community. Consistent with the technology improvements in previous years, we are targeting another ~4-fold increase in the throughput per SMRT Cell to achieve average read lengths greater than 10-15 kb and overall sequence data outputs in excess of 1 Gb per SMRT Cell, while at the same time preserving SMRT Sequencing's high consensus accuracy, lack of sequencing bias, and ability to detect many epigenetic base modifications. The improvements will be accomplished by a combination of sequencing chemistry upgrades through polymerase and nucleotide engineering, improvements in the polymerase loading efficiency, and software upgrades.
In addition to the sequencing process itself, we will continue to develop improvements for the other two aspects relevant to sequencing. For library preparation, more streamlined protocols will become available, including automated library preparation methods on liquid handling robots. Further, we are developing improved protocols that better ensure the integrity of large inserts (10-20 kb) during the generation of high-quality, long-insert DNA libraries. In addition, protocols with a further reduction in the amount of DNA input, as well as improved barcoding and multiplexing solutions, will become available. With regard to data analysis, our ongoing progress to support and accelerate the analysis of larger genomes, including the human genome, will continue, with improvements to the speed of components such as our mapping tool BLASR and consensus caller Quiver. New methods achieving the assemblies and appropriate representation of organisms with diploid genomes will become available, thereby providing a significant advance in the genetic characterization of virtually all higher organisms, and their corresponding heterozygosity and structural genetic variation. Our Iso-Seq application for the analysis of full-length transcripts and splice isoforms will become more streamlined and include a graphical interface for greater ease of use.
We are indebted to the community for helping with the development of new sample preparation methods and analysis tools for these and many other application spaces, and we anticipate a continuation of these very important contributions. We will continue to release new data sets to the public as we have done in the past, e.g. the Arabidopsis de novo assembly, the long-read human genome dataset for structural variation, the MCF7 Iso-Seq dataset, bacterial methylomes, and the recent Drosophila de novo assembly, to provide the scientific community with examples of what value PacBio data bring to the characterization of the genome, epigenome, and transcriptome of the organism under study, and to help researchers design their own studies.
I am very excited about the prospects for this coming year, and wish you the best of success in your research! http://blog.pacificbiosciences.com/2014/01/looking-ahead-2014-pacbio-technology.html
Jan 14, 2014-- De novo assembly of complex genomes
using single molecule sequencing,
Michael Schatz Summary
• Long read sequencing of eukaryotic genomes is here
• Recommendations
< 100 Mbp: HGAP/PacBio2CA @ 100x PB C3-P5
expect near perfect chromosome arms
< 1GB: HGAP/PacBio2CA @ 100x PB C3-P5
expect high quality assembly: contig N50 over 1Mbp
> 1GB: hybrid/gap filling
expect contig N50 to be 100kbp – 1Mbp
> 5GB: Email mschatz@cshl.edu
• Caveats
– Model only as good as the available references (esp. haploid sequences)
– Technologies are quickly improving, exciting new scaffolding technologies---- http://schatzlab.cshl.edu/presentations/2014-01-14.PAG.Single%20Molecule%20Assembly.pdf
•
Successfully assembled “diploid”-like long-read data generating assembly with N50 > 2 Mbp using only PacBio data.
•
With enough PacBio data, one can start assembly from reads >10 kb: It reveals the diploid structure as quasi-linear chains in the string graph.
•
Toward an improved diploid assembler:
–
More rigorous theoretical framework for the diploid / polyploid graph traversing problem
–
Generate diploid consensus: need an efficient aligner to create string graphs from long reads
–
Phasing: combining SV discovery to SNP calling to “unzip” the bubbles
–
More testing cases:
-
Real biological diploids
-
Other diploid genome might have different structure
Arabidopsis 120 Mbp genome Two strains, Ler-0 & Col-0 sequenced separately
Transposons
45S rDNAs
Retrotransposons
Common repeat element lengths
Pre-assembled read length distribution
Acknowledgements
We thank Joe Ecker and Chongyun Lou (HHMI & Salk Institute) for providing the Col-0 DNA sample. We also thank Adam Phillippy (NBACC) and Michael Schatz (CSHL) for insightful discussions about assembly algorithms. --https://s3.amazonaws.com/files.pacb.com/pdf/String+Graph+Assembly+For+Diploid+Genomes+with+Long+Reads.pdf?utm_content=buffer8d567&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer
PacBio Blog
Monday, January 13, 2014Data Release: Preliminary de novo Haploid and Diploid Assemblies of Drosophila melanogaster
Model organisms such as yeast, Arabidopsis and Drosophila have been essential to progress in genetic and biomedical research for more than 100 years. Model organisms are the best, fastest, most effective way to advance science especially when human experimentation may not be feasible. Numerous biological principles have been elucidated using model organisms, including Nobel-prize winning discoveries by Thomas Hunt Morgan that genes are carried on chromosomes; by Hermann Muller for the discovery that X-ray irradiation causes mutations; and by Edward B. Lewis, Christiane Nüsslein-Volhard, and Eric Wieschaus for their discoveries revealing the genetic control of early embryonic development - all using the fruit fly Drosophila melanogaster. D. melanogaster also played a crucial role in the development of genomics by being the first multicellular organism to be sequenced and assembled by a whole genome shotgun method, followed by directed clone-based finishing. The resulting D. melanogaster reference genome sequence provides an invaluable test bed for developing new genome sequencing and assembly technologies.
In collaboration with Dr. Casey Bergman at the University of Manchester and Drs. Susan Celniker and Roger Hoskins of the Berkeley Drosophila Genome Project (BDGP) at Lawrence Berkeley National Laboratory, we have sequenced adult males from a subline of the ISO1 (y; cn, bw, sp) strain of D. melanogaster. This is the same stock used in the official BDGP reference assemblies since the first genome sequence release in 2000. The DNA was size-selected for >15 kb elution using the BluePippinTM system (Sage Science), and in total, ~15 Gb of sequence was generated from a 20 kb library using P5-C3 sequencing chemistry on the PacBio® RS II.
Total number of bases: 15,208,567,933 bp
Total number of reads: 1,514,730
Average read length: 10,040 bp
Half of sequenced bases in reads greater than: 14,214 bp
PacBio RS II instrument time for sequencing: 6 days
Number of SMRT® Cells: 42
Some preliminary analyses and step-by-step instructions for downloading, mapping, and visualizing these raw data are described on the Bergman lab blog. This analysis shows that the depth of coverage for this dataset is >90x for reads mapping to autosomes in the D. melanogaster Release 5 reference genome sequence. Dr. Bergman also shows that individual PacBio long reads can uniquely localize repetitive transposable elements up to ~10 kb in size, and can be used to fill at least one of the rare but persistent gaps remaining in the euchromatic portion of the reference genome.
Most assembly algorithms collapse contigs into a single copy of the genome (haploid); however this does not reflect the true underlying state in diploid genomes such as flies or human, which have two copies of each chromosome - inheriting one copy from the mother and another copy from the father. In the case of this inbred subline, with limited allelic variation, both assembly strategies are feasible. We attempted both a traditional haploid assembly as well as a first-ever diploid assembly of the D. melanogaster genome, with preliminary results summarized below. The maximum contig length in both haploid (25 Mb) and diploid (21 Mb) assemblies produced contigs that span almost the entire length of chromosome arm 3L:
The preliminary haploid assembly using the PacBio Corrected Reads (PBcR) pipeline in Celera® Assembler 8.1 was carried out in collaboration with Drs. Sergey Koren and Adam Phillippy at the University of Maryland. They were able to assemble entire chromosome arms de novo into fewer pieces than the current version of the reference genome (Release 5) – an effort that has spanned two decades, cost millions of dollars, and involved laborious BAC design and sequencing as well as manual finishing and optical mapping. This level of completeness in a de novo assembly is unprecedented in a metazoan genome, and required a total of 6 weeks from initial fly collection and sorting to final analysis and assembly. The actual sequencing time was 6 days. The contig count for each chromosome is tabulated below.
The larger number of contigs for chromosome X can be explained because only adult males were sequenced here (with a 50:50 ratio of chrX to chrY), whereas the Release 5 genome is from mixed-sex embryos. Chromosome X and Y assemblies can further be improved using a higher coverage of reads, and this analysis is underway. Additional analysis and data associated with the haploid assembly are provided here by Dr. Sergey Koren and Dr. Adam Phillippy, including 70x of high-quality, pre-assembled reads using the Celera Assembler PBcR pipeline. Dr. Koren will be presenting the full results at the International Plant and Animal Genome XXII meeting on Tuesday, January 14th.
We also attempted a diploid assembly using an early version of FALCON (Fast Alignment and CONsensus), a new assembly algorithm that is currently being developed at PacBio to explore de novo assembly of diploid genomes. The FALCON assembler extends the Hierarchical Genome Assembly Process (HGAP), with faster implementations of the aligner and consensus algorithms for the pre-assembly and overlap steps of the process. A string-graph is used for the layout stage, which preserves structural and phasing information in polymorphic and heterozygous diploid genomes. The algorithm outputs “primary contigs” and their “associated contigs”, capturing alternative local variants associated with a primary contig. Details were presented at the Genome Informatics Conference in Cold Spring Harbor in November 2013 with slides available here. We assessed the results of our preliminary FALCON assembly by aligning the assembled contigs (light green) to the euchromatic arms (dark green) of the D. melanogaster reference genome using nucmer from the Mummer3 package. Alignments to 2L, 2R, 3L, 3R, 4, X and Y are shown below with sequence identity of each alignment block color-coded in blue indicating >99.96% identity and yellow indicating >99.9% identity. The blue arches represent “unused edges” in the FALCON string graph, and can be interpreted as potential connection points if a more aggressive contig joining threshold were desired.
By sequencing sex-selected male flies, we were able to increase the coverage of reads from the Y-chromosome, which remains one of the most challenging regions left to assemble in the reference genome. Only ~1% of chromosome Y is represented in the reference (Release 5). More recently the BDGP has assembled 7.5% of the Y and we anticipate more than 50% of the Y-chromosome can be assembled with the new data. Ken Wan and Sue Celniker identified contigs containing Y linked genes. The self-self dot-plot below shows that FALCON was able to de novo assemble a 650 kb region of the heterochromatic region of chromosome Y containing a very complex nested repeat structure. In addition, the gene Pp1-Y2 (a testis-specific phosphatase) which spans several gaps of unknown size in the Release 5 sequence is now entirely contained in this single contig.
A development version of FALCON is available on github to developers and bioinformaticians interested in this new assembly process.
We are also releasing an updated dataset of the Arabidopsis thaliana genome using the latest P5-C3 chemistry for comparison to our initial data release using P4-C2 chemistry. In total, we have now released four datasets from three important model organisms: S. cerevisiae (yeast), A. thaliana (flowering plant), and D. melanogaster (fruit fly). This data is freely available and we invite the research community to explore the collection:
http://blog.pacificbiosciences.com/
Bergman Lab--13 Jan 2014 — 10:01 AM -- High Coverage PacBio Shotgun Sequences Aligned to the D. melanogaster Genome
in drosophila, genome bioinformatics, high throughput sequencing, transposable elements, UCSC genome browser
Shortly after we released our pilot dataset of Drosophila melanogaster PacBio genomic sequences in late July 2013, we were contacted by Edwin Hauw and Jane Landolin from PacBio to collaborate on the collection of a larger dataset for D. melanogaster with longer reads and higher depth of coverage. Analysis of our pilot dataset revealed significant differences between the version of the ISO1 (y; cn, bw, sp) strain we obtained from the Bloomington Drosophila Stock Center (strain 2057) and the D. melanogaster reference genome sequence. Therefore, we contacted Susan Celniker and Roger Hoskins at the Berkeley Drosophila Genome Project (BDGP) at the Lawrence Berkeley National Laboratory in order to use the same subline of ISO1 that has been used to produce the official BDGP reference assemblies from Release 1 in 2000 to Release 5 in 2007. Sue and Charles Yu generated high-quality genomic DNA from a CsCl preparation of ~2000 BDGP ISO1 adult males collected by Bill Fisher, and provided this DNA to Kristi Kim at PacBio in mid-November 2013 who handled library preparation and sequencing, which was completed early December 2013. Since then Jane and I have worked on formatting and QC’ing the data for release. These data have just been publicly released without restriction on the PacBio blog, where you can find a description of the library preparation, statistics on the raw data collected and links to preliminary de novo whole genome assemblies using the Celera assembler and PacBio’s newly-released FALCON assembler. Direct links to the raw, preassembled and assembled data can be found on the PacBio “Drosophila sequence and assembly” DevNet GitHub wiki page.
Here we provide an alignment of this high-coverage D. melanogaster PacBio dataset to the Release 5 genome sequence and some initial observations based on this reference alignment. Raw, uncorrected reads were mapped using blasr (-bestn 1 -nproc 12 -minPctIdentity 80) and converted to .bam format using samtools. Since reads were mapped to the the Release 5 genome, they can be conveniently visualized using the UCSC Genome Browser by loading a BAM file of the mapped reads as a custom track. To browse this data directly on the European mirror of the UCSC Genome Browser click here
Credits: Many thanks to Edwin, Kristi, Jane and others at PacBio for providing this gift to the genomics community, to Sue, Bill and Charles at BDGP for collecting the flies and isolating the DNA used in this project, and to Sue and Roger for their contribution to the design and analysis of the experiment. Thanks also to Jane, Jason Chin, Edwin, Roger, Sue and Danny Miller for comments on the draft of this blog post.(Must see links)
https://twitter.com/caseybergman/status/422777445926699008/photo/1
http://bergmanlab.smith.man.ac.uk/?p=2176 /// http://cbcb.umd.edu/software/PBcR/dmel.html ///
Long reads of the year 2013
January 8, 2014 by Next-Gen Sequencing Data
A bit late to the yearly review posts. But here it is. Long Reads of the year 2013. As you can see, this “Long Reads” are slightly different Here we summarize a few “long read” sequence data that got publicly available last year and point to where one can download the data. They are awesome resources and great to start playing with them in the new year.
One of the most exciting things in “next-gen sequencing” happened this year is the availability “long” sequence reads, be it genomic or transcriptomic. Two sequencing technologies, that already have “long reads” and got a lot of attraction this year are Illumina’s Moleculo and PacBio. And Oxford Nanopore data is just around the corner. With Oxford Nanopore’s early access program, it is expected that, we might see some data by February 2014 (AGBT 2014?).
The year 2013 started with Illumina acquiring Moleculo for its long-read technology. And another biggest change that happened is that PacBio got more social (possibly realizing the threat from Illumina) :). PacBio started blogging in mid 2012, but had just two blog posts in 2012. Then, 2013 came, PacBio got really prolific and till now it has over 55 posts. In addition, PacBio also started making its data publicly available using the blog.
Moleculo and PacBio sequence data from Drosophila
After acquiring Moleculo, Illumina launched Fast Track Long Read sequencing service using Moleculo long read technology. As part of the early Access launch, Illumina shared long reads data set from Dr. Dmitri Petrov’s group at Stanford, comprising two libraries of Drosophila melanogaster, each run on a single HiSeq lane and producing ~30Gb data. Visit Illumina’s Base Space to get the data.
Around the same time, Casey Bergman’s lab made PacBio long reads publicly available. The raw PacBio data is 1,357,183,439 bp with ~7.5x coverage of the 180 Mb male D. melanogaster genome. The 63G PacBio data can be downloaded from Bergman’s lab website. Not just this, Begman lab also had Illumina data from the same sample and combined it with the PacBio reads to offer error corrected sequence data.
Another possible Moleculo data is from the publication first publication using Moleculo technology. The Moleculo team worked on the project before naming the technology as Moleculo and the results came out in a paper on eLife. However, it looks like the data is not available freely. Are there other Moleculo data out in the wild?
PacBio RNA-seq data from Human MCF-7
PacBio long generated sequencing data of RNA from MCF-7, a human breast cancer cell line and made it available on its website. The data obtained from P4-C2 sequencing chemistry and contains 44,531 non-redundant transcript-length consensus sequences with read length ranging from 400 bp – 4,900 bp (an average length of 1,929 bp). Here is the PacBio blog post offering more details on the “long read” data.
Long-Read Shotgun Sequencing of a Human Genome
Pacbio released the data generated from P5-C3 scaffolding sequencing chemistry and contains over 3.6 M reads with average length of 8,849 bases. (Half of sequenced bases in reads greater than: 10,985 bp). The data is from an interesting human cell line derived from a complete hydatidiform mole (CHM).
A hydatidiform mole is defined as a pregnancy with no embryo and clinically presents in approximately 1 in 1,500 pregnant women in North America. The CHM cells have a diploid genome, typically XX, that is a result of replication of a haploid paternal (sperm) genome. Through the corresponding absence of allelic variation, this sample has been used to generate a haploid reference genome sequence, and many associated resources are available, including physical maps, genotypes (iSCAN), and a large-insert BAC library (CHORI-17). It is also one of the targets for the production of a higher quality “platinum” genome assembly.
Visit PacBio blog for accessing the data.
PacBio RNA-seq data
Mike Snyder’s group from Stanford did the first long-read survey of human transcriptome and generated 476,000 CCS reads from cDNA with an average length of 1 kb to investigate the isoform complement of a diverse pool of RNA samples representing 20 human tissues and organs. Data from 454 platform with average read length 522 bp , but on the same samples, is also available. PacBio RNA-seq Data on ENA: PRJEB3969
PacBio RNA-seq data from hESC cell line
Wing Wong’s team from Stanford published a new method that can use PacBio and Illumina reads to identify isoforms in PNAS. The team used C2 chemistry to generate over 7.5 M lreads of average length 2-3 Kb from hESC cell line H1. Data can be accessed at GSE51861.
7Share1Share0Share0Share0Share0Share You may also like:
EncodingInformationAsDNA EncodingInformationAsDNA
Information Storage in DNARoche to Shut Down NJ R&D Facility and 1000 Jobs to Go Roche to Shut Down NJ R&D Facility and 1000 Jobs to Go
Roche, the swiss based pharma giant announced that it will be closing Nutley NJ R&D...2013 NGS Conferences 2013 NGS Conferences
Here is the list of Next-Gen sequencing conferences in 2013. NextGenSeek hopes to list...23andMe Reduces DNA Testing Kit Prize and Removes Subscription Plan 23andMe Reduces DNA Testing Kit Prize and Removes Subscription Plan
23andMe the personal genomics company based in California announced that it is...Illumina Sues Complete Genomics Again Illumina Sues Complete Genomics Again
Illumina announced that it is filing its second patent infringement lawsuit against...Did You Know There Are (At Least) 14 Next-Gen Sequence Technology Companies? Did You Know There Are (At Least) 14 Next-Gen Sequence Technology Companies?
Would you believe there are next-gen sequencing technology companies other than the... [ what's this ] Share on facebookShare on twitterShare on emailShare on pinterest_shareMore Sharing Services0Related posts:
1.PacBio Aims to Reach Average Read of Lengths of 7000-9000 Bases in 2013
2.Illumina CEO Jay Flatley on Moleculo and Verinata Health
3.Illumina Acquires Moleculo Inc. for Longer Reads
4.Update on Moleculo Technology from PAGXXI
5.Illumina Gives More Details on Moleculo Technology
Filed Under: Illumina Long Read Sequencing Service, Moleculo Long Reads, Moleculo Technology, PacBio, PacBio RNA-seq · Tagged With: Moleculo, Moleculo Long Reads, PacBio, PacBio Long Reads
Comments
Lex Nederbragy says:
January 8, 2014 at 4:18 pm
Great idea, this post! Some comments:
The Drosophila moleculo data is available through Illumina’s basespace (free registration required).
PacBio released several bacterial genome datasets, from projects illustrating the potential for finished genomes using this platform.
Reply
Lex Nederbragy says:
January 8, 2014 at 6:59 pm
And then I forgot to include the Arabidopsis Pacbio long reads, as well as the reads generated from the Human Microbiome Project ‘mock community’ sample – both released by the company and available through pacbiodevnet.com
http://nextgenseek.com/2014/01/long-sequence-reads-to-play-with-during-the-holidays/
Twitter discussions today by experts in the fields of biology in drug discovery,Bioinformatics, Microbial forensics,Next Generation Sequencing,bioinformatician, many more researcher. http://storify.com/sahasurya/pacbio-error-correction?utm_source=t.co&utm_campaign=&awesm=sfy.co_sMhP&utm_content=storify-pingback&utm_medium=sfy.co-twitter // http://pag14.mapyourshow.com/5_0/sessions/sessiondetails.cfm?ScheduledSessionID=18A1 Adam Phillippy @aphillippy
Follow @SahaSurya pbtools tries to strike the balance between those two extremes and @mike_schatz will surely have results to show next week at PAG1:29 PM - 8 Jan 2014
Reply ///////////// John Davey @johnomics
Follow @aphillippy @druvus @SahaSurya See PAG schedule http://pag14.mapyourshow.com/5_0/sessions/s....- @mike_schatz, should I start PacBioToCA now or wait til Tuesday?1:43 PM - 8 Jan 2014
PacBio Blog
Wednesday, January 8, 2014SMRT Sequencing for Plant and Animal Genomes: Learn More at PAG 2014
Many recent studies have demonstrated the use of Single Molecule, Real-Time (SMRT®) Sequencing for larger genomes, from complete reference genomes to de novo discovery of transcript isoforms. These advances include understanding genome complexity and variation and enabling improved leverage of haplotype information for biotechnology. Some of these efforts will be presented at the workshop we’re hosting at this year’s International Plant & Animal Genome (PAG) conference in San Diego. Sign up now to reserve your seat or receive the recording after the event.
Workshop details:
A SMRT® Sequencing Approach to Reference Genomes, Annotation, and Haplotyping
Tuesday, January 15
1:30 – 3:30 p.m.
San Diego Room, Town & Country Hotel
Presentations:
* Joe Ecker, Professor, Salk International Council Chair in Genetics, HHMI/Gordon and Betty Moore Investigator, The Salk Institute for Biological Studies
Resolving the Complexity of Genomic and Epigenomic Variations in Arabidopsis
* Sean Gordon, Research Molecular Biologist, USDA-ARS
A Fungal Transcriptome Uses Complex and Double-Edged Isoforms to Split Wood
* Shane Brubaker, Director of Bioinformatics, Solazyme, Inc.
Assembly, Haplotyping, and Annotation of a High GC Algal Genome
* Allen Van Deynze, Director of Research, UC Davis, Seed Biotechnology Center
A De Novo Draft Assembly of Spinach Using Pacific Biosciences’ Technology
We’ll also be hosting a new grant competition for PAG attendees, co-sponsored by Sage Science. Simply write a short description of why the genome you want to sequence is the “Most Interesting Genome in the World” and you could win free library construction with BluePippin™ automated DNA size selection and up to three sequencing runs on the new PacBio® RS II sequencing system. Entries must be submitted by January 31st at www.pacb.com/smrtgrant. Stop by our booth at PAG (#231) to find out more.
See you in sunny San Diego! http://blog.pacificbiosciences.com/2014/01/smrt-sequencing-for-plant-and-animal.html?utm_content=bufferf4376&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer
Published on 7 January 2014 by Philip Ball(Philip Ball is a science writer. His latest book is Curiosity: How Science Became Interested in Everything).
Impressive hardware at Pacific Biosciences, a genome sequencing company. http://aeon.co/magazine/nature-and-cosmos/science-is-becoming-a-cult-of-hi-tech-instruments/
January 2nd, 2014 SPAdes 3.0 Released with PacBio Hybrid Assembly Module
We received an email from Anton Korobeynikov of Algorithmic Biology Lab about the release of SPAdes 3.0. Anton mentioned several months back that they were working to incorporate PacBio reads in hybrid mode, and now it is official !! You can download the software from here and the list of changes are given below.
We look forward to reading the manuscript that presents their algorithms. Anton told us that they are working on it, but if it does not come too soon, we have to start poking around the code
http://www.homolog.us/blogs/blog/2014/01/02/spades-3-0-released-support-pacbiohybrid-assembly/
SPAdes 3.0.from Algorithmic Biology LabSt. Petersburg Academic University of the Russian Academy of Sciences
http://bioinf.spbau.ru/en/spades
GenomeWeb Daily News Index: Future is Better for Some than for Others
January 2, 2014 by 2012pharmaceutical
NanoString, Accelerate, PacBio Shares Sharply up in September; Myriad, Sequenom Down
Reporter: Aviva Lev-Ari, PhD, RN
Related Stories
NanoString, PacBio, Fluidigm Lead December Charge in Omics Tools, MDx Stocks
January 2, 2014 / GenomeWeb Daily News
http://pharmaceuticalintelligence.com/2014/01/02/genomeweb-daily-news-index-future-is-better-for-some-than-for-others/?utm_source=twitterfeed&utm_medium=twitter
Seven Major Trend Changes of 2013 – (i) Sequencing Technology
December 30th, 2013 -- This commentary is our modest attempt to capture the essence of over 500 blog posts published here in 2013. A major trend change is defined as a situation, where the social perception morphed substantially between the beginning and the end of 2013.
1. Sequencing Technology – PacBio has Arrived
Researchers essentially gave up on PacBio technology by the end of 2012. Only two or three blogs covered it in positive light, ours being one of those. For example, readers may take a look at our following 2012 posts related to PacBio.
HDF5 Data Format for PacBio Sequences
Mixing Illumina and PacBio Data for Genome Finishing
Excellent Slides from Fisherman Lex on Combining PacBio and Short Reads
Basic Local Alignment with Successive Refinement (BLASR) for PacBio
HGAp – Very Accurate de novo Genome Assembly from PacBio Data
PBSIM: PacBio reads simulator.
In contrast, the general perception among scientists was well captured by the concluding paragraph of a 2012 paper published in BMC Genomics.
A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers
The limited yield and high cost per base currently prohibit large scale sequencing projects on the Pacific Biosciences instrument. The PGM and MiSeq are quite closely matched in terms of utility and ease of workflow. The decision on whether to purchase one or the other will hinge on available resources, existing infrastructure and personal experience, available finances and the type of applications being considered.
Or by this satirical commentary from early 2012.
A new sequencing technology enters the ring: SHTseq(TM)
Longer Reads-Better Data – noSHT (What would you do with 100Mb reads?).
We are able to generate super-long reads with our ARSesnsors. Using CrapBio-SHTseq technology we regularly get 10Mb reads and we have even seen reads of 100Mb which completely sequenced E. coli 20 times in a single read. Our base calling accuracy is 25%, but with genomes with extreme AT/GC bias it reaches 40%. Although this is lower than other platforms the longer reads allow you to extract much more information from our reads than old-fashioned 2nd generation sequencers. Also this error is totally randomly distributed (unlike homopolymer errors in other technologies!) and there is no decline in base calling accuracy toward the ends of reads. The last base in a read is just as good as the first base.
Cleaning up SHT™ with Illumina data
If, for whatever reason, you need accurate sequence data we have developed hybrid assembler that can incorporate Illumina error correcting reads. With our HybridAssemblyReadDenoisingSHT data you can simply upload you 100x illumina data with your sample and get reads returned to you will 99.999% accuracy*.
Fast forward by one year. The experts present at the #UCDAssemble workshop made the following forecast.
From ‘technology unsuitable for large-scale sequencing projects’ to ‘the only thing used to sequence bacterial genomes’ is a major shift in perception. Long reads have definitely arrived.
How did ‘extremely noisy’ PacBio reads turn out to be useful? Scientists started to realize that clean short reads also introduced noise through their short length, and that noise manifested into lower quality of assembly at the next level of analysis. So, they were trading one form of noise with another. The following two blog posts explained the above point in detail.
End of Short-Read Era? – (Part I)
End of Short-Read Era? – (Part II)
In other major sequencing technology-related shifts of 2013, (i) Illumina acquired Moleculo Inc. for longer read, (ii) Roche closed its 454 sequencing business and announced collaboration with PacBio, (iii) NabSys unveiled its instrument and (iv) Ion Torrent and BGI announced partnership with BGI buying 37 new Proton instruments.
Nanopore technology in general and Oxford Nanopore in particular continue to be the wildcards of the sequencing world. Lack of actual sequencing data from Oxford Nanopore has been a big complaint and the researchers perceive the company as ‘secretive’. Their patent submissions provides some insight into where the company is heading to.
Everything You Want to Know about Oxford Nanopore
Readers interested in staying ahead of the crowd regarding changing dynamics of sequencing world are encouraged to follow these excellent blogs.
Omics! Omics! by Keith Robison
In Between Lines of Code by Lex Nederbragt
Pathogens: Genes and Genomes by Nick Loman
Opiniomics by Mick Watson
—————————————————————————————————————–
In the following post, we cover -
Seven Major Trend Changes of 2013 – (ii) Bioinformatics
http://www.homolog.us/blogs/blog/2013/12/30/six-emerging-themes-2013/
PacBio Blog
Friday, December 27, 2013// Breakpoint Detection in Cancer Structural Variants with PacBio May Yield Patient-Specific Data
A new publication from scientists at the University of California, San Diego, demonstrates the use of Single Molecule, Real-Time (SMRT®) Sequencing to identify structural variation (SV) breakpoints in cancer.
“Amplification and thrifty single molecule sequencing of recurrent somatic structural variations” was published in Genome Research and comes from authors Anand Patel, Richard Schwab, Yu-Tsueng Liu, and Vineet Bafna.
In the paper, the scientists report development of a new method — Amplification of Breakpoints, or AmBre — to detect important structural variant breakpoints. AmBre relies on a PCR-based approach for amplification of the structural variant, followed by sequencing on the PacBio® platform to resolve the exact breakpoints. The method was tested on several cancer cell lines that contained such extensive genomic rearrangements, including deletions of tumor suppressor genes.
The authors note that breakpoints of structural variation are far more individualized than the structural variants themselves; they posit that these breakpoints have “utility as patient specific tumor biomarkers.” A reliable way to detect breakpoints, then, could have clinical relevance for cancer patients. In addition, the method can also be used to validate structural variants found with other sequencing (exome or genome) or microarray-based methods.
The team used SMRT Sequencing of pooled amplicons in a single SMRT Cell as well as a custom-built algorithm to sort reads by breakpoint and then call a consensus sequence representing a particular structural variant. The AmBre approach was validated on cancer cell lines including A549, CEM, and Detroit562 by successfully identifying CDKN2A deletion breakpoints. It was then applied (and confirmed by Sanger sequencing) to cell lines MCF7 and T98G for which the breakpoints had not been identified in spite of previous efforts, including whole genome sequencing of the MCF7 cell line. Interestingly, the SNP-array estimate for the MCF7 breakpoint is 15 kb away from the AmBre detected breakpoint, likely due to repeat elements close to the upstream MCF7 breakpoint. The authors note that "Repetitive sequences are known to confound structural variation analysis and possibly explains why previous genome sequencing studies of MCF7, have not annotated the CDKN2A deletion breakpoints".
The authors also show that AmBre captures more complex rearrangements, like interchromosomal translocations, by resolving the RUNX1-RUNX1T1 gene fusion which forms from a translocation between chromosome 21 and chromosome 8. In addition, "the AmBre assay, unlike other methods, can target DNA with a SV in the context of high background of germline DNA", a feature "important for sensitive detection of tumor DNA and establishing a patient specific tumor DNA marker for monitoring tumor burden." They demonstrate successful targeting of SVs by AmBre in heterogeneous samples where tumor DNA was present in as little as 1:1000 of the sample.
The paper was also reported on by In Sequence in an article entitled UCSD Team Develops PacBio Sequencing Method to ID Structural Variant Breakpoints (free access).
http://blog.pacificbiosciences.com/2013/12/breakpoint-detection-in-cancer.html?utm_content=bufferec4f0&utm_source=buffer&utm_medium=twitter&utm_campaign=Buffer
Chemical & Engineering News
December 23, 2013
DNA Sequencing: Zero-Mode Waveguides Turn 10
Analytical devices allow researchers to track the sequencing of long, continuous stretches of DNA
A decade ago, researchers at Cornell University reported an analytical device, which they called a zero-mode waveguide, for using light to detect single biomolecules in samples of any concentration. That device was remarkably simple—just an array of nanometer-scale holes in a metal film on a fused-silica surface (Science 2003, DOI: 10.1126/science.1079700). Today, that unpretentious device is the basis of the DNA-sequencing technology from Pacific Biosciences, based in Menlo Park, Calif.
Waveguides are devices that are used to direct light and sound waves—optical fibers are one example. Zero-mode waveguides are so named because the holes, which serve as the waveguides, are smaller than the wavelength of light used, so the light doesn’t pass through the holes.
From the beginning, the Cornell team’s waveguides were intended to be used for optically monitoring the progress of DNA sequencing. “We knew the fundamental limitation in fluorescence-based sequencing would be background noise caused by other nucleotides,” says Stephen W. Turner, founder and chief technology officer of Pacific Biosciences, who was on the team that invented the zero-mode waveguides. “We needed to reduce the observation volume.”
Turner and his Cornell colleagues tried several approaches. “Most of them worked to some degree, but the zero-mode waveguide worked so well and was so much better and simpler than the others that it was the hands-down winner,” Turner says. The technology is now the heart of Pacific Biosciences’ PacBio RS II sequencing instrument.
Each waveguide in a disposable PacBio RS II reaction cell serves as a tiny vessel for a DNA-sequencing reaction. Single polymerase enzymes are immobilized in a waveguide, which provides a window to observe sequencing in real time by precisely following the incorporation of fluorescently labeled nucleotides. The prototype device had a few thousand waveguides, but today’s PacBio RS II uses reaction vessels with 150,000 waveguides that can be monitored simultaneously. The combination of the waveguide and specially designed DNA polymerases allows sequence read lengths that are thousands of base pairs long—some longer than 30,000 base pairs, which is the most among DNA sequencers.
Other sequencing devices sometimes have trouble working through DNA regions containing a lot of guanine and cytosine (GC) bases, but not the PacBio RS II. “The place where PacBio is truly amazing is its lack of GC bias,” comments Jay Shendure, associate professor of genome sciences at the University of Washington. “That, along with the long read lengths, is enabling its utility in niche applications such as accurate assembly of microbial genomes with high or low GC content, as well as accurate assembly of challenging regions of mammalian genomes.”
http://cen.acs.org/articles/91/i51/DNA-Sequencing-Zero-Mode-Waveguides.html?utm_content=bufferd3df9&utm_source=buffer&utm_medium=twitter&utm_campaign=Buffer
This on Twitter Today--- (Michael Schatz ?@mike_schatz lastest @pacbio stats: 82x coverage over 10kbp, 8.7x over 20kbp, max 36,861bp. 10:57 AM - 27 Dec 13 ) Reply from - (Pacific Biosciences ?@PacBio Happy Holidays Mike! RT @mike_schatz: latest @pacbio stats: 82x coverage over 10kbp, 8.7x over 20kbp, max 36,861bp. 1:01 PM - 27 Dec 13) Sounds Good ?? https://twitter.com/PacBio
December 19, 2013 at 11:47 am
#UCDAssemble Workshop Covering Genome Assembly ...
PacBio Makes a Big Stir !!! Readers interested in sequence assembly may take a look at the tweets under hashtag #UCDAssemble. It is attended by a number of researchers, who are working on writing genome assembly programs or are heavy users of such programs (e.g. Jared Simpson, Jason Chin, Titus Brown, Lex Nederbragt, Nick Loman). The workshop is technically-oriented with a number of hands on exercises. We will summarize the key points for those without twitter access.
Conference-related documents are available here.
http://www.homolog.us/blogs/blog/2013/12/19/ucdassemble-workshop-covering-genome-assembly/ ( https://twitter.com/search?q=%23UCDAssemble )
Fragile Expedition
By Aaron Krol
December 18, 2013 | At the farthest reaches of the long arm of the X chromosome, there sits a stretch of DNA that looks like something a cat would type by sitting on the Ctrl and V keys. Translated into standard nucleic acid notation, it reads:
CGG CGG CGG CGG CGG CGG CGG CGG…
and so on. This kind of genetic duplication – common in non-coding regions, but rarer in active genes – is called a tandem repeat, and just how many repeats can be found in this position varies from person to person. Some people have only five copies of this sequence, while the longest recorded stretch of repeats went on for well over a thousand copies. And although it might seem strange for anyone but a deeply postmodern poet to describe a long cycle of three repeating letters as “interesting” or “mysterious,” this genetic locus is both.
It’s interesting because it houses the single most common known cause of autism – in fact, the single most common cause of mental impairment of any type that can be passed from parent to child. The gene at this locus is called FMR1, and the protein it codes for, FMRP, is essential for normal brain development. Any X chromosome with more than 45 copies of the CGG repeat on FMR1 is considered to harbor a mutant variant of the gene. At around 200 copies, the tandem repeats silence the gene, shutting down production of FMRP and causing “fragile X syndrome.” Individuals affected by fragile X, particularly boys, will show slowed cognitive development, sometimes to the extent of autism, as well as behavioral quirks like hyperactivity and extremely repetitive rituals. People with fragile X also have distinctive physical features, including an elongated face, stuck-out ears, a skinny physique, and unusually flexible fingers. Oddly, if you look under an electron microscope, their X chromosomes have some telltale visual signs of their own: a “pinch” at the locus where their tandem repeats have multiplied uncontrollably, which causes a short knob of genetic material to dangle off the end of the chromosome like an earlobe.
The locus is also mysterious, because it’s almost impossible to sequence – even though we know pretty much what the sequence will look like. That’s because modern sequencers work by reading DNA in small fragments, on the order of 100-400 bases long. By doing this enough times, sequencers generate a huge set of overlapping fragments. A computer finds the stretches where the fragments match, and stitches them back together to produce the unified sequence.
This works very well for most of the genome, but if a sequencer runs into a 600-base stretch of nothing but CGG, it won’t have any idea how to align the fragments: many of them will be nothing but CGG from start to end. This makes it impossible to tell how long the repeats go on; and if there is an interruption somewhere down the line, say an AGG slipped in, it won’t be clear where it fits, or even whether these interruptions occur more than once. Sequencers just discard this information, and leave this section of FMR1 a black hole in their readouts.
Dr. Paul Hagerman, an expert in fragile X genetics at the UC Davis School of Medicine, Dept. of Biochemistry and Molecular Medicine. Image credit: Hagerman lab
That’s a problem Dr. Paul Hagerman has grappled with for years. His lab at UC Davis has been working on fragile X for over a decade, during a fruitful period for genetic discovery. Among his lab’s contributions was the discovery of a new condition connected with carrying an intermediate number of CGG repeats, between about 55 and 200, called the "premutation" range. The condition is fragile X-associated tremor/ataxia syndrome (FXTAS), and it presents with late-onset neurodegenerative symptoms and sometimes childhood seizures. Having an intermediate number of repeats can also result in fragile X-associated primary ovarian insufficiency (FXPOI), which causes women to experience early menopause. These discoveries have shifted our understanding of the premutation phenotype, making it more important than ever to understand where FMR1 variants fall on the spectrum.
Yet without sequencers, it’s hard to tell where an individual sits in this range. For years, the Hagerman lab could rely only on imprecise measures like Southern blotting – good enough to diagnose fragile X syndrome, but little help nailing down a carrier’s genotype.
Unmasking the Fragile X Gene
The progress of genetics has been characterized by huge, much-heralded leaps forward, followed by laborious backtracking to make some useful sense of the terrain that was hurdled past. When the human genome was first sequenced in 2003, scientific boosters predicted a revolution in medicine. A decade later, it’s a simple matter for the average person to have her genome tested for up to a million common mutations, yet it would be foolhardy in the extreme to base a health regimen on the results.
For fragile X, the leap came almost as soon as Dr. Hagerman got wind of SMRT technology. Designed by Pacific Biosciences, or PacBio for short, and released commercially in 2011, this sequencing method breaks the genome into much larger fragments than alternative sequencers: the record read length on a PacBio instrument, set in October of this year, topped 40,000 bases. Even an average read on a SMRT sequencer is thousands of bases long. This is technology capable of eating through hundreds of CGG repeats in a single bite.
UC Davis acquired an early SMRT sequencer in 2011, and in October of 2012, the Hagerman lab published the first fully-characterized sequences of the FMR1 gene in the unmutated, premutated and mutated ranges. The achievement was a milestone in fragile X research, rescuing FMR1 from a nebulous realm of genetic guesswork. For the first time, cutting-edge technology seemed to promise a world where anyone could learn exactly what fragile X-related variants lurked in their DNA, and what this meant for their future health and that of their children.
Today, Dr. Hagerman is still working to make good on that promise.
New Challenges
“We have a dual interest [in fragile X sequencing],” says Dr. Hagerman. “One is to develop a diagnostic procedure, and the second is to do screening.” Hagerman wants to design two separate genetic tests. The first will offer a complete description of the FMR1 locus that any doctor would accept as clinically valid, and another, preliminary test will flag individuals with any level of FMR1 mutation.
Both these projects differ in key ways from the headline sequencing runs of Hagerman’s 2012 paper. A diagnostic procedure has to meet some strenuous goalposts before either clinicians or the FDA will accept it for use on actual patients. It has to preserve its accuracy across the entire conceivable range of repeats that people might carry. It has to reliably capture rare, single-point mutations that can affect how fragile X and related disorders develop. It has to work on mosaic individuals, for whom different cells may carry different numbers of repeats: a common phenomenon in FMR1 disorders, because large numbers of tandem repeats are unstable and can spontaneously duplicate. “All of those features are important in developing a clinical diagnostic tool so you can use this for genetic counseling,” says Dr. Hagerman.
Hagerman’s earlier work also used a common shortcut that needs to be ironed out of the process. To obtain enough genetic material to sequence, Hagerman’s lab performed PCR amplification on their samples, using polymerases to copy the relevant DNA many times over. This boosts the signal fed into the SMRT sequencers, but it can also add extra CGG repeats to the sequence, again because tandem repeats are naturally prone to duplication. A diagnostic test will have to work with an unamplified raw sample to be considered valid.
The need for a genetic diagnostic is pressing. Although existing methods can diagnose fragile X syndrome itself, they say relatively little about the risks to family members of affected individuals. Relatives of those with fragile X tend to carry FMR1 genes in the premutation range, putting them at risk for FXTAS or FXPOI, conditions that almost always go undiagnosed today. Carriers can also pass full mutations onto their children, as the tandem repeats multiply during meiosis – a risk that grows greater with age. To assess all these dangers, the precise length of the repeat needs to be quantified.
“The tests that currently exist have well-known limitations,” Hagerman told Bio-IT World. Southern blots, for instance, “are very inaccurate in the carrier range… PCR methods are much more accurate in the premutation range, but they’re much less accurate in the full mutation range.” A full genetic description will also capture single-point mutations that can make a major difference in the fragile X family of disorders. For instance, Hagerman’s lab was instrumental in the discovery that just one or two AGG sequences scattered in the CGG repeats can dramatically reduce the risk of a carrier passing on a full mutation. “The probability of a mother… giving birth to a son who would have the full fragile X syndrome can be reduced as much as tenfold by a single AGG interruption,” says Dr. Stephen Turner, founder and CTO of PacBio. Current diagnostics don’t detect AGG inserts, but SMRT sequencers can.
A Key Partner
Hagerman’s early success in sequencing FMR1 has been encouraging enough to attract an NIH grant to develop a diagnostic test. Awarded in September of this year, the grant will culminate in a 300-person trial of the fine-tuned test before UC Davis prepares it for clinical release. PacBio is closely involved in the project, which could establish a unique position for the company in clinical use at a time when the sequencing industry is scrambling to enter the clinical market. Illumina, the world’s largest sequencing company, recently received FDA approval for a cystic fibrosis diagnostic, a landmark in medical genetics. QIAGEN, a multibillion-dollar diagnostics company, is preparing to release the new GeneReader instrument specifically for clinical use.
PacBio is smaller than either of these companies, but its SMRT sequencers are the only instruments that can access fragile X mutations. “We are right now in the business of facilitating sequencing that can’t be done using other techniques,” Turner, who is also principal investigator for the NIH grant, told Bio-IT World. “The longest read length of any technology other than PacBio is about one thousand bases, so they’re far short of being able to cover [the range of FMR1 mutations].”
Two aspects of SMRT technology – the acronym stands for single molecule, real time – give PacBio the coverage needed to delve into the black boxes of the genome. The first is a DNA polymerase engineered from the f29 bacteriophage. Bacteriophages, viruses that invade bacteria and replicate inside their hosts, use polymerases to copy their genetic sequences. The f29 phage is remarkable for replicating its entire, nearly 20,000-base genome in a single step, using just one enzyme. Armed with this highly accurate, long-reading polymerase, a SMRT sequencer can chew through thousands of bases using just one enzyme and one molecule of sample DNA.
“The more important thing is that we’re watching it in real time,” says Turner. “The other technologies have very brittle and regimented recipes, where they apply the polymerase to a mixture of molecules for a preset period of time.” A typical process is to flood the DNA sample with enzymes and nucleotides, wash away the mixture, and then retroactively determine the sequence by checking which nucleotides successfully bonded with the sample. SMRT sequencing, however, checks the bases one by one as they’re incorporated into the polymerase. “[We give] each base in the sequence precisely the amount of time it needs to incorporate,” says Turner. “No more, no less.”
A manufacturing floor for SMRT sequencers. Image credit: Pacific Biosciences
Both these elements serve to give a clearer picture of fragile X genetics. Because SMRT sequencing targets just one molecule of sample DNA at a time, it is ideally suited to detecting mosaicism: if an individual has different FMR1 alleles in different chromosomes, the SMRT sequencer will record those sequences separately, rather than returning blended results. In addition, the real time analysis lets SMRT sequencers detect DNA methylation, a chemical modification of nucleotides that causes the gene silencing in fragile X syndrome. Methylated bases incorporate into the PacBio polymerase at a predictably different rate than unmodified bases, a difference that is automatically recorded as the sequencer reads the DNA in real time. This means that a SMRT diagnostic test should be able to tell individuals not only the sequence of their FMR1 genes, but also the extent to which they are silenced.
To PacBio, this precision is a validation of the company’s recent move into clinical functions. “Clearly at some point in the future, Pacific Biosciences technology is poised to play an important role [in the clinic],” Turner told Bio-IT World. PacBio signed a $75 million licensing deal with Roche this September to develop in vitro diagnostics based on SMRT technology, making it clear that the fragile X diagnostic is not an isolated project.
Turner emphasizes that this deal will not prohibit outside groups from marketing their own SMRT-based tests, so if a fragile X diagnostic does emerge from Dr. Hagerman’s lab, it is unlikely to fall under the terms of the Roche agreement.
Statewide Screening
A genetic diagnostic for fragile X syndrome will be a major step forward for both the families affected by the disorder, and future research into its genetics. “We’re going to position ourselves to understand new findings much better,” says Turner, “because we have the full sequence.”
However, even the best clinical test wouldn’t address one of the most important limitations to diagnosis today. Fragile X is just one of many causes of autism and cognitive impairment, and although its physical signs help clinicians identify the disorder, many cases go undiagnosed for months or years before the symptoms become obvious. This can set back treatments that have a chance to improve the lives of individuals with fragile X. There is no cure for the syndrome, says Dr. Hagerman, but “it’s very clear that early intervention is beneficial. Both medical intervention and behavioral, educational intervention has a very good effect on outcome.” Certain medications can help lessen children’s anxiety, OCD-like symptoms, or hyperactivity. More importantly, early behavioral therapy often makes a crucial difference in acclimating children with fragile X to social situations.
Those with full fragile X mutations are at least generally diagnosed during childhood. People with premutations, however, may never receive a diagnosis. Children may have unexplained seizures; adults with FXTAS may undergo loss of memory and motor control that could have been alleviated by early intervention; women with FXPOI may delay having children only to discover that they have a condition causing early menopause. These individuals may never have any cause to seek out the genetic test that could have warned them about their carrier status.
That’s why Dr. Hagerman is also trying to develop a cheaper, faster screening test, with the ultimate goal of testing all infants in his home state of California. The screen would only detect whether the FMR1 repeat region is longer than normal, but those individuals flagged in screening could then receive the full diagnostic.
“At present,” says Hagerman, “there is no way to do that test on a cost-effective basis for large numbers of individuals.” The challenges are very different from those facing a clinical diagnostic. Mosaicism, methylation, and single-point mutations could all be ignored, and Hagerman’s 2012 sequencing was already more than accurate enough. Instead, he says, the questions are, “How low can you push the cost? Can you get it down to, say, a dollar a test? And can you do tens of thousands in a reasonable timeframe?”
These are questions PacBio hopes to be well-positioned to address. “The nice thing about the PacBio method and SMRT sequencing is that you, in principle, can do a high degree of multiplexing,” adds Hagerman. “So you can pull large numbers of samples in the same tube and sequence.” Users of SMRT sequencers can “barcode” their samples when multiplexing, attaching unique 16-base pair sequences to the DNA samples so that, after sequencing, it’s easy to tell which sequence came from which source.
This makes it easier to run mass screenings, but the price of testing remains a concern. One of the reasons SMRT technology remains favored for niche applications that other sequencers can’t cover is that PacBio’s costs can be much higher than the industry standard. With hundreds of thousands of children born in California every year, Dr. Hagerman’s screening ambitions rest heavily on that $1-per-test figure. And while the fragile X diagnostic is covered by the NIH grant, Hagerman’s lab has yet to locate a funding source to help develop their screen.
Screening also faces political obstacles. Newborn screening can sometimes discover unwanted results. Huntington’s disease, for example, is caused by a similar tandem repeat mutation to the HTT gene on chromosome 4. However, no early treatment or intervention has ever been proven effective in treating Huntington’s – meaning that a screen is likely to cause serious anxiety to those carrying the mutation, without helping them fight the disease. To secure support for a fragile X screen, Hagerman will need to address this possibility. “I think the issue,” he says, “is going to be, what is the benefit of screening? What the state will want to know – what any state would want to know – is, why is there an advantage? Is there an early intervention that would justify newborn screening?”
To Dr. Hagerman, these questions are already settled. “I think most people now are accepting of the fact that early intervention for fragile X is beneficial,” he says. “Even for the premutation individuals, there’s very clearly a benefit of early intervention – particularly because some significant fraction of children will develop seizures… You want to get at those early and aggressively.” Still, implementing a screening program large enough to make a widespread difference will require some political persuasion, in addition to the scientific challenge.
Hagerman is prepared to break down those barriers. “We’re optimistic that we can meet the milestones… We’re really talking about, eighteen months to two years from now, having a clinical diagnostic test,” he says. Just having the diagnostic available could make the screen easier to add into the mix, because there will be a clear next step for those infants flagged in screening. It could also encourage broader adoption of the technology needed to run the screens, as SMRT sequencers remain fairly uncommon pieces of equipment. Dr. Hagerman sometimes has to ship his samples to Washington State when the UC Davis instrument is unavailable. “In the scheme of things,” he says, however, “putting a PacBio sequencer in [a diagnostic] facility is not such an onerous task. After all, we see mass spectrometers that cost way more in such facilities for doing screening and protein analysis.”
California is a well-chosen location to test the feasibility of Dr. Hagerman’s vision. Earlier this year, the state made another genetic screen – a noninvasive prenatal test for chromosomal disorders like Down syndrome – available to all pregnant women considered at elevated risk. The state is also large and influential, and a statewide screen for fragile X could serve as a model for other regions ready to embrace genetic diagnostics. Meanwhile, other previously unsequenceable tandem repeat disorders, like myotonic dystrophy and Friedrich's ataxia, now look open to diagnostics using SMRT sequencers. Whether sooner or later, these kinds of genetic tests are likely to play a major role in the future improving public health from birth to old age.
As the example of fragile X shows, the road to that future will not always be smooth – but if researchers are determined enough, medical feats once thought impossible can gradually become routine.
http://www.bio-itworld.com/2013/12/18/fragile-expedition.html?utm_source=dlvr.it&utm_campaign=Buffer&utm_content=buffer50730&utm_medium=twitter
European Biotechnology News-- Exploiting the genome
04.12.2013 - PacBio, the UK’s Sanger Institute and Public Health England will analyse the genomes of 3,000 bacterial pathogens.
Using Pacific Biosciences' Single Molecule, Real-Time (SMRT) technology sequencing will be carried out at Europe’s largest sequencing hub, the Wellcome Trust’s Sanger Institute in Hinxton, UK, over the next three years. The finished genome sequences are to be stored in the GenBank database.
Most bacteria currently have no genome references and combining reference genomes with the wealth of historical and biological information existing for these strains will generate a data set of enormous value for clinical microbiology as well as basic research.
PacBio’s 3rd generation SMRT sequencing technology achieves the longest read lengths and highest consensus accuracy in the industry. Because the technology can directly detect base modifications, the epigenomes for bacteria can also be obtained with no additional data acquisition, providing unprecedented insight into the role of DNA methylation in bacterial pathogenicity.
Most recently, PacBio signed an exclusive agreement with global personalised healthcare major F. Hoffmann La-Roche to develop a SMRT sequencer suitable for clinical sequencing. The SMRT platform can uniquely read out a sequence without the error-prone PCR step required in other technologies.
eurobiotechnews.eu/tg
http://www.european-biotechnology-news.com/news/news/2013-04/exploiting-the-genome.html?utm_content=bufferbc6ee&utm_source=buffer&utm_medium=twitter&utm_campaign=Buffer
Jonathan Eisen's Lab -PacBio Sequence Assembly Workshop
Posted on December 11, 2013 - PacBio is hosting an evening symposium next week as part of another workshop I’m organizing on campus. All are encouraged to attend! Plenty of food available afterwards.
PacBio Sequence Assembly Workshop
Tuesday, December 17th 2013, 4 pm – 7 pm
The Auditorium, 1005 GBSF
4:00 pm Welcome & Introductions
4:00 – 4:30 pm Shane Brubaker, Solazymes
“Assembly, haplotyping, and annotation of a high GC algal genome.”
4:30 – 5:00 pm Jason Chin, PacBio
“String graph assembly for diploid genomes with long reads.”
5:00 – 5:30 pm Lex Nederbragt, University of Oslo
“Using PacBio reads to improve and validate the assembly of the complex Atlantic cod genome.”
5:30 – 6:00 pm Lawrence Hon, PacBio
“Larger genome hybrid assembly with PacBio.”
6 pm – 7:00 pm Reception & Discussions
Light Refreshments Will Be Served in GBSF Lobby
http://phylogenomics.wordpress.com/2013/12/11/pacbio-sequence-assembly-workshop/
Pacific Biosciences of California CEO Michael Hunkapiller Acquires 100,000 Shares (PACB)
Posted by Joseph Griffin on Dec 13th, 2013
Pacific Biosciences of California (NASDAQ:PACB) CEO Michael Hunkapiller purchased 100,000 shares of Pacific Biosciences of California stock on the open market in a transaction dated Thursday, December 12th. The shares were purchased at an average cost of $4.13 per share, with a total value of $413,000.00. Following the acquisition, the chief executive officer now directly owns 1,800,000 shares in the company, valued at approximately $7,434,000. The transaction was disclosed in a legal filing with the Securities & Exchange Commission, which is available at this link.
Shares of Pacific Biosciences of California (NASDAQ:PACB) opened at 4.52 on Friday. Pacific Biosciences of California has a 52-week low of $1.53 and a 52-week high of $6.50. The stock’s 50-day moving average is $4.11 and its 200-day moving average is $3.70. The company’s market cap is $299.0 million.
Pacific Biosciences of California (NASDAQ:PACB) last posted its quarterly earnings results on Tuesday, October 22nd. The company reported ($0.31) earnings per share (EPS) for the quarter, missing the consensus estimate of ($0.30) by $0.01. The company had revenue of $7.40 million for the quarter, compared to the consensus estimate of $7.08 million. During the same quarter last year, the company posted ($0.41) earnings per share. Pacific Biosciences of California’s revenue was up 164.3% compared to the same quarter last year. On average, analysts predict that Pacific Biosciences of California will post $-1.25 earnings per share for the current fiscal year.
Separately, analysts at Maxim Group raised their price target on shares of Pacific Biosciences of California from $4.00 to $8.00 in a research note to investors on Thursday, September 26th. They now have a “buy” rating on the stock.
Pacific Biosciences, Inc develops Deoxyribonucleic Acid (NASDAQ:PACB) sequencing platform
http://tickerreport.com/banking-finance/90083/pacific-biosciences-of-california-ceo-michael-hunkapiller-acquires-100000-shares-pacb/
Convey Computer’s Implementation of PacBioToCA Algorithm Speeds DNA Sequence Assembly, Delivering Up to Fifteen Times Acceleration
PacBioToCA Is the Newest Addition to Convey’s Expanding Bioinformatics Suite, Helping to Speed Genomic Research
. .The Convey hybrid-core system allows customers worldwide to enjoy increased application performance with lower ownership costs.
The combination of the PacBioToCA algorithm and a Convey HC system allows our customers to dramatically speed up research for projects in areas such as functional genomics, comparative genomics, and beyond. -- Kevin Corcoran, Pacific Biosciences Richardson, TX (PRWEB) December 11, 2013
Convey Computer™ Corporation announced today the newest addition to Convey’s expanding bioinformatics suite, PacBioToCA, an application that facilitates the assembly of genomes sequenced with Pacific Biosciences® long-read technology. Optimized to take advantage of the highly parallel processing architecture of the Convey hybrid-core (HC) server, PacBioToCA delivers six to fifteen times acceleration.
Researchers running PacBioToCA on Convey HC systems for sequencing and assembly are seeing exceptional results. “The speed up is significant; but even more importantly, researchers are now able to test more parameters,” commented Dr. George Vacek, Director of Convey Computer’s Life Sciences business unit. “Achieving results in a matter of days instead of weeks allows them to refine their approach and get better answers.”
The PacBio® RS II DNA Sequencing System, from Pacific Biosciences (NASDAQ: PACB), helps scientists solve genetically complex problems. Their single-molecule sequencing instruments can generate industry-leading sequence read lengths that dramatically improve genome and transcriptome assembly.
Researchers are attracted to the exceptionally long PacBio reads because they can deliver higher quality assemblies. Prior to the development of algorithms optimized for PacBio read data (such as PacBioToCA), single-pass error rates had been perceived to limit their utility in de novo assembly.
Last year, Dr. Sergey Koren, Bioinformatics Scientist at the National Biodefense Analysis and Countermeasures Center, and his colleagues developed an assembly strategy that uses short sequences (either from PacBio circular consensus sequencing or short read technologies) typical of high-throughput sequencers to correct the errors in PacBio reads. This strategy was subsequently extended to use shorter single-molecule reads to correct the longest ones. These techniques deliver high-accuracy long reads, resulting in gold standard genome assemblies.
For larger genomes, the PacBioToCA algorithm can be time-consuming; therefore, Koren collaborated with Convey to optimize the PacBioToCA algorithm for Convey’s highly parallel HC systems. The optimized version of PacBioToCA runs much faster on the Convey HC servers because the alignment algorithm it uses is significantly faster on a Convey HC-2ex server than the best implementation on a standard server.
“It has been shown that long PacBio reads processed with PacBioToCA lead to such high-quality assemblies, researchers are saved the significant subsequent cost of manual finishing,” explained Kevin Corcoran, Senior Vice President of Market Development at Pacific Biosciences. “The combination of the PacBioToCA algorithm and a Convey HC system allows our customers to dramatically speed up research for projects in areas such as functional genomics, comparative genomics, and beyond.”
Convey’s groundbreaking hybrid-core computing architecture tightly integrates advanced computer architecture and compiler technology with commercial, off-the-shelf hardware—namely Intel® Xeon® processors and Xilinx® Field Programmable Gate Arrays (FPGAs). Particular algorithms are optimized and translated into code that’s loaded onto the FPGAs at runtime to accelerate applications that use these algorithms. The systems help customers dramatically increase performance over industry standard servers while reducing energy costs associated with high-performance computing.
“Adding PacBioToCA to the Convey Bioinformatics Suite reflects our ongoing commitment to the bioinformatics and life sciences community,” concluded Vacek. “We enjoy working with innovators to bring solutions to the industry that will help solve the challenges of the rapidly changing area of sequencing. We look forward to continuing to collaborate with Pacific Biosciences and others on optimization of bioinformatics workflows.”
Convey’s expanding bioinformatics suite is made up of a number of personalities including the Convey GraphConstructor™ for de novo short read assembly, Smith-Waterman for local sequence alignment, and Burrows-Wheeler Aligner for fast reference mapping.
About Convey Computer Corporation
Based in Richardson, Texas, Convey Computer breaks power, performance and programmability barriers with the world’s first hybrid-core computer—a system that marries the low cost and simple programming model of a commodity system with the performance of a customized hardware architecture. Using the Convey hybrid-core systems, customers worldwide in industries such as life sciences, research, big data, and the government/military enjoy order of magnitude performance increases while reducing acquisition and operating costs. http://www.conveycomputer.com
http://www.prweb.com/releases/Convey/PacBio/prweb11058735.htm