Pacific Biosciences of California Inc (PACB): Whole story from incomplete link 0n 1/28...

Pacific Biosciences of California Inc (PACB)

Reply Private New

Next 10 Prev Next

Send PM Follow Ignore

Followers	3
Posts	607
Boards Moderated	0
Alias Born	06/07/2010

Paulieme

Re: Paulieme post# 301

Friday, 01/31/2014 6:04:06 PM

Friday, January 31, 2014 6:04:06 PM

Whole story from incomplete link 0n 1/28/14 PacBio Demos First De Novo Animal Genome as it Plans
Longer Reads, Increased Throughput
January 28, 2014
By Monica Heger
Researchers have sequenced and de novo assembled theDrosophila melanogaster genome on Pacific
Biosciences' RS II — the first time an animal genome has been sequenced and assembled solely with
PacBio technology — and have produced a genome with fewer gaps and longer contigs than the current
reference.
Sergey Koren, a bioinformaticist at the National Biodefense Analysis and Countermeasures Center and
University of Maryland, developed software for error correction of PacBio reads dubbed PBcR, and
presented on the Drosophila assembly at the International Plant and Animal Genome meeting in San
Diego earlier this month.
Additionally, the company is planning this year to increase its throughput four-fold to achieve 1 gigabase
of data per SMRT cell and average read lengths greater than 10-15 kilobases, as well as improvements to
sample prep and new methods for assembly of diploid genomes.
The Drosophila genome, estimated to be around 140 megabases, but potentially as large as 220
megabases, was sequenced in six days using 42 SMRT cells to 90-fold coverage and produced average
read lengths of 10 kilobases. Using the Celera assembler, the researchers constructed a haploid assembly
in 128 contigs with an N50 length of 15 megabases and a maximum contig length of 24.6 megabases.
Total turnaround time from sample to final assembly was six weeks.
PacBio scientists collaborated on the project with researchers involved in the
Berkeley Drosophila Genome Project, and researchers from the University of Maryland and the
University of Manchester.
According to Sue Celniker, co-director of the Berkeley Drosophila Genome Project, the PacBio-only
assembly is a huge improvement over the reference genome, which is currently in its fifth iteration.
Researchers involved in the Berkeley Drosophila Genome Project have spent over 10 years working on
the reference genome using a combination of Sanger sequencing, BAC clones, and other manual and
labor-intensive approaches. Yet, using just one next-gen sequencing technology, and over just six weeks,
the PacBio technology was able to piece together regions that have proved particularly troublesome, like
heterochromatin and the Y chromosome, she said.
"There's been some persistent repeats that we couldn't get through, that [PacBio] did," she told In
Sequence. "Having those very long reads allows you to get through large arrays of repeats."
Researchers are still evaluating and comparing the PacBio assembly to the reference, so Celniker said she could not precisely say how many of the remaining gaps the PacBio assembly was able to close.--- However, it is already clear that in some cases the long reads were able to generate a more contiguous
sequence than the reference. For instance, chromosome 2R was reduced to two pieces in the PacBio
haploid assembly from 27 pieces in the reference. Chromosome 2L was reduced to between 4 and 6
pieces from 6 pieces, and chromosomes 3L and 3R were reduced to 1 and 3 pieces in the PacBio
assembly from 22 and 15 pieces, respectively.
Additionally, in the most recent release of the Drosophila reference genome, only around 1 percent of
chromosome Y is represented. While the BDGP researchers have since assembled around 7.5 percent of
the Y chromosome, the team anticipates that more than half of the Y chromosome will be assembled with
the PacBio data.
Part of the reason for less Y representation in the reference genome is that the fly DNA was taken from
embryos, so there is no way to know whether male or female DNA was being used, Casey Bergman, a
senior lecturer in computational and evolutionary biology at the University of Manchester, told IS. But in
the PacBio collaboration, only male flies were used, he said.
Bergman's lab became involved with the project last summer after it released a dataset generating wholegenome shotgun sequences using PacBio technology of the Drosophila reference strain as well as
Illumina sequences that it used to error-correct the PacBio reads. The company contacted Bergman to
collaborate on generating data and doing de novo assembly using its newer sequencing chemistry.
Bergman said that this Drosophila genome validates PacBio's technology for use inde novo assembly, and
shows the value of long reads. Genomes that have been assembled using short-read sequencing
technology, like the panda genome, are put together in contigs that are tens of kilobases, he said. But,
the Drosophila has an N50 of 12 megabases. "That is chromosome-sized segments. It is what was
declared finished for many genomes 10 years ago, and is of much higher contiguity and sequence
quality," he said.
Short-read sequencing technology is valuable for applications like identifying genes or fragments of
genes, and enables many genomes to be sequenced cost-effectively — but it doesn't give you the longrange
architecture, Bergman said.
The PacBio-only assembly also has some advantages over the hybrid PacBio/Illumina assembly,
Bergman said.
One problem with error correction, he said, is that Illumina technology does not sequence well through
repetitive regions, so the Illumina-corrected reads in those repetitive regions are not as good. "You don't
really get the gain in the regions of the genome where you need them for the long-range assemblies," he
said.
Adam Phillippy with the National Biodefense Analysis and Countermeasures Center, who worked on the
assembly, agreed. In theory, a hybrid assembly approach is beneficial because it combines two orthogonal
technologies and can take advantage of the strengths of both, he said. And indeed, in many genomic regions, a hybrid assembly works well. But, since short reads do not align well to certain regions, like
repeats, it is difficult to use short reads for error correction in those regions.
"Short reads are notoriously hard to map against a repetitive genome," Phillippy said. "It's much easier to
align long reads to long reads, so you assemble the repeats much more effectively."
Phillippy and Koren last year published a study in Genome Biology, estimating a cost of about $1,000
for de novo sequencing and assembly of microbes with PacBio technology. Additionally, the researchers
compared self-correction to hybrid correction and found that self-correction was often better in terms of
accuracy and contiguity.
Phillippy said that he expects these conclusions for microbial genomes to carry over to larger genomes,
especially as throughput and read lengths continue to increase, and the Drosophila genome is the first
evidence of that.
Further improvements
Looking ahead, Jonas Korlach, PacBio's CSO, said that the company is planning further improvements to
its read lengths and throughput this year.
The company plans to increase throughput to 1 gigabase per SMRT cell and average read lengths to
greater than 10-15 kilobases. An increase in read length will be achieved by several factors, Korlach said.
The company continues to study different polymerases and is working out ways to optimize the signal
from the nucleotide.
For instance, in its latest sequencing chemistry, P5-C3, the company incorporated a protective scaffolding
strategy, which reduces photo damage to the polymerase and enables longer reads. Korlach said that the
company continues to improve upon this strategy. Additionally, the company has found that "nicks or
damage to the DNA template can stall the polymerase and thereby reduce read length," so researchers are
looking at ways to do more "efficient DNA damage repair during sample preparation."
Korlach added that the company is also looking at ways to improve loading efficiency, which would also
increase throughput. Each SMRT cell contains 150,000 zero-mode waveguides, each of which has the
potential to be occupied by a polymerase and template complex. However, the current method of loading
is limited by Poisson statistics, meaning that only about one-third of the ZMWs will be occupied with one
polymerase-template complex with the remainder occupied by either none or more than one complex,
Korlach said. "However, we believe through improvements in loading, we can at least double the amount
that we are currently loading per SMRT cell."
These improvements, which will be delivered over the year, will come in the form of software upgrades
and a new sequencing kit, Korlach said. None will require a hardware upgrade or installation.
Monica Heger tracks trends in next-generation sequencing for research and clinical applications for
GenomeWeb's In Sequence and Clinical Sequencing News. E-mail Monica Heger or follow her GenomeWeb
Twitter accounts at @InSequence and@ClinSeqNews. -- http://files.pacb.com/pdf/MediaCoverage_Demos_FirstDeNovoAnimalGenome.pdf