LT.Swing trade!
Register for free to join our community of investors and share your ideas. You will also get access to streaming quotes, interactive charts, trades, portfolio, live options flow and more tools.
Register for free to join our community of investors and share your ideas. You will also get access to streaming quotes, interactive charts, trades, portfolio, live options flow and more tools.
Monday, August 18, 2014---Genome-Wide Methylation in Human Microbiome Samples
Scientists in Florida and Finland recently published a report of their work studying methylation patterns in two human microbiome samples. While microbiome studies have become quite popular, the authors note there have been no prior papers detailing genome-wide methylation of bacteria found in those studies. Their goal was to ascertain how much added functional variation might occur based on methylation patterns.
“The methylome of the gut microbiome: disparate Dam methylation patterns in intestinal Bacteroides dorei,” published in Frontiers in Microbiology, comes from lead author Michael Leonard and senior author Eric Triplett at the University of Florida plus a team of collaborators from hospitals and universities across Finland.
The scientists used Single Molecule, Real-Time (SMRT®) Sequencing for its ability not just to sequence bacterial genomes to closure, but also to read methylation patterns across those genomes. They studied two stool samples from children at high risk for developing type 1 diabetes; both stool samples were dominated by Bacteroides dorei. In both strains, after sequencing to closure using the PacBio® sequencer, the team looked at GATC motifs for Dam methylation, which is believed to change gene expression in bacteria.
A marked difference between the genomes was discovered during methylation analysis: the first strain lacked Dam methylation entirely, while the second contained more than 20,000 methylated GATC sites. (Indeed, that strain only had three GATC sites that were not found to be methylated.) Scientists determined that the first genome lacked the DamMT gene, though both strains had other methylation patterns. “Another interesting observation is that of all of the methylation motifs observed in these two genomes, none is methylated in both genomes,” the authors report. “This suggests that the primary source of methyltransferases in these genomes is through lateral transfer, often from phage.”
Based on these remarkable differences, the scientists conclude that DNA sequence alone is not enough to understand the function of bacterial strains in a microbiome sample. “This work suggests that future microbiome studies should consider the methylome when describing the bacterial diversity in the gut,” the authors write. “Such analyses are no longer difficult given the latest sequencing technologies.”
http://blog.pacificbiosciences.com/2014/08/genome-wide-methylation-in-human.html?utm_content=buffer84588&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer
Chemical & Engineering News
Issue Date: August 18, 2014
Cover Story
Next-Gen Sequencing Is A Numbers Game
As technical and cost barriers fall, instrument firms move their systems into research and clinical markets
By Ann M. Thayer
Sequencing the human genome the first time cost $3 billion and took 13 years to complete. That was in 2003. Eleven years later, next-generation sequencing technology has brought the single-genome price close to $1,000 and cut the time to days.
These advances have enabled new opportunities for genomic studies. For example, Genomics England, set up by the U.K.’s Department of Health, plans to sequence 100,000 human genomes by the end of 2017. The project will focus on individuals with cancer and rare diseases in the hope of transforming diagnosis and treatment.
Next-generation sequencing, or NGS, is moving quickly into research, clinical, and diagnostic applications. As it does, users and regulators are learning how to handle the technology and the resulting genetic information. While researchers move up the learning curve, instrument developers are close behind. Armed with a dizzying array of technologies, they are competing with each other to introduce faster sequencing, higher accuracy, and even lower costs.
[+]Enlarge
DOMINATORIllumina rules the market for next-generation sequencers. SOURCES: Mizuho Securities USA, Frost & Sullivan
DOMINATOR
Illumina rules the market for next-generation sequencers. SOURCES: Mizuho Securities USA, Frost & Sullivan So far, Illumina leads the race. In January, the San Diego-based firm launched its HiSeq X Ten system with a price tag of $10 million. Consisting of 10 ultra-high-throughput sequencers, each capable of generating up to 1.8 terabases of data in less than three days, the system can sequence about 18,000 human genomes per year.
Illumina uses a sequencing-by-synthesis method. After DNA fragments are amplified on a chip, sequencing occurs by synthesizing a DNA strand complementary to the target strand by enzymatically attaching fluorescently labeled nucleotides one at a time. When reactions occur, the labels are optically imaged to identify what was attached, and the cycle is repeated.
Although Illumina says it has pushed the per-genome price under $1,000—including instrument and operating costs, but not overhead and data handling—the X Ten system is cost-effective only for large-volume users. The company has already found at least 13 such customers, including the Boston-based Broad Institute, the U.K.’s Wellcome Trust Sanger Institute, Australia’s Garvan Institute of Medical Research, and the Sidra Medical & Research Center in Qatar.
Even before the X Ten, Illumina commanded more than 70% of the $1.3 billion NGS market, according to the research firm Frost & Sullivan. Two-thirds of that money is spent on reagents and consumables and the rest on instruments sold by Illumina, Thermo Fisher Scientific, Roche, and Pacific Biosciences, usually for between $50,000 and $750,000.
“Newer instruments will generally fall into the category of low- to mid-throughput, benchtop-sized sequencers for both the research and clinical spaces,” says Christi Bird, life sciences senior industry analyst at Frost & Sullivan. They will tend to sell for less than $100,000. Although she expects some market share losses as new firms enter the business, the erosion will largely be offset by expansion of the pie overall. Frost expects the overall NGS market to grow about 16% per year through 2018.
SEQUENCING GENEALOGIES
Trace the emergences, mergers, and milestones of the power players in gene sequencing technology over the past 17 years. Companies’ techniques are listed below their names.Sales of X Ten systems no doubt look good on Illumina’s bottom line, but the firm insists that selling a range of systems is important to its growth strategy. It has developed a portfolio to target different markets, applications, and throughput needs, explains Joel Fellis, senior manager for systems and genomic services marketing. “We’re interested in making sequencing much more widely accessible, easier to use, and focused on end-to-end solutions.”
In January, for example, Illumina launched the NextSeq 500, a $250,000 benchtop machine priced between its small-scale MiSeq and workhorse HiSeq 2500 systems. In about a day, the NextSeq can run a whole 3 billion-base genome, 20 prenatal test samples, 16 exomes, 48 gene-expression samples, or 96 targeted panels. “It really is expanding our customer base,” Fellis explains.
Besides a core life sciences analysis market worth about $5 billion per year, Illumina targets the $12 billion research and clinical oncology market, the $2 billion reproductive and genetic health area, and about $1 billion in emerging prospects, such as infectious disease and food. Clinical sequencing will be a “turning point” as it drives the NGS business into diagnostics, Frost’s Bird predicts.
In November 2013, Illumina became the first company to receive Food & Drug Administration clearance for an NGS system used in diagnostics. The approval covers its MiSeqDx instrument, reagents for isolating and copying genes, and gene analysis software. As a result, labs will be able to use the system to develop and validate tests that involve sequencing any part of a patient’s genome. FDA also approved Illumina kits to detect cystic fibrosis-related mutations.
With the new instrument, Illumina is testing the diagnostic waters for the industry in a fast-changing regulatory climate. “Illumina’s recent approval enhances marketing, increases the FDA’s comfort level, and stays ahead of potential shifts in regulatory oversight,” Mizuho Securities USA stock analyst Peter Lawson pointed out in a recent report to clients.
Not surprisingly, Illumina and other NGS companies plan to develop and register more diagnostic systems, as do some leading diagnostic developers that have bought NGS start-ups. The German diagnostics firm Qiagen acquired U.S.-based Intelligent Bio-Systems in 2012. Similarly, in April, the U.S. lab products and diagnostics supplier Bio-Rad Laboratories acquired GnuBio of Cambridge, Mass.
Both the GnuBio and Qiagen NGS systems are being designed to go from sample through DNA library preparation, sequencing, and data analysis. Bio-Rad isn’t giving a timeline for selling a GnuBio system, but beta systems were available in 2013. Qiagen anticipates launching its benchtop GeneReader system for clinical applications this year, analysts say.
Meanwhile, England-based QuantuMDx is developing what it says will be a low-cost, simple-to-use device for 15-minute bedside diagnoses. It uses disease-specific cartridges and sequencing on nanowire biosensors. It plans to commercialize its handheld, chip-based device in 2015 for, it claims, “the price of a smartphone.”
California-based GenapSys has a bread-loaf-sized device using what it calls GENIUS, for Gene Electronic Nano-Integrated Ultra-Sensitive, technology. The four-year-old company is aiming for a $50 genome and point-of-care diagnostic use. According to analysts, performance data on the system could appear this year, with commercialization targeted for 2015 at a cost of a few thousand dollars.
Until these technologies are ready to compete, Illumina will enjoy the earlymover advantage. Since it started selling sequencers in 2007, its sales have nearly quadrupled, reaching $1.4 billion in 2013. Goldman Sachs stock analyst Isaac Ro believes that Illumina will continue to dominate the NGS market for the next few years.
[+]Enlarge
BUYING POWERAccuracy ranked highest in importance for users of next-generation sequencing systems. NOTE: Based on a survey of 108 end users. SOURCE: Frost & Sullivan
BUYING POWER
Accuracy ranked highest in importance for users of next-generation sequencing systems. NOTE: Based on a survey of 108 end users. SOURCE: Frost & Sullivan Thermo Fisher holds second place in the NGS market, with about 16% of sales. Its Life Sciences Solutions business has a decades-long history in gene sequencing and as a result offers several of the major technologies. Applied Biosystems Inc., a predecessor company, supplied Sanger sequencing instruments to decode the first human genome. In 2007, ABI launched its first NGS system based on sequencing by oligonucleotide ligation and detection, known as SOLiD.
Unlike highly accurate but less parallelizable Sanger methods, NGS systems carry out massive numbers of reactions, or sequence reads, at one time. Like Illumina’s approach, SOLiD uses sequencing by synthesis of amplified DNA fragments on either a bead or chip. Instead of nucleotides, it uses fluorescently labeled probes that are repeatedly ligated to the growing strand, optically imaged, and cleaved off. How long these processes can be kept going determines the “read length” that can be sequenced in a run.
The first lower-cost, nonoptical system appeared in 2010 after Life Technologies—now part of Thermo Fisher and formed from the 2008 merger of ABI and Invitrogen—acquired Ion Torrent for $725 million. Its systems use sequencing by synthesis, but with unlabeled nucleotides on a semiconductor chip. The chip electrically senses the release of hydrogen ions when bases attach. The full sequence is read by sequentially adding bases and tracking reactions across millions of microwells.
Today, Thermo Fisher continues to sell all the sequencing products, although NGS is growing the fastest. “The applications really drive what is the right choice of technology,” says Mark P. Stevenson, president of the Life Sciences Solutions unit. Sanger sequencing, although saddled with lower throughput and higher cost than NGS methods, is easy to use and offers long read lengths. “Over the years we have updated Sanger sequencing with faster reactions and newer software,” Stevenson adds.
“If you have just a few samples to run and want to know a small part of the DNA very accurately, then Sanger sequencing is still the best method,” Stevenson says. The industry considers it a “gold standard,” and the method is widely used in clinical diagnostics and for DNA analysis in forensics.
Thermo Fisher reports that it has sold more than 15,000 Sanger sequencers and more than 2,500 Ion Torrent systems. In 2013, the Ion Torrent business generated revenues of about $185 million, according to Goldman Sachs’s Ro. He predicts the business will grow about 30% this year, 20% in 2015, and then about 10% annually through 2018.
Since acquiring the Ion Torrent technology, Thermo Fisher has improved its performance, but the method has “just begun to be optimized,” Stevenson says. Although Thermo Fisher has looked at other technologies such as nanopore detection and even made small investments, they have “some way to go before having the same throughput or accuracy as the Ion Torrent,” he maintains.
Similarly, Illumina remains focused on its core chemistry, according to Ro. “While the company continues to research nanopore and single-molecule technologies, it is not yet convinced that the quality of data can be as high as sequencing by synthesis,” he says. This month, Illumina did receive a $592,000 National Institutes of Health grant to create a sequencing system around a hybrid protein and solid-state nanopore array.
The more gene sequencing technology is used, the more researchers are finding out what it can—and can’t—do. For example, medical and research centers are generating data on millions of genomic variants. Not only has handling all those data become a challenge, but much of the data is not understood. To overcome this, NIH is supporting a four-year, $25 million Clinical Genome Resource program evaluating which variants play a role in disease relevant to medical care.
Whole-genome and large population studies are expected to yield some of the necessary associations. However, a recent Stanford School of Medicine study found that even though NGS methods generally capture, or cover, most of the genome, “depending on the sequencing platform, 10 to 19% of inherited disease genes were not covered to accepted standards for single-nucleotide-variant discovery” (J. Am. Med. Assoc. 2014, DOI: 10.1001/jama.2014.1717).
The problem is even bigger. “Variations in the genetic blueprint are not just confined to single-base changes—the famous single-nucleotide polymorphisms that people go after—but are present at all different size scales,” explains Jonas Korlach, chief scientific officer at Pacific Biosciences and a company founder. Thousands of bases can be involved in structural variations such as insertions, deletions, inversions, and repeats, many of which have connections to cancer, Huntington’s disease, and other disorders.
For example, fragile X syndrome, the leading cause of heritable cognitive impairment and autism, arises from the expansion of a nucleotide repeat sequence in a specific gene. But sequencing the region has proven extremely difficult, Korlach says. Sometimes such DNA simply can’t be amplified. “Those pieces will often just fall out of the sample prep all together, and they will never get to the sequencer,” he says.
Pacific Biosciences’ technology is designed to overcome problems that stem from the gene sequencers themselves. If the region of interest is present many times in the genome, the read length must be long enough to cover the region and more. Otherwise, “it looks like a sky piece in the jigsaw puzzle when you don’t have any tree branch to tell you where it might go,” Korlach says. “The piece is scientifically useless because you won’t be able to place it on the reference map of the human genome.”
Illumina and Ion Torrent technologies have read lengths up to a few hundred base pairs, while Sanger sequencing covers several hundred. In contrast, Pacific Biosciences’ technology has average reads of about 8,500 bases. Some users have reached tens of thousands of bases. Its RS II system costs about $700,000.
Pacific Biosciences’ single-molecule real-time sequencing is a sequencing-by-synthesis approach that doesn’t use an amplified set of DNA fragments and doesn’t require stopping and starting the reaction to add reagents and image results. Reactions on individual DNA molecules are tracked in real time across 150,000 nanoscale wells where isolated polymerases read the DNA and incorporate fluorescently tagged nucleotides. Because detection occurs only at the bottom of the wells, the background noise from the other reactions is reduced.
Stability of the sequencing process depends in large part on the polymerase. Pacific Biosciences has modified a simple bacteriophage enzyme, slowing it down so that it incorporates about three bases per second and its detector can keep up. To prevent inadvertent photo damage that could stop the process, the company has put a protective scaffold on the enzyme.
Although fast and cheap sequencing will yield much useful knowledge, it has come at a price because of the shorter read lengths, Korlach argues. Pacific Biosciences “wanted to build a technology first and foremost that gives the highest quality of sequence information,” he says.
The 10-year-old company launched its first sequencer in 2011 and has since improved its chemistry, detection, and throughput. On target for 70% sales growth this year, to about $47 million, Pacific Biosciences has installed more than 100 systems and has a market share of a few percent. Its business has seen “a nice boost as the platform continues to improve and be useful in several niches,” Mizuho’s Lawson says.
Long reads and high accuracy are critical for de novo sequencing, or deciphering a genome without comparison to an existing version. In February, Pacific Biosciences published a de novo human reference genome, one of just a few ever assembled. It is now focusing on providing nonhuman reference genomes. For example, it is collaborating with Sanger Institute and Public Health England to complete the sequences of 3,000 microbial strains.
To branch into the rapidly growing human diagnostics field, Pacific Biosciences signed a deal in late 2013 with Roche worth up to $75 million. The companies plan to develop a system for clinical use that Roche will sell. Pacific Biosciences will get income from manufacturing the instrument, software, and certain consumables.
In June, Pacific Biosciences also joined with the Dutch diagnostics firm GenDx. The companies will offer products for full-length human leukocyte antigen gene sequencing, which is gaining in clinical use. HLA sequencing is difficult in part because of high levels of sequence homology, but it gives clues to autoimmune and other diseases.
As new technologies such as Pacific Biosciences’ rise, others are falling by the wayside. In late 2013, after an unsuccessful $6.8 billion attempt to acquire Illumina, Roche decided to close down its 454 Life Sciences NGS business and sunset its midrange sequencers by the end of 2016. The business still accounts for about 10% of the NGS market. Roche acquired 454 Life Sciences in 2007, two years after 454 launched the first NGS instrument based on a sequencing-by-synthesis method. It is called pyrosequencing and uses a luciferase to detect the release of pyrophosphate and emit light that is detected by a camera.
[+]Enlarge
HANDHELDOxford Nanopore Technologies’ MinION uses electronic sensing for single-molecule sequencing.Credit: Oxford Nanopore Technologies
HANDHELD
Oxford Nanopore Technologies’ MinION uses electronic sensing for single-molecule sequencing.
Credit: Oxford Nanopore TechnologiesIn its favor, the 454 technology offered high accuracy and read lengths of up to 1,000 bases. But “from a technological perspective, it had reached its maturity point in being able to compete with some of the newer technologies,” says Vinod Makhijani, vice president and project leader on the business development team for Roche’s sequencing unit. “The throughput of the instruments had pretty much reached its maximum, and we were unable to significantly lower the cost, so the market started to move away from 454.”
Just when it looked like Roche was out of the business, in June it agreed to spend up to $350 million to acquire five-year-old Genia Technologies. The California firm is developing single-molecule, semiconductor-based sequencing. Nucleotides are identified through base-specific tags that are cleaved and detected electrically as they go through protein nanopores. Roche believes that Genia’s technology can reduce sequencing costs while increasing speed and sensitivity.
Later in June, Roche signed a deal to invest up to $15 million in Seattle-based Stratos Genomics. Its sequencing-by-expansion approach aims to convert a DNA template into a larger surrogate molecule using a polymerase and custom expandable nucleotides. The result, which the company calls an Xpandomer, contains reporter molecules that mirror the DNA sequence and can be read off when a single molecule passes through a nanopore. Stratos believes the approach can overcome resolution and signal-to-noise problems seen with other nanopore technologies.
Single-molecule, nanopore, and semiconductor technologies are considered a step beyond current NGS methods. “We obviously wanted to go with platforms that we consider disruptive,” Makhijani says. “All of these technologies that we are looking at offer significant scalability.” Like other new technologies, they will likely enter the research market first and have the potential to evolve into clinical diagnostics.
Other small companies with intriguing but unproven technologies are close behind. Quantum Biosystems, a Japanese start-up, released raw data in February for its silicon-chip-based, direct electrical detection method. About a year ago, England-based Base4 signed a deal with Japan’s Hitachi High-Technologies to build a nanopore-based sequencer.
Most interest has been in the U.K.’s Oxford Nanopore Technologies as it moves closer to launching a new sequencing device. Its MinION uses protein nanopores held in a polymer membrane to sequence single-stranded DNA in real time. Individual bases are identified through changes in electrical current as a linear, single-stranded DNA molecule moves through a nanopore.
The nine-year-old company is conducting an early-access program for the MinION. The disposable device, which is about the size of a USB memory stick, is expected to sell for less than $900.
Still, interest in Oxford Nanopore’s device may have hit a lull. In a survey of gene sequencing system users published in January, Mizuho’s Lawson found that about 50% of respondents expect the firm to provide “the next big leap in sequencing technology.” The number was down from 70% in 2012, “likely due to the delays and slow pace of commercialization,” he says.
Full information on MinION’s performance will come when the access program is complete. In February, analysts heard an early-user report that indicated read lengths were averaging 5,000 bases, but errors were also popping up.
Despite such fits and starts, participants in the NGS field expect the move toward faster, cheaper, and better tools to continue. “There still is a lot of room for prices to drop,” Frost’s Bird says. Rapidly falling sequencing costs don’t necessarily hurt the market, however, because they drive sequencing throughput and make the technology accessible to more users.
In response to strong market growth and new opportunities, the number of companies will expand threefold over the next five years, Bird predicts. Unable to sustain such a large competitor base, the business will then enter a new phase, she says. It will be marked by a mergers and acquisitions race among top competitors and large-company entrants, along with many start-up failures.
http://cen.acs.org/articles/92/i33/Next-Gen-Sequencing-Numbers-Game.html
Posted August 14, 2014-- New Results
Assembling Large Genomes with Single-Molecule Sequencing and Locality Sensitive Hashing
Konstantin Berlin, Sergey Koren, Chen-Shan Chin, James Drake, Jane M Landolin, Adam M Phillippy
Abstract
We report reference-grade de novo assemblies of four model organisms and the human genome from single-molecule, real-time (SMRT) sequencing. Long-read SMRT sequencing is routinely used to finish microbial genomes, but the available assembly methods have not scaled well to larger genomes. Here we introduce the MinHash Alignment Process (MHAP) for efficient overlapping of noisy, long reads using probabilistic, locality-sensitive hashing. Together with Celera Assembler, MHAP was used to reconstruct the genomes of Escherichia coli, Saccharomyces cerevisiae, Arabidopsis thaliana, Drosophila melanogaster, and human from high-coverage SMRT sequencing. The resulting assemblies include fully resolved chromosome arms and close persistent gaps in these important reference genomes, including heterochromatic and telomeric transition sequences. For D. melanogaster, MHAP achieved a 600-fold speedup relative to prior methods and a cloud computing cost of a few hundred dollars. These results demonstrate that single-molecule sequencing alone can produce near-complete eukaryotic genomes at modest cost.
http://biorxiv.org/content/early/2014/08/14/008003
The most recent version of this article [btu392] was published on 2014-07-17 (Published by Oxford University Press)---proovread: large-scale high accuracy PacBio correction through iterative short read consensus
Abstract
Motivation: Today, the base code of DNA is mostly determined through sequencing by synthesis as provided by the Illumina sequencers. Although highly accurate, resulting reads are short, making their analyses challenging. Recently, a new technology, Single Molecule Real-Time (SMRT) sequencing, was developed which could address these challenges as it generates reads of several thousand bases. But, their broad application has been hampered by a high error rate. Therefore, hybrid approaches which use high quality short reads to correct erroneous SMRT long reads have been developed. Still, current implementations have great demands on hardware, work only in well-defined computing infrastructures and reject a substantial amount of reads. This limits their usability considerably, especially in the case of large sequencing projects.
Results: Here we present proovread, a hybrid correction pipeline for SMRT reads, which can be flexibly adapted on existing hardware and infrastructure from a laptop to a high performance computing cluster. On genomic and transcriptomic test cases covering Escherichia coli, Arabidopsis thaliana and human, proovread achieved accuracies up to 99:9% and outperformed the existing hybrid correction programs. Furthermore, proovread corrected sequences were longer and the throughput was higher. Thus, proovread combines the most accurate correction results with an excellent adaptability to the available hardware. It will therefore increase the applicability and value of SMRT sequencing.
Availability: proovread is available at the following URL: http://proovread.bioapps.biozentrum.uni-wuerzburg.de
Contact: frank.foerster@biozentrum.uni-wuerzburg.de
Supplementary information: Supplementary data are available at Bioinformatics online.
© The Author (2014). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
http://bioinformatics.oxfordjournals.org/content/early/2014/07/10/bioinformatics.btu392.short?rss=1
Wednesday, August 6, 2014-Plant and Animal Genomes: New Web Resource Available
http://blog.pacificbiosciences.com/2014/08/plant-and-animal-genomes-new-web.html
(Arizona Genomics Institute) PacBio® A New Sequencing Revolution
FREE: Workshop, Lunch & Tour of Arizona Genomics Institute to see PacBio® RSII!!!
Rare Opportunity - Register Now - Seating is Limited!
Thursday, August 14th, 2014
9am-1pm
BIO5 Institute - Keating Bldg, Rm 103
Program:
Sequencing with Long Reads: New and Upcoming Applications of PacBio® SMRT® Technology
Jonas Korlach, Chief Scientific Officer, Pacific BioSciences
Targeted PacBio® Sequencing: BAC Libraries, Physical Maps, Platinum Sequencing
Rod A. Wing, Director, Arizona Genomic Institute (AGI)
Microbial Genome Sequencing
David Baltrus, Assistant Professor, School of Plant Sciences & Microbial Sciences
Sequencing Large Eukaryotic Genomes
Yeisco Yu, Sequencing Group Leader, Arizona Genomics Institute
Dave Kudrna, BAC/EST Resource & Physical Mapping Center Group Leader
Full-Length Transcript Sequencing with PacBio® Iso-Seq™ Method: Going Beyond Short Read Assembly
Jonas Korlach, Chief Scientific Officer, Pacific BioSciences
Lunch:
Open Discussion Q&A
Tour:
Arizona Genomic Institute-see PacBIO®RSII
http://www.bio5.org/2014-pacbio
The Dazzler DB: Organizing an Assembly Pipeline
by Gene Myers Major Update: July 29, 2014
1.The database stores the source Pacbio read information in such a way that it can recreate the original input data, thus permitting a user to remove the (effectively redundant) source files. This avoids duplicating the same data, once in the source file and once in the database.
http://dazzlerblog.wordpress.com/2014/06/01/the-dazzler-db/
Tuesday, July 29, 2014-- Novel Study of Genome-wide PT Modifications in Bacteria Performed with SMRT Sequencing
A recent paper from scientists in China and the United States demonstrates a novel view of phosphorothioate (PT) DNA modifications in two bacterial genomes. Scientists from Shanghai Jiao Tong University, Massachusetts Institute of Technology, Wuhan University, and Pacific Biosciences teamed up to deploy Single Molecule, Real-Time (SMRT®) Sequencing to generate the first genome-wide view of PT modifications and to better understand their function. “Genomic mapping of phosphorothioates reveals partial modification of short consensus sequences” by Cao et al. was published in Nature Communications.
The authors note that PT modifications, which replace a non-bridging phosphate oxygen with sulphur, were only recently discovered to occur naturally in bacteria. (PT modifications are used by scientists to stabilize synthetic DNA molecules against nuclease degradation.) Today, these modifications have been seen in more than 200 bacteria and archaea, but the detailed genome-wide distribution and biological functions have not been clear.
To look at these events across whole genomes, the scientists used SMRT Sequencing, which can distinguish PT modifications as the polymerase is sequencing DNA. They studied Escherichia coli B7A, which uses the DndF-H proteins known to be associated with PT modifications, as well as Vibrio cyclitrophicus FF75, which lacks those proteins. The PacBio® RS II was used to fully sequence each genome and to assess PT modifications across the genomes.
The scientists found that in E. coli, PT modifications occur on both strands of a particular motif, but only 12 percent of possible motif sites were modified. In V. cyclitrophicus, the modifications are seen only on one DNA strand at CpsCA sequence contexts, but still in just 14 percent of possible sites. The authors also described an iodine-cleavage method in conjunction with Illumina® sequencing which was used to cross-validate the findings; however, that method requires both DNA strands to be modified so was only applied to the E. coli case. “The results raise questions about how Dnd modification proteins (DndA-E) select their DNA targets,” the authors write. “Emerging evidence suggests that DndD is a DNA nicking enzyme and that DndE binds selectively to nicked DNA, with both activities critical to incorporation of PT into the DNA backbone.”
The partial modification seen in both bacteria suggests that overexpression of DndA-E proteins could increase the levels of PT modifications, according to the paper. “These results point to a novel [restriction-modification] system involving site-specific PT modifications without a predictable consensus beyond four nucleotides and with partial modification of sites in the presence of a restriction activity,” the scientists report.
“Such consistency for two bacteria in which PT has very different functions points to a conserved mechanism of DNA target selection by the DNA-modifying DndA-E proteins, a mechanism that we have shown likely involves direct interaction of the modifying proteins with the consensus sequence,” the authors conclude. http://blog.pacificbiosciences.com/
Tuesday, July 22, 2014At ISMB, Gene Myers’ Keynote Offers History, Future of Genome Assembly
At ISMB 2014 in Boston earlier this month, Gene Myers of the Max-Planck Institute for Molecular Cell Biology and Genetics, presented a keynote address entitled “DNA Assembly: Past, Present, and Future.” Myers received the prestigious Senior Scientist Accomplishment Award from the International Society for Computational Biology (ISCB) at the event.
The ISCB Senior Scientist Accomplishment Award honors respected leaders in computational biology and bioinformatics for their significant contributions to these fields through research, education, and service. Myers is being honored as the 2014 winner for his outstanding contributions to the bioinformatics community, particularly for his work on sequence comparison algorithms, whole-genome shotgun sequencing methods, and for his recent endeavors in developing software and microscopic devices for bioimage informatics.
His talk chronicled the history of sequence assembly methods highlighting the different technologies from Sanger sequencing to today, and the various algorithmic approaches to the problem, weaving throughout it the ideas of string graphs and de Bruijn graphs.
Myers believes the demand for lower-cost sequencing “after the genome” has hampered progress on the production of high quality de novo genome reconstructions, and resulting instead in ‘swiss cheese genomes’. He said that generating genomes consisting of lots of small contigs was never his vision for assembly.
He spent nearly a decade out of the “DNA sequencing scene” (see his blog post “On Perfect Assembly”) because the cost-over-quality movement caused him to lose interest as a mathematician, until the advent of long-read sequencers renewed Myers’ engagement in assembly methods. He writes: “What I perceived early in 2013 was that the relatively new PacBio ‘long read’ DNA sequencer was reaching sufficient maturity that it could produce data sets that might make this possible, or at least get us much, much closer to the goal of near perfect, reference quality reconstructions of novel genomes.” Myers noted that some in the industry had misunderstood the accuracy profile of the system, but he recognized the power of the Poisson sampling and random distribution of errors and decided last year to purchase a PacBio® RS II and “get back into the genome assembly game.”
Myers now has two PacBio RS II sequencers and, as he has discussed in his blog and presentations this year at AGBT and ISMB, he is not concerned with error rates associated with PacBio sequencing because the error is truly random (“unlike any previous technology”), and therefore “the ideal of near perfect de novo assembly is again possible.”
He described his most recent algorithmic work on an assembler called the Dazzler (the Dresden AZZembLER) that can assemble 1-10 Gb genomes directly from a shotgun, long-read data set produced by PacBio RS II sequencers. Using Dazzler, he reported generating a de novo assembly of a human genome with an N50 of 5.5 Mb, which represents an improvement of over 1 Mb compared to our HGAP assembly in February, and with much reduced computational requirements and time. More information is available on his blog. In conclusion, he noted that long-read sequencers will enable de novo, reference-quality reconstructions, enhance comparative genomics and diversity studies, and give us an accurate picture of large-scale structural variation.
We are glad to see Myers back in the DNA sequencing scene, and very excited about the possibilities SMRT® Sequencing holds for genome assemblies! http://blog.pacificbiosciences.com/2014/07/at-ismb-gene-myers-keynote-offers.html?utm_content=bufferf6127&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer
July 23, 2014 by nextgenseek /// Personal Allele-Specific Transcriptomics by PacBio Long Reads
Just a few weeks ago Mike Snyder’s team at Stanford published an interesting paper on PNAS. Snyder’s team is again at using PacBio long reads to understand and characterize human transcriptome.
¦Defining a personal, allele-specific, and single-molecule long-read transcriptome
by Hagen Tilgnera, Fabian Gruberta, Donald Sharon, and Michael P. Snyder
In this paper, Tilgner et al. focussed on defining personal transcriptome at allele-level.
Diploids like human have two copies/alleles of every autosomal gene, where one allele is from mother and the other allele is from father. A number of studies have shown that preferential expression of mom’s or dad’s allele is pretty common. However, till now these studies have used short-read RNA-seq technology to study transcriptome at allele-level.
And this paper is the first attempt to characterize personal transcriptomics, where an individual’s genetic variations and allele-level isoforms are defined and quantified at transcript’s full length. Synder team used PacBio to produce a personal transcriptome at allele-level by producing long read PacBio RNA-seq data from the GM12878 cell line (lymphoblastoid) and its parental cell lines (GM12891 and GM12892). In addition to the long reads from PacBio, the team also used Illumina sequencing to give us the most comprehensive view of a personal transcriptome. Sequencing the transcriptomes of the trio by PacBio, this paper provides the glimpse of allele level expression at full length transcript.
They sequenced 711,000 circular consensus (CCS) reads/molecules from unamplified, polyA-selected RNA from the GM12878 cell line. The average length of CCS is 1,188 bp and some read with length up to 6 kb. The sequencing effort in this paper is much longer than their earlier effort to define transcriptomes by using Pacbio long read technology. The early part of the paper is all about how long read PacBio fares in finding isoforms and improve GenCode annotations.
¦How the CCS read is able to capture all exon-intron structure of transcript in a single read?
¦How does the gene detection by long-read PacBio compares to Illumina 101 bp reads?
¦How the long read enhances GenCode annotations?
One of the interesting parts of the paper was finding out the parental origin of a given PacBio long read, i.e. whether the read originated from mom or dad. This basically help quantifying allele-specific gene expression.
Allele Specific Expression
Traditionally, using Illumina reads, one aligns reads to genome, looks at known SNP location, and counts number of reads with reference allele and the number of reads with alternate allele to quantify allele-specific expression. Accounting for the alignment bias induced by reference genome, either at the alignment step or later, this approach gives us an idea of how much gene shows allele-specific expression. This approach gives us SNP level allelic expression, not isoform level or gene level. Therefore quantifying allelic expression of a transcript with multiple variants is challenging using Illumina reads.
Allele Specific Expression by Principal Component Analysis (PCA)
Interestingly, this paper addressed the quantification of allele-specific expression in a PCA/SVD framework instead of looking at known SNP locations. Possible reasons for the approach are that
¦this gives read level inference of parental origin instead of SNP level
¦the random errors in PacBio may make it difficult to quantify allele level expression at SNP level.
On a high level, In the PCA framework, the SNP profiles of aligned reads are the input data and the parental origin of reads is the unknown variable to inferred from the SNP profile. SNP profile – number of reads by mismatch matrix is created by coding a mismatch as 1,0,-1 depending on whether the nucleotide is different or the same as reference, or absent.
Parental origin of PacBio Reads by PCA
Assuming the PacBio errors are random and there are no other factors affecting the reads, one can do PCA on SNP profiles of aligned reads and one of the principal components will be the parental origin of the reads. With enough number reads, for a gene with multiple SNPs there will enough signal and that can be captured as one of the principal components. The basic idea is that instead of computing ASE by using single SNPs, PCA approach uses information from multiple SNPs (either real or error) and computes parental origin of reads. Each column in the Read by mismatch matrix is a potential SNP and the PCA does dimensionality reduction and the PC that explains the most variance gives us the parental origin of read information.
An advantage of the approach is that, a priori we do not need to know about SNPs. A problem with the approach is that, it works only when the data is from trios. The reason is – although PC from PCA can separate maternal and paternal reads from an individual, it can not tell which reads are maternal and paternal. A way to get around the problem is to use the reads from parents in the PCA analysis, then using them we can tell which reads are paternal and maternal. Also the approach will fail if a gene has fewer SNP, fewer number of reads, and read has more errors.
It will be great to understand the nitty-gritty details of how this method compares with SNP-based approach and what are the limitations of the approach. Some other time, if needed :)
13Share4Share0Share0Share0Share0Share Related posts:
1.IDP: Isoform Detection and Prediction Using Second Generation Sequencing and PacBio Sequencing
2.Long reads of the year 2013
3.PacBio Aims to Reach Average Read of Lengths of 7000-9000 Bases in 2013
4.PacBio Releases New Software Upgrade and Promises Better De Novo Assembly
5.PacBio Launches PacBio RS II Sequencer
http://nextgenseek.com/2014/07/personal-allele-specific-transcriptomics-by-pacbio-long-reads/
PACB Tweets and replies // https://twitter.com/pacbio
Join our seminar @ Texas A&M - Applications of SMRT Seq: 7/22 11-12 in ILSB Auditorium https://twitter.com/PacBio/status/489160154403725313/photo/1
Just some excited chat today on the web today about PACBIO!!! (11 hrs ago
#PP09 @aphillippy: 90% of any genome bacterial genome on GenBank could be closed with a single @PacBio SMRT cell- less than $1000.) (6 hrs ago
@aphillippy that's amazing. Thank goodness @PacBio survived Illumina's relentless PR blitzes) (5 hrs ago
@timtriche @druvus @PacBio wait 'til you watch @aphillippy's talk (that I'll post), which will really impress and settle a lot of doubts.)
Friday, July 11, 2014ISMB 2014: The World Cup of Bioinformatics
We’re eager for the #ISMB conference — it’s the 22nd annual Intelligent Systems for Molecular Biology event — kicking off this weekend in Boston. As we continue to push our technology to deliver longer read lengths, we have been honored to work with many leading bioinformaticians to optimize the processing and analysis of our data.
Several of those experts will be speaking at ISMB this year. On Sunday, attendees will hear from Adam Phillippy of the National Biodefense Analysis and Countermeasures Center. He’ll be presenting at noon on producing complete genome assemblies using Single Molecule, Real-Time (SMRT®) Sequencing data. Adam’s team recently developed a new assembler called MHAP that dramatically reduces CPU power needed for building assemblies, so we are eager to hear more.
Later that day, Gene Myers from the Max Planck Institute of Molecular Cell Biology and Genetics in Dresden, Germany, will give the 2014 ISCB Accomplishment by a Senior Scientist Award keynote presentation entitled “DNA Assembly: Past, Present, and Future,” in which he'll reflect on genome assembly challenges throughout his career. According to his abstract, Myers’ talk will also cover “the surprising transition from skepticism of whole-genome shotgun sequencing to an irrational acceptance of NGS whole-genome shotgun over short reads.” He’ll speak about Dazzler, a new tool he developed to assemble genomes as large as 10 Gb directly from long PacBio® reads.
There are several other terrific keynotes scheduled for the meeting. On Monday, Harvard’s Zak Kohane will give a talk outlining the opportunities he sees for biomedical quantitative analysis experts to participate in the healthcare revolution happening today. On Tuesday, Russ Altman will offer a presentation on using informatics to better understand drug response from the molecular to the population level.
With a history of two decades of high-profile talks, ISMB is arguably the World Cup of the bioinformatics world. We hope to see you there! http://blog.pacificbiosciences.com/2014/07/ismb-2014-world-cup-of-bioinformatics.html
Wednesday, July 9, 2014Optimizing Eukaryotic De Novo Genome Assembly: Webinar Recording Available
Our webinar on eukaryotic genome assembly attracted a great crowd, and now we’re making the full recording available to the community. The session featured great hands-on information and best practices for working with Single Molecule, Real-Time (SMRT®) Sequencing data. “Optimizing Eukaryotic Genome Assembly with Long-Read Sequencing” featured three excellent speakers — Michael Schatz and James Gurtowski from Cold Spring Harbor Laboratory and Sergey Koren from the National Biodefense Analysis and Countermeasures Center — and was hosted by our own CSO Jonas Korlach.
Schatz kicked off the session with an overview of assemblers for PacBio® data (as well as recommendations for when to use each one) and a look at the challenges of short-read assemblies. He also set expectations around long-read data, noting that for genomes less than 100 Mb, users should expect a nearly perfect assembly from the automated workflow. Genomes up to 1 Gb should be represented in a high-quality assembly with a contig N50 of at least 1 Mb. Genomes larger than that will have shorter contig N50 stats and will require larger computational power, he added.
Next, Gurtowski gave an in-depth look at hybrid assemblies in which shorter reads are used to correct errors in longer reads. He provided step-by-step instructions for the use of ECTools, a new portfolio of publicly available assembly tools developed in the Schatz lab. He noted that the pipeline was developed to be modular, so users could run the whole workflow or just pick out the elements that would be most helpful to them. Finally, Gurtowski alerted attendees that the choice of assembler for the pre-assembly step is dependent on the data, so he recommends using several and evaluating results across them.
Koren presented data on chromosome-scale assembly, reporting the new MinHash Alignment Process (MHAP) he developed to dramatically reduce the need for processing power in genome assemblies. (Adam Phillippy also spoke about this tool at our recent user group meeting.) Koren used the example of a Drosophila assembly to show that traditional assemblers required 629,000 CPU hours while MHAP was able to complete the same assembly with just 1,086 CPU hours, and even resulting in slightly higher quality. He also performed a live demo of the automated MHAP pipeline, showing how to tune parameters such as memory usage as you go.
After the speakers completed their presentations, there was a lively Q&A session that is also captured in the webinar recording. Discussions ranged from the impact of highly polymorphic regions on assembly quality to the highly technical, such as the use of unitigs or contigs for ECTools and how to combine PacBio data generated with different chemistries.
View the webinar recording
http://blog.pacificbiosciences.com/2014/07/optimizing-eukaryotic-de-novo-genome.html?utm_content=buffer6a5f7&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer
Journal of Microbiological Methods
Available online 27 June 2014
Improved performance of the PacBio SMRT technology for 16S rDNA sequencing
Jennifer J. Moshera, 1, Brett Bowmanb, Erin L. Bernbergc, Olga Shevchenkoc, Jinjun Kana, Jonas Korlachb, Louis A. Kaplana, -------Abstract
Improved sequencing accuracy was obtained with 16S amplicons from environmental samples and a known pure culture when upgraded Pacific Biosciences (PacBio) hardware and enzymes were used for the single molecule, real-time (SMRT) sequencing platform. The new PacBio RS II system with P4/C2 chemistry, when used with previously constructed libraries (Mosher et al., 2013) surpassed the accuracy of Roche/454 pyrosequencing platform. With accurate read lengths of > 1400 base pairs, the PacBio system opens up the possibility of identifying microorganisms to the species level in environmental samples.
Keywords
PacBio; SMRT sequencing; 16S
Corresponding author at: Stroud Water Research Center, 970 Spencer Rd., Avondale, PA 19311, United States. Tel.: + 1 610 268 2153x228; fax: + 1 610 268 0490.
1Current address: Marshall University, Department of Biological Sciences, Huntington, WV, United States
http://www.sciencedirect.com/science/article/pii/S0167701214001675
I started buying PACB 03/27/2013 ,@ around $2.00. Recent purchase`s @$4.49 to $6.06. Long term! (Anymore I only share my D D,For what it`s worth)??
Discovering standard and non-standard RNA transcripts
How to detect canonical splicing, circular RNAs, trans-splicing, and fusion transcripts
When?
October 23rd - 24th 2014
Where?
Leipzig, Germany- // sponsored by Pacific Biosciences.(PACB) and Illumina Inc.(ILMN)
http://www.ecseq.com/workshops/workshop_2014-04.html
Vacancy Details: Potato genomics (Computational Biologist)
Improving potato genetics using the long read sequencing data
Expected/Ideal Start Date 01 Aug 2014
Months Duration 36
Main purpose of the job To develop genetic resources in potato through the use of genomic technologies: assemble the genome of a diploid self-compatible relative from a mix of short (Illumina) and long read (PacBio/Nanopore) technologies; use next generation sequencing to harness induced variance (TILLING) that affects key traits and naturally occurring variance that confers resistance to common potato diseases.
Department Genomics
Group Details Plant and Microbial Genomics (PMG) group is a multi-disciplinary team with a strong technical expertise base in genetics, genomics, molecular biology, plant biology, bioinformatics. We are applying many of the latest sequencing technology improvements, and our own innovations, to the areas of plant and microbial genomics, and the interaction between the two.
See http://www.tgac.ac.uk/plant-and-microbial-genomics/ for more details
Advert Text Potatoes are the fourth most valuable crop in the world, yet despite a fully sequenced genome the available genetic resources are surprisingly sparse. One major hindrance is that potatoes (S. tuberosum) are an autotetraploid outbreeder, which hinders examination of recessive alleles. For this reason we will establish a tuber forming close relative S. verrucosum (an inbreeding diploid) as a genetic model by sequencing its genome and demonstrate its efficacy by cloning forward genetic EMS induced mutants. In a further grant we will harness naturally occurring potato variation in dominant traits i.e. NB-LRR based disease resistance using an improved version of RenSeq (Juppe et al. 2013).
The post holder will work at the cutting edge of next generation sequencing in a multi-disciplinary team with expertise in genetics, genomics, molecular biology, plant biology, bioinformatics and algorithm development. The successful applicant will use and optimise tools for assembling a large plant gene sequences and an entire genome using ultra-long Pacific Bioscience (and as available Nanopore) reads, as well as using shorter Illumina reads to verify and correct the assembly.
Key relationships Internal: working closely with Plant Microbial Genomics and Platform and Pipeline staff developing the wet lab (molecular biology) techniques, reporting to the Plant Microbial Genomics group leader.
Interacting with: all TGAC staff, especially the sequencing production teams, and the Computational Biology and Pipeline bioinformatics groups.
External: key biology collaborators at James Hutton Institute, University of Dundee, The Sainsbury Laboratory and Simplot (a world leading Agri-Biotechnology company), platform providers e.g. PacBio, Oxford Nanopore Technologies, and Illumina as well as potato geneticists and breeders. The post holder will represent TGAC at key meetings nationally and worldwide.
Main Activities & Responsibilities Percentage
To establish optimal systems for handling and assembling 3rd generation long read e.g. PacBio and/or Nanopore data. 30
To work with key members of the Plant Microbial Genomics and Computational Biology groups to develop optimal experimental and computational approaches to assemble the inbred diploid potato species S. verrucosum. 20
Use hybrid sequence data (long and short reads) to separate large highly related gene families. 15
Use hybrid sequence data to identify naturally occurring variance in large gene families that confer desirable traits 15
Use next generation sequencing short reads to map EMS induced alleles e.g. from bulk segregant pools. 10
Prepare talks and publications on in-depth analysis of population genomics data sets generated at TGAC, or with collaborators. 10
http://jobs.tgac.ac.uk/Details.asp?vacancyID=7851
The Hunt for a New Human Reference Genome
By Aaron Krol
June 30, 2014 | The human reference genome is a linchpin of modern genetics, but it’s also a bit of an historical oddity. Currently known as GRCh38, or “build 38” for short, it is a direct descendant of the original Human Genome Project, and has touched almost every genomic study since. The reference genome acts as a template that makes it much cheaper and easier to assemble new human genomes: when a sequencing project breaks a subject’s DNA into millions of short fragments, those reads can be placed in their correct locations by matching them to build 38.
There’s no special reason that a certain genome has to be used as the reference, but build 38 has a lot going for it. It remains the most accurate and complete human genome ever assembled, and is regularly updated by the Genome Reference Consortium (GRC), made up of teams from the National Center for Biotechnology Information (NCBI), the Genome Institute at Washington University in St. Louis, the U.K.’s Wellcome Trust Sanger Institute, and the European Bioinformatics Institute — all key participants in the Human Genome Project. The GRC released its latest major overhaul of the reference genome last December, replacing build 37, and adds minor updates four times a year. (For more on the biggest changes that accompanied build 38, see “Getting to Know the New Reference Genome Assembly.”)
But build 38 also carries some baggage. Most importantly, a mix of several donor sources was used in the Human Genome Project, and more have been incorporated into the reference since. And while the reference genome is haploid — it features just one copy of each chromosome — almost all its sources are diploid, with two copies of each chromosome that may be dramatically different from each other in areas of heavy structural variation.
This confusion of sources can cause problems. The reference can end up with two versions of a structural variant mashed together, to create a new genotype that never occurs in nature. There may even be artificial gaps in the sequence, where two structural variants from different sources fail to meet in the middle. The GRC is constantly searching for and cleaning up these events, but they’re not always easy to find. Even harder to capture are haplotypes: sets of variants that tend to travel together, even if they occur tens of thousands of bases apart. With its motley heritage, the reference genome in many regions has no natural haplotype, just a patchwork of its various clone sources.
“The way the reference is currently created is that clone A and clone B and clone C can all be completely different haplotypes,” says Tina Graves-Lindsay, leader of the Reference Genomes Group at Washington University, and a contributor to the GRC. “So you have no linkage across any large region, for knowing your assembly aligns the same way to each haplotype.”
This can affect the main purpose of the reference genome, of placing DNA fragments in the right order. If two haplotypes differ by a large structural variation, like an inversion or duplication, where the same sequence appears in different places or in reverse order, reads can be aligned differently depending on the haplotype of the reference. Without the context of the whole haplotype, it becomes much harder to resolve this kind of error.
To get a better sense of variation across large regions, it would be better to have a naturally haploid reference, one that can show the real sequence of at least one set of actual human chromosomes. Graves-Lindsay is now part of a team at Washington University that is working on just that: a whole alternate reference genome, as accurate as build 38, but sequenced from just one sample with half the usual number of chromosomes. Its curators call it the platinum genome, and it relies on a very unusual donor source.
One Set of Chromosomes
In 2002, Evan Eichler, a geneticist then at Case Western Reserve University, wrote to the National Human Genome Research Institute to request the creation of a new BAC library — a sort of bacterial storage system for long DNA fragments, used to hold onto interesting genetic material for repeated sequencing. The Human Genome Project had not yet been completed, but Eichler was already finding gaps and errors that could be fixed by sequencing a haploid human genome. To get one, he recommended a BAC library covering the entire genome of a hydatidiform mole.
Hydatidiform moles are the result of a type of abnormal pregnancy, where an egg that by some accident has no nuclear DNA is impregnated by an ordinary sperm. The sperm then doubles its own DNA, resulting in two identical copies of each chromosome in every cell as the mole starts to divide. Hydatidiform moles are rare, but several have been isolated and turned into cell lines, and one of those, called CHM1, has become an industry standard.
Eichler’s proposed BAC library was eventually created from CHM1 — the library is called CHORI-17 — and Eichler, now at the University of Washington in Seattle, has been working with it for around ten years, in collaboration with the Genome Institute at Washington University. At first, says Graves-Lindsay, who has been regularly involved in the partnership, the goal was just to go back over the most confusing parts of the reference genome and repair them.
“We really started with the BAC sequencing, initially to fix regions,” she says. “There are definitely regions that cannot be sorted out without a single haplotype. And we actually found that there were a lot of gaps in the reference that are due to two different haplotypes on either side.”
Thanks to long work on CHORI-17, updates to builds 37 and 38 corrected several unresolved genes. These included SRGAP2, a very complex gene that is duplicated in three different places across the length of chromosome 1, and the immunoglobulin heavy locus, where several similar DNA segments are shuffled and reshuffled together to express a highly variable set of antibodies.
The success patching up specific structural variants, however, soon underlined the need to show how these variants behave together. “The more we worked on it,” says Graves-Lindsay, “the more we realized that having the complete sequence would be good also.” In 2011, the Washington University/University of Washington team sequenced all of CHM1 on Illumina sequencers, creating their first assembly of a haploid human genome, which was made freely available in the NCBI’s GenBank database.
This assembly was a useful starting point, but it had some limitations. Like all whole genomes created with Illumina’s instruments — which are fast and highly accurate, but split their samples into small fragments just one or two hundred bases long — the new CHM1 assembly had to be guided by build 38. This meant it was vulnerable to the same confusions around large structural variants that a haploid reference genome was meant to overcome. Adding information from CHORI-17 could fix some of these problems, but not all, and not quickly.
The Illumina assembly was also far from complete, covering just over 92% of build 37 upon release. While that has since improved, today the assembly is still divided in over 40,000 segments, or contigs, with gaps in between that cannot be resolved. The hardest work was still to come in bridging the distance from this first CHM1 assembly to the “platinum genome.”
No Reference Required
Elsewhere, however, a different assembly of CHM1 was in the works. The sequencing company Pacific Biosciences, based in Menlo Park, California, had struggled to carve out a market since the release of its first sequencer in 2010. The company’s technology was neither as cheap nor as fast as market leader Illumina, but it did have one noteworthy advantage. With the release of new chemistry in October 2013, PacBio was delivering half its reads in fragments of 8,000 bases or more, over an order of magnitude longer than any of its competitors.
Long reads make it exponentially easier to put together whole genomes de novo, without using a reference genome, in part because there are fewer total fragments and less confusion about the order they belong in. Over the course of 2013, PacBio released a series of de novo genomes, starting with a few bacteria and building up to yeasts and fruit flies, to sell potential customers on its long-reading machines. But Jonas Korlach, the company’s CSO, wanted to tackle a human sample, and he naturally reached out to someone who worked with human genomes on a regular basis.
“I had asked Evan Eichler in the summer,” Korlach told Bio-IT World, “if we want to show that the long reads from PacBio can be really useful for getting an improved de novo assembly, what sample should we use? And Evan immediately said we should use the CHM1 sample.” (Eichler was also until recently a member of PacBio’s advisory board.)
By February 2014, PacBio was ready to release its own assembly of CHM1. It joined just a handful of de novo human assemblies ever performed, and thanks to the long read lengths, it had some properties that previous efforts couldn’t match. “Our assembly, straight out of the pipe, came to an N50 of 4.4 Mb,” says Korlach, meaning half the contigs are at least 4.4 million DNA bases long. By comparison, the Washington University assembly of CHM1 has a contig N50 of just 144,000 bases.
Longer contigs mean fewer contigs, and fewer gaps in between them. Overall, says Korlach, “the assembly was about forty times more contiguous than any of the previous approaches, except of course the very first Human Genome Project.” Intriguingly, the PacBio assembly is also longer overall than any previous human genome, by about 400 million bases. “We’re looking at that carefully now,” Korlach adds, “but we already see indications that it’s because you recover and resolve highly repetitive regions” — areas like the telomeres and centromeres, which aren’t fully represented in build 38 because they’re too repetitive to sequence with current technologies.
A great deal of validation still needs to be done on PacBio’s version of CHM1, which is now being carried out both within the company and at outside institutions that have downloaded the freely available data. But as the most complete human genome since the reference itself, this assembly looks like a much more secure model for the platinum genome, a project Korlach enthusiastically supports.
“The ultimate goal would be to get a human genome that goes from one telomere, through the centromere, to the other telomere — a chromosome represented by continuous sequence,” he says. “That would be a great advance for science, to really have a sense of completion, and to know all the bases in at least one human genome.”
PacBio has previously contributed in a small way to the GRC’s efforts; its assembly of the MUC5AC gene, a highly repetitive gene that may be involved in chronic obstructive lung disease, is the canonical sequence in build 38. Now, the company’s first whole human genome is playing a central part in the effort to add a second high-quality reference to human geneticists’ arsenals.
Toward a Platinum Genome
It will take a mix of many data sources, each with their advantages and disadvantages, to piece a useful platinum genome together. “We’ve got the Illumina sequence, the PacBio sequence, we’ve got lots of clones sequenced,” says Graves-Lindsay. “So we plan to use all of those resources to check the accuracy of our final assembly.” The team is also referring to a third assembly of CHM1 on a different technology, an optical system from a company called BioNano, which is useful for ordering structurally similar regions.
The Genome Institute at Washington University was an early adopter of PacBio instruments, and the team is now using its PacBio sequencer on BAC clones from the CHORI-17 library. They’re still focusing mainly on the thorniest regions, so that improved sequence can be added to build 38 as quickly as possible. “As soon as we fix a region, it will be a part of the reference as a patch,” says Graves-Linday. “So that’s the piecemeal goal, to get the sequence out there as best we can.”
The major challenge is getting consistently high accuracy. Most high-throughput sequencing technologies are now in the region of 99.9% accurate for each DNA base call, but over the length of a whole genome, that still leaves a lot of room for error. The original Human Genome Project used the much more painstaking and expensive Sanger sequencing method, which is scrupulously accurate; by comparing different data sources against each other, Graves-Lindsay and her colleagues hope to achieve the same quality at a fraction of the cost.
In the medium term, a first full draft of the platinum genome is still in the making. The Genome Institute plans to deposit that resource in GenBank, just as it already has with its Illumina CHM1 assembly, so that researchers anywhere in the world can access it. At first, its most promising use will likely be in haplotype studies, helping to clarify which variants tend to be inherited together.
“I think the biggest utility for the single-allelic [haploid] representation is likely to be the context, that allele A or variant A always goes with this variant that’s down the road a little bit,” says Graves-Lindsay. “Especially if you want allelic context in a large region, if you’ve got a single allele, you’ll be able to figure out how things work together.”
In the longer term, she imagines the platinum genome could be curated to the same degree as build 38. One danger of any reference genome is that structural differences between the reference’s haplotype and a given sample will be too great to bridge, leading to regions that can’t be assembled. (This is especially relevant because build 38 is based almost entirely on DNA from U.S. donors. Korlach remembers speaking to Japanese customers at the Advances in Genome Biology & Technology conference: “they said the human reference genome is great, but it doesn’t really apply to the kinds of genomes that they’re interested in.”)
To get around this, the GRC has been diligently adding “alternate scaffolds” to builds 37 and 38, where highly variable regions can be represented in a number of different ways. Graves-Lindsay and her colleagues want to do the same for the platinum genome — possibly even with whole haplotypes, so that alternate sequences stretch great distances across chromosomes.
“Our intention is to continue to add additional sequences,” she says. “There will probably be a complete sequence, and then hopefully you’ll be able to layer these either on the reference, or on the single haplotype, to the point where you have all the different alleles layered on.” That would be the most powerful resource for human genetics, able to correctly assemble whole genomes from any human sample, as well as illuminate the way variants stay linked with one another across long stretches of the chromosomes.
Build 38 and its predecessors have been incredible tools for genetics, making it possible to sequence human genomes en masse, and collecting the highest quality sequence for nearly all regions of the genome in one place. But the reference remains bound to the unique circumstances of the Human Genome Project, the race to build a human genome as quickly as possible from whatever sources worked. Although it remains the best-curated genome available, it probably looks quite different from a genome built for reference-guided assembly from the ground up.
As the researchers at the Genome Institute and in Eichler’s lab continue to pore over CHM1’s DNA, this strange cell line may one day offer a new foundation for the daily work of human genetics.
http://www.bio-itworld.com/2014/6/30/hunt-new-human-reference-genome.html?utm_source=dlvr.it&utm_medium=twitter
Long Amplicon Analysis:
Highly Accurate, Full-length, Phased, Allele-Resolved Gene
Sequences from Multiplexed SMRT® Sequencing Data--- Brett N. Bowman1, Patrick Marks1, N. Lance Hepler1, Kevin Eng1, John Harting1, Takashi Shiina2, Shingo Suzuki2, Swati Ranade1
1.Pacific Biosciences of California, Inc., Menlo Park, United States of America
2.Tokai University School of Medicine, Isehara, Japan-------------- The correct phasing of genetic variations is a key
challenge for many applications of DNA sequencing.
Allele-level resolution is strongly preferred for
histocompatibility sequencing where recombined genes
can exhibit different compatibilities than their parents. In
other contexts, gene complementation can provide
protection if deleterious mutations are found on only one
allele of a gene. These problems are especially
pronounced in immunological domains given the high
levels of genetic diversity and recombination seen in
regions like the Major Histocompatibility Complex. A new
tool for analyzing Single Molecule, Real-Time (SMRT)
Sequencing data – Long Amplicon Analysis (LAA) – can
generate highly accurate, phased and full-length
consensus sequences for multiple genes in a single
sequencing run.
Introduction
Amplicon Analysis Overview
Mixed Long Amplicon Analysis
SMRT® Sequencing
Full-Length HLA Class I
References
Conclusions
Motivation--// (Must see link) https://s3.amazonaws.com/files.pacb.com/pdf/Long_Amplicon_Analysis_Highly_Accurate_Full-length_Phased_Allele_Resolved_Gene_Sequences_from_Multiplexed_SMRT_Sequencing_Data.pdf
Homolog.us – Bioinformatics
Frontier in Bioinformatics
June 23rd, 2014 ///-- Another PacBio Development – Adam Phillippy’s New MHAP Module
Share..... 1 0 0 0 0
Homolog.us blog is written by professional janitors dedicated to clean up US science. During lunch breaks and other time off from the job, we discuss bioinformatics. The name 'homolog.us' is not a spelling mistake, but is derived by taking Arabic translation of the 'O' in the original word.
Please follow us on twitter – @homolog_us.
http://wgs-assembler.sourceforge.net/wiki/index.php?title=PBcR -------------------------------------
Alignment speed is the biggest bottleneck in PacBio assembly. Therefore, those working on PacBio reads will find the following release helpful.
June 14, 2014 – PBcR and CA 8.2 alpha as source or pre-compiled for Linux is now available. PBcR now incorporates a novel probabilistic overlapper for self-correction of sequences named MHAP. This allows assembly of prokaryotic genomes in < 30 minutes on a typical desktop and assembly of small eukaryotic genomes in < 2 days. If you use MHAP, please cite the Biology of Genomes poster (Berlin K., Koren, S. et. al. Reducing assembly complexity of genomes with single-molecule sequencing. Biology of Genomes, 2014). For best results, java 1.7r51 or newer is recommended to use MHAP.
How good is it? Here is the plain language description -
———————————————-
Given that alignment is a big bottleneck, we previously checked whether BWA-mem could improve the execution time. Heng Li came up with a set of optimal parameters to get the best alignment. Readers may find the following comment from Irek in that thread helpful -
Hey, I just finished comparison for new PacBio RSII data (CCS and CLR), used: bwa-sw,mem,blasr,ssaha2,smalt,last and agile.
Checked speed, memory, mapping status of reads, then went for precision-recall assessment, and finished with the analysis of error model recognition.
Actually new version of SMALT looks like a winner and it’s really fast and memory efficient.
As for mem – blasr. For CCS they are comparable in terms of precision-recall, but for CLR, mem definitely looses.
-------------------------------------------------------------------------------------------------------------
Heroes and Heroines of New Media--2014
I am strongly influenced by Charles Hugh Smith, who runs his insightful social blog of Two Minds. I hope he will not mind, if I copy his style of acknowledgement to the supporters of our blog.
Our blog is deeply honored by the generous contribution of the following readers. Without their patronage, this site would go away.
Outstandingly Generous:
Amemiya C. Schnable J. Bowman B.
We are also looking for subscribers to get help to finish the tutorials. Please see this post for details.
June 23rd, 2014 | Category: pacbio-- http://www.homolog.us/blogs/blog/2014/06/23/another-pacbio-development-adam-phillippys-new-pacbio-module/
By Bio-IT World Staff -June 19, 2014 |
PacBio Users Share New Tools and Applications at Meeting in Baltimore
This Tuesday, the Institute for Genome Sciences (IGS) at the University of Maryland, Baltimore played host to the annual East Coast user group meeting for Pacific Biosciences. While PacBio has never been in a position to challenge market leader Illumina on cost or throughput, its SMRT sequencer is the most differentiated device currently being sold, thanks to read lengths more than an order of magnitude longer than its competitors. With over one hundred SMRT sequencers now running around the world, geneticists and bioinformaticians are reporting new uses for PacBio’s technology faster than ever before. The ability to sequence genomes in fragments thousands of base pairs long has opened up projects that would be monumentally difficult, if not impossible, with standard next-generation sequencing.
The projects described at the PacBio user group meeting may be of particular interest now, as a second long-reading instrument, Oxford Nanopore’s MinION, is undergoing careful scrutiny in its early access period.
Whole Genomes from Scratch
One of the fastest-growing applications of SMRT sequencing has been de novo assembly of whole genomes, especially in microbes. “It was only last year that we published the first paper that discussed PacBio de novo assembly methods,” remembered Luke Hickey, PacBio’s Director of Business Development, in his address at the meeting. “Since then, there’s been a tremendous amount of work done.”
While short-read technologies almost always rely on a reference genome to place reads in context, with longer reads it becomes exponentially easier to piece a genome together from scratch. As Luke Tallon, Scientific Director of the IGS, observed, at the “golden threshold” of five- to seven-kilobase reads, it becomes possible to build an entire E. coli genome in one end-to-end contig, eliminating gaps. This is well within the capabilities of the current PacBio chemistry, which usually delivers half its data in reads over 10 kilobases long. De novo assembly is more reliable in catching structural variation than reference-guided assembly, and is not vulnerable to mistakes in the reference itself.
Tallon and his colleagues at the IGS have been relying heavily on the Institute’s SMRT Sequencer for an ongoing project to build new reference genomes of clinically relevant microbes for the NCBI’s GenBank database. In one early batch of 50 different Staphylococcus aureus samples, the IGS was able to assemble 32 of the samples in single contigs, a level of completeness makes it easier to validate genomes. The IGS will be expanding this project to over 550 different microbes, as reliable long-read data makes it possible to speed up the process of creating new reference genomes.
While the IGS works to improve the back-end resources available to microbiologists, other groups are bringing SMRT sequencing into more active settings. Sean Conlan of the National Human Genome Research Institute reported on a study of a carbapenem-resistant Klebsiella pneumoniae outbreak. Like many outbreaks, this one was complex in its origins and transmission — and further complicated by the presence of dozens of plasmids, including at least two that carried genes for carbapenem resistance.
To trace the relationships between bacteria and plasmids isolated from different patients, Conlan’s group had to sequence a large number of samples, covering enough of the genome that similarities between key regions of unrelated plasmids and chromosomes would not be misleading. PacBio instruments let the team reliably close whole plasmids and chromosomes, capturing all the data they needed to trace the outbreak. Previously, this work had been done through a painstaking process that integrated short-read data, targeted PCR and optical mapping.
In one case, Conlan believes his team’s data may have overturned the reference sequence for a complex, repetitive region in a key antibiotic resistance plasmid. “You start to doubt your references when you’re dealing with data of this quality,” he said.
Scaling to Human
While bacteria and archaea, with their small, haploid genomes, still lead the way in de novo sequencing, PacBio users are also turning their attention to more genetically complex organisms. PacBio itself has released reads for the Drosophila, spinach, goat, and human genomes, among others. The human genome in particular is an important frontier for de novo sequencing, which can be more informative than the reference-guided resequencing that is now the standard. As Richard McCombie, a psychiatric geneticist at the Cold Spring Harbor Laboratory, said, “Resequencing is great in some ways. We can do it very inexpensively; it’s under $3,000 to do a whole genome now on an Illumina machine. But it misses some structural variants, and it misses some regions of the genome.”
One intriguing project centered on SMRT sequencing of human DNA is taking place at the Genome Institute of Washington University. There, a rare type of haploid human sample, the result of an abnormal pregnancy in which only the sperm contributed DNA to the embryo, has become the basis for improving some of the roughest regions of the human reference genome. PacBio’s assembly of this sample has an N50 contig length of over 4 megabases — by far the most contiguous human genome ever constructed other than the reference genome itself.
Tina Graves-Lindsay, the leader of the reference genomes group at Washington University, reported on efforts to integrate pieces of this assembly into the reference genome. The haploid sample can provide clearer information in areas where high levels of structural variation between alleles have led to ambiguous assemblies when using diploid samples. To refine the data supplied by PacBio, the Genome Institute has been building libraries of BAC clones that cover the most disputed regions, and sequencing those with a SMRT Sequencer. “Many of these clones will actually end up in the reference, so if the region is a mess in the reference, we can use this sequence to fix it,” said Graves-Lindsay. The long-read data has already helped to resolve questions about the SRGAP2 and IGH genes.
“Our ultimate goal for this is to end up with a single-allelic representation of the entire genome,” she added, which could be used as a complete reference with very little structural ambiguity.
The Best Tools for the Job
New instruments and computational tools have been essential to improving the quality of PacBio data. At the user group meeting, the BluePippin device from Sage Science was repeatedly credited with opening up the long-read potential of SMRT Sequencers. By allowing users to choose fragment sizes during library preparation, up to the kilobase range, the BluePippin ensures that long-read sequencing is not limited by the DNA sample. More than one user reported that their N50 read lengths doubled after using a BluePippin.
Meanwhile, software tools created both at PacBio and by the company’s user community have helped to make sense of long reads. The HGAP tool has become the standard for de novo assembly of bacterial genomes, and PBJelly, the brainchild of Adam English at the Human Genome Sequencing Center (HGSC) at Baylor College of Medicine, has automated the process of using long reads to fill in gaps in draft genomes. The still-experimental FALCON is PacBio’s follow-up to HGAP, for assembling diploid genomes.
Several presenters at the user group meeting shared new tools that can be added to this arsenal, all of them freely available. William Salerno, from the HGSC, described PBHoney, which tweaks PBJelly to focus on finding structural variants. In the same way that PBJelly searches for reads that span or extend into gaps in an assembly, PBHoney searches for reads that span error events in the assembly caused by structural variants. It can also identify reads that map partly to the edge of an error event, with dangling “tails” that map somewhere else, and use that information to resolve long inversions and duplications. The HGSC has tested PBHoney on both E. coli and human samples, and successfully resolved structural variants in both.
For scientists using SMRT sequencers on larger genomes, Adam Phillippy, the principal bioinformatician at the National Biodefense Analysis and Countermeasures Center (NBACC), presented a dramatically faster alternative to BLASR, the standard tool for overlapping long reads that is used as part of HGAP. The NBACC first saw the need to replace BLASR after assembling a Drosophila melanogaster genome using PacBio reads. The project itself was very successful, putting together a more contiguous assembly than the species’ reference genome in just six weeks. “There was so much data generated that we were able to assemble the genome using only reads greater than 17kb,” said Phillippy. “It was unlike anything I had seen at the time. There were entire chromosome arms that were assembled into a single piece.”
However, that assembly also underscored the high compute demands of BLASR, which took up over 600,000 CPU hours, or more than 90% of the project’s total compute time. To reduce those demands, Phillippy’s colleague Sergey Koren turned to an algorithm developed for the AltaVista search engine in the 1990’s. Although this algorithm had never been used in genetics, it provided a novel way to rapidly track the similarity between two sets of data — whether that data is duplicate web pages, or long DNA fragments that need to be overlapped.
The algorithm was repurposed as MHAP, an algorithm that reduces long reads to a much smaller set of polynomials without losing information. MHAP divides each read into a set of k-mers, and then runs each k-mer through hundreds of hash functions, each of which outputs a polynomial. Then, for each hash function, the program stores only the k-mer that produced the smallest polynomial. This allows a dramatically faster search for overlaps between reads, by comparing a small set of numbers instead of a massive string of DNA bases.
The results were impressive; running the Drosophila reads again through MHAP instead of BLASR reduced the compute time from 600,000 CPU hours to just 1,000, while improving the assembly. The NBACC found that the entire process could be performed through Amazon Web Services for under $300. Smaller, prokaryotic genomes, like E. coli, could be assembled in half an hour on a desktop computer — and as genomes get larger, MHAP realizes greater and greater efficiencies. A preliminary version of MHAP, written in Java, is available through SourceForge.
Continuing Improvements
Other niche uses of SMRT sequencing were discussed at the meeting, including the ability to trace methylation patterns of DNA, and the IsoSeq method, which uses long reads of RNA to capture alternative splicing, showing which different protein isoforms are likely to be present in a sample.
Meanwhile, PacBio continues to work on its chemistry, which has repeatedly doubled the SMRT Sequencers’ average read lengths year over year. Kevin Corcoran, the company’s Senior Vice President of Market Development, said that the next release of a combined chemistry update and improved library loading is slated for early 2015, and is predicted to bring the instrument’s N50 read length to between 12 and 18 kilobases.
Luke Hickey also announced that PacBio has entered a partnership with GenDx to release an HLA typing system. HLA typing has traditionally relied on the much slower and more labor-intensive Sanger sequencing, because the complexity of the region is too great to be reliably captured with short-read next-generation technologies. The combined GenDx-PacBio system will be the first commercial solution to use next-generation sequencing to deliver complete HLA genotypes.
“We’ve been thinking about larger problems at PacBio, and more diverse problems,” said Hickey in his concluding address. While PacBio is unlikely to seize a huge share of the sequencing market for the foreseeable future, the company is finding more and more niches for its unique chemistry. As the user group meeting in Baltimore demonstrated, most problems tackled by PacBio users could not be adequately addressed by any other sequencer on the market today.
http://www.bio-itworld.com/2014/6/19/pacbio-users-share-new-tools-applications-meeting-baltimore.html
June 19, 2014 by Lex Nederbragt--// Our review of “Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data”, aka the HGAP paper
As it is out in the open that I was one of the reviewers of the ‘HGAP’ paper, I though I could as well make my review publicly available.
I have posted the review report (from February 2013) online at Publons. The review was actually done together with a PhD student in the group, Ole Kristian Tørresen (I like to do reviews together with others, it leads to better reviews and is a great learning experience for students!).
Here are the first few paragraphs. Enjoy!
—
In this manuscript, the authors show how, by using just a single sequencing technology, one can get new-finished bacterial genomes assembled. The strategy one chooses for assembling a bacterial genome depends – as for any genome – on the goal of the project. For the remainder, we will assume the goal is a complete as possible reconstruction of the sequence of the genome: as close as possible to a single, gapless, error-free contig per chromosome (or plasmid).
Before the method described in this manuscript became available, the best available strategy was to use the ALLPATHS_LG method described in Ribeiro et al (doi:10.1101/gr.141515.112). However, this approach requires three different libraries and two different sequencing technologies: paired-end and mate-pairs from Illumina, and long PacBio reads. The next best thing would be using PacBioToCA and Celera (or a comparable program), see Koren et al (doi:10.1038/nbt.2280). Here, two different libraries and two different sequencing technologies are needed: paired-end from Illumina, and long PacBio reads. The ALLPATHS_LG strategy so far outperforms the PacBioToCA results as reported in these two publications.
The novelty in the approach demonstrated with this manuscript is that one can use a single library, with a single sequencing technology. In the sense of making sequencing bacterial genomes practical this is a clear novelty and significant advantage over the alternatives. This will interest groups both routinely, or occasionally, doing de novo genome assembly of bacterial genomes and genomes (or contracts such as BACs) of smaller size, provided they have access to PacBio sequencing or can order from a service facility.
The next question is then how good the method works. The authors have demonstrated this by using the method on three bacterial genomes for which a high-quality reference genome is available. The results are indeed impressive. The remaining problems are centred around repeats for which probably not enough long, spanning reads were available.
The authors do not compare their results with those presented by Ribeiro et al, understandable as – with the exception of E coli – different genomes were tested for each approach. It would be too much to askj the authors to apply their method on the two other genomes with a reference used in Ribeiro et al. However, a more in depth comparison with the E coli K12 MG1655 data would be a goo addition to the manuscript (e.g. by running the final ALLPATHS_LG assembly through the same analyses pipeline) A quick glance at comparable tables shows that the ALLPATHS_LG strategy results in somewhat better assemblies (final quality scores are higher and fewer errors are present), but a thorough comparison was not performed for this report. Suffice it to say that the significant simplicity of the method presented in this manuscript will for many weigh up to the (very) slightly lower final quality.
An important aspect of a new method is reproducibility and ease of implementation for the intended users. Time did not permit a full attempt at redoing the analyses described. However, due to the fact that we have access to the latest version of the PacBio smrtpipe software (version 1.4), we were able to reproduce one of the results: using the 8 SMRTCells of the E coli dataset described in the manuscript, and the PreAssembly module of smrtpipe, we generated an set of preassembled reads whose statistics perfecty match those described in Supplementary table 1, second row (15252 reads, average length 5466 bp, N50 length 6291 bp). We have not tried assembling these reads. Regarding reproducibility, it would be good of the authors to provide recipes describing step-for-step how to go from the downloaded data to final assemblies. regarding ease of implementation, could the authors provide information on running times and memory use of the different steps (preassembly, assembly and Quiver)? What are the hardware requirements for running the software? this will help the reader to judge what it would take to implement the method compared to other such methods.
In conclusion, the manuscript as presented here is a significant advance in the field, scientifically sound, clearly written, and of interest to the intended audience. It could be improved on the reproducibility aspect (with the addition of recipes).
http://flxlexblog.wordpress.com/2014/06/19/our-review-of-nonhybrid-finished-microbial-genome-assemblies-from-long-read-smrt-sequencing-data-aka-the-hgap-paper/
GetHub--Home
Magdoll edited this page June 11, 2014 · 30 revisions
??Pages 20
A hands on tutorial of three aligners: BLAT, BLASR, and GMAP
Bioinfx study: Sampling depth? Simulated rarefaction curves based on real data
cDNA_primer tofu CHANGELOG
Comparison of Reads Of Insert parameters and full length detection
Error Correction using Illumina short reads
Glossary for PacBio transcriptome
Home
Iso Seq FAQ
Iso Seq protocol: Bioinformatics study of common concerns
PacBio In House Transcriptome Datasets
RS_IsoSeq (v2.2.0) Tutorial #1. Getting full length reads
RS_IsoSeq (v2.2.0) Tutorial #2. Isoform level clustering (ICE and Quiver)
Setting up virtualenv and installing pbtranscript tofu
tofu runtime and memory usage
tofu Tutorial #1. Getting full length transcripts
tofu Tutorial #2. Isoform level clustering (ICE and Quiver)
tofu Tutorial #3. Removing redundant transcripts
Understanding PacBio transcriptome data
Useful API calls in pbtranscript: SAM parser, etc
What is pbtranscript tofu? Do I need it?
Show 5 more pages…
Clone this wiki locally
? ?Clone in Desktop Welcome to the Iso-Seq wiki!
?Iso-Seq Datasets
•PacBio In-House Transcriptome Datasets
?Code Repository Information
•cDNA_primer-tofu-CHANGELOG
?Background Information
•Understanding PacBio transcriptome data
•A hands on tutorial of three aligners: BLAT, BLASR, and GMAP
•Iso-Seq protocol: Bioinformatics study of common concerns
•Error Correction using Illumina short reads
•Glossary for PacBio transcriptome
?PacBio Datasets
•PacBio In House Transcriptome Datasets
?RS_IsoSeq (official pipeline) tutorial
•RS_IsoSeq (v2.2.0) Tutorial #1. Getting full length reads
•RS_IsoSeq (v2.2.0) Tutorial #2. Isoform level clustering (ICE and Quiver)
•Iso Seq FAQ
?GitHub code (pbtranscript-tofu) tutorial
•What is pbtranscript tofu? Do I need it?
•Setting up virtualenv and installing pbtranscript tofu
•tofu Tutorial #1. Getting full length transcripts
•tofu Tutorial #2. Isoform level clustering (ICE and Quiver)
•tofu Tutorial #3. Removing redundant transcripts
•tofu runtime and memory usage
•(same link as above) Iso Seq FAQ
--------------------------------------------------------------------------------
For Research Use Only. Not for use in diagnostic procedures. © Copyright 2010 - 2014, Pacific Biosciences of California, Inc. All rights reserved. Information in this document is subject to change without notice. Pacific Biosciences assumes no responsibility for any errors or omissions in this document. Certain notices, terms, conditions and/or use restrictions may pertain to your use of Pacific Biosciences products and/or third party products. Please refer to the applicable Pacific Biosciences Terms and Conditions of Sale and to the applicable license terms at http://www.pacificbiosciences.com/licenses.html.
https://github.com/PacificBiosciences/cDNA_primer/wiki
(June 10, 2014
Pacific Biosciences got in touch with quantitative ecologist Karthik Ram,with the University of California (UC), Davis)___ Give, and It Will Be Given to You
By Eli Kintisch
“We’re trying to bring the culture across disciplines and lower the bar to sharing.” —Karthik Ram
In 2007, quantitative ecologist Karthik Ram sought to find out why certain insect parasites appeared in some sand dunes but not others. Ram, who was a graduate student at the time, thought that asking scientists for field data used in the papers they published was no big deal. But the scientists he e-mailed ignored his requests, so Ram, then at the University of California (UC), Davis, had to collect extra insect samples.
Later, as he studied how climate change was impacting vegetative growth as a postdoc at UC Santa Cruz, Ram found that colleagues weren’t willing to hand over the raw measurements behind published data, or the algorithms that supported the authors’ conclusions. So, Ram spent a year reproducing the data sets so he could use them in his analyses. "I did it all myself, even though I knew that others had done this work before. I personally felt a little bit cheated," Ram says. "Aren't research papers meant to be recipes, allowing colleagues to reproduce their conclusions? But usually they’re not. And nobody thought about publishing the code they used at the end of their paper."
CREDIT: Donald Strong
Karthik Ram measures the persistence of entomopathogenic nematodes at a California coastal grassland, Bodega Bay.Seven years on, the practice of science is becoming more open, and a culture of sharing preprints, data sets, and scientific code is spreading. Ram—one of the pioneers—is prodding and enabling that shift. In 2011, he and his colleagues created rOpenSci, a platform and repository that boasts dozens of open-source data-and-analysis packages serving fields ranging from climate science to vertebrate biology via human genetics. Today, the Alfred P. Sloan Foundation awarded the project, which operates out of U.C. Berkeley, a second round of funding, bringing its total funding to $481,000. rOpenSci is one of a growing community of tools—Dryad, Mendeley, figshare, GitHub, and arXiv are others—that help scientists more easily share data and other resources. “We’re trying to bring the culture across disciplines and lower the bar to sharing,” says Ram, today an assistant researcher at Berkeley. “More and more people are seeing the value in sharing their data.”
An evolving culture
For more than a century, the peer-reviewed paper has been the main way that scientists share their work. But in the 1980s and 1990s, respectively, open-science adopters started sharing work via preprint servers: Research Papers in Economics—RePEc—for economists and then arXiv for physicists. As part of the Human Genome Project, the government required researchers to make their genomic data and related code available freely. In other fields, though, sharing drafts of papers—or for that matter, data or code—hasn’t really caught on.
The World Wide Web’s power to host myriad collaborative tools—GitHub, a code-sharing repository platform used by millions of software engineers is a leading example—has inspired scientific societies, journal publishers, and universities to apply steady pressure for science that’s more open. The National Science Foundation (NSF), the National Institutes of Health, and other agencies in the United States and abroad have data-sharing requirements. Enforcement, though, is spotty.
Scientific publisher the Public Library of Science (PLOS) became a leader in the movement when it announced a new policy in February requiring authors in all its journals to archive the raw data used in PLOS papers. “Data availability allows validation, replication, reanalysis, new analysis, reinterpretation, or inclusion into meta-analyses, and facilitates reproducibility of research,” PLOS editors wrote, adding that sharing would provide "better ‘bang for the buck’ ” from scientific research.
In frequent talks around the country, Ram hears a lot of skepticism from scientists, he admits. Scientists generally believe that sharing is a good idea in principle, he finds, but in practice many are reluctant. After the PLOS announcement, for example, a firestorm erupted on Twitter under the hashtag #plosfail. The policy would “radically change the way [researchers] do science, at great cost of personnel time,” wrote DrugMonkey, an anonymous blogger. Neuroscientist Erin McKiernan, of the Monterrey Institute of Technology and Higher Education’s Cuernavaca campus in Mexico, explained that in that country “data acquired are like gold, and it is absolutely crucial that researchers here get as many publications out of one data set as possible.”
Biologist Terry McGlynn at California State University, Dominguez Hills in Carson fears other scientists might use data he posts online and not collaborate with him. Once sharing data sets “gets as much recognition and credit [as papers] in the academic sphere,” he says, “I’d be a lot more interested in sharing.”
Incentives for sharing
What, then, are the incentives for scientists to share? Hiring and tenure committees, in their deliberations over the fate of faculty, still focus by and large on traditional metrics: publications in high-impact journals, citations, and grants. They continue to generally undervalue shared data sets, methods, and analytical tools—despite the fact that such work is “wickedly important” to the scientific enterprise, as Tom Daniel, former chairman of the Department of Biology at the University of Washington in Seattle, puts it in an e-mail to Science Careers. His university, he says in an interview, is “discussing” rewarding such contributions, but he describes those discussions as “a work in progress.”
Yet, some scientists say that sharing data has paid big professional dividends. Genomicist Casey Bergman, of The University of Manchester in the United Kingdom, says that ever since his postdoc days, his career has “definitely benefited” from shared genomic data and software. So now that he is a faculty scientist, he has made a point of sharing data and resources. After his group utilized a new gene-sequencing tool, they posted some unpublished genome data online. Biotech firm Pacific Biosciences got in touch. The result was a new collaboration and what Bergman calls “an amazing genomics data set” on fruit flies. “Small groups can benefit from embracing open data just as big consortia have in the past,” he says.
In 2012, bioinformaticist C. Titus Brown, of Michigan State University (MSU) in East Lansing, posted a draft paper on a preprint server describing a new sequence-analysis technique. Since then, hundreds of scientists have used it. Even though it isn’t formally published, the technique has been cited in 15 peer-reviewed papers. Brown believes that this and other influential software he has developed, which have led to grants and job offers, should help him get tenure. “My career has developed in large part because I’ve been open about everything,” he writes in an e-mail to Science Careers.
Ecologist Ethan White of Utah State University, Logan believes that sharing can help you win grants because “funding agencies want to know that the money they spend will benefit science as a whole.” Adds Ram: “You collected data via publicly funded work; it’s not yours to hoard forever.”
Courtesy of Michigan State University
C. Titus BrownSharing 101
One of the two top complaints about sharing data is that it takes a lot of time—and indeed it often does. So, how can scientists start sharing their data without too much extra effort? Ram and other sharing gurus have plenty of tips and exhortations.
Think about sharing before you begin collecting data. Metadata—the background info on the data you’ve collected—is crucial for colleagues to be able to use your data for new research. But creating it is difficult if you wait to do it until long after your experiments or fieldwork, White says. Ram urges scientists to prepare to share from a project’s first day. It need not be drudgery, he says, noting that rOpenSci has a tool that can create automated workflows to continually update metadata for ecology.
Don’t worry about getting scooped. That’s a very low risk, says evolutionary genomicist Ian Dworkin at MSU. “Young scientists might be worried about releasing data into the wild on a database, concerned that somebody might beat [them] to the punch,” he says. It might happen, but it would be “a fringe case,” he adds.
Don’t post your data on your personal website. You may change institutions, or the site could be shuttered in the future, Ram says. If that happens, the data could be lost, or you might need to start over again. Your data will be safer—and more visible—if you file it with similar data in a place where it can easily be searched for and found. Many universities offer data repository services (check your institution’s library) and hundreds of research repositories have been established encompassing many fields. Dryad and figshare are popular repositories that span many fields.
?Don’t try to create your own license, White says. Standard licenses, like one from Creative Commons, can help you avoid the ambiguity created when researchers try to write their own sharing agreements.
?Use common, appropriate tools. rOpenSci was built for scientists motivated to share but who don’t know how, Ram says. It’s built around data-and-analysis packages that use R, an open-source statistical analysis language that uses brief lines of written code instead of a series of pull-down menus, allowing for easier sharing and iteration. Learning R, and how to follow rOpenSci’s myriad recipes, requires about a day of training, Ram says. Some packages include actual data; others enable searching of data sets posted on the site, or in full-text journals or other repositories.
?Sharing preprints can prevent embarrassing errors from getting published, Ram says. Posting code you used to generate your figures can allow colleagues to check your work and improve it. “Better to have colleagues catch your mistakes before they appear in a journal,” he says.
?Your work includes products, not just publications. As NSF put it in 2012: “products may include, but are not limited to, publications, data sets, software, patents, and copyrights.” When you post a tool online and someone uses it to generate new results, that’s a citation. Be sure to list it on your CV.
?Link your data, methods, and papers online. Someday, Ram hopes, it will be standard practice for a scientist’s full work on a project—the data, the metadata, the code, and the paper itself—to sit together online, freely and indefinitely.
?Just get started, says Brown. White has written a guide called “Nine simple ways to make it easier to (re)use your data.”
As for his own career, Ram says he’s glad he has become a sharing evangelist. He enjoys helping a wide range of scientists share—and discover—data online. The new funding from the Sloan Foundation will allow him to add a second colleague to rOpenSci’s paid team and to create new data and analysis packages to reach new fields, including several social science fields. “I thought I would follow the script [and] work towards a tenured position in academia,” he says. ‘“I never planned my career to take this path.”
Eli Kintisch is a contributing correspondent for Science.
http://sciencecareers.sciencemag.org/career_magazine/previous_issues/articles/2014_06_10/caredit.a1400146
Single Molecule, Real-Time Sequencing of Full-length cDNA Transcripts Uncovers Novel Alternatively Spliced Isoforms----https://s3.amazonaws.com/files.pacb.com/pdf/Single+Molecule+Real+Time+Sequencing+of+Full+length+cDNA+Transcripts+Uncovers+Novel+Alternatively+Spliced+Isoforms.pdf?utm_content=buffer1f78d&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer
(Just a comment about GENIA Purchase). opiniomicsbioinformatics, genomes, biology etc. "I don't mean to sound angry and cynical, but I am, so that's how it comes across"
Genia single molecule sequencing
6 Replies
I’m at ICG8 in Shenzhen and have rare access to the internet! I have to use my University VPN as both wordpress and Twitter are inaccessible within China <- WEIRD!
Genia presented yesterday, they are one of the new breed of single molecule sequencing companies. Working our of Mountain View, CA, they have about 25 employees. Their stated aim is to produce a sequencer for less than $1000 and to be able to sequence human genomes for less than $100.
Their technology is interesting in that it uses both a biological nanopore and a polymerase. The polymerase sits atop the nanopore, and a template single molecule of DNA is provided to the polymerase. Single nucleotides are added to the mix and each nucleotide has a “NanoTag” tail. As the polymerase incorporates the nucleotides, the nanopore sucks in the NanoTag, and this event is measured electronically. The polymerase moves on and the nanopore measures the next base etc etc.
They have done proof-of-principle sequencing of about 10bp templates. They envisage longer templates in future and even circular templates (presumably to increase accuracy). The measurement rate is around 5-10 bases per second.
We were shown the now familiar nanopore traces which the speaker claimed showed “clear, accurate data” but which looked very messy to me.
They have a current machine at the “alpha” stage, and a “beta” version will be available “next year”.
http://biomickwatson.wordpress.com/2013/10/31/genia-single-molecule-sequencing/
Monday, June 2, 2014--Intro to the Iso-Seq Method: Full-length transcript sequencing
With the recent launch of SMRT Analysis v2.2, we’re excited to introduce analysis software support for the new Iso-Seq™ method for sequencing full-length transcripts and gene isoforms, with no assembly required! Today we’ll take a deeper look at the Iso-Seq method to explain its unique scientific value and review publications from those already applying Single Molecule, Real-Time (SMRT®) Sequencing to this exciting area of research.
In plant and animal genomes, along with all higher eukaryotic organisms, the majority of genes are alternatively spliced to produce multiple transcript isoforms. In humans, for example, there is evidence for alternative splicing of more than 95% of genes [1], with an average of more than five isoforms per gene. Gene regulation through alternative splicing can dramatically increase the protein-coding potential of a genome that contains a limited number of genes that encode proteins. Somewhat surprisingly, alternatively spliced isoforms from a single gene can also have very different, even antagonistic, functions [2]. Therefore, understanding the functional biology of a genome requires knowing the full complement of isoforms. Microarrays and high-throughput cDNA sequencing have become incredibly useful tools for studying transcriptomes, yet these technologies provide small snippets of transcripts and building complete transcripts to study gene isoforms has been challenging.
Thanks to the extraordinarily long reads available with PacBio® sequencing, the new Iso-Seq method provides full-length reads spanning entire transcript isoforms all the way from the polyA-tail to the 5' end. It is no longer necessary to reconstruct transcripts or infer isoforms based on combining local information since each sequence represents an individual full-length cDNA molecule. The method combines isoform-level resolution with the best of whole-transcriptome sequencing to enable direct gene isoform sequencing across an entire transcriptome. We’re pleased to report that scientists are now using the Iso-Seq method to routinely sequence full-length isoforms in a wide variety of organisms and are applying the approach to improve annotations in reference genomes, characterize gene isoforms in important gene families, and find novel genes even in the most comprehensively studied human cell lines.
Here is a selected list of recent publications, presentations, and sample data:
• Au et al. (2013) Characterization of the human ESC transcriptome by hybrid sequencing. PNAS 110: E4821-4830.
• Sharon et al. (2013) A single-molecule long-read survey of the human transcriptome. Nature Biotechnol 31: 1009-1014.
• Thomas et al. (2014) Long-read sequencing of chicken transcripts and identification of new transcript isoforms. PLoS One. 9: e94650.
• Treutlein et al. (2014) Cartography of neurexin alternative splicing mapped by single-molecule long-read mRNA sequencing. PNAS 111: E1291-1299.
• Zhang et al. (2014) PacBio sequencing of gene families - a case study with wheat gluten genes.Gene 533: 541-546.
• Brinzevich D. et al. (2014) HIV-1 Interacts with Human Endogenous Retrovirus K (HML-2) Envelopes Derived from Human Primary Lymphocytes. J Virology 88: 6213-6223.
• P. Larsen et al. (2012) Application of circular consensus sequencing and network analysis to characterize the bovine IgG repertoire. BMC Immunology, 13: 52
• Webinar: No Assembly Required - Extremely Long Reads for Full-length Transcript Isoform Sequencing
• Human MCF-7 Iso-Seq Dataset
For more detailed information on cDNA sequencing with PacBio, don’t miss this primer on GitHub and the shared protocol on SampleNet.
[1] Pan et al. (2008) Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nature Genetics 40: 1413-1415.
[2] Boise et al. (1993) Bcl-x, a bcl-2-related gene that functions as a dominant regulator of apoptotic cell death. Cell 74: 597-608.
http://blog.pacificbiosciences.com/
Near perfect de novo assemblies of eukaryotic
genomes using PacBio long read sequencing!
James!Gurtowski!
Schatz!Lab!
5/29/2014! //http://schatzlab.cshl.edu/presentations/2014.05.29.SFAF.EukaryoticAssembly.pdf
May 13, 2014Anthony Nolan Revolutionizes HLA Typing With Breakthrough DNA Sequencing Technology From Pacific Biosciences
HAMPSTEAD, LONDON and MENLO PARK, Calif., May 13, 2014 (GLOBE NEWSWIRE) -- Anthony Nolan, the UK blood cancer charity, and Pacific Biosciences of California, Inc., (Nasdaq:PACB) announced today that Anthony Nolan is the world's first stem cell registry to invest in an innovative new technology for advanced tissue typing.1 The charity, which led the way 40 years ago when it created the world's first bone marrow register, is continuing its record for innovation by purchasing two PacBio® RS II systems, which enable Single Molecule, Real-Time (SMRT®) DNA Sequencing of full-length HLA genes.
Anthony Nolan will offer unparalleled detail and accuracy across its entire tissue typing service. The PacBio RS II system was selected because it is the only system available that can sequence full-length HLA genes due to its industry-leading read lengths and consensus accuracy.
Professor Steven Marsh, Anthony Nolan's Director of Bioinformatics, said: "Anthony Nolan has, from its inception, always been a scientifically pioneering organization. Investment in Pacific Biosciences SMRT technology will enable us to conduct allele-level typing, as standard. By providing the highest resolution typing available, we will be able to unambiguously phase HLA alleles for research in tissue transplantation and other applications, with the goal of making bone marrow and blood stem cell transplants more successful."
Professor Marsh continued: "Anthony Nolan intends to use this new technology to comprehensively HLA type new and existing donors as well as improve and extend services to our current customer base. Allied with this, Anthony Nolan's strategy seeks to offer services to new customers requiring full HLA typing for first-time donors, re-typing existing donors, confirmatory typing when donor/patient matches have been found, and typing for HLA-related disease association and drug hypersensitivity. This ground-breaking technology means Anthony Nolan staff and our customers will gain extra confidence that they have the most comprehensive data available as we strive toward ultimately improving transplant outcomes for patients in the future."
"We are proud to have innovative leaders like Anthony Nolan adopt our platform as we bring our installed base to more than 100 systems worldwide," said Michael Hunkapiller, President and Chief Executive Officer of Pacific Biosciences. "Together with Professor Steven Marsh, the designer and curator of the worldwide IMGT/HLA Database, we are excited that Anthony Nolan will now begin enhancing this critically important resource with full-length HLA genes. We anticipate that the unique advantages of SMRT Sequencing will also provide significant contributions to the IPD-KIR Sequence Database."
About Anthony Nolan
Anthony Nolan, now in its 40th anniversary year, was the world's first bone marrow register. The blood cancer charity has been saving lives for four decades by matching remarkable people willing to donate their bone marrow to patients in desperate need of a transplant.
About blood cancer
Every 20 minutes someone in the UK is diagnosed with a blood cancer. Around 1,800 people in the UK need a bone marrow (or stem cell) transplant each year. This is usually their last chance of survival. 63% of UK patients will not find a matching donor from within their families; instead they turn to Anthony Nolan to find them an unrelated donor.
About Pacific Biosciences
Pacific Biosciences of California, Inc. (Nasdaq:PACB) offers the PacBio® RS II DNA Sequencing System to help scientists solve genetically complex problems. Based on its novel Single Molecule, Real-Time (SMRT®) technology, the company's products enable: targeted sequencing to more comprehensively characterize genetic variations; de novo genome assembly to more fully identify, annotate and decipher genomic structures; and DNA base modification identification to help characterize epigenetic regulation and DNA damage. By providing access to information that was previously inaccessible, Pacific Biosciences enables scientists to increase their understanding of biological systems.
_____________________
1 Tissue typing is a process carried out at the time potential donors join the register, before a blood stem cell transplant. Transplants are used to treat blood cancers (e.g. leukemia) and other serious blood disorders. The human leukocyte antigen (HLA) of the donor must exactly, or very closely, match the HLA (tissue type) of the patient requiring the transplant.
http://investor.pacificbiosciences.com/releasedetail.cfm?ReleaseID=847568
PacBio Blog - Tuesday, May 6, 2014--Retroviral Study Reveals Potential for Influencing HIV Replication
Scientists from the Icahn School of Medicine at Mount Sinai in New York City and the MRC National Institute for Medical Research in London published a paper using Single Molecule, Real-Time (SMRT®) Sequencing to gain a better understanding of how human endogenous retroviruses may be interacting with HIV infection. They pursued a new avenue of research that could shed light on how to interfere with HIV replication.
“HIV-1 interacts with HERV-K (HML-2) Envelopes derived from human primary lymphocytes” was recently published in the Journal of Virology, a publication of the American Society for Microbiology. Daria Brinzevich and George R. Young were lead authors on the work.
The scientists conducted a study uniquely suited to the extremely long reads provided by the PacBio® platform, noting that this technology was needed to accurately parse the complexity in expression among a specific group of human endogenous retroviruses (HERVs). “Applied to the sequencing of PCR products, PacBio reads maintain the entire product as an uninterrupted sequence, allowing reliable identification against reference libraries with the equivalent levels of similarity as those of HERV-Ks,” the authors write.
In this project, the scientists dug deeper into evidence that expression of the endogenous retroviruses that make up almost 5% of the human genome is upregulated when a person is infected with HIV-1. “HIV-1 infection in human cells is equivalent to a co-infection by several retroviruses,” they explain. They used SMRT Sequencing to analyze the expression profiles of the HERV-K group of retroviruses in lymphocytes from five healthy people.
The team found nearly 4,000 HERV-K sequences in these lymphocytes, compared to a previous study from other scientists that found fewer than 1,000 of these sequences in 11 samples. They posit that the higher number seen here reflects the greater sensitivity of PacBio sequencing as well as the difference in cell types analyzed.
In all, the authors identified more than 30 different transcripts for HERV-K envelopes, including two that produce full-length proteins — one of which was found to incorporate into HIV-1 particles. “These findings imply that some HERV-Ks interact specifically with HIV possibly shaping the properties of the lentivirus,” they write. “Future studies are needed to determine the extent of their influence on the HIV-1 life cycle and whether their expression can be harnessed to hinder HIV-1 replication.”
http://blog.pacificbiosciences.com/2014/05/retroviral-study-reveals-potential-for.html?utm_content=bufferfbe3f&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer
PacBio Blog
Monday, May 5, 2014---Webinar Recap: New Insights in Genome and Transcriptome Research
This week, our CSO Jonas Korlach hosted a webinar entitled “Gain New Insights in Genome and Transcriptome Research with Greater than 10,000 bp Reads.” He spoke to attendees about the PacBio® technology, elements of sequencing, and applications of the ultra-long reads generated by Single Molecule, Real-Time (SMRT®) Sequencing. Here’s a quick recap.
Jonas offered a look at how PacBio’s technology performs in the four key sequencing characteristics that one should consider for any sequencing work: contiguity, accuracy, uniformity, and originality. For contiguity, or how much of a DNA fragment can be sequenced in a single pass, the PacBio platform outperforms all other technologies. Half of data generated in a SMRT Sequencing run is captured in reads longer than 10 kb, and many reads exceed 30 kb. Jonas added that the PacBio RS II also has industry-leading consensus accuracy, routinely at QV50 (>99.999%). Regarding uniformity, or the ability to uniformly sequence all of the DNA present in your sample, the PacBio platform has been shown in an independent study to have the least bias of any sequencing technology. Finally, originality refers to the ability to sequence native, unamplified DNA to avoid amplification bias and allow access to detect epigenetic marks as part of the sequencing, which Jonas noted is a unique attribute of SMRT Sequencing.
Jonas then turned to examples of how scientists are using the PacBio technology for a wide range of applications, starting with high-quality de novo assemblies and finishing of genomes ranging from microbial to large and complex genomes like the human genome. With plant genomes, such as Arabidopsis, SMRT Sequencing assemblies have been shown to include hundreds of thousands of SNPs that were missed by short-read assemblies of the same organism. A scientist working on the spinach genome saw his assembly go from 1.3 million contigs with short-read sequencing to fewer than 10,000 contigs with PacBio. In a human data set we released, the contig N50 of 4.4 Mb represented a 30-fold improvement over the previous best de novo human assembly with a contig N50 of 144 Kb.
The long reads produced by SMRT Sequencing have also been particularly useful for detecting various forms of structural genomic variation, thereby resolving complex or even previously “unsequenceable” genomic regions, and elucidating the complex landscape of alternative splice isoforms in transcriptomes, Jonas noted. Other applications include genome editing, demonstrating SMRT Sequencing as a powerful tool to quantify outcomes of genome-editing experiments, as well as epigenomics since PacBio technology can directly detect base modifications.
Finally, Jonas spoke about upcoming improvements in sample preparation, read length, and data analysis. To see full details, you can view the recorded webinar. http://blog.pacificbiosciences.com/2014/05/webinar-recap-new-insights-in-genome.html?utm_content=buffer63c59&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer
Thursday, April 24, 2014--SMRT Sequencing of Chicken Heart Transcripts Yields New Genes and Isoforms
The Gallus gallus (common chicken) genome was initially published in 2004, but the latest RefSeq and Ensembl annotations remain incomplete. The chicken is an important model organism, especially for research on embryogenesis and heart development. In a new paper published in PLOS One, researchers representing the Cardiovascular Development Consortium of the Bench to Bassinet Program and Pacific Biosciences describe work to improve the chicken genome annotation using SMRT® DNA Sequencing.
In “Long-Read Sequencing of Chicken Transcripts and Identification of New Transcript Isoforms,” the consortium describes how they used SMRT sequencing to generate full-length cDNA reads from embryonic chicken hearts, combined these with short-read sequences from five different adult chicken tissues, and identified more than 9,000 novel transcript isoforms, as well as more than 500 genes not currently included in the Ensembl annotations.
“All of the important work being done now to uncover the regulatory mechanisms that control when and how genes are active rely heavily on a foundation built with solid genome assemblies and annotations,” the authors note.
Database searches yielded homologs for three of the new gene regions, FOXE3, RASA1, and FAM179B, but the remaining genes remain uncharacterized, including 121 gene regions that exhibited tissue-specific expression and might play key roles in chicken biology.
The authors suggest that beyond the results benefiting the community of researchers interested in the chicken genome, the methodology employed in this study will help other researchers improve efforts to annotate the genomes of other model organisms.
http://blog.pacificbiosciences.com/2014/04/smrt-sequencing-of-chicken-heart.html
(A good read)Published: April 15, 2014--Long-Read Sequencing of Chicken Transcripts and Identification of New Transcript Isoforms
http://www.plosone.org/article/info:doi%2F10.1371%2Fjournal.pone.0094650?utm_content=buffer351c4&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer
Applications of PacBio Single Molecule, Real-Time (SMRT) DNA Sequencing
Apr 25th 12:00 pm - 1:00 pm
Room 3320, Faculty of Pharmaceutical Sciences - 2405 Wesbrook Mall, Vancouver, BC
Details:
Genomic data have become commonplace in most branches of the biological sciences and have fundamentally altered the way research is conducted.
However, the predominance of high-throughput, short-read sequence data from second-generation technologies has commonly resulted in fragmented and partial genomic and transcriptomic data, limiting the type of genetic variation able to be characterized. Long, unbiased reads from SMRT sequencing now allow for a return to more contiguous and comprehensive views of genomes and transcriptomes.
I will present several examples highlighting this transition, including improved de novo genome assemblies, characterizing structural genetic diversity, and resolving previously unsequenceable genomic regions. In the area of RNA-seq, I will highlight numerous examples of full-length transcript recall and splice isoform characterization in human and other complex transcriptomes that demonstrate that the identification of transcript isoforms and even gene content is far from complete.
Speaker Bio:
Luke Hickey is Director of Marketing and Business Development at Pacific Biosciences. He has 15 years of experience in the genome technology industry, holding various R&D, Marketing and Commercial roles at Pacific Biosciences, Affymetrix, Ingenuity Systems, and Incyte Genomics.
He currently leads development of applied markets for PacBio’s single molecule long read sequencing technology. In this role, he has contributed to co-developing novel methods and applications of the technology resulting in numerous high impact publications, including; Sequencing of the Fragile X gene, full length transcript isoform sequencing, compound somatic mutation phasing, gene editing outcome profiling, long read de novo assembly methods (HGAP/Quiver), human structural variation sequencing, centromere sequencing, haplotype phasing, and methylation profiling for epigenetic analysis.
He is on the steering committee for the 100k Pathogen project, an active participant in the NIST Genome in a Bottle Initiative, the 1,000 Genome Project, and a member of Toastmasters International. He is based in San Francisco, California with his wife and three children.
Date and Location:
Date: Friday, April 25th, 2014.
Time: 12pm - 1pm
Location: Room 3320, Faculty of Pharmaceutical Sciences. 2405 Wesbrook Mall, Vancouver, BC.
http://www.genomebc.ca/news-events/event/applications-pacbio-single-molecule-real-time-smrt-dna-sequencing/?eID=28
-- PacBio Blog --- Wednesday, April 16, 2014-----------------------------------------------------------------------------Innovation Centre in Quebec Uses SMRT Sequencing for
Cost-Effective, Complete Microbial Genomes
At the McGill University and Génome Québec Innovation Centre, many projects conducted in the sequencing core facility fall under the umbrella of life sciences rather than biomedical research. To the scientists responsible for making the core facility operate as smoothly as possible, that makes a world of difference.
“When you’re in the life sciences in addition to human biomedical [research], you’re out there in the world of things that haven’t been sequenced before, or haven’t been sequenced particularly well,” says Ken Dewar, a principal investigator at the Innovation Centre.
To navigate this type of uncharted territory, scientists at the center rely on long-read sequencing from their PacBio® RS II platform to cost-effectively close microbial genomes, traverse repeat-heavy genomic regions, and perform full-length transcript sequencing. By leveraging the dramatically increased read lengths PacBio sequencing provides, they have driven down costs and improved completeness of their assemblies.
At the core facility, Alexandre Montpetit is dedicated to running the next-generation sequencing platforms. His primary affiliation is with Génome Québec, and he has an adjunct appointment at McGill. He and his colleagues have been champions of long-read sequencing for years, so when PacBio unveiled its platform with industry-leading read length, it was an obvious choice for the center to adopt the technology.
“We’ve always had a focus on sequencing things for the first time or assembling genomes for the first time, not for the thousand-and-first time,” Dewar says. “PacBio was a natural fit.”
At the center, SMRT Sequencing has been used in diverse research areas. Some examples include generation of high-quality assemblies in microbial sequencing, analysis of long, repetitive genomic regions, and sequencing of full-length human gene isoforms. Microbial sequencing encompasses a number of applications, including biotech industry efforts to improve microbial biofermentation and microbiome studies, from environmental remediation projects on Alberta tar sands to veterinary research on microbes present in cattle rumen.
In the two years they’ve been running the SMRT Sequencing platform, the Innovation Centre scientists have seen remarkable progress in what they have been able to achieve. Continued improvements in read lengths — partly due to new reagent kits from PacBio and partly due to more streamlined sample prep protocols developed at the center — have already made a major difference.
One major step was achieving complete bacterial sequencing and assembly in less than a day, a feat that may enable the core facility to serve as a rapid response center for organizations that study pathogen outbreaks and other urgent problems. In 2013, tests conducted with researchers at the Canadian Food Inspection Agency and other government agencies demonstrated that the Innovation Centre scientists could sequence a sample and fully assemble the genome and plasmid elements — all in 20 hours or less.
Indeed, the Innovation Centre team is routinely able to deliver affordable, high-quality, finished genomes. “A single bacterial genome, a library prep, and two SMRT Cells of sequencing — which is generally a little bit overkill — is less than $1,000,” Dewar says. “More and more often, we are getting a completely closed, finished-quality genome for that.”
High-quality assemblies aren’t just for bacteria. “We’ve shown recently that we can assemble a fungal genome of 20 or 30 megabases with four or eight SMRT Cells and get only 10 or 20 contigs — which often represents the number of chromosomes in the genome,” Montpetit says.
Read the full case study to learn more about how the Innovation Centre has deployed SMRT Sequencing, their shift from hybrid to PacBio-only assemblies, and how they differentiate their bioinformatics analysis service. http://blog.pacificbiosciences.com/2014/04/innovation-centre-in-quebec-uses-smrt.html?utm_content=buffer9d4d5&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer
PacBio Blog Thursday, April 10, 2014 ---- Automating PacBio 10 kb Template Preparation
During our recent sample prep webinar, PacBio® scientists described new automated options for the 10 kb SMRTbell™ template preparation workflow. In case you didn’t have the chance to attend, here’s a quick recap.
The webinar was hosted by Marty Badgett, our consumables product manager, with field application scientist Mike Weiand and technical support scientist Kristi Spittle Kim. The goal of the session was to introduce two new automation solutions that were developed based on interest from customers for more streamlined sample preparation.
Both of the systems, available from PerkinElmer and Agilent Technologies, are designed to run up to 96 samples with SMRTbell HT template prep kit, our new high-throughput kit. As Marty told webinar attendees, these solutions boost flexibility and scalability while minimizing hands-on time, sample tracking errors, and project cost. They also improve consistency by reducing sample-to-sample and run-to-run variability that can be introduced by manual preparation. We recommend automated systems like these for production-mode facilities or for large projects where run-to-run consistency is essential. Both systems can now process anywhere from eight to 96 samples in eight hours or less.
We partnered with PerkinElmer on its Sciclone NGSx and Agilent on its Bravo NGS workstation to offer PacBio customers more automation with sample prep platforms that are often already in their labs. We anticipate additional solutions from other vendors later this year as well. Also, we are currently working on 2 kb and 20 kb protocols with each vendor, which may be released later this year.
The Sciclone and Bravo automated options for the 10 kb workflow were validated at customer sites. The validation testing showed good consistency across a range of organisms and various input sample amounts for both devices. Yield was comparable to what customers expect from manual prep.
The SMRTbell HT kit is delivered in a 4x24 format. To get more information about the kit or the automated solutions from PerkinElmer or Agilent, contact your local sales rep. And for details on how these extremely long reads can make a difference in your genomic and transcriptomic studies, don’t miss the next webinar with our CSO Jonas Korlach. “Gain New Insights in Genome and Transcriptome Research with >10 kb Reads” will be hosted on April 29 and April 30; just click your preferred date to register. http://blog.pacificbiosciences.com/2014/04/webinar-recap-automating-pacbio-10-kb.html
April 7th, 2014 Homolog.us – Bioinformatics
Frontier in Bioinformatics
BWA-MEM vs BLASR for PacBio – Need Help with Benchmarking
www.simplesharebuttons.comShare..... 3 0 0 0 0
New readers can follow us on twitter – @homolog_us - where we report on every new blog post.
A few months back, we posted -
Is BWA-MEM as Good as BLASR for Aligning PacBio Reads? – Part 1
Is BWA-MEM as Good as BLASR for Aligning PacBio Reads? – Part 2
Those commentaries were about using BWA-MEM, Heng Li’s new long-read aligner, to align PacBio reads. Although Heng did not have PacBio in his mind in writing the code, we noticed many similarities between the algorithms of BWA-MEM and BLASR. After comparing the two methods, we found out that BWA-MEM indeed aligned Pacbio reads much faster than BLASR, but on the negative side, the matches were somewhat fragmented.
———————————————————————-
Today, we have two pieces of good news for our readers.
i) Heng Li got interested in tuning BWA-MEM to align PacBio reads. He found a set of BWA MEM parameters that minimize the fragmentation of alignments. The updates are not yet integrated in the main BWA code and he needs help with benchmarking.
Here is how to run the PacBiofied BWA-MEM.
Checkout the github HEAD and run bwa-mem on PacBio reads with:
bwa mem -x pacbio ref.fa reads.fa
ii) Mark Chaisson, the author of BLASR, also made a number of updates to the BLASR code to make it run much faster than earlier. The changes are not integrated into the official version that PacBio yet.
If you are interested in comparing the speed and accuracy of BWA-MEM vs BLASR, please feel free to comment here on what you find, and both Heng and Mark will be very happy. You can use the official version of BLASR for the time-being, until we hear back from Mark/PacBio on integrating Mark’s latest changes.
Here are the optimal settings for using BLASR (from Mark) -
For human I use -bestn 2 -maxAnchorsPerPosition 100 -advanceExactMatches 10 -affineAlign -affineOpen 100 -affineExtend 0 -insertion 5 -deletion 5 -extend -maxExtendDropoff 20 . The -bestn 2 is to detect larger structural rearrangements with downstream code.
——————————————————————-
Needless to add that after these warm-up work, Heng will most likely go after updating his recent arxiv paper to incorporate the library released by PacBio.
How Noisy Are Those Variant Calls from Short Reads Really?
As we were writing up this work, Paci?c Biosciences released deep resequencing data for the CHM1 cell line. It could be used to isolate errors caused by the Illumina sample preparation and sequencing. However, mapping and variant calling from PacBio human data is still in the early phase. We decided to leave out the comparison to the PacBio data for now.
http://www.homolog.us/blogs/blog/2014/04/07/bwa-mem-vs-blasr-for-pacbio-need-help-with-benchmarking/
April 4th, 2014 PacBio P4-C2, P5-C3, etc. – What Do They Mean?
We had been pondering about those cryptic terms and found by asking some people around that the P stands for polymerase and C stands for chemistry. Therefore, P4-C2 means polymerase of fourth generation and chemistry of second generation.
Polymerase
That got us curious about what the actual DNA polymerase sequences are for 2nd, 3rd or 4th generation. Everyone else told us that those would be top secret information at PacBio and as unavailable as Coca-Cola formula. We do not like to take no for an answer, and had to contact our friendly NSA collaborator to get to the deepest secrets from the company. Here is what we found. This document explains various polymerase sequences -
Provided are compositions comprising recombinant DNA polymerases that include amino acid substitutions, insertions, deletions and/or heterologous or exogenous features that confer modified properties upon the polymerase for enhanced single molecule sequencing.
Compositions that include modified recombinant DNA polymerases that include amino acid substitutions, insertions, deletions and/or heterologous or exogenous features that confer modified properties upon the polymerase for enhanced single molecule sequencing are a feature of the invention. Relative to a wild-type F29 DNA polymerase, these modifications can include any one of, or any combination of: an L253 mutation and a mutation at one or more of T368, E375, A484, or K512; an E375 and K512 mutation, and a mutation at one or more of L253, T368 or A484; an I93 mutation; an S215 mutation; an E420 mutation; a P477 mutation; a D66R mutation; a K135R mutation; a K138R mutation; an L253T mutation; a Y369G mutation; a Y369L mutation; an L384M mutation; a K422A mutation; an I504R mutation; an E508K mutation; an E508R mutation; a D510K mutation; or at least one mutation or combination of mutations selected from those listed in Tables 6, 9 and 10. The modified polymerases can exhibit desirable features described in detail hereinbelow, e.g., reduced reaction rates at one or more steps of the polymerase kinetic cycle, decreased branching fractions, increased closed complex stability, enhanced metal ion coordination and/or reduced exonuclease activity, etc.
There is lot more in the document to keep a biochemist busy for his entire lifetime.
Chemistry
How about chemistry? Our friendly NSA collaborator took us to the following top-secret documents -
Phospholink nucleotides for sequencing applications
The present invention provides labeled phospholink nucleotides that can be used in place of naturally occurring nucleotide triphosphates or other analogs in template directed nucleic acid synthesis reactions and other nucleic acid reactions and various analyses based thereon, including DNA sequencing, single base identification, hybridization assays, and others.
Recombinant Polymerases With Increased Phototolerance
Provided are compositions comprising recombinant DNA polymerases that include amino acid substitutions, insertions, deletions, and/or exogenous features that confer modified properties upon the polymerase for enhanced single molecule sequencing. Such properties include increased resistance to photodamage, and can also include enhanced metal ion coordination, reduced exonuclease activity, reduced reaction rates at one or more steps of the polymerase kinetic cycle, decreased branching fraction, altered cofactor selectivity, increased yield, increased thermostability, increased accuracy, increased speed, increased readlength, and the like. Also provided are nucleic acids which encode the polymerases with the aforementioned phenotypes, as well as methods of using such polymerases to make a DNA or to sequence a DNA template.
Method for sequencing using branching fraction of incorporatable nucleotides
Provided are methods for enhanced sequencing of nucleic acid templates. Also provided are reaction conditions that increase branching fractions during polymerization reactions. Also provided are compositions comprising modified recombinant polymerases that exhibit branching fractions that are higher than the branching fractions of the polymerases from which they were derived. Provided are compositions comprising modified recombinant polymerases that exhibit delayed translocation relative to the polymerases from which they were derived. Also provided are compositions comprising modified recombinant polymerases that exhibit increased nucleotide or nucleotide analog residence time at an active site of the polymerase. Provided are methods for generating polymerases with the aforementioned phenotypes and methods of using such polymerases to sequence a DNA template or make a DNA. Also provided are methods and nucleic acid sequencing systems for determining which labeled nucleotide is incorporated at a site during a template-dependent polymerization reaction.
Isolation of polymerase-nucleic acid complexes
Compositions, methods and systems are provided for the isolation of polymerase-nucleic acid complexes. Complexes can be separated from free enzyme by using hook molecules to target single stranded regions on the nucleic acid. Active complexes can be isolated from mixtures having both active and inactive complexes by initiating nucleic acid synthesis so as to open up a portion of a double stranded region rendering that region single stranded. Hook molecules are targeted to bind the sequences that are thus exposed. The hook molecules bound to active polymerase-nucleic acid complex are isolated, and the active polymerase-nucleic acid complexes released.
Generation of modified polymerases for improved accuracy in single molecule sequencing
Provided are compositions comprising recombinant DNA polymerases that include amino acid substitutions, insertions, deletions and/or heterologous or exogenous features that confer modified properties upon the polymerase for enhanced single molecule sequencing. Such properties can include reduced reaction rates at one or more steps of the polymerase kinetic cycle, increased closed polymerase/DNA complex stability, enhanced metal ion coordination, reduced exonuclease activity, decreased branching fractions, and the like. Polymerases that exhibit branching fractions that are less than the branching fractions of the polymerases from which they were derived, or branching fractions that are less than about 25% for a phosphate-labeled nucleotide analog, are also provided. Also provided are nucleic acids which encode the polymerases with the aforementioned phenotypes, as well as methods of using such polymerases to make a DNA or to sequence a DNA template.
More Details
Following papers and documents may be helpful to understand the technology better.
Real-time DNA sequencing from single polymerase molecules
http://nar.oxfordjournals.org/content/38/15/e159.long
http://nar.oxfordjournals.org/content/38/15/e159/F2.expansion.html
http://www.pacificbiosciences.com/img/press_release_assets/PacBio_Response_to_Helicos_Complaint_090110_final.pdf
http://www.homolog.us/blogs/chem/2014/04/04/pacbio/