LT.Swing trade!
Register for free to join our community of investors and share your ideas. You will also get access to streaming quotes, interactive charts, trades, portfolio, live options flow and more tools.
Register for free to join our community of investors and share your ideas. You will also get access to streaming quotes, interactive charts, trades, portfolio, live options flow and more tools.
May 29, 2015 (My favorite part!-“We are the industry leading technology in terms of de novo assemblies, and our customers are very happy with those capabilities,” PacBio Chief Scientific Officer Jonas Korlach tells Bio-IT World. “But that doesn’t mean we’re going to rest on our laurels. We always want to push the limits.”)
PacBio Aims for Haplotyped Whole Genome Assemblies in Partnership with RainDance
By Aaron Krol
Pacific Biosciences has worked hard over the past year and a half to demonstrate that its DNA sequencer, the RS II, can do something no other instrument can: assemble a whole human genome from scratch, without using an existing reference genome as a template. The process, known as de novo assembly, is harder and more costly than mapping to a reference genome, but it also reveals important information about the most complex types of variation in our genetic makeup.
PacBio can pull off this feat because its sequencer produces “long reads,” fragments of the genome spanning thousands of DNA base pairs. These reads also make PacBio’s technology ideal for haplotyping ? figuring out which genetic variants in an individual’s DNA come from chromosomes inherited from the father, and which from the mother. Recently, however, some enterprising companies have taken up the challenge of turning the “short reads” produced by PacBio’s competitors, just a few hundred base pairs long, into the kind of information needed to power haplotyping and whole genome assembly. Most prominently, this February 10X Genomics introduced GemCode, a platform meant to be combined with market leader Illumina’s short-read sequencers. (See, “10X Genomics at AGBT.”)
“We are the industry leading technology in terms of de novo assemblies, and our customers are very happy with those capabilities,” PacBio Chief Scientific Officer Jonas Korlach tells Bio-IT World. “But that doesn’t mean we’re going to rest on our laurels. We always want to push the limits.”
PacBio has now announced a collaboration with RainDance Technologies to develop a solution very similar to GemCode, which will allow the already sizeable RS II reads to be extended to 100 kilobases or more. Like the GemCode platform, this solution will involve isolating extremely long DNA fragments ? using RainDance’s “digital droplets” system for isolating single molecules ? and attaching short DNA barcodes to those fragments. After the fragments have been chopped up and sequenced on the RS II, a software program will recognize the barcodes and use them to reassemble the longer DNA elements captured in the RainDance instrument.
Because this procedure is expected to work on existing devices, the development cycle may be fairly fast. Both partners say that, while the collaboration is in its early stages, the key elements are past proof of principle. They also expect their solution to be better suited to de novo assembly than the 10X Genomics pipeline, which Korlach points out is “still based on short-read Illumina sequencing, with all the limitations in terms of read length… We absolutely see that there’s a big advantage to sequencing 10 to 30kb pieces and stitching those together into longer contiguous elements, rather than relying on something that’s 250 or 300 bases.”
One difference between the two technologies may be their ability to handle regions of the genome made up of short tandem repeats, small DNA sequences that are repeated many times in a row, which among other things are relevant to genetic diseases like Huntington’s and fragile X syndrome. While GemCode has not been on the market long enough for much information on its performance in these areas to be available, studies have shown that another barcoding system based on short reads, Moleculo, has been unable to perfectly resolve short tandem repeats.
Still, the GemCode platform should be well-suited to resolving many other types of long structural variants, as well as for haplotyping ? applications that have been among the biggest selling points for PacBio in a challenging market. That makes it critical for PacBio to stay a step ahead on delivering the most complete long-range genomic information.
Aiming High
The resemblance of the PacBio-RainDance partnership to GemCode is not a coincidence. The microfluidics system used by 10X in its GemCode instrument works by capturing single DNA molecules in beads of oil and separating them into microwells, a process that RainDance has used for other applications since the launch of its first instrument in 2008. In fact, RainDance believes 10X is infringing on its patents and has filed a lawsuit against the company.
“We have over 175 patents in our portfolio that cover all aspects from droplet formation and manipulation to barcoding [and] analysis,” says RainDance President and CEO Roopom Banerjee. “Candidly, we’ve published on certain applications of incarnations of what 10X has done years before 10X was founded.”
Banerjee is also extremely bullish on how the final PacBio-RainDance solution will stack up against GemCode. Among other advantages, he predicts that the RainDance instrument will allow for higher throughput, longer reassembled DNA fragments, and more diverse barcodes, which will allow a greater number of genomic targets to be parallelized in each sequencing run.
Most importantly, Banerjee expects his company to undercut 10X on cost. “I’ve already heard from many customers who are evaluating 10X’s technology that $500 a sample is just not economically feasible for them to be able to deploy commercially,” he says, citing 10X’s quoted price. “You can do a whole exome today for roughly $300 to $400 a sample, with the exome capture being about $50. So if it now costs $500 to haplotype and phase that exome, who’s going to do that?”
The economic argument only goes so far: low-cost whole exomes and genomes are primarily a hallmark of Illumina, which is a major reason short-read sequencing is dominant. Even with an efficient barcoding system, de novo assembly will remain something of a luxury in genomics for some time to come. The partnership between RainDance and PacBio is non-exclusive, however, and Banerjee anticipates that more collaborations are in the cards. (It’s worth noting that there is no formal relationship between 10X Genomics and Illumina; 10X simply designed GemCode as an add-on service for the huge market of Illumina users.)
In the meantime, there are a number of genomics labs that might be willing to pay a premium for a higher quality of assembly with barcoded long reads. Production-scale centers like Human Longevity, Inc., and the Broad Institute of MIT and Harvard use RS II sequencers to support their higher-throughput batteries of Illumina instruments, and are increasingly interested in the large structural variants that short reads can’t resolve. The RainDance instruments also promise to capture useful amounts of DNA from very small samples, which could be useful in applications like sequencing tumor or pathogen DNA from whole blood.
“We’re really excited about translational applications,” says Banerjee, but notes that both PacBio and RainDance instruments are currently sold for research use only. In the short term, the partners expect to see most of their uptake in basic research, from health-related fields like oncology to far-flung areas like agricultural genomics. Banerjee suggests one exception might be HLA typing for organ and tissue transplants, which has lower regulatory barriers than diagnostic testing and badly needs long-range information to resolve the complex structure of the human genome’s immune-related complexes.
Korlach agrees that this technology will be primarily used for research, at least at the outset, although he also points out that PacBio recently completed its second development milestone in a partnership with Roche aimed at developing mass market diagnostics.
Like Banerjee, Korlach is confident that the partnership will deliver the best long-range sequencing information available, maintaining PacBio’s position as the go-to technology for assembling whole genomes. “To us, it’s quite clear, and the data is also in the scientific literature that shows there are quite a lot of limitations to trying to do this with short reads.”
http://www.bio-itworld.com/2015/5/29/pacbio-aims-haplotyped-whole-genome-assemblies-partnership-raindance.html?utm_source=dlvr.it&utm_medium=twitter
GRC a PacBio Certified Service Provider; Co-sponsoring SMRTest Microbe Grant Program at ASM 2015
Posted on May 28, 2015 by nsengamalay
We are pleased to announce that the GRC is the first PacBio certified service provider on the East Coast. This recently announced program is a partnership between PacBio and select sequence providers who have completed the certification process and offer the highest quality sequencing and analysis services using the PacBio technology. We offer a full range of PacBio services, including whole genome sequencing, transcriptome sequencing via Iso-Seq, targeted amplicon sequencing, and other customized applications. Our analysis team has expertise in genome assembly and annotation, variant analysis, transcriptome analysis, and base modification detection. We look forward to continuing our strong partnership with PacBio and offering the highest quality sequencing and analysis to our customers and collaborators.
As part of this new partnership, the GRC is proud to co-sponsor the SMRTest Microbe Grant Program. One lucky winner will receive sequencing and analysis services from the GRC. To enter, submit a short grant application detailing your project and how it would benefit from the long reads and high consensus accuracy of SMRT Sequencing. The deadline for submissions is June 27, 2015.
For more information on our full range of sequencing and analysis services, visit our Laboratory Services and Analysis Services pages. Please contact us if you have any questions or would like a quote.
If you are attending ASM 2015 next week, please stop by the IGS booth (#776) to learn more about the grant program and all of our sequencing and analysis services. See you in New Orleans!
This entry was posted in Analysis, PacBio, RS II, Sequencing and tagged ASM, Certified Service Provider, CSP, Microbe Grant Program, PacBio, Pacific Biosciences, Sequencing. Bookmark the permalink. http://www.igs.umaryland.edu/labs/grc/2015/05/28/grc-now-a-pacbio-certified-service-provider-co-sponsoring-smrtest-microbe-grant-program-at-asm-2015/
Tuesday, May 26, 2015---New MHAP Algorithm Delivers Fast, High-Quality Genome Assemblies
A new publication in Nature Biotechnology reports the development of a lightning-fast genome assembly pipeline optimized for long reads. Scientists from the University of Maryland and the National Biodefense Analysis and Countermeasures Center created the MinHash Alignment Process, known as MHAP, to dramatically reduce assembly time and improve assembly quality. Their results are worth celebrating: assembly times were 600-fold faster compared to existing methods. “Using MHAP and the Celera Assembler, single-molecule sequencing can produce de novo near-complete eukaryotic assemblies that are 99.99% accurate when compared with available reference genomes,” the authors write. In the best cases, entire chromosome arms assembled into single-pieces from telomere to centromere!
MHAP takes a probabilistic approach to overlap-based assembly of long reads. MinHash represents longer text or a string of information as a set of fingerprints, allowing the assembly process to occur with more compact data that’s less computationally intensive. The authors’ MHAP overlapping method has been integrated into Celera Assembler for the assembly of gigabase-sized genomes, and is reported in their new paper “Assembling Large Genomes with Single-Molecule Sequencing and Locality-Sensitive Hashing.”
While the technical approach to MHAP is very clever, what impressed us most from this publication were the results. Lead authors Konstantin Berlin and Sergey Koren, along with their collaborators, tested the algorithm on five different genomes to gauge its performance. The organisms included Escherichia coli, Saccharomyces cerevisiae, Arabidopsis thaliana, Drosophila melanogaster, and a human cell line (CHM1). For E. coli, the assembly of 85x SMRT® Sequencing reads took 20 minutes on a 16-core desktop computer; for S. cerevisiae, assembly time was less than two hours and resulted in a four-fold improvement in N50 size compared to previous assemblies. “For E. coli, the total cost of PBcR-MHAP assembly and Quiver polishing is currently less than $2 [using cloud computing],” Berlin et al. write.
As genomes got larger, the speedup became more pronounced: the D. melanogaster assembly was 600 times faster than previous methods — from 629,000 CPU hours to just 1,086 CPU hours — and produced a contig N50 longer than the scaffold N50 of the Sanger reference assembly, and “hundreds of fold more contiguous” than a synthetic long-read assembly. For yeast, the results were even more impressive. The authors report that “a majority of the 16 chromosomes were completely resolved from telomere to telomere” in their MHAP long-read assembly of S. cerevisiae W303.
The team also assembled a human genome, using the publicly available PacBio® long-read CHM1 haploid human genome dataset. The assembly’s contig N50 is an order of magnitude larger than the original Sanger-based human assembly, the authors report, and may resolve more than 50 of the 800-plus annotated gaps in the latest human reference genome. Zooming in on the highly complex MHC region of the genome, the scientists found that 97 percent of the locus is represented in just two contigs, compared to more than 60 contigs covering the same region in a recently published short-read assembly of the same cell line.
With MHAP, the authors anticipate that pent-up demand for rapid, affordable, and high-quality assembly can now be met. “These results demonstrate that [PacBio] single-molecule sequencing alone can produce near-complete eukaryotic genomes,” they write. We’re certainly excited to see more reference-grade eukaryotic genome assemblies generated using the new MHAP method.
Tweet
http://blog.pacificbiosciences.com/2015/05/new-mhap-algorithm-delivers-fast-high.html
Jason Chin "Just shameless plugin what I am going to talk about tomorrow for diploid genome assembly with @PacBio data #SFAF2015"
https://twitter.com/infoecho/status/603614689344540674/photo/1
Published: May 27, 2015-- HLA Typing for the Next Generation
Abstract
Allele-level resolution data at primary HLA typing is the ideal for most histocompatibility testing laboratories. Many high-throughput molecular HLA typing approaches are unable to determine the phase of observed DNA sequence polymorphisms, leading to ambiguous results. The use of higher resolution methods is often restricted due to cost and time limitations. Here we report on the feasibility of using Pacific Biosciences’ Single Molecule Real-Time (SMRT) DNA sequencing technology for high-resolution and high-throughput HLA typing. Seven DNA samples were typed for HLA-A, -B and -C. The results showed that SMRT DNA sequencing technology was able to generate sequences that spanned entire HLA Class I genes that allowed for accurate allele calling. Eight novel genomic HLA class I sequences were identified, four were novel alleles, three were confirmed as genomic sequence extensions and one corrected an existing genomic reference sequence. This method has the potential to revolutionize the field of HLA typing. The clinical impact of achieving this level of resolution HLA typing data is likely to considerable, particularly in applications such as organ and blood stem cell transplantation where matching donors and recipients for their HLA is of utmost importance.
Citation: Mayor NP, Robinson J, McWhinnie AJM, Ranade S, Eng K, Midwinter W, et al. (2015) HLA Typing for the Next Generation. PLoS ONE 10(5): e0127153. doi:10.1371/journal.pone.0127153
Received: January 7, 2015; Accepted: April 12, 2015; Published: May 27, 2015
Copyright: © 2015 Mayor et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Data Availability: All relevant data are within the paper and supporting information files. In addition, the genomic sequences described have been submitted to both the EMBL and IMGT/HLA Databases. Accession numbers for all sequences are provided in the manuscript.
Funding: Pacific Biosciences provided support in the form of salaries for authors SR, KE, C-SC, BB and PM. The authors affiliated with Pacific Biosciences were involved in the study design, data collection and analysis, and have reviewed, commented on and approved the content of the manuscript. The authors have reviewed the authors' roles in the online form and confirm they are correct.
Competing interests: Anthony Nolan Research Institute does not have any conflicts of interest to declare. SR, KE, C-SC, and BB are or were employees of Pacific Biosciences of California, Inc., a company commercializing DNA sequencing technologies at the time that this work was completed. This does not alter the authors' adherence to PLOS ONE policies on sharing data and materials.
Introduction
The HLA genes are located within one of the most gene rich regions of the human genome, the Major Histocompatibility Complex (MHC), on the short arm of chromosome 6 (6p21.3). Many of these genes, including HLA, encode proteins that have a critical role in immune responses[1, 2]. The MHC is divided into three distinct regions referred to as class I, II and III, with the HLA genes being located within the class I and class II regions. The HLA genes are known to be the most polymorphic genes of the human genome[1, 3]. This polymorphism is predominantly found within the six classical HLA genes: the class I genes HLA-A,-B and-C and the class II genes HLA-DRB1,-DQB1 and -DPB1. Over 12,200 HLA alleles have been identified to date (December 2014), with in excess of 9,200 being variants of the HLA class I genes alone [www.ebi.ac.uk/imgt/hla] [4, 5].
HLA proteins function as antigen presentation molecules presenting self and non-self peptides to T-cells, a fundamental step in the initiation of certain adaptive immune responses. Much of the described polymorphism within the HLA class I genes is located within the exons that encode the peptide binding groove and the points at which T-cells interact with the molecule itself. This diversity has evolved as a mechanism to ensure on-going pathogen recognition and eradication by increasing the repertoire of peptide motifs that can be bound and presented to T cells [6, 7]. Over-dominant selection is also thought to have driven the extent of polymorphism. HLA heterozygosity is favoured in a population because it increases the number of peptide motifs that can be presented by the co-dominantly expressed HLA molecules[7]. This strong heterozygote advantage is of particular importance in the event of infection by a pathogen that is specifically able to evade presentation by a particular HLA allele by ensuring that an individual is capable of initiating immune responses against the pathogen by presentation of the peptide by the second allele.
As the majority of described polymorphisms are located within the peptide binding groove that is encoded by exons 2 and 3 of the HLA class I genes, and that these differences have such an important functional relevance, many of the routinely used high-throughput HLA typing methods are focused on identifying variation within this limited region. A common problem encountered is the inability to determine the phase of polymorphisms identified in a single individual, a problem that is exacerbated by the extensive genetic diversity seen in HLA genes [8, 9]. The result of this is ambiguous HLA types and the reporting of HLA typing strings. The high workload, cost and time required to generate true allele-level HLA typing using current methods makes it preclusive for most histocompatibility laboratories.
The recent development of second-generation sequencing methods has been of great interest to the HLA typing community due to the possibility of sequencing a single DNA strand in isolation. These techniques provide an opportunity for single allele definition at primary HLA typing as opposed to cross-referencing results from different molecular techniques and serological testing. Previously, sequencing an entire HLA gene in isolation was achieved through the use of PCR-cloning processes, which are lengthy and often problematic. Second-generation sequencing methods have the potential to negate the use of such challenging laboratory practices. These technologies offered the first realistic solution to the problem of phasing polymorphisms throughout the HLA gene, enabling definitive allele typing. Consequently, many second-generation technologies have now been optimised for use by the HLA typing market and use of Sequence-Based Typing (SBT) protocols are common [10–14]. A current limitation of these methods are the read lengths that can be generated, resulting in the need for multiple over-lapping sequences to achieve full gene and even partial gene sequencing. A common concern with these methodologies is that incorrectly aligned fragments could result in HLA typing errors. It is possible that in a system as polymorphic as the HLA genes, incorrect phasing of SNPs that are distant to each other across the gene but otherwise show complete homology could result in an incorrect allele being assigned. Additionally rare or novel allele formed by a recombination event may be missed if the consensus sequence analysis tools are biased towards the more common alleles.
The ideal solution to resolve both HLA ambiguity and the potential problems caused by phasing multiple fragments would be to produce multiple long sequence reads encompassing whole gene PCR amplicons, in isolation. The development of Pacific Biosciences’ Single Molecule Real Time (SMRT) DNA sequencing technology offers the first realistic option to achieve this goal [15]. The SMRT sequencing method is able to generate exceptionally long read lengths that would allow coverage of the 3 kb or more of a HLA class I gene sequence, and thus determine the phase of the resolving polymorphisms seen. In addition, the technology has the potential to sequence read length in excess of 20 kb that could allow for entire HLA class II gene sequencing, which at over 10 kb for some genes, are substantially longer in length than the HLA class I genes.
SMRT DNA sequencing makes use of SMRTbell templates, single stranded hairpin adaptors that can be ligated on to the ends of PCR products. The function of these adaptors is to turn an essentially linear PCR amplicon into a circular molecule. The advantage of generating a circular molecule is that the enzyme added to facilitate the reaction is capable of processively generating sequence from both strands of the PCR amplicon until either the enzyme expires or the end of the run-time is achieved. Under optimal experimental conditions, the result of the continuous sequencing process is the generation of a Continuous Long Read (CLR); one exceptionally long read which contains multiple regions of sequence specific to the PCR amplicon (known as sub-reads) interspersed with the sequence of the SMRTbell adaptors (Fig 1). This novel method of generating DNA sequence means that it is possible to interrogate the same DNA strand multiple times within a single experiment, achieving exceptionally high depth of sequence coverage.
SMRTbell adaptors are ligated onto the ends of a blunt-ended PCR amplicon to facilitate continuous sequencing of both strands of the amplicon. The entire sequence generated may include multiple copies of the sense and anti-sense strands of the PCR amplicon in a single read known as the Continuous Long Read (CLR). The post-sequencing bioinformatic post-processes are able to break down the CLR into shorter sub-reads, which encompass the sequence of one strand of the amplicon. These sub-reads can then be compared and used to create a consensus sequence.
doi:10.1371/journal.pone.0127153.g001
Here we describe the results of a study to determine whether the SMRT DNA sequencing methodology could be adapted for use in the Anthony Nolan Histocompatibility Laboratory to facilitate stem cell donor registry typing. The aims of this study were fourfold; i) to determine whether the methods were suitable for adaptation with Anthony Nolan DNA samples and PCR amplicons; ii) to determine if basic levels of multiplexing were possible; iii) to see if genomic HLA class I sequences could be generated; and iv) to determine the accuracy and specificity of the sequences generated.
Material and Methods
Seven DNA samples were selected for HLA class I genotyping using Pacific Biosciences’ SMRT sequencing methodology. In recent years, Anthony Nolan has changed from blood to Oragene saliva (DNA Genotek, Ottawa, Canada) as their primary source of DNA when recruiting donors to our stem cell donor register and thus DNA from this source makes up a large part of our workload. In accordance with our in-house protocol, blood samples are still requested from donors who are short-listed as potential matches for confirmatory HLA typing, virology and other associated tests. To ensure that DNA from both starting materials were suitable for use with SMRT sequencing methods, two samples were selected from each category. Written consent for HLA typing was obtained from all blood and saliva sample donors at the point of collection for the purposes of transplantation. These consent forms are in accordance with the Human Tissue Authority (HTA) UK, European Federation for Immunogenetics (EFI), and Clinical Pathology Accreditation (CPA) regulatory body guidelines. Anthony Nolan’s Medical Advisory committee has also reviewed and approved the consent form. Specific approval from the local ethics committee was not sought, as the purpose of the study is to assess the method of HLA typing in comparison with existing HLA typing techniques. No new genetic information outside of the HLA genes that would affect the donors of the material used in this study has been gained.
Three B-Lymphoblastoid Cell Lines (B-LCLs) were also selected from a well-characterized panel that have been extensively analysed for their HLA genes. HLA class I genotyping was undertaken for all donor-derived DNA samples using Luminex LABType SSO typing kits (One Lambda, CA, USA). HLA genotype information for the B-LCLs was obtained from the IMGT/HLA Database website [www.ebi.ac.uk/ipd/imgt/hla/][8].
In addition to the aforementioned selection criteria, samples were chosen that included a) as many commonly seen HLA alleles as possible (the definition of ‘common’ in this case relates to those alleles seen frequently in our tested population, typically British and Irish north-west European caucasoids); b) alleles with genomic sequences available in the IMGT/HLA Database where possible; c) alleles that have known indels; and d) one DNA sample that is well characterized and known to be homozygous for HLA-A,-B and-C. The homozygous cell line chosen was the B-LCL COX (sample AN5) and was selected for HLA typing using the SMRT sequencing method as it is known to be consanguineous and has previously undergone in-depth sequence analysis of the entire Major Histocompatibility Complex (MHC) region on chromosome 6, which includes the HLA gene family [16, 17]. This previous in-depth analysis ensured full-length HLA class I gene sequences were available for the constituent alleles. In addition, the HLA haplotype observed in this cell line has remained evolutionarily well-conserved and is one of the most frequently observed haplotypes in our tested population [18].
HLA class I amplicons were generated using primers as described for the 13th International Histocompatibility workshop [19]. This protocol enables amplification of the entire HLA Class I gene from 5´ to 3´ UTR. Fragment sizes were estimated to be 3500 bp, 3400 bp and 3450 bp for HLA-A,-B and-C respectively. The amplification method used TaKaRa LA DNA polymerase (Takara Bio Europe SAS, Saint-Germain-en-Laye, France). Agarose gel electrophoresis was used to confirm amplification and correct fragment size, as well as to check for non-specific product contamination. A 10 KB sizing marker was included to confirm size specificity (HyperLadder I; Bioline Reagents LTD, London, UK).
The HLA class I amplicons were sequenced according to Pacific Biosciences’ standard protocol for PCR amplicons greater than 3 KB in length. Briefly, HLA class I amplicons underwent quality and quantity confirmation using a Bioanalyser instrument and the Agilent DNA 12000 kit (Agilent Technologies, Santa Clara, CA, USA). As we aimed to test basic multiplexing capabilities of the SMRT DNA sequencing system, a pool of each of the HLA class I amplicons for a single sample were pooled at equimolar concentrations. After performing DNA damage and end repair, the SMRTbell adaptors were blunt-end ligated onto the PCR amplicons in the pool. Following the ligation of the adaptors, an adaptor-specific sequencing primer and enzyme were bound to the templates. For this study, sequencing was enabled by the use of the P4 enzyme and C2 chemistry. Finally, the SMRTbell templates were loaded on to MagBeads, magnetic beads that facilitate even sample loading into the SMRT Cell. DNA samples were sequenced on the PacBio RS II SMRT DNA Sequencing System with a movie capture time of 120 minutes. All stages of the sequencing process, including library preparation, SMRT Cell loading and data collection were achieved within three working days.
The DNA sequences derived from Pacific Biosciences’ SMRT sequencing technologies underwent post-processing using the SMRT analysis tool v2.1, and were assigned HLA types using Anthony Nolan in-house Bioinformatics methods. The PacBio methodology provides a number of sequences for analysis for each sample. The optimal consensus sequences for each run were selected by Anthony Nolan and Pacific Biosciences’ researchers and analysis was performed. HLA types were assigned based on identity to known sequences within the IMGT/HLA Database. Where novel sequences were reported, assignment of a HLA type was based on aligning the novel consensus sequence at both the cDNA, gDNA and protein level, to identify the nearest known HLA allele.
Sanger sequencing was used to determine the accuracy of regions of DNA sequence obtained from SMRT sequencing that either differed to the existing genomic sequences for the expected allele, or if no genomic sequence were available, differed to that seen in the closest matching allele. SBT was enabled using BigDye Terminator sequencing kit V3.1 (Applied Biosystems, Foster City, California, USA) and utilised primers designed in-house. Fragments were sequenced on an ABI 3730XL Genetic Analyser (Applied Biosystems, Foster City, California, USA). As the majority of the tested samples were heterozygous for each of the HLA class I loci, generic PCR and SBT was not sufficient in some cases to enable confirmation of discrepancies. For these samples, cloning of full-length HLA gene PCR amplicons was used to allow separation of the two alleles. HLA class I PCR products were cloned using the Zero Blunt TOPO cloning kit (Life Technologies, Paisley, UK) before targeted sequencing as previously described.
Results
Seven DNA samples were selected for SMRT DNA sequencing based on a set of defined inclusion criteria which included different starting material from which the DNA was extracted and the inclusion of as many commonly seen HLA class I alleles as was feasible. Each of the seven samples tested were able to generate sufficient quality sequence data to create a consensus sequence for all of the alleles expected. Variation was seen in the number of sub-reads achieved for each allele due to allelic imbalance that occurs during PCR amplification that is not routinely detected with HLA typing strategies that do not allow sequencing of single gene sequences in isolation. Despite these imbalances, the minimum depth of coverage was still in excess of 150x (median 462.5; range 154–2931), that is there were in excess of 150 sub-reads of sufficient quality for each allele once subjected to the quality checks in the post-processing stage of SMRT data analysis, that could be used to generate a consensus sequence (Table 1). 100% of the total number of consensus sequences generated achieved a mean Quality Value (QV) of over 70 (mean QV 74.079, range 71.937–80).
A comparison of the HLA typing results expected based on HLA class I typing by Anthony Nolan and that obtained through SMRT sequencing can be found in Table 2. Samples that were thought to be homozygous at a particular locus were expected to generate a single consensus sequence. Alleles thought to be the same but observed in different individuals (for example, HLA-A*03:01:01:01 in samples AN1 and AN4) were considered as different consensus sequences. Therefore, a total of 38 possible consensus sequences were expected.
Thirty of these 38 possible HLA consensus sequences immediately showed complete identity with reference sequences available in the IMGT/HLA Database (Table 2). Both alleles for each of the three HLA class I loci were accurately called in three samples, AN1, AN5 and AN7. SMRT sequencing of sample AN5 correctly identified this sample to be homozygous for all three HLA class I loci tested. Unexpectedly, an additional HLA gene sequence was also identified in this sample. Sequences corresponding to the HLA pseudogene allele HLA-H*02:01:01:01 were identified, although the numbers of reads were low (n = 12). Despite the suboptimal number of reads, the HLA-H sequences were correctly called for this sample. The reason for this non-specific product is due to the co-amplification of HLA-H in the reactions for another class I product, presumably HLA-A due to sequence similarities between the two genes. Samples AN6 and AN7 were included in this test cohort as they contain alleles that have notable non-coding deletions in their genomic DNA (gDNA) sequences (B*73:01 and C*17:01). SMRT sequencing methods were able to generate consensus sequences that accurately identified these two alleles.
Four consensus sequences generated with SMRT DNA sequencing methods matched alleles that only had either partial gene or Coding DNA Sequences (CDS) available in the IMGT/HLA Database. As described previously, HLA types were assigned to the consensus sequences by comparison to CDS sequences available on all HLA alleles as well as with genomic sequences of closest related alleles (Table 3). Data for sample AN6 (HLA-C*15:05:01 expected) was identical to the reference genomic sequence used as a comparison (HLA-C*15:05:02) except for the single nucleotide difference in exon 1 that differentiates the two alleles (gDNA 24T>C). Samples AN3 (HLA-B*14:01:01 expected) and AN4 (HLA-B*27:05:18 expected) showed sequence variation in intron sequences in addition to those that define the differences between the observed allele and that to which it was being compared. Sanger sequencing was used to confirm allele identity (AN6) or to confirm the existence of novel non-coding variants (AN3 and AN4). All positions tested matched those generated with SMRT sequencing technology confirming the accuracy of the method. Data for the second HLA-B allele in sample AN3 initially suggested an intron 5 variant of HLA-B*27:05:02 (gDNA 2086C>T). Sanger sequencing of the region of interest confirmed the nucleotide substitution when compared to the existing genomic reference sequence. An analysis of all available intron 5 sequences for HLA-B*27 alleles showed all other alleles had the variant base at the queried position (2086T), possibly suggesting that there was an error in the original sequence submitted to the IMGT/HLA Database. The original DNA source used to generate the HLA-B*27:05:02 genomic sequence was identified and re-sequenced. The data confirmed that there was an error in the original genomic sequence and that the consensus sequence generated by SMRT DNA sequencing was correct. These novel genomic HLA sequences have been submitted to the IMGT/HLA Database as extensions or corrections to existing alleles (Table 3).
Four of the 38 tested alleles showed novel genomic HLA sequences when compared to the expected sequences (Table 4); AN2 (HLA-A*68 variant), AN6 (HLA-B*52 variant) and AN3 (two HLA-C variants, C*02 and C*08). Sanger SBT of the regions of interest for each of the four alleles confirmed the variant bases, and thus the novel alleles identified using SMRT sequencing. These novel genomic sequences have been submitted to the IMGT/HLA Database and have been officially named according to the WHO Nomenclature Committee for Factors of the HLA System (Table 4) [3].
When Sanger sequencing confirmations of novel genomic sequences were included, the final analysis showed absolute concordance between the consensus sequences generated with Pacific Biosciences’ SMRT DNA sequencing method and the expected/observed alleles.
A common concern with all sequencing-based typing methods (second generation and in some cases, Sanger sequencing) is the accuracy of the technology to determine the correct number of consecutive nucleotides within homopolymer regions. This is of utmost importance in HLA testing as a single nucleotide insertion/deletion will change the HLA type of an individual, which can have serious clinical consequences. In order to assess the accuracy of SMRT DNA sequencing technology, we determined the number of homopolymer regions present in each of the 38 HLA sequences generated that consisted of five or more nucleotides (Table 5). The total number of bases sequenced was 130117 bp within which 487 homopolymer regions were identified. The most frequently observed nucleotide repeat regions were 5-mers, which occurred multiple times for each of the four nucleotides (range 12–209 times). The longest homopolymer region found in the tested alleles was a 9-mer; this occurred 13 times but only for the T nucleotide. Noticeably fewer homopolymer regions were observed for the A, C and G nucleotides, particularly in 7-, 8- and 9-mers. In all cases, SMRT DNA sequencing methodology accurately determined the correct number of nucleotides present in each allele. Additionally, 99.354% of the homopolymer regions had a mean QV of 70 or more across the homopolymer region (mean QV 74.084; range 64.286–80). Details on individual QV data for each of the 38 HLA sequences generated can be found in the supplementary information (S1 Table).
Discussion
Next generation sequencing technologies have offered the first feasible laboratory-based solution to the problem of phasing the complex polymorphisms seen in the HLA gene family. Limitations in read length have meant that a shotgun approach has to be applied, with multiple fragments covering an entire region of interest being necessary [10, 12, 20–24]. The SMRT DNA sequencing method from Pacific Biosciences has overcome the need to sequence multiple overlapping fragments allowing sequencing of a single fragment in excess of 20 kb in one sequencing reaction. The implications of this technology in the field of HLA typing could be enormous, allowing for true allelic HLA typing in a single experimental set-up and making redundant the need for multiple experiments on different typing platforms, cross-referencing of results and/or the need for re-sequencing using an allele specific protocol. We have described here the results of a feasibility study which shows that whole HLA class I gene sequencing is possible using the SMRT DNA sequencing platform. The sequence data generated was high quality and allowed for accurate allele calling. In addition, all stages of the experimental set-up were completed within three working days and sequence data were captured over 120 minutes. In combination, these factors make the SMRT DNA sequencing method amenable for use in a high-throughput HLA typing laboratory.
The primary aim in testing this methodology was to determine whether accurate genomic consensus sequences could be generated using our current in-house protocol for full gene HLA class I amplification and with our DNA samples using SMRT sequencing technology. Our findings have confirmed that our blood and saliva specimens and subsequent DNA extraction procedures are suitable for the isolation of high molecular weight genomic DNA, an essential prerequisite for the PCR amplification of HLA whole gene amplicons. The PCR primers and amplification conditions led to specific amplification of the genes of interest, namely HLA-A,-B and-C. There was minimal co-amplification of HLA-H, most likely with HLA-A primers, but this did not have a detrimental effect on allele calling. Some allelic imbalance was observed in the data generated for HLA-B. The most likely explanations for this observation are either that SMRT DNA sequencing is a more sensitive methodology and is therefore more likely to identify imbalance in the PCR which is not seen in SBT, or that possible nucleotide differences between the primer and allele sequences caused inefficient or inhibited binding.
Differences between the numbers of reads seen for each locus of a single sample were also observed. A potential reason for these differences is that there was imbalance during the equimolar pooling stages, with some loci being over or under represented. Additionally, the kit used to quantify the PCR amplicons prior to equimolar pooling is limited to quantifying samples within a range of 0.5–50 ng/µl. As all amplicons in this experiment were of concentrations towards the upper limits of the kit, it is possible that the sizing and quantification values were affected, which consequently affected the volumes required for equimolar pooling and causing the imbalance between loci. The use of single molecule sequencing methodologies is challenging our previous perceptions of what constitutes ‘good’ or ‘successful’ PCR amplifications, with significantly lower quantities of amplicons required for most processes.
Despite imbalance issues, significant depth of coverage was achieved for all alleles that were sequenced and allowed for accurate HLA allele assignment. Future experiments where the extent of multiplexing different DNA samples or HLA loci is tested should consider the affect of allelic imbalance on the depth of coverage achievable, although these issues should be easily rectified with additional amplification optimisation. Additionally, future experiments should either take final concentrations of samples and quantification kit limitations into consideration before proceeding with the sequencing experiment, or alternatively, PCR conditions altered to allow for lower quantities of amplicon to be generated.
The concentration of amplicons for all three class I loci was sufficient for pooling at equimolar levels prior to library preparation. The multiplexing of the three amplicons from a single sample in a designated SMRT cell allowed for 150x read depth for all alleles present with the resultant sequence reads being successfully aligned and assigned to the relevant HLA class I genes with analysis software. The amplicon lengths were similar enough to negate the potential problem of loading bias towards smaller PCR products in a pool when dispensed into SMRT Cells. The generated sequence exhibited complete coverage from the sites of the PCR primers. Depending on the loci and alleles present, this was inclusive of the terminal 300 bp in the 5´UTR, exons, introns and the leading 200 bp in the 3´UTR.
The quality of the HLA class I genomic sequences generated can partly be confirmed by the high percentage of those reaching QV70, in some cases higher, but also by the accurate assignment of HLA types to these sequences. Of particular interest was the accuracy of the data produced for the homopolymer regions present in the different alleles due to the known cross-platform problem of enzymes incurring slippage when sequencing through long stretches of a single continuous base. In all cases, SMRT DNA sequencing technology was able to call the correct number of bases for each allele. The longest homopolymer region sequenced here was a 9-mer and although this was seen multiple times, only 9-mers of the T nucleotide were observed. Thus it remains to be seen whether the technology can adequately sequence through longer homopolymer regions and whether different bases introduce other problems.
The accuracy of the methodology for sequencing the tested samples was substantiated by the correct identification of novel HLA class I alleles, each of which was separately confirmed using Sanger-based sequencing methods. The high number of novel alleles found in this small test cohort (4/38 sequences; 10.5%) highlights the extensive polymorphism seen in the HLA genes outside of the routinely typed exons, much of which may as yet be unknown. As previously stated, most histocompatibility laboratories would like to be able to generate allele-level resolution for all samples processed, but this is often unattainable due to financial, time and experimental constraints. SMRT DNA sequencing technology could offer a resolution to these issues, providing sequences for ultra-high resolution HLA typing in a single sequencing reaction and being achievable in less time than it would take using current methodologies.
The down-stream uses of HLA typing data are varied and include assessing compatibility between donors and recipients prior to transplantation, drug hypersensitivity and disease associations. The potential impact of using SMRT DNA sequencing in the future to generate such high-resolution HLA typing on many of these areas of medicine are likely to be considerable. For example, high resolution HLA typing has been shown to significantly improve outcome when stem cell transplant recipients and their unrelated donors are matched for both alleles at five of the classical HLA loci (HLA-A,-B,-C,-DRB1 and-DQB1, a 10/10 match) [25–28], as it is thought that disparity at these important compatibility loci can contribute to complications such as graft-versus-host disease and consequently, to mortality. SMRT DNA sequencing has the potential to detect previously unidentified polymorphisms in regions of the HLA genes that could be significantly contributing to these complications. This could ultimately result in considerable improvement in survival rates post transplant.
Currently many histocompatibility laboratory regulatory bodies are defining the standards that will be necessary for clinical typing and reporting of HLA types by various sequencing platforms, particularly regarding the minimum depth of coverage required. At this time, no clinical governance has been established. The depth of sequence coverage described to date in HLA studies that have utilised next generation sequencing methods has varied [10, 12, 20–24]. Here we have demonstrated a minimum of 150x depth of coverage for each of the alleles tested, with the added advantage that each of the sub-reads are full genomic sequences. However, as this was a feasibility study, we have not tested the maximum capabilities of the SMRT DNA sequencing method, with a maximum of six individual amplicons (two different alleles per HLA gene; three HLA genes per DNA sample tested) being sequenced on a single SMRT Cell. In order for this technology to be economically and practically viable for use in our clinical laboratories, the degree of multiplexing must be significantly higher. What effect this would have on the depth of coverage achievable for a single allele is yet to be determined, but it is reasonable to assume that it would be notably lower than experienced in this study. Thus, the potential of SMRT sequencing for routine HLA typing at this current time will in some part be dictated by the cost per sample, but also by the requirements of the histocompatibility laboratory regulatory bodies. Preliminary data from our group suggests that multiplexing 48 samples for three HLA class I genes is possible and produces accurate typing results, suggesting that this technology is viable for use in a high-throughput clinical laboratory (unpublished data).
The number of DNA samples tested here were low although multiple genes were sequenced for each sample. It is important that larger and more diverse cohorts of DNA samples are sequenced using SMRT DNA technology to confirm suitability for HLA typing. Future studies should also test the maximum multiplexing capabilities of the SMRT sequencing system, both with increased numbers of samples and the number of HLA loci included per SMRT Cell. It also remains to be seen whether accurate and high-quality HLA class II consensus sequences can be generated on this platform, which would be necessary for clinical use of this technology.
This method offers a realistic solution to the issues encountered in clinical HLA typing and has the potential to significantly improve clinical prognoses.
Supporting Information
S1 Table. Accession numbers and QV values of all HLA genomic sequences generated using SMRT DNA sequencing method and submitted to EMBL.
doi:10.1371/journal.pone.0127153.s001
(DOCX)
Author Contributions
Conceived and designed the experiments: NPM JR AJMM SR KE C-SC HB JAM KL SGEM. Performed the experiments: NPM JR AJMM KE WM WPB C-SC BB PM. Analyzed the data: NPM JR AJMM SR KE WM WPB C-SC BB PM HB JAM KL SGEM. Contributed reagents/materials/analysis tools: SR C-SC HB JAM KL SGEM. Wrote the paper: NPM JR AJMM SR KE WM WPB C-SC BB PM HB JAM KL SGEM.
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0127153
Large Genome Assembly with PacBio Long Reads--PacBio long reads can be used in a number of ways to generate and improve de novo assemblies for large genomes. You can take several different approaches:
1.PacBio-only de novo assembly. Using just PacBio reads from a long insert library, the reads are often preprocessed before being assembled using an Overlap-Layout-Consensus algorithm. The best known implementation of this is HGAP.
2.Hybrid de novo assembly. Using a combination of PacBio and short read data, the reads are used together during assembly to generate a hybrid assembly.
3.Gap filling. Starting with an existing mate-pair based assembly, the internal gaps (consisting of Ns) inside the scaffolds are filled using PacBio sequences.
4.Scaffolding. Using an existing assembly (such as an assembly based on short read data), PacBio reads are used to join contigs.
Figure 1. Illustration of PacBio assembly approaches
Below we discuss what software is available, choosing software, and additional considerations.
Software Options
Name
Description
PacBio-only
HGAP A workflow to first preassemble reads, assemble the preassembled reads using Celera® Assembler, then polish using Quiver.
•Supports up to 100 Mb from SMRT Portal, which is part of SMRT Analysis.
•Larger genomes are possible from the command line using either smrtpipe.py or the Makefile-based smrtmake.
HBAR-DTK An experimental toolkit for running HGAP-style assemblies.
Falcon An experimental diploid assembler, tested on ~100 Mb genomes. 2014 AGBT presentation by Jason Chin.
PBcR self-correction A mode within PBcR (aka pacBioToCA) to do self-correction in the same style as HGAP. Celera® Assembler 8.2 uses the MHAP algorithm for faster overlap calculation during the self-correction phase.
Celera® Assembler Celera® Assembler 8.1 now offers a way to directly assemble subreads.
Sprai A preassembly-based assembler that aims to generate longer contigs.
Hybrid
pacBioToCA An error correction module in Celera® Assembler originally designed to align short reads to PacBio reads and generate consensus sequences. These error corrected reads can then be assembled by Celera® Assembler.
ECTools A set of tools that uses contigs instead of short reads for correction.
Spades A short read assembler that added PacBio hybrid assembly support as of version 3.0.
Cerulean Cerulean starts with an assembly graph from Abyss and extends contigs by resolving bubbles in the graph using PacBio long reads. Was successfully run on genomes <100 Mb.
dbg2olc dbg2olc uses Illumina contigs as anchors to build an overlap graph with PacBio reads, allowing very fast performance.
Gap Filling
PBJelly 2 PBJelly upgrades genomes by using PacBio reads to fill in gaps in scaffolds. Has been shown to work with genomes >1 Gb. Part of the PBSuite of applications including PB Honey. See also PAG 2014: Kim Worley, "Improving Genomes using Long Reads and PB Jelly 2
Scaffolding
AHA AHA ("A Hybrid Assembler") is designed to join existing contigs using PacBio reads. Limited to genomes greater than 200 Mb; part of SMRT Analysis.
PBJelly 2 The new version of PBJelly has support for joining scaffolds.
Considerations
Coverage and Choosing Software
The choice of algorithms depends on how much PacBio sequencing can be obtained and what types of short read data are available. We recommend PacBio-only de novo assembly when it is possible to get at least 50X PacBio coverage. HGAP performs best with the minimum recommended coverage; with higher coverage a greater number of the longest reads becomes available for assembly. For larger genomes, PBcR in Celera Assembler 8.2 beta uses MHAP which offers faster assembly times.
For a hybrid assembly involving both PacBio and short read sequencing, PBcR and ECTools can work well with around 20X PacBio coverage. If a high quality set of scaffolds exists, then PBJelly 2 can be used. We recommend at least PacBio 5X coverage to fill gaps; higher coverage enables better consensuses in gap filled regions and increases the number of addressable gaps, as random sampling at lower coverage can lead to coverage gaps.
Figure 2. PacBio algorithm suggestions from a PAG 2014 presentation by Mike Schatz
Repetitive Content
One of the biggest challenges with de novo assembly is repeat content. In general, the solution is to work with insert sizes that can span repeats and identify unique anchoring sequence on each side. PacBio long reads are uniquely useful in sequencing long inserts, given that they can read from one end of the insert to the other.
Ploidy
Most existing assemblers were designed for haploid genomes. When a diploid genome has little structural variation between the chromosome copies, then a haploid approach can work well, with the occasional structural heterozygosity appearing as separate contigs. In diploid genomes with larger structural variation or multiploid genomes, assemblies based on haploid assemblers are increasingly fragmented. For these genomes, consider Falcon - though it is considered experimental code. Note also that Celera® Assembler can be configured to favor merging haplotypes.
If possible, select strains to minimize heterozygosity, which helps facilitate assembly. This includes using inbred lines, double haploid strains, and other effectively haploid genomes. For example, the human mole sequenced is a double haploid genome.
Coverage Bias with Short-Read Data
Short read data has coverage bias in regions with extreme GC composition because short read technologies require amplification. Even if PCR-free sample preparation methods are used, ultimately there is bridge amplification during sequencing.
In addition, with error correction approaches such as PBcR, short reads made of simple repeats are difficult to use given that the kmers used to seed overlaps are at high frequency and thus often filtered out (see PAG 2014, Mike Schatz slide 12).
Computational Requirements
De novo assembly algorithms using PacBio reads generally use an overlap-layout-consensus algorithm to arrange long reads (such as Celera® Assembler, which HGAP and pacBioToCA both use). Because the overlap phase requires an all-by-all alignment, computation time scales quadratically with the genome size. For larger genomes approaching one gigabase and greater, assembling genomes of this size requires significant computational resources. For example, the initial overlap step in preassembly for the 54X human assembly required 405,000 CPU hours. Compute times are also described in the pacBioToCA-based drosophila assembly. There are efforts to reduce the computational burden, such as Dazzler (blog) and MHAP (blog post, webinar).
Hybrid assembly using PBcR also adds a layer of computational complexity, since aligning 100X of short reads to PacBio reads is a computationally intensive task. One way to reduce computational time is to align short read contigs to PacBio reads, such as through ectools, which effectively compresses down the short read data. This type of approach also has the advantage of increasing the mappability of short read data, since assembled contigs are longer than the individual reads.
Draft Genome Quality
Gap filling of mate pair-based scaffolded assemblies are particularly sensitive to the quality of the starting assembly. When aligning PacBio reads across gaps in the scaffolds, misassemblies in the scaffolds can result in improper alignments and incorrectly-filled or unfillable gaps.
Large insert libraries
Even though this is a discussion of assembly algorithms, key to a successful assembly is the longest reads possible through careful sample preparation. We recommend the largest insert libraries possible (e.g. 20 kb) using BluePippin™ size selection (see 20 kb Template Preparation Using BluePippin Size-Selection) and sequencing with the P6-C4 chemistry.
Datasets and Example Projects• Human dataset, see also blog post
• 2014 PAG presentation by Allen Van Deynze discussing the spinach assembly
•Arabidopsis dataset
• Drosophila dataset, see also this blog post
•Saccharomyces cerevisiae dataset and assembly
• Neurospora Crassa (Fungus) Genome, Epigenome, and Transcriptome, see also this poster at AGBT 2014
•Other datasets
Additional Links•http://www.homolog.us/blogs/blog/2014/02/21/opinionated-history-genome-assembly-algorithms/
•PAG 2014: Michael Schatz, “De novo assembly of complex genomes using single molecule sequencing”
• 2014 AGBT presentation by Richard McCombie discussing the assembly of rice and yeast, including a coverage titration of the Arabidopsis dataset and assembly performance of ectools versus HGAP.https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/Large-Genome-Assembly-with-PacBio-Long-Reads
De Novo Assembly
Targeted Sequencing
Base Modification Detection
Isoform Sequencing
Research Focus
Microbiology
Plant & Animal
Human
Home Applications Isoform Sequencing
Isoform Sequencing
Full-length transcript sequencing, no assembly required
RNA sequencing (RNA-Seq) is a broadly used method for studying gene expression. However, short-read sequencing approaches cannot span full-length transcripts, making it difficult to accurately characterize the diverse landscape of isoforms from alternative splicing and transcriptional regulation.
Using the extraordinarily long reads generated by SMRT Sequencing, the Iso-Seq method provides reads that span entire transcript isoforms, from the 5' end to the 3' polyA-tail. It is now possible to directly sequence full-length transcripts ranging up to 10 kb. Generation of accurate, full-length transcript sequences greatly simplifies analysis by eliminating the need for transcript reconstruction to infer isoforms using error-prone assembly of short RNA-Seq reads.
The Iso-Seq method has been applied in a wide variety of organisms to improve annotations in reference genomes, characterize alternatively spliced isoforms in important gene families, and find novel genes in the most comprehensively studied human cell lines. In-depth isoform sequencing can be done across the entire transcriptome or for targeted genes. Understanding the complete representation of a sample's gene isoforms increases the sensitivity and specificity of quantitative functional genomics studies. Isoform sequencing also provides information to efficiently detect or validate novel gene fusions and has been used to determine allele-specific isoform expression.
Benefits:
•Sequence full-length mRNA transcripts, with no assembly required
•Characterize gene-isoform expression across an entire transcriptome, or within targeted genes
•Discover novel genes and gene isoforms even in well characterized samples
•Perform de novo gene annotation, with or without a reference genome
•Complete information about alternatively spliced exons, transcriptional start sites, polyadenylation sites and strand orientation
•Improve quantitation accuracy for functional genomics studies
http://www.pacb.com/applications/isoseq/index.html
PacBio Explores Targeted Sequencing, Focusing on Complex Genomic Regions
May 06, 2015 | Monica Heger
NEW YORK (GenomeWeb) – Pacific Biosciences is making advances in the targeted sequencing space, including a partnership with Roche NimbleGen, the firm said this week. During a conference call discussing its 2015 first quarter results, CEO Mike Hunkapiller highlighted recent publications from customers who have turned to PacBio technology to target complex regions of the genome.
During the call, Hunkapiller also provided an update on the company's development agreement with Roche, saying that it expects to deliver a clinical sequencing system to Roche by the second half of next year. The most recent development milestone in the firms' pact has triggered a $10 million payment from Roche to PacBio.
PacBio has now earned half of the $40 million milestone revenue and the "pathway to the finish line is becoming more definite," Hunkapiller said. As planned, the company is on track to "deliver a product to Roche in the second half of next year."
Hunkapiller also noted that for the first time China's BGI has purchased a PacBio RS II system. He said that after an initial evaluation, BGI would likely "purchase additional units to integrate into its sequencing service business."
In addition, the firm said it no longer plans to provide updates on the number of instruments it has sold or installed during a quarter, due to competitive reasons. However, in a note following the earnings call, William Quirk, an analyst with Piper Jaffray, wrote that he estimated that the firm "validated" approximately 12 instruments and added another 12 orders to its pipeline, for an instrument backlog of 15.
Yesterday, PacBio also announced an agreement with RainDance Technologies to co-develop and commercialize technology that would create synthetic long reads around 100 kb in size that could be used for de novo whole-genome assembly.
Targeted sequencing
Hunkapiller said that PacBio last week partnered with Roche NimbleGen to develop a workflow using the NimbleGen SeqCap EZ enrichment technology to enrich DNA fragments up to 6 kb.
"This approach can provide a very comprehensive view of structural variants and haplotype information," Hunkapiller said.
The company also recently launched sample prep barcoding kits, "which allow customers to pool multiple samples onto a single SMRT cell," Hunkapiller said, "which can dramatically bring down the cost of sequencing large numbers of samples for applications like HLA typing."
He added that the product launch supports its goal of reducing the cost of sequencing on the RS II to "expand the use across a broader spectrum of applications."
Researchers at Baylor College of Medicine developed a method based on this approach using NimbleGen capture technology that they called PacBio-LITS, which they used to identify breakpoint junctions of low copy repeat-associated complex structural rearrangements on chromosome 17 in patients diagnosed with Potocki–Lupski syndrome. They published their work in BMC Genomics.
Similarly, a group from George Washington University developed a targeted sequencing method on the PacBio platform to search for biomarkers associated with ovarian hyperstimulation syndrome, and a group from Uppsala University used targeted sequencing on the RS II to search for mutations in the BCR-ABL1 gene fusion.
Customers sequencing more complex genomes
A sign that customers are turning to PacBio technology for increasingly complex projects is the rising consumable revenue and rising average annual consumable revenue per instrument, Hunkapiller said. In the first quarter, average consumable revenue per system was around $133,000. Total consumable revenues grew 69 percent in the quarter, to $4.3 million, he added.
"We have a long way to go on per-instrument usage," Hunkapiller said, adding that the company is seeing system usage rise at an increasing number of sites, "particularly as they get into more complicated genome projects, which require multiple SMRT cell runs per project," as opposed to microbial sequencing projects, which often need only one SMRT cell.
Hunkapiller said that going forward, increasing throughput will be key, which will bring down the cost. "We expect to get an increase of four-fold in throughput per dollar," he said, which will give the company a more competitive position in the human genome sequencing space.
https://www.genomeweb.com/business-news/pacbio-explores-targeted-sequencing-focusing-complex-genomic-regions
From the American Society for Microbiology-----------------Complete Genome Sequence of the Clinical Beijing-Like Strain Mycobacterium tuberculosis 323 Using the PacBio Real-Time Sequencing Platform
Juan Germán Rodrígueza,
Camilo Pinob,
Andreas Tauchc,
Martha Isabel Murciaa
+
Author Affiliations
aDepartamento de Microbiología, Facultad de Medicina, Grupo MICOBAC-UN, Universidad Nacional de Colombia, Sede Bogotá, Colombia
bFacultad de Ingeniería, Grupo BioLISI, Universidad Nacional de Colombia, Sede Bogotá, Colombia
cInstitut für Genomforschung und Systembiologie, Centrum für Biotechnology (CeBiTec), Universität Bielefeld, Bielefeld, Germany
ABSTRACT
We report here the whole-genome sequence of the multidrug-resistant Beijing-like strain Mycobacterium tuberculosis 323, isolated from a 15-year-old female patient who died shortly after the initiation of second-line drug treatment. This strain is representative of the Beijing-like isolates from Colombia, where this lineage is becoming a public health concern.
FOOTNOTES
Address correspondence to Martha Isabel Murcia, mimurciaa@unal.edu.co.
Citation Rodríguez JG, Pino C, Tauch A, Murcia MI. 2015. Complete genome sequence of the clinical Beijing-like strain Mycobacterium tuberculosis 323 using the PacBio real-time sequencing platform. Genome Announc 3(2):e00371-15. doi:10.1128/genomeA.00371-15.
Received 12 March 2015.
Accepted 19 March 2015.
Published 30 April 2015.
http://genomea.asm.org/content/3/2/e00371-15.abstract
Thursday, April 30, 2015---In Study, Continuous Long Reads Outperform Synthetic Long Reads for Resolving Tandem Repeats
Scientists from Argentina and Brazil published the results of a study comparing long-read approaches to characterize the genome structure of a highly complex region of the Y chromosome in Drosophila melanogaster. They found that Single Molecule, Real-Time (SMRT®) Sequencing outperformed synthetic long reads in accurately representing tandem repeats.
The study aimed to resolve the structure of the autosomal gene Mst77F, which had previously been found to have multiple tandem copies; the region, however, was known to be grossly misassembled in the reference. The scientists, from Centro Internacional Franco Argentino de Ciencias de la Información y de Sistemas and Universidade Federal do Rio de Janeiro, used Illumina TruSeq Synthetic Long-Reads technology with Celera Assembler as well as PacBio® long-read sequence data assembled with MHAP to interrogate the genomic region. Results were published in the journal G3: Genes, Genomes, Genetics in a paper entitled “Long-read single molecule sequencing to resolve tandem gene copies: The Mst77Y region on the Drosophila melanogaster Y chromosome.”
Lead author Flavia Krsticevic and collaborators report that the synthetic long reads failed to completely cover the region of interest. The resulting assembly “is incomplete and fragmented,” the scientists write. “The scaffolds are small (all below 15 kb), and hence provide little information on the genomic structure and context of the Mst77Y region.”
The authors note that synthetic long reads can accurately resolve repetitive regions “as long as there is only one copy of a repeat in each 10 kb fragments; i.e., the repeats should be interspersed.” Tandem repeats, on the other hand, pose a major challenge to this approach. “It is worth noting also that several biologically interesting and poorly known regions of the Drosophila genome such as other recently duplicated genes, the histone and rDNA clusters, and the centromeres, have a tandem repeat organization, and in these cases synthetic long reads are predicted to have limited utility,” Krsticevic et al. write.
In contrast, the team found that SMRT Sequencing generated data fully covering the genomic region, which assembled into a single contig using MHAP. The assembly revealed 18 copies of the gene, some of them present in identical copies, covering 96 kb. The team independently validated the findings, demonstrating that six previously detected versions of this gene were likely PCR artifacts and discovering six new versions of the gene that had never been identified before. Their validation found two single-base errors across the entire span. “Thus, the assembly of this region seems to be essentially perfect,” they write.
The scientists took advantage of the D. melanogaster reference genome that PacBio generated and made publicly available last year; it served as a comparison point to the PacBio sequence they produced. We’re glad to see that the community is finding resources like this to be helpful. http://blog.pacificbiosciences.com/2015/04/in-study-continuous-long-reads.html
A brief animated introduction to Pacific Biosciences' single molecule real time (SMRT) sequencing,
Monday, April 27, 2015--New Solutions for Comprehensive and Efficient Targeted Sequencing and Multiplexing of Samples
We are proud to announce the introduction of several new solutions for targeted sequencing and sample multiplexing on the PacBio® Sequencing System.
New Targeted Sequencing Workflow through Collaboration with Roche NimbleGen
Today we announced a new workflow that combines Roche NimbleGen’s SeqCap EZ enrichment technology with large DNA fragments (up to 6 kb) and our Single Molecule, Real-Time (SMRT®) Sequencing to provide a more comprehensive view of variants, transgene integration sites, and haplotype information over multi-kilobase contiguous regions. The laboratory workflow is described in a shared protocol. For each targeted region, SAMtools are used to phase and bin reads by haplotype, and then Quiver is applied to polish each haplotype to high consensus accuracy. This entire bioinformatics workflow is summarized on GitHub.
The long fragments capture haplotype information that provides a more inclusive view of the targeted region of interest, allowing a comprehensive look at variants in exons, introns, and intergenic regions, as well as the generation of accurate haplotypes as demonstrated in this application note. In a recent publication by Richard Gibbs and colleagues from Baylor College of Medicine, this new enrichment method was used for de novo sequencing and detection of structural variation involved in human genetic disease.
“The use of NimbleGen Sequence Capture technology to enrich large DNA molecules for targeted sequencing on the PacBio platform leverages the advanced capabilities of both systems to achieve unprecedented efficiency for targeted genetic phasing and structural variant elucidation,” said Rebecca Selzer, Ph.D., President, Roche NimbleGen. “The ability to focus SMRT Sequencing onto discrete regions of interest within a genome demonstrates that the technical advantages inherent in long, single molecule sequencing can be part of a higher-throughput analysis strategy.”
Multiplexing and Barcoding Workflows
We have also introduced two multiplexing workflows and associated products for barcoding samples prior to running on SMRT Cells. The new barcoding kits allow you to pool many samples and thus reduce the cost per sample and the total time for sample preparation. By employing the barcoding strategies, you can efficiently focus on even more defined genomic regions. And by barcoding samples, you have the flexibility to multiplex samples or targets within a sample, or a combination of both.
Kevin Corcoran, Senior Vice President of Market Development for Pacific Biosciences, commented: “The new barcoding methods allow a greater numbers of samples to be studied simultaneously and also improve the efficiency of SMRT Sequencing, by increasing SMRT Cell sample capacity and streamlining the workflow.”
For more information, visit our web page for target enrichment and web page for barcoding methods. http://blog.pacificbiosciences.com/2015/04/new-solutions-for-comprehensive-and.html
Wednesday, April 15, 2015---In Genome-wide Study, Long Reads Prove Critical for Structural Variant Discovery
In a paper just published in BMC Genomics, a team of scientists led by Baylor’s Human Genome Sequencing Center reports a thorough analysis of structural variation in a personal genome. What makes this study special is the large number of different technologies applied and the sheer volume of data gathered and analyzed for this single genome. The paper also includes the first known analysis of structural variation in a diploid human genome using SMRT® Sequencing, with 10x coverage from PacBio® long reads.
Lead authors Adam English and William Salerno and their collaborators at a number of institutions describe the results obtained from a structural variant calling tool they have developed called Parliament. (Check out the full paper: “Assessing structural variation in a personal genome—towards a human reference diploid genome.”)
Structural variants account for the majority of variable bases in a human genome, according to the authors, who note that it will be important to detect and characterize these elements to understand their clinical relevance. Despite their significance, these variants are not as well understood as single nucleotide variants and other small variants. Through this effort to establish new ways to find and analyze structural variants, the scientists determined that short-read sequencing technologies alone miss a good amount of this kind of variation in a genome.
Working with a well-characterized genome, the team combined array CGH data with genome sequence data from Illumina, SOLiD, and PacBio systems, as well as a genome map from BioNano Genomics for the most comprehensive data set possible. That information was fed into Parliament, a pipeline for consensus structural variant calling that can be used with multiple data sets and detection approaches. The data sets were analyzed in various permutations within Parliament, which identified more than 31,000 loci representing possible structural variants. Of those, nearly 10,000 — spanning almost 60 Mb and nearly 2% of the reference genome — were supported by deep-dive genome analysis.
Of the 9,777 confirmed structural variants, the authors report that 3,801 were identified solely by PacBio long-read sequence data, “indicating the importance of read length when characterizing structural variation.” English et al. make the case for using multiple data sources to improve structural variant detection. “The addition of long-read data can more than triple the number of [structural variants] detectable in a personal genome,” they write.
The team has made Parliament publicly available through cloud-based service provider DNAnexus. “Implementation of Parliament on local compute requires independent installation of multiple discovery tools and a local assembler, imposing a burden of systems administration and resource consumption,” English et al. write, explaining why they chose DNAnexus to handle the computational side of this project. The cloud provider is now hosting the workflow established by the team — including a pipeline that takes BAM files and generates structural variant calls — as well as data generated in this project.
The authors hope their work helps establish a gold-standard catalog of human structural variation. “The present work identifies upper (4.5%) and lower (1.8%) estimates of the extent of structural variation in a personal genome and characterizes the impact of various resequencing methods,” they write. They also note that “as with [single nucleotide variants], many [structural variants] in a personal genome represent rare or private variants not observed in databases,” highlighting the need to sequence many individuals to obtain a deeper understanding of the extent and diversity of structural variants in the human population and their link to disease. http://blog.pacificbiosciences.com/2015/04/in-genome-wide-study-long-reads-prove.html
"(We found that the Illumina Synthetic Long Reads assembly failed in the Mst77Y region,The PacBio MHAP assembly of the Mst77Y region seems to be very accurate)" -- Long-Read Single Molecule Sequencing To Resolve Tandem Gene Copies: The Mst77Y Region on the Drosophila melanogaster Y Chromosome
Flavia J. Krsticevic1,
Carlos G. Schrago2 and
A. Bernardo Carvalho2,*
+
Author Affiliations
1 Centro Internacional Franco Argentino de Ciencias de la Información y de Sistemas - CONICET;
2 Universidade Federal do Rio de Janeiro
?*To whom correspondence should be addressed. E-mail: bernardo1963@gmail.com
Abstract
The autosomal gene Mst77F of Drosophila melanogaster is essential for male fertility. In 2010 Krsticevic et al. found 18 Y-linked copies of Mst77F ("Mst77Y"), which collectively account for 20% of the functional Mst77F-like mRNA. The Mst77Y genes were severely misassembled in the then available genome assembly, and were identified by cloning and sequencing PCR products. The genomic structure of the Mst77Y region, and the possible existence of additional copies remained unknown. The recent publication of two long-read assemblies of D. melanogaster prompted us to re-investigate this challenging region of the Y chromosome. We found that the Illumina Synthetic Long Reads assembly failed in the Mst77Y region, most likely due to its tandem duplication structure. The PacBio MHAP assembly of the Mst77Y region seems to be very accurate, as revealed by comparisons with the previously found Mst77Y genes, a BAC sequence, and Illumina reads of the same strain. We found that the Mst77Y region spans 96 kb and originated from a 3.4 kb transposition from chromosome 3L to the Y chromosome, followed by tandem duplications inside the Y chromosome and invasion of transposable elements, which account for 48% of its length. Twelve of the 18 Mst77Y genes found in 2010 were confirmed in the PacBio assembly, the remaining six being PCR-induced artifacts. There are several identical copies of some Mst77Y genes, coincidentally bringing the total copy number to 18. Besides providing a detailed picture of the Mst77Y region, our results highlight the utility of PacBio technology in assembling difficult genomic regions such as tandemly repeated genes.
http://www.g3journal.org/content/early/2015/04/09/g3.115.017277.abstract
Senior Software Engineer
Pacific Biosciences of California, Inc. Menlo Park, CA--(Job Description)-----------
Do you want to use your C++ software engineering talent to solve problems of real scientific and medical importance? Are you bored with the idea of simply contributing to yet another social networking site or fixing bugs in some large company’s search engine? Do you have a desire to be part of a company creating cutting edge technology that is uncovering the mysteries of life itself? Do you want to work at a company that uses robots and lasers and nanotechnology to see a single DNA molecule sequence in real-time? Pacific Biosciences is seeking a talented C++ software engineer to build automated test systems, infrastructure code and optimized algorithm implementations, including well designed APIs for a complex system that produces SMRT (Single Molecule, Real Time) sequencing data which addresses a myriad of diverse scientific application areas. The candidate will develop robust, reliable and performant software infrastructure components, documentation and tests that enable our team to rapidly create a diversity of software solutions.
Responsibilities:
• Work in an exciting multi-disciplinary organization of software and hardware engineers, bioinformaticians, chemists, and molecular biologists developing state-of-the-art, single-molecule, genomic analysis systems.
• Design and write automated functional and unit tests and support developers to troubleshoot system issues or assist in system-level integration.
• Create a variety of infrastructure components that will be the foundational software layers used to build advanced analysis software.
• Quickly identify solutions to complex data processing and automation problems for use in production instrument software and internal tools; provide methods or prototypes for concept evaluation.
• Develop, integrate and test analysis pipeline components for deployment in production software; develop optimized implementations to maximize performance and throughput on available hardware.
• Write design and functional specifications as well as test plans for peer review; maintain software development practices adhering to company standards for coding and unit/functional test coverage.
Skills & Requirements
• M.S. in computer science, electrical engineering or in physical science disciplines; candidates with a similar B.S. degree and a high level of relevant experience will also be considered.
• An expert in C++ with at least 5 years of implementation experience on Linux.
• At least 5 years of experience developing reusable components, automated tests, utilizing formal software development processes, employing best practices.
• Experience maintaining software projects under source control with p4, git, svn, or similar.
• Extensive experience with multithreading programming and optimization techniques including SSE/SIMD vector processing instructions is required.
• Experience with the Intel Xeon Phi computing platform is a plus.
• Understanding of FDA compliance regulations a plus.
• Preference is shown to candidates with strong analytical and development skills who demonstrate the capability to bring solutions beyond the prototype stage for deployment in performance-critical production software.
All qualified applicants will receive consideration for employment without regard to race, sex, color, religion, national origin, protected veteran status, or on the basis of disability, gender identity, and sexual orientation.
http://careers.stackoverflow.com/jobs/85051/senior-software-engineer-pacific-biosciences-of
Thursday, April 9, 2015
Advances in Genome Assembly With Long Reads-
This symposium is co-hosted by the National Science Foundation and the
University of Minnesota Supercomputing Institute for Advanced Computational Research. MSI wordmark
Support from PacBio is gratefully acknowledged. PacBio logo
Genome Assembly is in the midst of a paradigm shift due to the plummeting cost of long-read sequencing and other large-scale physical mapping technologies. Specifically, single Molecule Real-Time (SMRT) sequencing by Pacific Biosciences (PacBio), a third-generation sequencing technology, is quickly supplanting Illumina and other second-generation sequencing technologies for whole genome assembly applications. For less than $700 in sequencing costs at Mayo, microbial genomes can be sequenced and assembled into a single contiguous high-quality genome by this technology in more than 90% of cases; whereas nearly any depth of Illumina sequencing will result in a fragmented genome consisting of hundreds of pieces. As costs continue to decline, and raw sequences become longer, complete sequence assembly is already becoming economical for fungi and simple plants. Personalized human genomes are not far off. Concurrent with these developments, Oxford Nanopore’s minion sequencer is moving beyond protypical status to offer long reads on a sequencer as small as a flash drive, and physical mapping technology companies such as BioNano and DoveTail genomics are enabling researchers to build ever bigger physical superscaffolds with confidence.
In this symposium, we present perspectives from some of the world’s leading computational scientists in the field who are experimenting with these new technologies toward the goal of creating bigger, better and more contiguous genome sequences for use in real biological contexts.
To register for the symposium, including lunch, please fill in the Symposium Registration Form.
To submit a poster to the Poster Session, please fill in the Poster Submission Form.
Please note that there is a separate registration for the optional lecture/hands-on tutorial in the morning. To attend this tutorial, please fill out the tutorial sign-up page. Space is limited. Due to high demand, two additional hands-on tutorials have been added Friday, April 10 and Tuesday, April 21.
Schedule
9:00 - 11:30: Lecture and hands-on tutorial on PacBio SMRT Portal and Assembly; Jon Badalamenti, Bond Lab [NOTE: Separate registration is required for this tutorial; please visit the tutorial sign-up page.]
11:30 - 12:30: Lunch
12:30 - 12:40: Welcome
12:45 - 1:45: Michael Schatz, CSHL: Oxford Nanopore and PacBio assembly approaches
1:45 - 2:45: Poster session
2:45 - 3:30: Jason Miller, JCVI: PacBio / Hybrid assembly
3:30 - 3:50: Peng Zhou, NSF-funded graduate student: Medicago project, University of Minnesota
3:50 - 4:30: Joann Mudge, NCGR: PacBio and BioNano
4:30 - 4:50: Wrap-up / discussion
https://www.msi.umn.edu/events/genome-assembly-mini-symposium
(PacBio Blog) Thursday, March 26, 2015--In Chronic Myeloid Leukemia Study, SMRT Sequencing Detects Resistance Mutations Early, New Splice Isoforms and More
Scientists from Uppsala University report in a recent paper that using the Iso-Seq™ method with SMRT® Sequencing allowed them to detect and monitor mutations in the BCR-ABL1 fusion gene for patients with chronic myeloid leukemia (CML). Screening mutations in this region is important for determining the point at which these patients become resistant to tyrosine kinase inhibitor (TKI) therapies, and is currently performed in the clinic using Sanger sequencing, quantitative RT-PCR, and other assays.
The paper, “Clonal distribution of BCR-ABL1 mutations and splice isoforms by single-molecule long-read RNA sequencing,” was published last month in BMC Cancer from lead author Lucia Cavelier and collaborators. In it, the scientists describe sequencing samples from six patients who experienced poor response to cancer treatment; samples were collected at diagnosis and at subsequent follow-up periods and sequenced on the PacBio® system.
The team checked for mutations in the BRC-ABL1 fusion transcript, generating on average10,000 full-length sequences of the gene from a single SMRT cell. Short-read sequencers have been tried for this kind of work, the authors note, but their inability to span the entire transcript as well as concerns about bias introduced by nested PCR has limited their utility.
“Here we present for the first time an assay to directly investigate the entire 1,578 bp BCR-ABL1 major fusion transcript, amplified from a single PCR reaction and sequencing on the Pacific Biosciences (PacBio) RS II system,” Cavelier et al. write. “In addition to enabling a rapid workflow at a relatively low cost, the PacBio system produces reads sufficiently long to span across a full length BCR-ABL1 molecule.” They report that the process, which took two to three days to complete, had a 0% false positive rate, attributed to the random error mode of PacBio sequencing data, “which results in highly accurate base calls for molecules that are sequenced at high coverage.”
For each of the six patients studied, the authors report, SMRT Sequencing confirmed the mutations that had already been found with Sanger sequencing. It also detected five low-frequency mutations that were missed by the Sanger pipeline. In one case, the scientists found that PacBio sequencing successfully detected a mutation four months earlier than it was found by Sanger sequencing, indicating that the technology may ultimately accelerate the identification of genetic markers that are important for diagnosis or drug response monitoring.
In addition, long reads from SMRT Sequencing allowed the team to distinguish multiple transcript isoforms for BCR-ABL1 from individual samples. “These results corroborate previous findings that propose alternative splicing as a common mechanism among CML patients undergoing TKI treatment,” the authors write.
Importantly, PacBio data also made it possible to differentiate compound mutations from independent mutations in other molecules, information that cannot be gleaned from Sanger sequencing. “This feature is of major clinical relevance as compound mutations show different resistance profiles compared to individual mutants,” Cavelier et al. report.
By Bio-IT World Staff
March 3, 2015 | The 2015 Advances in Genome Biology & Technology conference wrapped up over the weekend in Marco Island, Florida, after four days of presentations from the front lines of genome analysis. With less than the usual amount of razzle-dazzle on display in this year’s product launches, the event was stolen by some outstanding scientific achievements pulled off with existing platforms. Pacific Biosciences, this year’s gold sponsor, highlighted several of these in a star-studded workshop Friday afternoon to show off the feats that can be accomplished with its SMRT (single molecule real time) sequencers, the instruments of choice for recovering long-range structural information on the genome. Speakers included J. Craig Venter, who runs the world’s largest genome sequencing center at his company Human Longevity, Inc., and is best known for competing with the Human Genome Project to produce the first whole human genome sequence; Deanna Church, who has helped shape improvements to the human reference genome in her work with the Genome Reference Consortium; and Gene Myers, one of the world’s premier bioinformaticians and co-author of the foundational genome analysis tool BLAST.
In a piece this January looking back on last year’s milestones in genomics, Bio-IT World wrote that “2014 could be looked at as the Year of PacBio, when the [midsize] company proved there was room in the market for a pricier instrument that won’t flinch at high GC coverage, large indels, or de novo assembly.” The present moment might eventually come to be seen as the peak of PacBio’s powers, a window in which the company was truly producing the most comprehensive, highest-quality genomes money could buy.
PacBio’s commercial future is murky: companies like 10X Genomics are toying with more affordable ways to get reliable long-range genomic information, and if Oxford Nanopore gets a handle on its error rates and releases the production-scale PromethION, they’re likely to undercut PacBio on price while delivering the same top-of-the-line features. But whatever its market prospects, scientifically PacBio is driving some of the most innovative sequencing projects going on today. Among other accomplishments, the PacBio workshop at AGBT presented multiple users’ de novo assemblies of whole human genomes — until very recently, a vanishingly rare type of project because no high-throughput instrument could deliver the type of data needed to put together a whole human genome without aligning reads to a reference genome.
De Novo Assemblies as a Commodity?
Today, the very presence of SMRT sequencers on the market has encouraged bioinformaticians to build a whole suite of analytical tools to deal with multi-kilobase reads. As the AGBT workshop made clear, PacBio users now have something like a standard pipeline for going all the way from raw reads to a whole genome. A typical workflow might use Gene Myers’ DALIGNER to find local alignments between reads, FALCON for assembly, and Quiver for variant calling. As PacBio CEO Mike Hunkapillar announced in his opening remarks, DNAnexus recently used this DALIGNER-FALCON pipeline to create a new diploid assembly of J. Craig Venter’s genome, following a sequencing effort that took less than a month to generate all the required raw data on SMRT instruments.
Diploid assembly, correctly distinguishing between the maternal and paternal copies of each chromosome, is the gold standard for a full genome sequence. This ability sets FALCON assemblies apart from even the human reference genome — which, as Deanna Church memorably pointed out in her own presentation, has historically included “Franken-alleles” stitched together from different copies of the same chromosomes.
DNAnexus also appears to have set a world record for the fastest human genome assembly last week, patching together the genome of a peculiar breast cancer cell line, SK-BR-3, in less than 21 hours. The process wrapped up at 10:30 on Friday morning, just in time for a shout-out at the workshop from W. Richard McCombie of the Cold Spring Harbor Laboratory. DNAnexus will now be making this workflow available to all customers through its cloud-based informatics service, offering rapid assembly to any labs with the sequencing capacity to drive through enough PacBio reads.
All this is starting to make de novo assembly look less like a titanic enterprise, and a little more like a commodity. Venter, giving the first talk at the workshop, revealed plans to produce an extraordinary 30 new reference genomes at Human Longevity, Inc., combining two SMRT sequencers with his bank of 20 ultra-high-throughput Illumina HiSeq X instruments. “I’m delighted with the focus I’m hearing here, on getting back to assembled genomes,” said Venter. “If we’re going to understand each of our genomes, we need to do de novo assembly.”
The collection of new reference-grade assemblies at Human Longevity isn’t just a matter of showing off; getting new reference genomes from donors with diverse ethnic and geographical backgrounds will help with all future interpretation of large structural variants, which differ widely between human populations and are difficult to square with a single reference assembly. (Sadly unmentioned was whether and when Human Longevity might share its reference genomes with the wider scientific community.)
Venter, of course, has a knack for thinking big. His 30 reference assemblies will represent just a small fraction of the one million whole genomes he intends to sequence by 2020. In his presentation, Venter even spoke glibly about the pace at which he hopes to see his massively expensive bank of sequencers (an investment in excess of $21 million) become obsolete, based on the historical trend toward ever-cheaper sequencing. “We’re counting on $30 genomes in three or four years, and hopefully we can truck away to the dumpster all the machines we have [now],” Venter said.
Many of our readers should also be interested to hear that Venter casually mentioned looking to hire around 200 new bioinformaticians for his company in 2015.
The second speaker, Gene Myers, was also keenly interested in the possibilities PacBio has opened up for relatively straightforward de novo assembly. Myers spent many years in the 2000’s more or less out of the limelight, reportedly because he was dissatisfied with the industry’s trend toward using short-read sequencers and reference alignment for most applications. However, he reemerged at AGBT last year, after a conversation with Hunkapillar in which Myers learned that SMRT sequencers deliver long reads with both random sampling of the genome, and random, unbiased error rates at any point in the genome.
“As a mathematician, when Mike used this word ‘random’ in those two places I got incredibly excited,” said Myers at this year’s workshop. “Because I understood, from theory alone, that what that meant was immediately that perfect assembly was back on the table.”
Since then, Myers has been hard at work making perfect assembly a reality. In addition to building DALIGNER, he has also started work on a new tool called DAscrub, which was a major focus of his workshop presentation. The purpose of DAscrub is to clean up raw PacBio reads, which are error-prone and vulnerable to sequencing artifacts, without sacrificing valuable data. Myers presented an E. coli assembly produced with 30x coverage of the sample that produced a complete circular genome without requiring any correction steps between running DALIGNER and performing full assembly, except for using DAscrub to clear out artifacts.
Key Genomes
None of these advances in de novo assembly will do much to advance science if we don’t choose samples that truly have something to teach us. The last three speakers at PacBio’s AGBT workshop rounded out the afternoon with some compelling applications for this burgeoning technology.
Deanna Church, formerly of the National Center for Biotechnology Information and now Senior Director of Genomics and Content at genetic diagnostics company Personalis, shared her thoughts on using long-read data to update the human reference genome, and in particular to deal with regions of high structural complexity and large differences between human haplotypes. This is a subject Church has spoken about with Bio-IT World before — in fact, in Hunkapillar's opening remarks he quoted an interview we ran with Church in April 2013, in which she said that “if we are truly going to be successful in having genomics affect clinical medicine and we want to understand variation within individuals, we have to have de novo assembly.”
At AGBT, Church noted that the reference genome is essential even when working with de novo assemblies, both as a resource for calling variants, and as a coordinate system for describing those variants. That means missing or confounded sequence in the reference can cause problems for interpretation no matter how scrupulous a new genome may be.
Church touted the addition of many alternate loci in the latest update to the human reference genome, which allow geneticists to consider multiple “paths” through variable regions. She also urged bioinformaticians to update their tools to take these alternate loci into account, something that few groups have done to date. “In aggregate, these alt loci contribute an additional 3.6 megabases of novel sequence that contain 153 unique genes,” said Church. “So if you are not using these sequences in your analyses, you are missing part of the exome, and you are missing some important sequence.”
At the same time, Church acknowledged that the patchwork of alternate loci, in the long term, is not the most efficient way to represent large structural variants across the genome. In a question-and-answer session, she mentioned the Global Alliance for Genomics and Health, which is working on an alternative way to represent chromosomal positions as a branching graph that spans an entire chromosome. “I think this movement to this graph-based representation is really the way we have to go,” she said, “because it allows us to represent this complexity in a much more natural way.” While Church expects it to take some time before this structure is ready to be as widely adopted as the current standards for representing genetic variation, she did say that the alternate loci provide a “graph-lite” approach in the current human reference assembly.
The fourth speaker, Jeong-sun Seo of Seoul National University and Macrogen, presented on a critical new resource for genomics, a diploid assembly of a whole Asian genome. “We have to consider seriously ethnic differences for personalized medicine,” Seo reminded the audience. Ultimately, Seo’s work on this new assembly, of a genome donated by an Altaic Korean individual, is meant to support an Asian Genome Project recruiting 10,000 patient volunteers for whole genome sequencing across South Korea, Japan, China, and Mongolia.
Like Human Longevity, Macrogen has a bank of HiSeq X instruments and has been using a cross-platform approach to generating new reference assemblies. Interestingly, Seo mentioned that his team is also using an Irys device from BioNano, which uses fluorescent markers to map out very large structural variation on the order of hundreds of kilobases. In an interview with Bio-IT World, BioNano CEO Eric Holmlin recently told us that the Irys has been paired with SMRT sequencing but declined to reveal more details; Seo’s presentation offers at least one example of both techniques for getting long-range genomic information being used in parallel.
Highlighting the magnitude of difference between the Korean assembly his group performed and the standard reference genome, Seo noted that on chromosome 20 alone, he was able to pinpoint nearly 500 structural variants, totaling over 210 kilobases inserted or deleted relative to the reference. He also shared one example of a phenotypic difference that appears to be traceable to one of these structural variants, an 8-kilobase insertion in the NINL gene related to pigmentation. “NINL is the most significantly differentially expressed gene between Asians and Caucasians,” Seo observed, a fact that can likely be attributed to this large insertion. Other structural variants that differ widely between ethnic groups are likely to have direct relevance to health and disease risks.
The final speaker was W. Richard McCombie, whose own assembly of interest was the previously-mentioned SK-BR-3 cell line, collected from a Her2-positive case of breast cancer. The SK-BR-3 genome is profoundly disordered — so much so that Hunkapillar, introducing McCombie’s talk, said that looking at this genome, “you wonder how in the heck was this thing alive?”
McCombie, much like Myers, believes that short-read sequencing has been a mixed blessing for the genomics community, offering more data than ever before but at the cost of distracting researchers from profoundly important sources of variation. He quoted Evan Eichler’s term “the seduction of next-gen sequencing,” which he called “very appropriate. You can get really good SNP data from a very large number of individual genomes… but you do miss… a lot of the structural variants.”
Turning to the SK-BR-3 genome, McCombie showed some detailed data, derived from SMRT sequencing, on complex translocations between chromosomes 8 and 17, which occurred across multiple different sites on both chromosomes. With more precise information on precisely how these regions are arranged, which translocations have undergone inversion, and the complete sequence of gene fusions, McCombie’s team is now trying to reconstruct the precise history of the structural events that have produced the SK-BR-3 chromosome 17, particularly at the locus where the Her2 gene resides. Happily, McCombie announced that all his data on this genome is publicly available online, and that he will soon be releasing methylation data as well — something that can be recovered routinely off SMRT sequencers.
PacBio is still very much a niche player in sequencing, and with a notably lower throughput and higher costs than its competitors, that’s unlikely to change any time soon. Nonetheless, the company has done a remarkable job drawing attention to features like haplotypes and structural variants that cannot be captured by short-read sequencing. While the genomics community never really forgot about these factors, they have been shortchanged in the drive for more and cheaper data in the next-generation sequencing era.
Today, it seems possible that projects like those presented at PacBio’s AGBT workshop are just the leading edge of a cultural shift in genomics toward full representations of genomic variation and more routine use of de novo assembly. The full force of that shift will have to wait for technology that brings long-read data in reach of the average user. But whether that comes from future PacBio instruments, a new contender like Oxford Nanopore, a parallel platform like 10X Genomics, or a combination of all three, this year’s AGBT demonstrated that the groundwork has been laid to make the best use of this data once we have it.
http://www.bio-itworld.com/2015/3/3/pacbio-agbt.html
March 2, 2015
Pacific Biosciences Enables Reference-Quality De Novo Human Genome Assemblies
Researchers at AGBT Conference Present PacBio Data for J. Craig Venter, Asian Genome Reference, Ashkenazi Jewish Reference, and Breast Cancer Genome
MENLO PARK, Calif., March 2, 2015 (GLOBE NEWSWIRE) -- Pacific Biosciences of California, Inc., (Nasdaq:PACB) provider of the PacBio® RS II Sequencing System, today announced that its Single Molecule, Real-Time (SMRT®) Sequencing was featured in a number of presentations during last week's Advances in Genome Biology & Technology (AGBT) conference, including demonstrations of the technology's ability to create reference-quality de novo human genome assemblies.
Presentations at the conference highlighted the power of PacBio's long and accurate sequencing reads to resolve difficult regions and access novel genetic variation. At the company's workshop, J. Craig Venter, Ph.D., of Human Longevity, Inc. presented data about his highly studied genome, which has now been sequenced using the PacBio RS II and assembled in the cloud on the DNAnexus platform, creating a higher resolution version of this reference genome at a fraction of the original time and cost. Deanna Church, Ph.D., who has played a key role in the public efforts to create a human reference genome, discussed the importance of having more high-quality de novo human genomes, and Gene Myers, Ph.D., from the Max Planck Institute discussed his work to develop computational methods to enable perfect assemblies using SMRT Sequencing data. To highlight the importance of population-specific reference genomes, Jeong-Sun Seo, M.D., Ph.D., of the Seoul National University College of Medicine and co-founder of Macrogen, Inc. discussed progress with the Asian Genome Project, which is also using the PacBio RS II for de novo genome assembly of Asian subpopulations. In addition, W. Richard McCombie from Cold Spring Harbor presented analysis of structural re-arrangements and gene amplifications in a breast cancer cell line genome.
"As a result of continual performance improvements with the PacBio RS II, it is now feasible to return to reference-quality de novo human genome assemblies and no longer rely on a single reference genome that does not adequately represent the variation in the global population," said Michael Hunkapiller, Ph.D., CEO of Pacific Biosciences. "With the performance improvements planned for this year, we expect the cost to generate a human genome on the PacBio RS II to drop to around $10,000, which is not a high premium to pay for the superior quality and completeness that SMRT Sequencing provides. This cost will only continue to drop as we maintain our track record of performance improvements."
Evan Eichler, Ph.D., from the University of Washington presented more data about his work on characterizing complex variation in the human genome using SMRT Sequencing. This work was originally published in the journal Nature. In the poster sessions, Mark Salit, Ph.D., from the National Institute for Standards and Technology and Robert Sebra, Ph.D., from the Icahn School of Medicine at Mount Sinai discussed aspects of their collaboration to create a genome reference for the Ashkenazi Jewish population using a mother, father, child trio. In addition, the Genome Reference Consortium presented de novo assemblies for two human cell lines targeted for "platinum-grade" references.
Jonas Korlach, Ph.D., Chief Scientific Officer of Pacific Biosciences, added: "We are excited to see how our customers are using SMRT Sequencing for an increasing number of important human and other complex genome studies, including characterizing variation beyond SNPs, developing population-specific genome references, and resolving the genetic basis of disease. We are also delighted to support the efforts by many in the community to raise the bar on the completeness and quality of genome information."
More information about the data presented at the workshop is available here: http://programs.pacificbiosciences.com/l/1652/2015-02-23/312lbw. To learn more about how to access PacBio de novo genome assembly data using the DNAnexus cloud-based platform, please visit: https://dnanexus.com/falcon.
http://investor.pacificbiosciences.com/releasedetail.cfm?ReleaseID=899111
SKBR3 PacBio Sequencing and Assembly
Cold Spring Harbor Laboratory and Ontario Institute for Cancer Research
Genomic instability is one of the hallmarks of cancer, leading to widespread copy number variations, chromosomal fusions, and other structural variations in many cancers. The breast cancer cell line SK-BR-3 is an important model for HER2+ breast cancers, which are among the most aggressive forms of the disease and affect one in five cases. Through short read sequencing, copy number arrays, and other technologies, the genome of SK-BR-3 is known to be highly rearranged with many copy number variations, including an approximately twenty-fold amplification of the HER2 oncogene, along with numerous other amplifications and deletions. However, these technologies cannot precisely characterize the nature and context of the identified genomic events and other important mutations may be missed altogether because of repeats, multi-mapping reads, and the failure to anchor alignments to both sides of a variation.
To address these challenges, we have sequenced SK-BR-3 using PacBio long read technology. Using the new P6-C4 chemistry, we generated more than 70x coverage of the genome with average read lengths of 9-13kb (max: 71kb). PacBio read coverage is highly correlated with the copy number assignments made using short read sequencing technologies, although the long reads provide more consistent coverage across repetitive elements. Furthermore, using the structural variation analysis program LUMPY and our new hybrid mapping and de novo assembly algorithm for analyzing split-read alignments, we have developed a detailed map of structural variations in this cell line. We have tentatively identified more than 900 intra-chromosomal and 300 inter-chromosomal variations, including many of the previously known gene fusions in SK-BR-3. Taking advantage of the newly identified breakpoints, we have developed an algorithm to reconstruct the mutational history of this cancer genome. From this we have characterized the amplifications of the HER2 region, discovering a complex series of nested duplications and translocations between chr17 and chr8, two of the most frequent translocation partners in primary breast cancers. To our knowledge, this establishes the most complete cancer reference genome to date.
http://schatzlab.cshl.edu/data/skbr3/
Tuesday, February 24, 2015--AGBT 2015: Seeing the Genome in a New Light (Sunshine?)
Like many others, we’re looking forward to an exciting week of science and sun at the 16th annual Advances in Genome Biology and Technology (AGBT) conference! We’re hosting a lunch workshop on Friday, February 27, in the Palms Ballroom from 12:00 pm to 2:00 pm EST. We hope you can join us onsite (please reserve your seat) and even if you’re not at the conference, you can watch the live stream.
Here’s the agenda:
Towards Comprehensive Genomics – Past, Present and Future
The Human Genome: From One to One Million
J. Craig Venter, Human Longevity Inc.
Is Perfect Assembly Possible?
Gene Myers, Max Planck Institute
Finishing Genomes: Why Does It Matter?
Deanna Church, Personalis
De Novo Assembly of a Human Diploid Genome for the Asian Genome Project
Jeong-Sun Seo, Macrogen Inc. and Seoul National University College of Medicine
PacBio Long Read Sequencing and Structural Analysis of a Breast Cancer Cell Line
W. Richard McCombie, Cold Spring Harbor Laboratory
After reviewing the packed AGBT agenda, we’ve already spotted several can’t-miss presentations. These speakers and talks look especially promising and we’ll be covering several of them on the blog later this week:
Evan Eichler, University of Washington: “Resolving the Complexity of Human Genetic Variation by Single-Molecule Sequencing”
Matthew Blow, Joint Genome Institute: “Sequencing-Based Approaches for Genome-Scale Functional Annotation”
Tim Smith, U.S. Meat Animal Research Center: “A Genome Assembly of the Domestic Goat from 70x Coverage of Single Molecule Real Time Sequence”
Amy Ly, The Genome Institute at Washington University: “PacBio Application – Influenza Viral RNA-Seq”
Somasekar Seshagiri, Genentech: “Spectrum of Diverse Genomic Alterations Define Non-Clear Cell Renal Carcinoma Subtypes”
Gene Myers, Max Planck Institute: “Low Coverage, Correction-Free Assembly for Long Reads”AGBT is also known for its excellent poster sessions, and we’ll be spending plenty of time in the poster hall this year. If you’re interested in learning more about SMRT® Sequencing results, be sure to stop by some of these posters.
And if you need a break from the marathon, feel free to put your feet up in our suite (Lanai #189) during our open hours:
Wednesday: 8:00 p.m. – 11:00 p.m.
Thursday: 3:00 p.m. – 6:00 p.m. and 8:00 p.m. – 11:00 p.m.
Friday: 3:00 p.m. – 6:00 p.m.
We look forward to seeing you in Marco Island and for those tuned in at home via our blog for lots of updates and live streaming of the workshop! http://blog.pacificbiosciences.com/
Tuesday, February 24, 2015--AGBT 2015: Seeing the Genome in a New Light (Sunshine?)
Like many others, we’re looking forward to an exciting week of science and sun at the 16th annual Advances in Genome Biology and Technology (AGBT) conference! We’re hosting a lunch workshop on Friday, February 27, in the Palms Ballroom from 12:00 pm to 2:00 pm EST. We hope you can join us onsite (please reserve your seat) and even if you’re not at the conference, you can watch the live stream.
Here’s the agenda:
Towards Comprehensive Genomics – Past, Present and Future
The Human Genome: From One to One Million
J. Craig Venter, Human Longevity Inc.
Is Perfect Assembly Possible?
Gene Myers, Max Planck Institute
Finishing Genomes: Why Does It Matter?
Deanna Church, Personalis
De Novo Assembly of a Human Diploid Genome for the Asian Genome Project
Jeong-Sun Seo, Macrogen Inc. and Seoul National University College of Medicine
PacBio Long Read Sequencing and Structural Analysis of a Breast Cancer Cell Line
W. Richard McCombie, Cold Spring Harbor Laboratory
After reviewing the packed AGBT agenda, we’ve already spotted several can’t-miss presentations. These speakers and talks look especially promising and we’ll be covering several of them on the blog later this week:
Evan Eichler, University of Washington: “Resolving the Complexity of Human Genetic Variation by Single-Molecule Sequencing”
Matthew Blow, Joint Genome Institute: “Sequencing-Based Approaches for Genome-Scale Functional Annotation”
Tim Smith, U.S. Meat Animal Research Center: “A Genome Assembly of the Domestic Goat from 70x Coverage of Single Molecule Real Time Sequence”
Amy Ly, The Genome Institute at Washington University: “PacBio Application – Influenza Viral RNA-Seq”
Somasekar Seshagiri, Genentech: “Spectrum of Diverse Genomic Alterations Define Non-Clear Cell Renal Carcinoma Subtypes”
Gene Myers, Max Planck Institute: “Low Coverage, Correction-Free Assembly for Long Reads”AGBT is also known for its excellent poster sessions, and we’ll be spending plenty of time in the poster hall this year. If you’re interested in learning more about SMRT® Sequencing results, be sure to stop by some of these posters.
And if you need a break from the marathon, feel free to put your feet up in our suite (Lanai #189) during our open hours:
Wednesday: 8:00 p.m. – 11:00 p.m.
Thursday: 3:00 p.m. – 6:00 p.m. and 8:00 p.m. – 11:00 p.m.
Friday: 3:00 p.m. – 6:00 p.m.
We look forward to seeing you in Marco Island and for those tuned in at home via our blog for lots of updates and live streaming of the workshop!
pacific-biosciences Company Profile
Michael Hunkapillar
President & CEO
Michael Hunkapillar
100 %
Approval Rating
REVENUE
$ 51.1M
Trailing Four Quarters
EMPLOYEES
318
STOCK DIRECTION
100% believe Pacific Biosciences will
GO UP
PacBio Forecast 2015----(From The NGS Expert Blogs) As already predicted, it is not only Illumina who communicates innovations for their NGS portfolio. Here you can read about the implementations Pacific Biosciences plans this this. I think the good news for many users of PacBio machines is, that they do not talk about new instruments, but improvments that affect already installed machines (GenomeWeb):
•PacBio plans to improve the sequencing chemistry, including the active loading of single polymerase enzymes onto the chip
•PacBio plans to improve the workflows for an easier and faster handling of samples
•PacBio plans to improve bioinformatics for faster de novo genome assemblies & better analysis of full-length HLA analysis
With this changes PacBio wants to extend the data output to more than 4 gigabases / SMRT cell and increase the average read lengths to 15-20 kbp.
Read more about it here.
I still wonder if there will be news from PacBio this year about a new system? Maybe a benchtop like everyone has?
Join our research scientists as they present two webinars to help you optimize the Iso-Seq™ method to meet your research goals. Each webinar will be presented twice (8 AM PST/5 PM PST) for live viewing and will also be recorded.
•Wednesday Feb 11: Iso-Seq™ Method: Sample Prep and Experimental Design for Full-Length cDNA Sequencing
•Thursday Feb 12: Iso-Seq™ Analysis & Beyond: Advanced Bioinformatics for Transcriptome Sequencing Using Long Reads
Iso-Seq ™ Method: Sample Prep and Experimental Design for Full-Length cDNA Sequencing
In this webinar we will present the recent Iso-Seq template preparation protocol updates for creating full-length cDNAs and discuss considerations for experimental design.
Topics Covered
•Why full-length transcript sequences matter
•Overview of Iso-Seq template preparation methods
•Updates to protocol allowing for sequencing of transcripts up to ˜10 kb
•Methods for size fractionation
•Applications of the Iso-Seq method
•Targeted transcript sequencing
•Normalization
Who should attend:
Biologists and technicians who are running Iso-Seq experiments and anyone who is interested in learning more about using long read sequencing to study full-length transcripts.
Recommended Pre-reading:
•General information on Isoform Sequencing
•
•User Bulletin - Guidelines for Preparing cDNA Libraries for Isoform Sequencing (Iso-Seq™)
Click on your preferred time below to register:
•Wednesday, February 11 8:00 a.m. PST
Wednesday, February 11 5:00 p.m. PST
Presenter Biography
Tyson A. Clark, Ph.D.
Senior Manager, Next Generation Applications and Technologies
In 2002, Tyson received his doctorate degree in Molecular Biology from the University of California, Santa Cruz. He was a pioneer in the use of microarrays to study alternative splicing events on a genome-wide scale. After working in the gene expression space at Affymetrix, Tyson joined Pacific Biosciences in 2009. His energy is focused on the development of new applications for use on the PacBio RS II.
Iso-Seq™ Analysis & Beyond: Advanced Bioinformatics for Transcriptome Sequencing Using Long Reads
In this webinar we will demonstrate how to run the Iso-Seq bioinformatics software pipeline that is part of PacBio's SMRTAnalysis software suite. Both the web portal interface (SMRT Portal) and the command line version will be introduced. In addition, we will be using the community version of Iso-Seq (pbtranscript-tofu) and other community tools to perform additional analyses.
Topics Covered
•Iso-Seq Bioinformatics Pipeline*: ?running from SMRTPortal web interface
?running from command line
?interpreting Iso-Seq output results
?troubleshooting
•Beyond Iso-Seq**: ?visualizing isoforms
?comparing against reference transcript annotations
?look for fusion gene candidates
?open reading frame prediction
?and more...
* Demonstrated using the latest official Iso-Seq protocol from SMRTAnalysis 2.3
** Enabled through use of community tools
Who should attend:
Bioinformaticians interested in running and analyzing PacBio Iso-Seq datasets. Familiarity with Unix and Python is expected.
Recommended Pre-reading::
•Basic understanding of the PacBio sequencing format
•Official Iso-Seq resource page
Click on your preferred time below to register:
•Thursday, February 12 8:00 a.m. PST
Thursday, February 12 5:00 p.m. PST
Presenter Biography
Elizabeth Tseng
Staff Scientist, Pacific Biosciences
Elizabeth obtained her doctorate degree in Computer Science & Engineering from the University of Washington in 2012. Her thesis work focused on the computational discovery of bacterial non-coding RNAs and gut microbiome. After joining PacBio, she decided to give prokaryotes a break and now supports and develops eukaryotic transcriptome-related collaborations.
PacBio Blog ---
Wednesday, February 4, 2015--High-Quality Genome Assembly and Transcriptome of Cotton Using SMRT Sequencing
A recent research partnership with KeyGene, a Dutch plant genomics and crop improvement company, has resulted in an integrated whole-genome assembly and transcriptome of Gossypium hirsutum, or tetraploid cotton. This is the first known complete assembly for a polyploid crop with a genome larger than 2 Gb.
KeyGene has a long established reputation for generating high-quality data even for very complex genomes. For this project, the cotton genome was sequenced with 38x coverage using Single-Molecule, Real-Time (SMRT®) Sequencing. Assembly of PacBio® long reads reduced the number of contigs from more than 1 million in an existing short-read assembly to fewer than 22,000, representing a 47-fold increase in contiguity.
KeyGene also studied gene expression in cotton, using Pacific Biosciences’ Iso-Seq™ method to generate full-length transcript reads. These were then used for evidence-based annotation of the new reference. They analyzed expression patterns in leaf, stem, and root tissues and discovered novel tissue- and haplotype-specific splice variants. The biological significance of these variants is undergoing investigation at KeyGene.
To improve the reference even further, KeyGene incorporated its proprietary Whole Genome Profiling (WGP™) technology, building a physical map based on partially sequenced BAC fragments. This step further reduced the number of contigs in the final cotton assembly, resulting in the most comprehensive tetrapolid cotton reference to date.
KeyGene has incorporated all of the resulting data into its proprietary crop-specific genome database, which will be available to their commercial partners around the world engaged in breeding and genetic improvement of cotton.
For more about KeyGene’s scientific work, check out our case study featuring Michiel van Eijk, the company’s Chief Scientific Officer.
Tuesday, January 20, 2015----Looking Ahead: The 2015 PacBio Technology Roadmap
By Jonas Korlach, Chief Scientific Officer
All of us at Pacific Biosciences are very proud of the momentum SMRT® Sequencing achieved in 2014, especially due to the more than 500 customer publications now in the literature describing its many applications. We remain deeply thankful to all the scientists who have applied our technology to gain new insights into genomes, transcriptomes, and epigenomes. By applying SMRT Sequencing to a wide variety of applications, our customers are demonstrating that long, unbiased reads have brought about new quality standards for many fields of genomic research. This exciting level of scientific activity and collaboration also provides us with important feedback to further optimize and develop sequencing applications for the PacBio® RS II.
In 2015, we plan to continue our track record of delivering improvements in all aspects of SMRT Sequencing. Sample preparation developments include improved and streamlined sample preparation protocols, barcoding solutions for multiplexing many samples in a SMRT Cell run, and protocols for improved yields of very long-insert libraries and full-length cDNA libraries.
With regard to sequencing runs, as was the case in the previous three years, we expect to deliver another ~4-fold increase in throughput, reaching >4 Gb of data per SMRT Cell run, with average read lengths increasing to 15-20 kb. We plan to accomplish this through a combination of improvements in the sequencing chemistry, protocol workflows, and software. An example is active loading to increase the efficiency of loading one polymerase per ZMW at frequencies greater than the Poisson limit. In the area of data analysis, we will continue to work with the bioinformatics community to create faster algorithms for de novo genome assemblies, further developing solutions like our FALCON assembler for resolving diploid or polyploid genomes, and streamlined analysis workflows for other applications such as Iso-Seq™, full-length HLA, and others.
It is exciting to think about the new frontiers in genomics research that will be realized by this continued innovation and performance increases in SMRT Sequencing, for example:
• High-quality population and disease-specific human reference genomes
• Comprehensive views of tissue and disease-specific transcriptome architectures
• High-quality plant and animal reference genomes and transcriptomes
• Comprehensive characterization of structural variation in genomes
• Large-scale microbial genome and epigenome studies
Examples of these successes have already been featured at last week’s Plant & Animal Genome meeting, with over 50 researchers presenting their work on the use of SMRT Sequencing in the plant and animal research space. Of course, we are also looking forward to next month’s AGBT conference, where advances in the human genomics research space will be highlighted as part of the conference program and during our workshop on February 27.
We are excited to interact with many of you at these and other forums as we support the efforts voiced by many in the community to “bring the ‘W’ back into whole-genome sequencing,” e.g. at the NHGRI event held last year on “Future Opportunities for Genome Sequencing and Beyond: A Planning Workshop for the National Human Genome Research Institute.” We wish you continued success in your research, and thank you again for your support!
Balti and Bioinformatics On Air: 21st January 2015
09 Jan 2015
The plan this year for the triumphant Balti and Bioinformatics series is to alternate between virtual, "on-air" meetings (where sadly you will need to provide your own balti curry) and real life ones which will be mainly held in Birmingham, but may be in other places in England or Wales. Ideally I would plan to run 6 meetings a year.
So ... to kick us off:
Balti and Bioinformatics On-Air
This meeting's theme is open data and reproducible bioinformatics.
Please register over at the Google Hangout page: https://plus.google.com/events/cbtuikle0h2619obgjrgfu74424
Wednesday 21st January, 4pm GMT (=11am EST, =8am PST, 00:00 China)
20 minute talks each (interactive Q&A through Google Hangouts enabled)
Draft schedule:
+0m C. Titus Brown, UC Davis: Self-interest: can it be a strategy for convincing scientists to share pre-publication data in a useful way?
+30m Scott Edmunds, GigaScience: New models for open data publishing
+50m Jane Landolin, Pacific Biosciences: Open Pacific Biosciences data for model organisms
+70m Michael Barton, JGI: nucleotid.es for de novo assembly benchmarking and Docker
+90m Nick Loman, U. Birmingham: Nanopore data updates and the "poreathon".
+100m Dave Lunt, University of Hull: ReproPhylo - Reproducible phylogenetics.
+110m Discussion (are we on the right track? Challenges? Containers and VMs - beneficial or the wrong direction?)
From Sage Science-- Posted on December 18, 2014 by Alex--- New PacBio Isoform Sequencing Protocol Recommends SageELF
Posted on December 18, 2014 by Alex
If you’re performing isoform sequencing on the PacBio platform, check out this new protocol on DevNet. PacBio recommends size fractionation of cDNAs into four pooled fractions using our SageELF. There’s also an optional step in the protocol for larger libraries to use SageELF for the removal of shorter fragments prior to sequencing.
We’re glad to see the new protocol. Scientists are already doing impressive work with long PacBio reads to more accurately assess transcriptomes, and it’s great to know that our instrument can help people achieve insightful results.
To get a better sense of how the instruments function in a pipeline, check out this poster from researchers at the University of Washington and PacBio. It illustrates a gene expression study of various human cell types, yielding some transcripts longer than 10 Kb.
This entry was posted in Blog and tagged PacBio, SageELF, transcriptome. Bookmark the permalink. New PacBio Isoform Sequencing Protocol Recommends SageELF
Spoiler alert: the RS 4 is amazing! -- Mick Watson ?@BioMickWatson ·11 hours ago
Rumours about a new PacBio machine again. Hmmm!--------------------
@AndyLarrea @BioMickWatson @PacBio Spoiler alert: the RS 4 is amazing!
— Michael Schatz (@mike_schatz) December 16, 2014
Thursday, December 11, 2014---Review Article: Long-Read Sequencing Offers Better Understanding of Pluripotency
A new review article offers a nice overview of attempts to characterize the transcriptome of human stem cells using RNA-seq, the Iso-Seq™ method, and more. Kin Fai Au and Vittorio Sebastiano, scientists at the University of Iowa and Stanford University, respectively, contributed the review to Current Opinion in Genetics & Development.
“The introduction of the RNA-Seq technology based on [second-generation sequencing technology] has provided a remarkable step forward providing a fast and inexpensive way to determine the transcriptome of a given cell type and several remarkable works have been done using this type of approach,” Au and Sebastiano write. “Nonetheless tasks like de novo discovery of genes, gene isoforms assembly or transcript and isoform abundance determination are still challenging and far from being achieved.”
They report on a previous paper from Au in which Single Molecule, Real-Time (SMRT®) Sequencing was combined with short-read sequencing to detect isoforms in a well-characterized human embryonic stem cell line. Long reads led to the detection of hundreds of novel isoforms and long noncoding RNAs. Long intergenic noncoding RNAs (lincRNAs) are a topic of interest in the review article, where Au and Sebastiano note that they “have a very high degree of repetitive elements and it is therefore extremely challenging to determine the correct gene annotation and the abundance due to the difficulties in aligning short read data to the genome.” With long-read sequencing, they add, sequence data spans unique sections of the lincRNAs and makes it possible to accurately map reads to the correct region.
The authors cite recent studies demonstrating more transcriptional activity in the human genome than has been expected. “Transcription occurs across 80–90% of the human genome, in contrast with the assumption that only 3% (or less) of the genome is actually coding for proteins,” they write. LincRNAs and other noncoding RNAs may explain the difference between those numbers.
Au and Sebastiano call for more studies of stem cells using long-read sequencing technology to establish a better view of the transcriptional activity in these important cells with accurate detection of noncoding RNAs characterized by highly repetitive sequence. “Given such complexity of the epigenetic status for most of the genes, it is essential to identify the transcripts and the isoforms that are indeed functionally relevant (even if expressed at low levels) in [pluripotent stem cells],” they write. http://blog.pacificbiosciences.com/2014/12/review-article-long-read-sequencing.html
Tuesday, September 3, 2013--- New Data Release: Arabidopsis Assembly Offers Glimpse of
De Novo SMRT Sequencing for Larger Genomes
Advances in our chemistries, throughput, and read length are pushing the envelope in the way we tackle larger genomes. We recently sequenced the Landsberg erecta ecotype (Ler-0) of Arabidopsis thaliana and produced a successful assembly solely using PacBio® data. The data set resulting from this sequencing effort and assembly using SMRT® Portal is now available via Devnet for anyone who wants to give it a test drive.
A few stats on Arabidopsis and the assembly using PacBio sequence data:
Genome size: 124.6 Mb
GC content: 33.92%
Raw data: 11 Gb
Assembly coverage: 15.37x
Polished Contigs: 540
Max Contig Length: 12.98 Mb
N50 Contig Length: 6.19 Mb
Sum of Contig Lengths: 124.57 Mb
Arabidopsis thaliana Ler-0 was sequenced using our latest P4 enzyme and C2 chemistry with a 20 Kb insert library; size selection was performed with an 8 Kb to 50 Kb elution window on a BluePippin™ device from Sage Science. We generated 11 Gb of unfiltered bases for the assembly, and used a seed read cutoff of just over 9,000 bases for preassembled reads. Assembly of the genome was performed using SMRT Portal 2.0, including polishing with Quiver. Our scientists were pleased to see that our currently available bioinformatics platform, which has demonstrated consistent utility in building high-quality assemblies for microbial genomes, worked beautifully for the more complex Arabidopsis genome as well.
We have released the input files including the preassembled reads into the Celera® Assembler for those interested in running the bioinformatics analysis and evaluation. Along with the assembled Celera results, we have also included the Quiver polished assembly for comparison. More information on the HGAP approach can be found in the recently published paper in Nature Methods.
Here are some graphs summarizing the quality metrics with this genome assembly:
Distribution of sequencing coverage across assembled Arabidopsis Ler-0 genome
http://blog.pacificbiosciences.com/2013/08/new-data-release-arabidopsis-assembly.html
Thursday, December 4, 2014-- A New Reference Genome for Shigella: SMRT Sequencing of a Historic Sample
In a special issue of The Lancet dedicated to World War I, an article by scientists from the Wellcome Trust Sanger Institute used Single Molecule, Real-Time (SMRT®) Sequencing to decode the genome of the first isolate ever collected of Shigella flexneri.
The bacterium, a descendant of E. coli and first identified as a separate strain in 1902, was responsible for severe dysentery among World War I troops due to poor hygienic conditions in the trenches. Today, S. flexneri is one of the leading causes of diarrheal death among children in developing countries and other areas of poor sanitation.
Hoping to learn more about the evolution of S. flexneri in the last hundred years, the Sanger team used the PacBio® RS II to construct the reference genome of an isolate collected from a British soldier in 1915. They then compared that genome sequence to other isolates collected in the time since, finding that the pathogen’s acquisition of new genetic material has almost exclusively centered on heightened virulence and increased antimicrobial resistance.
In a video describing the work, lead author Kate Baker explained the need to use PacBio long-read sequencing for this project. “Normally when we sequence a bacterial genome, we shred it into sections and we try and put it back together like a jigsaw. With Shigella, though, throughout their genome they have hundreds of repeated elements so when we try to put the jigsaw back together it’s not obvious how we reconstruct the genome,” she said. “Understanding these repeat elements in Shigella is actually a really important part of their evolution because it’s what allows them to exchange DNA with other bacteria.”
With SMRT Sequencing, Baker and her team got a comprehensive look at the Shigella genome. As Baker and her collaborators compared the genomes representing different evolutionary points, they found that modern isolates had more pathogenicity islands and new virulence factors, like Shigella enterotoxin. They also determined that the 1915 isolate was already resistant to penicillin and erythromycin. More results from their work can be found in the paper: “The extant World War 1 dysentery bacillus NCTC1: a genomic analysis.”
Interestingly, the 1915 S. flexneri isolate included in this project was the very first contribution to Public Health England’s National Collection of Type Cultures, the longest-running collection of human pathogen samples in the world. Dubbed NCTC1, the sample was taken from a soldier who died in France. As part of their work, the Sanger scientists traced the sample’s history and for the first time identified the soldier it came from as Private Ernest Cable, who died on March 13, 1915.
This work is part of an ongoing collaboration with the Wellcome Trust Sanger Institute and Public Health England to complete the genome sequences of 3,000 bacterial strains from samples at NCTC.
http://blog.pacificbiosciences.com/2014/12/a-new-reference-genome-for-shigella.html
Large Genome Assembly with PacBio Long Reads
lhon edited this page Nov 13, 2014 · 53 revisions
PacBio long reads can be used in a number of ways to generate and improve de novo assemblies for large genomes. You can take several different approaches:
1.PacBio-only de novo assembly. Using just PacBio reads from a long insert library, the reads are often preprocessed before being assembled using an Overlap-Layout-Consensus algorithm. The best known implementation of this is HGAP.
2.Hybrid de novo assembly. Using a combination of PacBio and short read data, the reads are used together during assembly to generate a hybrid assembly.
3.Gap filling. Starting with an existing mate-pair based assembly, the internal gaps (consisting of Ns) inside the scaffolds are filled using PacBio sequences.
4.Scaffolding. Using an existing assembly (such as an assembly based on short read data), PacBio reads are used to join contigs.
Figure 1. Illustration of PacBio assembly approaches
Below we discuss what software is available, choosing software, and additional considerations.
?Software Options
Name Description
PacBio-only
HGAP A workflow to first preassemble reads, assemble the preassembled reads using Celera® Assembler, then polish using Quiver.
•Supports up to 100 Mb from SMRT Portal, which is part of SMRT Analysis.
•Larger genomes are possible from the command line using either smrtpipe.py or the Makefile-based smrtmake.
HBAR-DTK An experimental toolkit for running HGAP-style assemblies.
Falcon An experimental diploid assembler, tested on ~100 Mb genomes. 2014 AGBT presentation by Jason Chin.
PBcR self-correction A mode within PBcR (aka pacBioToCA) to do self-correction in the same style as HGAP. Celera® Assembler 8.2 uses the MHAP algorithm for faster overlap calculation during the self-correction phase.
Celera® Assembler Celera® Assembler 8.1 now offers a way to directly assemble subreads.
Sprai A preassembly-based assembler that aims to generate longer contigs.
Hybrid
pacBioToCA An error correction module in Celera® Assembler originally designed to align short reads to PacBio reads and generate consensus sequences. These error corrected reads can then be assembled by Celera® Assembler.
ECTools A set of tools that uses contigs instead of short reads for correction.
Spades A short read assembler that added PacBio hybrid assembly support as of version 3.0.
Cerulean Cerulean starts with an assembly graph from Abyss and extends contigs by resolving bubbles in the graph using PacBio long reads. Was successfully run on genomes <100 Mb.
dbg2olc dbg2olc uses Illumina contigs as anchors to build an overlap graph with PacBio reads, allowing very fast performance.
Gap Filling
PBJelly 2 PBJelly upgrades genomes by using PacBio reads to fill in gaps in scaffolds. Has been shown to work with genomes >1 Gb. Part of the PBSuite of applications including PB Honey. See also PAG 2014: Kim Worley, "Improving Genomes using Long Reads and PB Jelly 2
Scaffolding
AHA AHA ("A Hybrid Assembler") is designed to join existing contigs using PacBio reads. Limited to genomes greater than 200 Mb; part of SMRT Analysis.
PBJelly 2 The new version of PBJelly has support for joining scaffolds.
?Considerations
?Coverage and Choosing Software
The choice of algorithms depends on how much PacBio sequencing can be obtained and what types of short read data are available. We recommend PacBio-only de novo assembly when it is possible to get at least 50X PacBio coverage. HGAP performs best with the minimum recommended coverage; with higher coverage a greater number of the longest reads becomes available for assembly. For larger genomes, PBcR in Celera Assembler 8.2 beta uses MHAP which offers faster assembly times.
For a hybrid assembly involving both PacBio and short read sequencing, PBcR and ECTools can work well with around 20X PacBio coverage. If a high quality set of scaffolds exists, then PBJelly 2 can be used. We recommend at least PacBio 5X coverage to fill gaps; higher coverage enables better consensuses in gap filled regions and increases the number of addressable gaps, as random sampling at lower coverage can lead to coverage gaps.
Figure 2. PacBio algorithm suggestions from a PAG 2014 presentation by Mike Schatz
?Repetitive Content
One of the biggest challenges with de novo assembly is repeat content. In general, the solution is to work with insert sizes that can span repeats and identify unique anchoring sequence on each side. PacBio long reads are uniquely useful in sequencing long inserts, given that they can read from one end of the insert to the other.
?Ploidy
Most existing assemblers were designed for haploid genomes. When a diploid genome has little structural variation between the chromosome copies, then a haploid approach can work well, with the occasional structural heterozygosity appearing as separate contigs. In diploid genomes with larger structural variation or multiploid genomes, assemblies based on haploid assemblers are increasingly fragmented. For these genomes, consider Falcon - though it is considered experimental code. Note also that Celera® Assembler can be configured to favor merging haplotypes.
If possible, select strains to minimize heterozygosity, which helps facilitate assembly. This includes using inbred lines, double haploid strains, and other effectively haploid genomes. For example, the human mole sequenced is a double haploid genome.
?Coverage Bias with Short-Read Data
Short read data has coverage bias in regions with extreme GC composition because short read technologies require amplification. Even if PCR-free sample preparation methods are used, ultimately there is bridge amplification during sequencing.
In addition, with error correction approaches such as PBcR, short reads made of simple repeats are difficult to use given that the kmers used to seed overlaps are at high frequency and thus often filtered out (see PAG 2014, Mike Schatz slide 12).
?Computational Requirements
De novo assembly algorithms using PacBio reads generally use an overlap-layout-consensus algorithm to arrange long reads (such as Celera® Assembler, which HGAP and pacBioToCA both use). Because the overlap phase requires an all-by-all alignment, computation time scales quadratically with the genome size. For larger genomes approaching one gigabase and greater, assembling genomes of this size requires significant computational resources. For example, the initial overlap step in preassembly for the 54X human assembly required 405,000 CPU hours. Compute times are also described in the pacBioToCA-based drosophila assembly. There are efforts to reduce the computational burden, such as Dazzler (blog) and MHAP (blog post, webinar).
Hybrid assembly using PBcR also adds a layer of computational complexity, since aligning 100X of short reads to PacBio reads is a computationally intensive task. One way to reduce computational time is to align short read contigs to PacBio reads, such as through ectools, which effectively compresses down the short read data. This type of approach also has the advantage of increasing the mappability of short read data, since assembled contigs are longer than the individual reads.
?Draft Genome Quality
Gap filling of mate pair-based scaffolded assemblies are particularly sensitive to the quality of the starting assembly. When aligning PacBio reads across gaps in the scaffolds, misassemblies in the scaffolds can result in improper alignments and incorrectly-filled or unfillable gaps.
?Large insert libraries
Even though this is a discussion of assembly algorithms, key to a successful assembly is the longest reads possible through careful sample preparation. We recommend the largest insert libraries possible (e.g. 20 kb) using BluePippin™ size selection (see 20 kb Template Preparation Using BluePippin Size-Selection) and sequencing with the P6-C4 chemistry.
?Datasets and Example Projects
•Human dataset, see also blog post
•2014 PAG presentation by Allen Van Deynze discussing the spinach assembly
•Arabidopsis dataset
•Drosophila dataset, see also this blog post
•Saccharomyces cerevisiae dataset and assembly
•Neurospora Crassa (Fungus) Genome, Epigenome, and Transcriptome, see also this poster at AGBT 2014
•Other datasets
?Additional Links
•http://www.homolog.us/blogs/blog/2014/02/21/opinionated-history-genome-assembly-algorithms/
•PAG 2014: Michael Schatz, “De novo assembly of complex genomes using single molecule sequencing”
•2014 AGBT presentation by Richard McCombie discussing the assembly of rice and yeast, including a coverage titration of the Arabidopsis dataset and assembly performance of ectools versus HGAP https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/Large-Genome-Assembly-with-PacBio-Long-Reads
Posted on November 27, 2014 by lexnederbragt
My review of “Long-read, whole-genome shotgun sequence data for five model organisms”
Two days ago, a paper appeared in Nature Scientific Data by Kristi Kim et al, titled “Long-read, whole-genome shotgun sequence data for five model organisms”. This paper describes the release of whole-genome PacBio data by Pacific Biosciences and others, for five model organisms, Escherichia coli, Saccharomyces cerevisiae, Neurospora crassa, Arabidopsis thaliana, and Drosophila melanogaster, using quite recent chemistries.
Beyond the datasets described in the paper, Pacific Biosciences also released whole-genome data for the human genome, and very recently, for Caenorhabditis elegans using the latest P6/C4 chemistry. Check out PacBio devnet, also for data for other applications.
I think it is fantastic that Pacific Biosciences releases these datasets as a service to the community – and obviously to showcase their technology. Company-generated data often represents the best possible data, as it is done by people with very much experience with the technology. It remains to be seen if ‘regular’ owners of PacBio RS II instrument can reach the same level of data quality. Nonetheless, these datasets are very helpful for teaching (see my previous blog post), comparisons with other technologies (I wish a I could make time to throughly compare PacBio data to Moleculo data available from the same species), as well as development of new software applications.
I was a reviewer for the Nature Scientific Data paper. My full review report can be found on publons. Here I reprint the first paragraph.
I am happy to say that the authors addressed all the points that I raised in my review report in the final published version.
--------------------------------------------------------------------------------
This Data Descriptor manuscript describes eight PacBio sequence datasets from five model organisms sequenced using the latest chemistries. The data is already available for the research community, the manuscript provides the necessary background for understanding how the data was generated and analysed. These sequencing datasets represent a highly valuable contribution to the community. Tools for working with the data from the PacBio RS are being developed at an increasing rate, and testing these tools requires high-quality data in combination with a (very close) reference genome sequence. The data described in this manuscript provide just that. I applaud Pacific Biosciences, and the authors and research groups involved, in releasing these variable datasets without restrictions. Such releases greatly speed up research, greatly enable the development and testing of new software and applications, and are a fantastic tool for teaching purposes. I hope other companies in a similar position follow suit.
Continue reading at publons.
https://flxlexblog.wordpress.com/2014/11/27/my-review-of-long-read-whole-genome-shotgun-sequence-data-for-five-model-organisms/
Published online 25 November 2014=== Long-read, whole-genome shotgun sequence data for five model organisms
http://www.nature.com/articles/sdata201445 -- Assembling Large Genomes with Single-Molecule Sequencing and Locality Sensitive Hashing
Konstantin Berlin, Sergey Koren, Chen-Shan Chin, James Drake, Jane M Landolin, Adam M Phillippy
doi: http://dx.doi.org/10.1101/008003
Konstantin BerlinUniversity of Maryland, College Park, MD; Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteSergey KorenNational Biodefense Analysis and Countermeasures Center; Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteChen-Shan ChinPacific Biosciences Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteJames DrakePacific Biosciences Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteJane M LandolinPacific Biosciences Find this author on Google ScholarFind this author on PubMedSearch for this author on this siteAdam M PhillippyNational Biodefense Analysis and Countermeasures Center; Find this author on Google ScholarFind this author on PubMedSearch for this author on this site
AbstractInfo/HistoryMetricsData Supplements?Preview PDF Abstract
We report reference-grade de novo assemblies of four model organisms and the human genome from single-molecule, real-time (SMRT) sequencing. Long-read SMRT sequencing is routinely used to finish microbial genomes, but the available assembly methods have not scaled well to larger genomes. Here we introduce the MinHash Alignment Process (MHAP) for efficient overlapping of noisy, long reads using probabilistic, locality-sensitive hashing. Together with Celera Assembler, MHAP was used to reconstruct the genomes of Escherichia coli, Saccharomyces cerevisiae, Arabidopsis thaliana, Drosophila melanogaster, and human from high-coverage SMRT sequencing. The resulting assemblies include fully resolved chromosome arms and close persistent gaps in these important reference genomes, including heterochromatic and telomeric transition sequences. For D. melanogaster, MHAP achieved a 600-fold speedup relative to prior methods and a cloud computing cost of a few hundred dollars. These results demonstrate that single-molecule sequencing alone can produce near-complete eukaryotic genomes at modest cost.
http://biorxiv.org/content/early/2014/08/14/008003
2014 Nov 19;14(1):289. [Epub ahead of print]
A precise chloroplast genome of Nelumbo nucifera (Nelumbonaceae) evaluated with Sanger, Illumina MiSeq, and PacBio RS II sequencing platforms: insight into the plastid evolution of basal eudicots.
Wu Z, Gui S, Quan Z, Pan L, Wang S, Ke W, Liang D, Ding Y.
Abstract
BackgroundThe chloroplast genome is important for plant development and plant evolution. Nelumbo nucifera is one member of relict plants surviving from the late Cretaceous. Recently, a new sequencing platform PacBio RS II, known as `SMRT (Single Molecule, Real-Time) sequencing¿, has been developed. Using the SMRT sequencing to investigate the chloroplast genome of N. nucifera will help to elucidate the plastid evolution of basal eudicots.ResultsThe sizes of the de novo assembled complete chloroplast genome of N. nucifera were 163,307 bp, 163,747 bp and 163,600 bp with average depths of coverage of 7×, 712× and 105× sequenced by Sanger, Illumina MiSeq and PacBio RS II, respectively. The precise chloroplast genome of N. nucifera was obtained from PacBio RS II data proofread by Illumina MiSeq reads, with a quadripartite structure containing a large single copy region (91,846 bp) and a small single copy region (19,626 bp) separated by two inverted repeat regions (26,064 bp). The genome contains 113 different genes, including four distinct rRNAs, 30 distinct tRNAs and 79 distinct peptide-coding genes. A phylogenetic analysis of 133 taxa from 56 orders indicated that Nelumbo with an age of 177 million years is a sister clade to Platanus, which belongs to the basal eudicots. Basal eudicots began to emerge during the early Jurassic with estimated divergence times at 197 million years using MCMCTree. IR expansions/contractions within the basal eudicots seem to have occurred independently.ConclusionsBecause of long reads and lack of bias in coverage of AT-rich regions, PacBio RS II showed a great promise for highly accurate `finished¿ genomes, especially for a de novo assembly of genomes. N. nucifera is one member of basal eudicots, however, evolutionary analyses of IR structural variations of N. nucifera and other basal eudicots suggested that IR expansions/contractions occurred independently in these basal eudicots or were caused by independent insertions and deletions. The precise chloroplast genome of N. nucifera will present new information for structural variation of chloroplast genomes and provide new insight into the evolution of basal eudicots at the primary sequence and structural level.
PMID:25407166[PubMed - as supplied by publisher] http://www.ncbi.nlm.nih.gov/pubmed/25407166?dopt=Abstract&utm_source=dlvr.it&utm_medium=twitter
18 November 2014
"A true platinum sequence will be assembled from just one genome, however, because only then can scientists be sure there are no remaining gaps. To this end, a team led by Richard Wilson at Washington University in St. Louis, Missouri, reported a draft sequence of the entire CHM1 genome earlier this month (K. M. Steinberg et al. Genome Res. http://doi.org/w7b; 2014). Researchers at the firm Pacific Biosciences in Menlo Park, California, are similarly working on the whole CHM1 genome, but are using sequencers that work with longer stretches of uninterrupted DNA, and so produce fewer gaps than typical sequencers. The firm released a draft genome assembly in February. The hope is that the method will speed up the platinum genome’s arrival."
nature.com---- ‘Platinum’ genome takes on disease
Disease sites targeted in assembly of more-complete version of the human genome sequence.
Ewen Callaway 18 November 2014
Article toolsPDF Rights & Permissions Geneticists have a dirty little secret. More than a decade after the official completion of the Human Genome Project, and despite the publication of multiple updates, the sequence still has hundreds of gaps — many in regions linked to disease. Now, several research efforts are closing in on a truly complete human genome sequence, called the platinum genome.
“It’s like mapping Europe and somebody says, ‘Oh, there’s Norway. I really don’t want to have to do the fjords’,” says Ewan Birney, a computational biologist at the European Bioinformatics Institute near Cambridge, UK, who was involved in the Human Genome Project. “Now somebody’s in there and mapping the fjords.”
•Landing on a comet: A guide to Rosetta’s perilous mission
•Italian seismologists cleared of manslaughter
•Edits to ethics code rankle
The efforts, which rely on the DNA from peculiar cellular growths, are uncovering DNA sequences not found in the official human genome sequence that have potential links to conditions such as autism and the neuro-degenerative disease amyotrophic lateral sclerosis (ALS).
In 2000, then US President Bill Clinton joined leading scientists to unveil a draft human genome. Three years later, the project was declared finished. But there were caveats: that human ‘reference’ genome was more than 99% complete, but researchers could not get to 100% because of method limitations.
Sequencing machines cannot process entire chromosomes, so scientists must first make many identical copies of the DNA and cut them into short stretches, with the breaks in different places. After sequencing, a computer program looks for overlapping patterns to ‘stitch’ the resulting segments back together.
This approach worked for most of the genome, because DNA sequences are almost identical across its three billion ‘letters’ (the As, Cs, Ts and Gs). But in some parts, big differences exist between the versions of chromosomes that an individual inherits from the mother and father. Attempts to stitch together these regions to sequence the DNA led to gaps when the differing sequences gave conflicting solutions.
“There’s a whole level of genetic variation that we’re missing.”
The problem can be likened to assembling a single jigsaw puzzle from the mixed-up pieces of similar, but not identical, puzzles. If one puzzle piece is identical across the sets, any copy of it will do. But if one set contains a much larger version of the matching piece, or if a piece is missing, the puzzle will not fit together. In particular, long, repetitive stretches near genes vexed the computer algorithms used to analyse the data. And the problem was made worse because DNA from multiple people was used, adding to the variation between the genomes.
As a result, when a person’s genome is sequenced — for instance, to look for the cause of a disease — crucial bits of DNA may be overlooked because they do not have counter-parts in the published genome. “There’s a whole level of genetic variation that we’re missing,” says Evan Eichler, a genome scientist at the University of Washington in Seattle, a leading proponent of the platinum-genome efforts. To plug the gaps, researchers need a supply of human cells with just a single version of each chromosome, to remove the possibility of conflicting solutions — a single set of puzzle pieces, in other words.
Sperm and egg cells contain a single copy of each chromosome, but these cells cannot divide and produce copies of themselves. So in recent years, geneticists have turned to cells from growths called hydatidiform moles, created when a sperm fertilizes an egg that is missing its own genetic material (see ‘To simplify a sequence’). The fertilized cell copies its genome and starts dividing, just as the cells in a normal fertilized egg would. The resulting ball of cells, which is usually removed in the first trimester of pregnancy, contains identical copies of each human chromosome.
Expand Cells taken from one such mole were used in the early 1990s to create a cell line called CHM1. In a Nature paper published on 10 November, Eichler and his colleagues describe how they used sections of the CHM1 genome to fill about 50 especially troublesome holes in the official human genome sequence. They also shortened many more gaps, including in genes linked to ALS and Fragile X syndrome, a neuro-developmental disease with autism-like symptoms (M. J. P. Chaisson et al. Nature http://doi.org/w69; 2014). In total, the team mapped around 1 million DNA letters that were missing in the original reference genome.
A true platinum sequence will be assembled from just one genome, however, because only then can scientists be sure there are no remaining gaps. To this end, a team led by Richard Wilson at Washington University in St. Louis, Missouri, reported a draft sequence of the entire CHM1 genome earlier this month (K. M. Steinberg et al. Genome Res. http://doi.org/w7b; 2014). Researchers at the firm Pacific Biosciences in Menlo Park, California, are similarly working on the whole CHM1 genome, but are using sequencers that work with longer stretches of uninterrupted DNA, and so produce fewer gaps than typical sequencers. The firm released a draft genome assembly in February. The hope is that the method will speed up the platinum genome’s arrival.
“The chances of actually achieving this, for one genome, are looking much better”, says Deanna Church, a genome scientist at the firm Personalis in Menlo Park. Still, Birney says that the human reference genome is more about “constant improvement” than completion. “For sure, somebody’s going to be fiddling around with this in 10–20 years’ time.”
Journal name:
Nature
Volume:
515,
Pages:
323
Date published:
(20 November 2014)
http://www.nature.com/news/platinum-genome-takes-on-disease-1.16375
Published online 10 November 2014-- Nature | Letter
Resolving the complexity of the human genome using single-molecule sequencing
Mark J. P. Chaisson,1, John Huddleston,1, 2, Megan Y. Dennis,1, Peter H. Sudmant,1, Maika Malig,1, Fereydoun Hormozdiari,1, Francesca Antonacci,3, Urvashi Surti,4, Richard Sandstrom,1, Matthew Boitano,5, Jane M. Landolin,5, John A. Stamatoyannopoulos,1, Michael W. Hunkapiller,5, Jonas Korlach5, & Evan E. Eichler1, 2,
The human genome is arguably the most complete mammalian reference assembly1, 2, 3, yet more than 160 euchromatic gaps remain4, 5, 6 and aspects of its structural variation remain poorly understood ten years after its completion7, 8, 9. To identify missing sequence and genetic variation, here we sequence and analyse a haploid human genome (CHM1) using single-molecule, real-time DNA sequencing10. We close or extend 55% of the remaining interstitial gaps in the human GRCh37 reference genome—78% of which carried long runs of degenerate short tandem repeats, often several kilobases in length, embedded within (G+C)-rich genomic regions. We resolve the complete sequence of 26,079 euchromatic structural variants at the base-pair level, including inversions, complex insertions and long tracts of tandem repeats. Most have not been previously reported, with the greatest increases in sensitivity occurring for events less than 5 kilobases in size. Compared to the human reference, we find a significant insertional bias (3:1) in regions corresponding to complex insertions and long short tandem repeats. Our results suggest a greater complexity of the human genome in the form of variation of longer and more complex repetitive DNA that can now be largely resolved with the application of this longer-read sequencing technology.
(For At a glance,open link)
http://www.nature.com/nature/journal/vaop/ncurrent/full/nature13907.html --- Researchers Use PacBio Sequencing to Create More Complete Human Genome Reference -- http://finance.yahoo.com/news/researchers-pacbio-sequencing-create-more-160000801.html