LT.Swing trade!
Register for free to join our community of investors and share your ideas. You will also get access to streaming quotes, interactive charts, trades, portfolio, live options flow and more tools.
Register for free to join our community of investors and share your ideas. You will also get access to streaming quotes, interactive charts, trades, portfolio, live options flow and more tools.
Wednesday, December 11, 2013Data Release: Human MCF-7 Transcriptome
Understanding the biology of a genome requires knowing the full complement of mRNA isoforms. In recent years, microarrays, high-throughput cDNA sequencing, and RNA-seq have become very useful tools for studying transcriptomes. High-throughput cDNA sequencing is accurate but laborious, while the inherently complex nature of the transcriptome makes transcript assembly computationally intractable. Recently, Steijger et al. (1) showed that complete isoform reconstruction from RNA-seq short-read data remains challenging even when all constituent exons are identified.
A number of recent publications have demonstrated the utility of full-length transcript sequencing by taking advantage of the long read lengths of SMRT® Sequencing technology (2)–(4). SMRT Sequencing produces reads that originate from independent observations of single molecules; no assembly is needed if a read spans the entire length of the transcript. To demonstrate the capabilities of PacBio® Isoform Sequencing (Iso-Seq) technology and show a glimpse of the complexity of eukaryotic transcriptomes, we generated a deep dataset of full-length cDNA sequencing of RNA from MCF-7, a human breast cancer cell line. The sequencing data was collected from several internal training sessions where different library preparation techniques were tested. We are releasing the underlying data in an effort to aid the design of future PacBio Iso-Seq experiments and to spur advances in the development of bioinformatics tools for analyzing full-length transcripts.
In our final dataset, we obtained 44,531 non-redundant transcript-length consensus sequences ranging from 400 bp – 4,900 bp, with an average length of 1,929 bp (Fig. 1a). The total percentage of consensus bases that disagreed with the hg19 genome is 0.27%, out of which 0.16% are due to substitutions and thus could likely be true SNPs (Fig. 1b). About half of the transcribed loci have one observed isoform, while the rest have mostly 2-5 isoforms (Fig. 2). We compared our predicted full-length transcripts against the known annotations and found that we were able to recover full-length alternative splice forms (Fig. 3), alternative polyadenylation, novel transcripts, and known fusion genes (Fig. 4). We encourage interested researchers to explore the dataset.
Materials & Methods
Full-length cDNA was generated from polyA RNA using standard cDNA synthesis kits (Clontech® SMARTer™ and Invitrogen® Superscript® kits). To capture longer, rarer transcripts in sufficient abundance, parts of the double-stranded cDNA were size selected into three fractions, which were subsequently amplified and converted into SMRTbell™ templates. Details on the sample preparation can be found on Sample Net. SMRTbell libraries were sequenced using the P4-C2 sequencing chemistry with 2-hour movies.
After sequencing, we computationally determined the completeness of the sequences using polyA-tail signals and library adapters. To obtain a non-redundant set of full-length, high-quality transcript sequences without bias from other sequencing platforms, we developed a de novo, isoform-level clustering algorithm that uses only PacBio data. Briefly, the algorithm iteratively clusters reads to generate consensus sequences that represent the original transcripts. The algorithm takes into account the existence of the polyA-tail signal to differentiate isoforms with alternative stop sites. The final consensus sequences were called using Quiver and filtered to create the final polished, full-length, non-redundant dataset. Details of the clustering algorithm will be described in two upcoming webinars on Wednesday, January 22 at 8 AM PST and 5 PM PST.
Some statistics from the sequencing and results are listed below:
•Number of SMRT Cells: 119
•no-size selection: 12
•1-2 kb: 37
•2-3 kb: 37
•> 3 kb: 33
•Total number of post-filtered bases: 14,062,161,755
isoform. (b) Breakdown of differences to hg19. Consensus sequences were mapped to hg19 using GMAP (version 2013-07-20) with default parameters. Different error categories were aggregated over all 44,531 transcript sequences. Some errors are likely to be due to real biological differences from the reference sequence.
Figure 1. (see link) (a) Length distribution of polished, non-redundant transcript sequences. Each transcript sequence represents a unique isoform. (b) Breakdown of differences to hg19. Consensus sequences were mapped to hg19 using GMAP (version 2013-07-20) with default parameters. Different error categories were aggregated over all 44,531 transcript sequences. Some errors are likely to be due to real biological differences from the reference sequence.
Figure 2.(see link) Number of isoforms per loci. Transcripts that overlap on the genomic coordinate by 1 bp are grouped together to form non-overlapping transcribed loci. Total number of loci: 14,385. The majority (61%) of transcribed loci have only 1 or 2 transcripts while 0.6% of them have 20 or more isoforms. This is consistent with other studies of full-length cDNA sequencing of a single sample type [5].
Figure 3. (see link) UCSC browser screenshot of the CREM gene region. PacBio transcripts (top, red) capture multiple isoforms of the CREM gene, including alternatively spliced exons and alternative poly adenylation sites.
Figure 4. (see link) Known cancer fusion gene BCAS4/BCAS3 identified. PacBio transcripts (top, red) show three different fusion variants of the BCAS4/BCAS3 genes. All three variants contain a portion of the 5’ region of the BCAS4 gene (chr20q13) and a portion of the 3’ region of the BCAS3 gene (chr17q23).
References
1.T. Steijger, J. F. Abril, P. G. Engström, et. al., “Assessment of transcript reconstruction methods for RNA-seq,” Nat. Methods, vol. 10, no. 12, pp. 1177–1184, Nov. 2013.
2.D. Sharon, H. Tilgner, F. Grubert, and M. Snyder, “A single-molecule long-read survey of the human transcriptome,” Nat. Biotechnol., vol. 31, no. 11, pp. 1009–1014, Nov. 2013.
3.W. Zhang, P. Ciclitira, and J. Messing, “PacBio sequencing of gene families-a case study with wheat gluten genes,” Gene, 2013.
4.K. F. Au, V. Sebastiano, P. T. Afshar, J. D. Durruthy, L. Lee, B. A. Williams, H. van Bakel, E. E. Schadt, R. A. Reijo-Pera, J. G. Underwood, and W. H. Wong, “Characterization of the human ESC transcriptome by hybrid sequencing,” Proc. Natl. Acad. Sci. U. S. A., Nov. 2013.
(link) http://blog.pacificbiosciences.com/2013/12/data-release-human-mcf-7-transcriptome.html?m=1
UCSD Team Develops PacBio Sequencing Method to ID Structural Variant Breakpoints
December 10, 2013
http://www.genomeweb.com/sequencing/ucsd-team-develops-pacbio-sequencing-method-id-structural-variant-breakpoints?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+genomeweb%2Finsequence+%28In+Sequence%29
Posted on December 10, 2013 (Christmas came early: 2x Heliconius genome with PacBio, mean length 5.1 kb) http://img.ly/xpbC
PacBio Sequence Assembly
Workshop
Tuesday, December 17th 2013, 4 pm – 7 pm
The Auditorium, 1005 GBSF
4:00 pm Welcome & Introductions
4:00 – 4:30 pm Shane Brubaker, Solazymes
“Assembly, haplotyping, and annotation
of a high GC algal genome.”
4:30 – 5:00 pm Jason Chin, PacBio
"String graph assembly for diploid
genomes with long reads.”
5:00 – 5:30 pm Lex Nederbragt, University of Oslo
“Using PacBio reads to improve and
validate the assembly of the complex
Atlantic cod genome.”
5:30 – 6:00 pm Lawrence Hon, PacBio
“Larger genome hybrid assembly with
PacBio.”
6 pm – 7:00 pm Reception & Discussions
Light Refreshments Will Be Served in GBSF
Lobby (http://f.cl.ly/items/1D441w1A3z3R2q3X2044/UCD_PacBio_Assembly_Workshop_Agenda_Dec17th2013.pdf
PACBIO Blog
Monday, December 9, 2013New Publication Characterizes the Complex Methylomes of Helicobacter pylori
A new paper in Nucleic Acids Research describes the genome-wide methylation state of two strains of Helicobacter pylori, using Single Molecule, Real-Time (SMRT®) Sequencing. The paper represents the first comprehensive study of the myriad of DNA base modifications present across the genome of this major human pathogen.
The collaborative study, entitled "The complex methylome of the human gastric pathogen Helicobacter pylori" was led by the laboratory of Sebastian Suerbaum at the Institute of Medical Microbiology & Hospital Epidemiology and German Center for Infection Research, Hannover, Germany, and includes researchers from New England Biolabs, the DSMZ German Collection of Microorganisms and Cell Cultures Braunschweig, and Pacific Biosciences.
H. pylori chronically infects more than half of the world's population and has been implicated in the formation of ulcers and gastric cancer. Its genome is known for having a “large number of restriction-modification (R-M) systems, and strain-specific diversity in R-M systems has been suggested to limit natural transformation, the major driving force of genetic diversification in H. pylori,” the authors write. They characterized strains 26695 and J99-R3, observing that the methylation patterns were significantly different between the two strains - 17 methylated sequence motifs in the former and 22 in the latter.
Among the motifs were 12 patterns associated with nine recognition sites that were not previously associated with any known methyltransferases. According to the paper, “The combined strategy of allelic disruption of candidate genes and additional functional tests led to the identification of new MTase activities in H. pylori responsible for methylation of eight of the nine novel recognition sites and, furthermore, to the description of unexpected new features of R-M systems.” Those features included “frameshift-mediated changes of sequence specificity and the interaction of one MTase with two alternative specificity subunits resulting in different methylation patterns.”
“This pathogen is not only remarkable due to its abundance of active MTases but [it] also harbors R-M systems with exceptional versatility,” Suerbaum and his collaborators report. “The data show that R-M systems are more complex than previously thought and with the further use of SMRT sequencing for the discovery and functional characterization of R-M systems, our knowledge is likely to rapidly increase.”
Suerbaum presented aspects of this research at the American Society for Microbiology conference earlier this year in a presentation entitled "Comprehensive methylome analysis of the human gastric pathogen, Helicobacter pylori."
http://blog.pacificbiosciences.com/2013/12/new-publication-characterizes-complex.html?utm_content=buffer11afa&utm_source=buffer&utm_medium=twitter&utm_campaign=Buffer ------ ((Introduction to SMRT Sequencing . PacificBiosciences))
( Older interesting read )
March 22, 2011, 12:24 PM
Adventures in Extreme Science
From Crick and Watson through J. Craig Venter, we had all our eggs in one basket — molecular biology, gene mapping, whatever you want to call it. It failed. And now we're counting on this guy.
By Tom Junod
Read more: Eric Schadt Profile - Interview with Eric Schadt Pacific Biosciences - Esquire
Follow us: @Esquiremag on Twitter | Esquire on Facebook
Visit us at Esquire.com ------ http://www.esquire.com/features/eric-schadt-profile-0411-2
PacBio Blog
Wednesday, December 4, 2013In RNA-seq Study, Long PacBio Reads Allow for Detection of Full-Length and Novel Isoforms
A new paper out in PNAS details the usefulness of long reads for isoform sequencing. “Characterization of the human ESC transcriptome by hybrid sequencing” comes from lead author Kin Fai Au and senior author Wing Wong at Stanford University as well as a number of collaborators.
The authors detail the problem that they see with current RNA-seq studies: the inability to capture full-length mRNA isoforms (averaging about 2,500 bases) by using reads of just a few hundred base pairs. “We are still far from achieving the original goals of RNA-Seq analysis, namely the de novo discovery of genes, the assembly of gene isoforms, and the accurate estimation of transcript abundance at the gene or the isoform level,” Au et al. write. They note that isoform detection or prediction with short reads is even more difficult when the full set of possible isoforms is not known going into the project.
The scientists describe a new approach, combining short-read Illumina® and long-read PacBio® sequence data and pairing that with a computational tool to predict isoforms as a more comprehensive means of examining transcripts. They tested the method in a well-characterized line of human embryonic stem cells (hESCs) and validated their findings with follow-on qPCR and knockdown studies.
By adding SMRT® Sequencing to the study, the authors report direct detection of more than 8,000 full-length, RefSeq-annotated isoforms, as well as prediction of nearly 5,500 other isoforms using the Isoform Detection and Prediction computational tool. “Over one-third of these are novel isoforms, including 273 RNAs from gene loci that have not previously been identified,” the scientists write. They add that long noncoding RNAs are especially likely to be lost in short-read studies and that consequently there is “significant downward bias in the current strategy for genome-wide discovery” of these genetic elements.
The authors use one particular example to demonstrate the complexity of isoform analysis. “Several long reads with up to four junctions were mapped to the locus chr6:167,641,267–167,660,912 (hg19, the same below), where no annotated genes in RefSeq, Ensembl, UCSC Known Genes, or GENCODE are reported. The long reads indicated complex expression from this locus with at least three different isoforms transcribed,” they write.
“In our approach the error-corrected long reads are ideal for narrowing down the isoforms expressed in a sample, thus enabling much more reliable abundance quantification from [second-generation sequencing] reads,” report Au et al. They note that results from studies of their Isoform Detection and Prediction tool show that it is “effective in using the information from the PacBio long reads to significantly improve isoform identification.” In addition, these results "suggest that gene identification, even in well-characterized cell lines and tissues, is far from complete." http://blog.pacificbiosciences.com/2013/12/in-rna-seq-study-long-pacbio-reads.html?utm_content=bufferadf82&utm_source=buffer&utm_medium=twitter&utm_campaign=Buffer
(same post as last,has Yahoo link added). Tuesday, December 3, 2013PacBio Partners with Sanger Institute and Public Health England to Finish 3,000 Bacterial Genomes
Sanger's Genome Campus
We are pleased to announce a new collaboration with the Wellcome Trust Sanger Institute and Public Health England to complete the sequences of 3,000 bacterial genome strains from PHE’s National Collection of Type Cultures (NCTC). Sequencing will be performed on the PacBio® RS II DNA Sequencing System at the Sanger Institute. The three-year project could double the number of finished microbial genomes in GenBank.
The NCTC is one of the world’s premier collections for bacterial strains, but most bacteria in NCTC currently have no genome references. Combining reference genomes with the wealth of historical and biological information existing for these strains will generate a data set of enormous value for basic and clinical microbiology.
“This is a crucial set of reference bacteria, and it is critical to have fully finished genomes for them,” said Dr. Julian Parkhill, head of Pathogen Genomics at the Wellcome Trust Sanger Institute. “The collection of 3,000 additional finished genomes, including plasmids and other genomic elements, and epigenomes, will be a wonderful resource for the entire microbiology community.”
Our CSO Jonas Korlach commented: “SMRT® Sequencing has become the gold standard for finishing microbial genomes and we are delighted to be part of this NCTC project to add a wealth of information to the public databases for these important microbes.” http://blog.pacificbiosciences.com/2013/12/pacbio-partners-with-sanger-institute.html--- http://finance.yahoo.com/news/pacific-biosciences-wellcome-trust-sanger-123000490.html
PacBio Blog
Tuesday, December 3, 2013PacBio Partners with Sanger Institute and Public Health England to Finish 3,000 Bacterial Genomes
Sanger's Genome Campus
We are pleased to announce a new collaboration with the Wellcome Trust Sanger Institute and Public Health England to complete the sequences of 3,000 bacterial genome strains from PHE’s National Collection of Type Cultures (NCTC). Sequencing will be performed on the PacBio® RS II DNA Sequencing System at the Sanger Institute. The three-year project could double the number of finished microbial genomes in GenBank.
The NCTC is one of the world’s premier collections for bacterial strains, but most bacteria in NCTC currently have no genome references. Combining reference genomes with the wealth of historical and biological information existing for these strains will generate a data set of enormous value for basic and clinical microbiology.
“This is a crucial set of reference bacteria, and it is critical to have fully finished genomes for them,” said Dr. Julian Parkhill, head of Pathogen Genomics at the Wellcome Trust Sanger Institute. “The collection of 3,000 additional finished genomes, including plasmids and other genomic elements, and epigenomes, will be a wonderful resource for the entire microbiology community.”
Our CSO Jonas Korlach commented: “SMRT® Sequencing has become the gold standard for finishing microbial genomes and we are delighted to be part of this NCTC project to add a wealth of information to the public databases for these important microbes.” http://blog.pacificbiosciences.com/2013/12/pacbio-partners-with-sanger-institute.html
IDP: Isoform Detection and Prediction Using Second Generation Sequencing and PacBio Sequencing
December 1, 2013 by nextgenseek--- http://nextgenseek.com/2013/12/idp-isoform-detection-and-prediction-using-second-generation-sequencing-and-pacbio-sequencing/ ( Nature Biotechnology -- Corrected online25 November 2013) A single-molecule long-read survey of the human transcriptome
http://www.nature.com/nbt/journal/v31/n11/full/nbt.2705.html
PacBio® at Sanger Institute:De Novo Assembly, Methylation Analysis, and Detection of Rare Variants .(Published on Nov 25, 2013
Harold Swerdlow, who runs the R&D department at Wellcome Trust Sanger Institute, discusses his team's use of the PacBio RS sequencer. He says the system is uniquely suited for de novo
Sometimes!!
Wednesday, November 20, 2013New Publication Demonstrates Long-Read Sequences Needed to Thoroughly Resolve Short Tandem Repeats
In a new paper reporting a protocol for using short-read sequence data to locate short tandem repeats (STRs), scientists find that long-read sequence information is necessary to resolve regions with repeat complexity, extreme GC content, and other challenging factors. Their solution is to use short-read data to find STRs, and then to use long-read sequencing to fully characterize those repeat expansions.
The Bioinformatics publication is entitled “Rapid detection of expanded short tandem repeats in personal genomics using hybrid sequencing” and came from scientists Koichiro Doi, Shinichi Morishita, et al. at the University of Tokyo. They focused on resolving STRs across whole genomes because of their links to genetic disease, noting that exome capture is insufficient to fully characterize these repeat units, which are often found in non-exonic regions of the genome.
Doi et al. report that short-read sequence data has traditionally proven inadequate for elucidating STRs that span more than 100 bp, the average length of a short read. In this project, they developed an efficient computational program along with ab initio procedures to sense and locate STRs by scanning massive short-read data sets and analyzing frequency distributions of approximate STRs based on length.
However, they note, “As genomic regions of GC content > 70% are difficult to cover with an ample number of Illumina® reads, our method is unlikely to detect long expansions of STRs with high GC contents. STRs in reads originating in centromeres, telomeres, or retrotransposons are too numerous to map to unique genomic positions.”
To fully analyze longer STRs, the team utilized Single Molecule, Real-Time (SMRT®) Sequencing on 11 samples from patients with a brain disease. Through this approach, they report, “we were able to determine a divergent set of [two] 3-3.1 kb STR sequences in eleven SCA31 samples, showing the instability of STR expansions.” By combining both methods — genome scanning with short-read data to find STR locations and sequencing those structural variants with the PacBio® platform — the scientists were able to rapidly hone in on long STRs implicated in human disease.
Looking ahead, the authors suggest that there is much to be learned about STRs longer than 1 kb and whether STR expansions occur more often in germline or somatic cells. “Analysis of the stability of STR expansions in germline and somatic cells of a specific disease might eventually lead to the recognition of a functional role of STRs,” they write.
http://blog.pacificbiosciences.com/2013/11/new-publication-demonstrates-long-read.html?utm_content=buffer960d6&utm_source=buffer&utm_medium=twitter&utm_campaign=Buffer#SMRTSeq
BioTechniques The International journal of life science Methods.(November 2013) Microsatellite marker discovery using single molecule real-time circular consensus sequencing on the Pacific Biosciences RS--- Abstract
Microsatellite sequences are important markers for population genetics studies. In the past, the development of adequate microsatellite primers has been cumbersome. However with the advent of next-generation sequencing technologies, marker identification in genomes of non-model species has been greatly simplified. Here we describe microsatellite discovery on a Pacific Biosciences single molecule real-time sequencer. For the Greater White-fronted Goose (Anser albifrons), we identified 316 microsatellite loci in a single genome shotgun sequencing experiment. We found that the capability of handling large insert sizes and high quality circular consensus sequences provides an advantage over short read technologies for primer design. Combined with a straightforward amplification-free library preparation, PacBio sequencing is an economically viable alternative for microsatellite discovery and subsequent PCR primer design.
Introduction
Microsatellites are important and highly informative markers for population genetics, evolutionary biology, and ecology. But the development of sufficient microsatellite markers for population genetics studies of non-model organisms has been laborious. The first high-throughput attempts to identify microsatellite markers were performed using 454 pyrosequencing of genomic shotgun libraries or microsatellite enriched libraries (1-3). Approaches using ultra-short read sequencing for microsatellite identification have also been explored (4, 5). Typically, these cloning-free approaches are performed at low genome coverage, employing relatively short sequence reads and often recovering small fractions of the target genome. Shotgun genome sequencing for microsatellite discovery is generally less biased, but potential markers are typically sequenced only once, inherently linking the reliability of the sequence to the single pass accuracy of the sequencing chemistry. In contrast, microsatellite-enriched libraries are biased by the choice of capture probe, restriction enzyme, or PCR amplification steps. Pacific Biosciences (Menlo Park, CA) has developed a single molecule, real-time (SMRT) DNA sequencing system, the PacBio RS (6). Although SMRT sequencing has a single pass accuracy of ~85%, the sequencing library format (SMRTbell) allows multi-pass sequencing of the same circular template, thereby generating highly accurate circular consensus sequencing (CCS) data reaching >99.999% accuracy (7).
Method Summary
This study reports the use of single molecule consensus sequencing using the Pacific Biosciences RS for microsatellite discovery. The advantage over other next-generation sequencing systems is the random error model and the capability for sequencing the same library molecule several times, thereby generating high quality consensus sequences. The relatively long library molecules in combination with high consensus accuracy are excellent templates for primer design as the reads cover larger flanking regions of the identified microsatellites compared to Illumina, 454 or Ion Torrent sequencing reads.
Figure 1 shows the genomic shotgun approach for microsatellite discovery using the PacBio PacBio real-time sequencing system, which is based on the CCS approach. CCS therefore represents a combination of unbiased random shotgun sequencing with high consensus coverage of the template molecule. We have therefore explored the usability of CCS reads for microsatellite discovery and subsequent primer design.
Figure 1. Overview of microsatellite discovery using genomic shotgun circular consensus sequencing (CCS). (Click to enlarge)
Material and methods
Genomic DNA from a single Anser albifrons individual was sheared to approximately 3 kb fragments for SMRTbell template library generation. Sequencing was performed on a PacBio RS using C2/C2 chemistry (movie time 90 min) by GATC Biotech (Konstanz, Germany).
The CCS reads were subjected to microsatellite analysis and primer design using msatcommander (v1.08) (8) with a threshold of at least five repetitions for di-, tri-, tetra-, penta-, and hexa-nucleotide repeats, excluding mononucleotide repeats. Primer design parameters were: product size 90–500; primer: max size 25 / min size 18, min Tm 47°C, max Tm 63°C, min GC 40%, max GC 60%, and msatcommander option “combine loci”. PCR was performed in a 25 µL reaction mix containing: 1 × complete buffer (2.0 mM MgCl2; Bioron Diagnostics, Ludwigshafen, Germany), 10 pM of each primer, 0.5 mM of each dNTP, except dATP (0.25 mM), and 0.25 mM of radiolabeled (33P-a)-dATP,1 unit of Taq polymerase, and 50–100 ng of template DNA. The following PCR protocol was employed: (1) initial denaturation, 5 min at 94°C, (2) 35 cycles of 45 s of denaturation at 94°C, 60 s of annealing at 52°C, and 2 min of extension at 72°C; (3) final 10 min extension at 72°C. Samples were genotyped by autoradiography in 6% acrylamide: N,N’-methylenebisacrylamide 24:1 denaturing gels using X-ray Hyperfilm (Kodak, Taufkirchen, Germany). The autoradiograms were analyzed by eye and scored. Filtered subreads and CCS reads were generated using the PacBio SMRT analysis software (v1.3.1). The filtered subreads were mapped to the complete mitochondrial genome of A. albifrons (GenBank: NC_004539.1) using the BWA Smith-Waterman Aligner (BWA-SW; v0.6.2) with mapping parameters “-b5 -q2 -r1 -z20” (9).
Results and discussion
After quality filtering of zero-mode waveguides (ZMW), the run yielded 16,180 reads with 43 Mb of sequence data. The reads were further split into 31,200 subreads, with subreads ranging from 1 to 41 per insert (mean: 10, median: 7.5). For 1,457 of the inserts with at least 3 full subreads, a CCS read could be generated (length mean: 1,867, median: 1,861), translating into 2.72 Mb of sequence data. A vast majority of the reads had an average predicted error rate of <1%, i.e., Phred 20 score (Figure 2) (10).
Figure 2. Average Phred quality scores of circular consensus sequences (CCSs). (Click to enlarge)
In 281 CCS reads, 316 microsatellites were identified, and 251 flanking PCR primer pairs could be designed. The distribution of putative target loci consisted of 213 di-, 50 tri-, 28 tetra-, and 25 penta-nucleotide motifs. Of the microsatellite containing reads, 255 contained a single microsatellite and 26 reads contained 2 to 5 motifs. This is equivalent to a(combined) locus to primer conversion of ~90%. This yield is higher than the values obtained from 454 Titanium or Illumina shotgun sequencing, which usually show conversion of 40%–60% of the loci due to read length constraints and depending on stringency of analysis (2, 4, 5, 11).
For CCS reads containing more than one motif and generating several primer pairs, we chose the most promising target for PCR amplification, based on its microsatellite motif characteristics. Although birds are known to have a lower microsatellite density than other vertebrates (12, 13), a reasonable number of loci could be extracted from our sequence data. These higher primer yields from relatively few reads are likely the result of the long CCS reads, which have a higher chance of identifying microsatellites and also leave enough flanking region for subsequent primer design. The random error model of PacBio sequencing (14) grants the CCS reads a high accuracy at three or more circular sequencing passes and are favorable in this context compared to reads from 454 and Illumina sequencing with their sequence-specific error models (15, 16).
We tested the performance of the primer pairs generated by msatcommander based on CCS reads of 10 individuals from 3 goose species: A. albifrons, A. anser, and A. erythropus. Our first primer evaluation showed successful PCR products for A. anser from 48 of 50 pairs, and 46 out of 50 pairs amplified the desired locus for both A. albifrons and A. erythropus. The autoradiograms were analyzed by eye and scored. To evaluate if the frequency of the observed genotypes is higher than expected under genetic equilibrium, genotypic linkage disequilibrium per pair of loci was tested using the software Genepop (v4.1) (17); the results indicated no linked loci (P < 0.05). A thorough analysis of the microsatellite markers will be presented in a separate publication (Frias Soler et al., manuscript in preparation).
Apart from microsatellite discovery, we also analyzed how many reads matched mitochondrial sequences and found that 59 subreads could be mapped to the mitochondrial genome of A. albifrons. The reads covered 12,957 bp out of 16,737 bp from the published reference sequence (77.42%). In principle, one could co-retrieve the complete mitochondrial genome from a single SMRT-cell, at least in a draft stage, to be improved later.
In conclusion, we show that, even with a small fraction of an avian target genome, one can generate enough primer pairs for microsatellite loci to perform population genetics studies. The unique combination of randomly sub-sampling a small fraction of the genome and long high-quality CCS reads is advantageous for primer design over short-read technologies that allowonly single-pass reads at low genome coverages. For technical reasons, our library had an average insert size of 1.8 kb and was slightly above the optimal range for the C2 chemistry read length, yielding only few CCS reads from the longest fraction of the reads. As CCS accuracy is correlated with the number of sequencing passes of the template molecule, shorter library molecules would result in a higher number and better quality of the CCS reads (7).
In light of current sequencing chemistry and system upgrades (RS II; P5 polymerase / C3 chemistry) with average read lengths of ~8,500 bases, our library would have been covered by at least 3–4 sequencing passes. A single sequencing run costing ~$600 (including library preparation) should therefore generate a minimum of 30,000 CCS reads, a 20-fold increase compared with our study, yielding ~6500 potential loci even in microsatellite poor bird genomes. Several price calculations for different PacBio sequencing chemistry combinations can be found in the recent study by Koren et al. (18). This puts the cost of microsatellite discovery using PacBio sequencing between that of 454 and Illumina sequencing (4). Our approach provides a useful alternative when cost reduction using multiplexing is not practicable, and microsatellite array length is to be determined directly from the data for prioritized locus testing.
Author contributions
Markus A. Grohme designed the study, performed data analysis, and wrote the manuscript. Roberto Frias Soler performed PCR experiments. Michael Wink provided the sample and was involved in writing the manuscript. Marcus Frohme provided funding, aided in writing the manuscript, and supervised the work.
Acknowledgments
Funding was by the Ministry of Science, Research and Culture (MWFK) of the federal state of Brandenburg (Germany) in the program “Knowledge and Technology Transfer for Innovation” (FKZ 80143246 / GenoSeq) based on the European Fund of Regional Development (EFRE).
Competing interests
The authors declare no competing interests.
http://www.biotechniques.com/BiotechniquesJournal/2013/November/Microsatellite-marker-discovery-using-single-molecule-real-time-circular-consensus-sequencing-on-the-Pacific-Biosciences-RS/biotechniques-348097.html?utm_source=BioTechniques+Newsletters+%26+e-Alerts&utm_campaign=9b623cf325-etoc&utm_medium=email&utm_term=0_5f518744d7-9b623cf325-87364122
submitted by jabylyn 29 minutes ago - PacBio sequencing of gene families — A case study with wheat gluten genes
mainly accumulate in storage proteins called gliadins and glutenins. Gliadins contain a/ß-, ?- and ?-types whereas glutenins contain HMW- and LMW-types. Known gliadin and glutenin sequences were largely determined through cloning and sequencing by capillary electrophoresis. This time-consuming process prevents us to intensively study the variation of each orthologous gene copy among cultivars. The throughput and sequencing length of Pacific Bioscience RS (PacBio) single molecule sequencing platform make it feasible to construct contiguous and non-chimeric RNA sequences. We assembled 424 wheat storage protein transcripts from ten wheat cultivars by using just one single-molecule-real-time cell. The protein genes from wheat cultivar Chinese Spring are comparable to known sequences from NCBI. We demonstrated real-time sequencing of gene families with high-throughput and low-cost. This method can be applied to studies of gene amplification and copy number variation among species and cultivars.
http://sciencealerts.com/stories/2577085/PacBio_sequencing_of_gene_families__A_case_study_with_wheat_gluten_genes.html?utm_source=dlvr.it&utm_medium=twitter (http://www.sciencedirect.com/science/article/pii/S0378111913013681
November 12th, 2013- PacBio Posts Slides from User Group Meeting
-Lance Hepler, Center for AIDS Research, UC San Diego
Hepler used the PacBio RS to study intra-host diversity in HIV-1. He compared PacBio’s performance to that of 454®, the platform he and his team previously used. Hepler noted that in general, there was strong agreement between the platforms; where results differed, he said that PacBio data had significantly better reproducibility and accuracy.
George Weinstock, Washington University St. Louis
Weinstock discussed his overall approach to human microbiome projects, including both targeted 16S sequencing with PacBio, as well as shotgun sequencing of the whole sample. In a pilot project, Weinstock’s team created a mock microbiome of 24 samples with a 300-fold range of concentration; PacBio sequencing was able to accurately identify the taxa for all 22 species where 16S amplification succeeded, yielding highly accurate full-length 16S consensus sequences.
John Huddleston, University of Washington
Huddleston is looking at challenging regions in the human genome, noting
that assembly accuracy needs to be quite high to resolve breakpoints and
reconstruct duplication architectures. His team is working with BACs to
validate the use of the PacBio platform as a faster, more cost-effective
alternative to Sanger. In one study, his team found that PacBio results
had 99.994% identity with Sanger results and showed uniform coverage
across the clone.
Lisbeth Guethlein, Stanford University School of Medicine
Guethlein looked at highly repetitive and variable regions of the
orangutan genome. Guethlein reported that “PacBio managed to accomplish
in a week what I have been working on for a couple years,” (with
Sanger) and the results were concordant.
Alisha K. Holloway, Gladstone Institute
Holloway presented data from transcript identification work in chicken.
Because she uses chicken to model human heart development, she needs good
annotations of RNA produced at various developmental stages to figure out
where problems arise. Unlike short-read technologies, PacBio provided
reads long enough to span entire transcripts and dramatically improved
gene annotation.
Chongyuan Luo, Salk Institute for Biological Studies
Luo from the Ecker lab spoke about studying the genome and epigenome of several Arabidopsis thaliana strains using SMRT® Sequencing. PacBio sequence data detected 40 percent more SNPs than short-read technology, indicating that some regions may not have been covered well enough with short reads to find all SNPs.
(LOTS MORE)!!! http://www.homolog.us/blogs/blog/2013/11/12/pacbio-posts-slides-user-group-meeting/
Top N Genome Scientists to Follow on Twitter: 2013 Edition.
( Many of these Scientists Follow PACBIO.,Pros & Cons)! http://nextgenseek.com/2013/11/top-n-genome-scientists-to-follow-on-twitter-2013-edition/
Thursday, November 7, 2013-Event Recap: Fall User Group Meeting Presentations & Review
In September we were excited to have 100+ customers gather in Palo Alto, Calif., to discuss their use of Single Molecule, Real-Time (SMRT®) Sequencing and hear about what’s next for the PacBio® RS II. Many thanks to all the scientists who attended and shared their experiences. For anyone who couldn’t make it, we’ve included some highlights from each talk below (and links to full presentations when possible):
Chongyuan Luo from the Ecker lab at the Salk Institute for Biological Studies spoke about studying the genome and epigenome of several Arabidopsis thaliana strains using SMRT Sequencing. Luo noted that Arabidopsis is the only plant to have its complete genome sequenced, though structural variation has not yet been well characterized. PacBio sequence data detected 40 percent more SNPs than short-read technology, indicating that some regions may not have been covered well enough with short reads to find all SNPs.
We were delighted to have two speakers from the Joint Genome Institute, which has become a power user of the PacBio technology. Alex Copeland offered an overview of the institute’s microbial and fungal reference assembly pipeline, where de novo genome sequencing is especially important. He described their experience with a 10x increase in read length and total throughput in three years on the PacBio platform. He also discussed the evolution of their pipeline from Sanger sequencing to the Illumina® and PacBio platforms, going from a median of 49 contigs per microbial genome with Sanger, to 69 with Illumina sequencing, and to 10 or fewer with the PacBio system. Copeland noted that after producing 100x coverage of long reads (10 kb inserts), PacBio users can reliably assemble a genome into 10 contigs or fewer. He said that the team has shifted to a PacBio-only pipeline, and that they are finishing genomes on the platform for less than $2,000.
Fellow JGI scientist Matthew Blow spoke next on bacterial epigenomics, an important genome component that his team looks at with every microbe sequenced. Blow and his colleagues are studying methyltransferases, their link to restriction enzymes, related sequence motifs, and sites that remain unmodified. A recent analysis of global patterns in DNA modifications in bacteria revealed that of 198 analyzed genomes, 169 (>90%) had modified DNA bases, with the most common being N6-methyladenine (80%). Novel motifs constituted ~20%, and the average number of modified motifs per genome was 3, with a maximum of 12. Blow noted that JGI is seeking collaborators for additional projects to explore the biological functions of DNA modifications.
Bart Weimer from the UC Davis School of Veterinary Medicine spoke about the 100K Foodborne Pathogen Genomes project. He noted that sequencing is critical for pathogen identification both because microbial evolution can erase the markers currently used for tracking, and because 16S classification does not correlate with phylogenetic serotype clustering. The goal of the 100K project is to provide a useful, comprehensive database that will allow users to find clinically relevant information about new strains in outbreak situations. Weimer was enthusiastic about the additional information provided by PacBio sequence data, such as methylation and phage elements — both useful in tracking and identifying pathogens. “I get the sequence, I get the structural variation, I get the SNPs and most importantly I get the epigenetic information. Sequencing is almost a byproduct,” he exclaimed.
Lance Hepler from UC San Diego’s Center for AIDS Research used the PacBio RS to study intra-host diversity in HIV-1. He compared PacBio’s performance to that of 454® sequencer, the platform he and his team previously used. Hepler noted that in general, there was strong agreement between the platforms; where results differed, he said that PacBio data had significantly better reproducibility and accuracy. “PacBio does not suffer from local coverage loss post-processing, whereas 454 has homopolymer problems,” he noted. Hepler said they are moving away from using 454 in favor of the PacBio system.
From Washington University in St. Louis, George Weinstock discussed his overall approach to human microbiome projects, including both targeted 16S sequencing with the PacBio platform, as well as shotgun sequencing of the whole sample In a pilot project, Weinstock’s team created a mock microbiome of 24 samples with a 300-fold range of concentration; PacBio sequencing was able to accurately identify the taxa for all 22 species where 16S amplification succeeded, yielding highly accurate full-length 16S consensus sequences. He also presented a proof of concept study wherein the PacBio system outperformed Sanger sequencing in using full-length 16S sequencing for high-throughput identification of bacteria in clinical isolates of hospital acquired infections.
We had a couple of talks on characterizing complex genomic regions. Lisbeth Guethlein from Stanford University School of Medicine looked at highly repetitive and variable regions of the orangutan genome. Guethlein reported that “PacBio managed to accomplish in a week what I have been working on for a couple years” (with Sanger sequencing), and the results were concordant. “Long story short, I was a happy customer.” In a separate presentation, John Huddleston from the University of Washington discussed sequencing challenging regions in the human genome, noting that assembly accuracy needs to be quite high to resolve breakpoints and reconstruct duplication architectures. His team is working with BACs to validate the use of the PacBio platform as a faster, more cost-effective alternative to Sanger. In one study, his team found that PacBio results had 99.994% identity with Sanger results and showed uniform coverage across the clone.
In the afternoon, talks turned to the transcriptome. Vince Magrini from the Genome Institute at Washington University described a proof-of-principle RNA-seq study using SMRT Sequencing in a nematode to help elucidate transcriptional regulation and its effect on life cycle. Using PacBio data added more than 1,500 genes to what had been found in the reference sequence. In another talk, Alisha Holloway from the Gladstone Institutes presented data from transcript identification work in chicken. Because she uses chicken to model human heart development, she needs good annotations of RNA produced at various developmental stages to figure out where problems arise. Unlike short-read technologies, PacBio sequencing provided reads long enough to span entire transcripts and dramatically improved gene annotation. Finally, Kin Fai Au from Stanford University spoke about gene isoform identification and prediction in embryonic stem cells, commenting that long reads are essential to examining these long regions and resolving alternative splice isoforms.
Robert Sebra from the Icahn Institute for Genomics and Multiscale Biology at Mount Sinai presented data on how to use BluePippin™ size selection from Sage Science to increase subread lengths of PacBio data He noted that the BluePippin sizing step also cleans up DNA quality, compensating for any drop in yield. With size selection, Sebra said that his team could generate microbial assemblies from a single SMRT Cell; without the step, more sequencing was needed.
Two of the speakers came from PacBio: Senior VP Kevin Corcoran and CSO Jonas Korlach. Corcoran updated attendees about the latest on our sequencing platform, including upcoming advances such as polymerase photodamage protection, the P5-C3 chemistry offering 8,500-base average reads, three-hour movies, Quiver for diploid sequencing, and more. In the closing presentation, Korlach spoke about where the PacBio platform is heading, including use for large customer projects that include large numbers of samples, higher complexity metagenomic studies, and assemblies of larger genomes. He also mentioned upcoming technology improvements, such as library prep automation and new data analysis algorithms.
http://blog.pacificbiosciences.com/2013/11/event-recap-fall-user-group-meeting.html?utm_content=buffer46246&utm_source=buffer&utm_medium=twitter&utm_campaign=Buffer
FOR MARYLAND INVESTIGATORS, LONG READS OFFER
NEW PATH TO FINISHED GENOMES,This is the full story,much better than the PACB BLOG: http://ow.ly/qy(xiE ("For other teams considering whether
SMRT Sequencing is the right choice
for them, Tallon says: “If you’re going
to be working with small genomes,
value complete or nearly finished
genome sequences, and are looking at
base modifications in addition to the
genome sequence, there’s no better
platform out there.”)!!!!!!!
Wednesday, November 6, 2013At Institute for Genome Sciences, Long Reads Offer New Path to Finished Genomes
The Genomics Resource Center (GRC) at the Institute for Genome Sciences (IGS) has a scientific pedigree and a sample-to-interpretation service commitment that place it in a league of its own. The team operates under a simple mantra: ‘If it can be sequenced, we can do it.’
Both GRC and IGS were founded in 2007 when a high-powered team of investigators formerly at The Institute for Genomic Research (TIGR), led by Claire Fraser, joined the University of Maryland School of Medicine. “The group of faculty and senior staff that came here to start the institute was heavily focused on infectious disease research,” says Luke Tallon, scientific director and founding leader of the GRC. “Our primary goal in joining the medical school was to extend our pathogen genomics expertise into host-pathogen studies and direct clinical genomics applications.”
In addition to its infectious disease and genomics expertise, TIGR was also renowned for its bioinformatics talent — a trait that continues with the group at IGS. The GRC team of 15 staff members is evenly split between wet lab and bioinformatics, and more than half of the institute’s 100-plus employees are bioinformaticians. “One of our strengths is that we go beyond generating efficient, high-quality sequence data. We have teams of analysts and engineers who can assist investigators with downstream analysis and interpretation,” Tallon says.
The GRC has had a mandate to stay on the cutting edge of sequencing technology since its inception. “We are continuously monitoring and evaluating new technologies,” Tallon says. A few years ago, these evaluations led IGS to Pacific Biosciences and its single molecule, real-time (SMRT®) sequencer.
The instrument’s strength in de novo microbial sequencing and other long-read applications made it particularly well-suited to the type of research at IGS. Though there are many factors they consider when choosing a new instrument, an important one is “the relative value of the type of data you’re going to get,” Tallon says. “Small genomes are such a significant part of what we do, and we’re moving more and more into microbial transcriptomes and methylation studies. It’s the only platform that allows us to do all of that really well.”
Lisa Sadzewicz, administrative director of the facility, notes that anticipated demand is also an important factor in bringing in a new instrument for a core facility. “We are not driven by only one customer or one small group,” she says. “In order to drive down costs, you need to have a wide community from which you can draw projects and samples to fill the capacity of the instrument.”
The PacBio sequencer, which GRC upgraded to the PacBio® RS II last spring, is now a workhorse for generating finished or nearly complete microbial genomes as well as genome-wide methylation data. “We’re now analyzing base modifications and methylation patterns routinely with most of our small genomes,” Tallon says. “We’re also doing metagenomic sequencing using the RS II and exploring ways we can use the long reads to get full-length genes and transcripts out of our metagenome and metatranscriptome samples.”
The team’s attention to optimizing the sequencing workflow has resulted in a high-performance pipeline for finishing genomes. “Prior to PacBio, we couldn’t close genomes without manual finishing efforts,” Tallon says. With SMRT Sequencing, his team is consistently finishing genomes. “Our biggest challenge is getting sufficient high-quality DNA to start a project. If we get that, the genomes are going to close more often than not.”
For more on the GRC’s use of the PacBio system, including details on how the lab optimized its SMRT pipeline, read the complete profile. You can also visit the GRC blog or attend their upcoming applications seminar to hear how the PacBio RS II can advance your research on Thursday, November 7 at 11:00 AM. http://blog.pacificbiosciences.com/2013/11/at-institute-for-genome-sciences-long.html?utm_content=bufferb6ae0&utm_source=buffer&utm_medium=twitter&utm_campaign=Buffer ( Pacific Biosciences ?@PacBio 2h Blog Post: At Institute for Genome Sciences, Long Reads Offer New Path to Finished Genomes http://j.mp/187IPyb @
Error-Corrected PacBio Sequences for the D. melanogaster Reference Strain
Posted 06 Nov 2013 Using PacBio and Illumina whole genome shotgun sequences we recently released for the D. melanogaster reference strain, Sergey Koren and Adam Phillippy at the University of Maryland have recently run their pacBioToCA method to generate a dataset of error-corrected PacBio reads for this dataset, which the have kindly made available here for re-use without restriction. This pilot data set is not at high enough coverage and thus a whole genome assembly was not attempted. Nevertheless, both the raw and error-corrected datasets should be of use to better understand the nature of PacBio data and the pacBioToCA pipeline as applied to Drosophila genomes.
The 2057_PacBio_corrected.tgz archive contains the following files:
•pacbio.blasr.spec – the specification file used for the pacBioToCA run.
•corrected.trim.fastq.bz2 – the error-corrected PacBio reads.
•corrected.trim.lens – a file containing columns with corrected read id, read length, running total BP, running mean, and running coverage assuming 140Mbp genome size.
•corrected.trim.names – a look-up table to map read IDs in the pacBioToCA output to the original PacBio input IDs.
Sergey also sent along a quick summary of the correction:
The correction used the latest CA from the repository (as of 10/15/13). The max number of mappings each Illumina sequence could have was set to 20, the repeat separation was set to 10. The genome size was set to 180Mbp. BLASR was used to align the Illumina sequences to PacBio sequences for correction. The input for correction was 673 Mbp of PacBio data (max: 8,231, mean: 1,587) which corresponds to approximately 4.8X of a 140Mbp genome. The correction produced 469Mbp (max: 6,883, mean: 1,410) or 3.35X. The throughput was approximately 69%. To estimate, the accuracy of the data, it was mapped back to the reference D. melanogaster. The sequences average 99.85% identity. Approximately 10% of the sequences mapped in more than a single piece to the reference. Some of these mappings were due to short indel differences between the reference and the reads or N’s in the reference. Some sequences mapped across large distances in the reference. I did not confirm if these split mappings were supported by the uncorrected reads or not.
We thank both Sergey and Adam for taking the time to run their pacBioToCA pipeline and making these data available to the community, and hope these data are of use to others in their research.
http://bergmanlab.smith.man.ac.uk/?p=2151#.UnqOzFOzu7I.twitter
Monday, November 4, 2013-Comparative Transcriptome Analysis: Insights from a Single SMRT Cell
In a new paper published in the journal Gene, scientists from Rutgers University and King’s College London report the use of a single SMRT® Cell to sequence and assemble more than 400 wheat-storage protein transcripts from 10 strains of the crop.
In “PacBio sequencing of gene families — A case study with wheat gluten genes,” authors Wei Zhang, Paul Ciclitira, and Joachim Messing note that traditional studies of these cDNA sequences are so costly and labor-intensive that they have not allowed for intensive study of “the variation of each orthologous gene copy among cultivars.”
That kind of study for complex traits “usually requires positional information from sequencing entire genomes,” a task that would be prohibitive for this type of cross-strain interrogation. “Comparative transcriptome analysis of gene families,” the scientists note, offers an alternative way to study multigenic traits “without a need to re-sequence the related genomes in their entirety.”
For transcriptome sequencing, short-read technologies eliminate the cost problem, the authors add, but the short sequences “are a critical barrier to assemble repetitive genes, which may result in inadvertently joining of different gene copies into chimeric molecules.”
PacBio® sequencing, on the other hand, offers not only the needed throughput but also read lengths capable of resolving long, complex genetic regions, Zhang et al. write. The paper reports a proof-of-principle study designed to determine whether SMRT Sequencing is a viable and scalable option for investigations of variation across several different crop strains.
The authors chose 10 wheat cultivars from around the world, used barcoded PCR primers for each, and pooled the samples to run on a single SMRT Cell. Sequence data had an average read length of 3,050 bp and included nearly 33,000 circular-consensus sequencing reads in the final analysis.
The scientists then compared results of one of their cultivars, a common type of wheat known as Chinese Spring, to information on the same cultivar from the NCBI protein database, finding high rates of concordance. “The accuracy of the assembly in Chinese Spring was validated with 99% identity from cDNAs obtained by conventional sequencing methods,” they report. They also succeeded in sidestepping the chimera problem of short-read sequencers: “With the redundancy in sequencing coverage and the length of the sequences, our assemblies avoid chimeric joining of different gene copies.”
Zhang et al. note that their method should be useful for other phylogenetic studies as well. “We suggest our method as an efficient, low-cost method for profiling gene expression of gene families from cultivars, which genome has not been [sequenced] or is only available as a draft sequence,” they write. http://blog.pacificbiosciences.com/2013/11/comparative-transcriptome-analysis.html
homologs Tutorials
SMRT Technology from Pacific Biosciences (PacBio)
http://www.homolog.us/Tutorials/
Posted on October 29, 2013 by nsengamalay Join us for an IGS-sponsored PacBio seminar Thursday, 11/7 at 11:00 in Discovery Auditorium, Biopark II. Come see how SMRT sequencing can advance your research! Click here for details.
This entry was posted in PacBio. http://www.igs.umaryland.edu/labs/grc/2013/10/29/339/
New Products: PacBio's P5-C3 Chemistry
October 29, 2013 http://www.genomeweb.com/sequencing/new-products-pacbios-p5-c3-chemistry?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+genomeweb%2Finsequence+%28In+Sequence%29
(Full-text access for premium subscribers only.)
Proof-of-Principle Study Points to PacBio Promise for Profiling Transcriptome
October 29, 2013
By Andrea Anderson (Full-text access for premium subscribers only.)http://www.genomeweb.com/node/1301956?utm_source=SilverpopMailing&utm_medium=email&utm_campaign=In%20Sequence:%20Oxford%20Nano%20Shows%20off%20MinIon;%20Illumina,%20Ion%20Torrent%20Product%20Plans;%20PacBio%20Q3%20Earnings%20%20-%2010/29/2013%2003:00:00%20PM
bluecheaps4me- Yes,it`s is ashame!
Mick Watson - 23 Oct. "There it is RT @EpgntxEinstein: Roche to develop a smaller (cheaper) PacBio machine focused on clinical sequencing" ?? (Mick Watson is Genomicist / Bioinformatician, Director of ARK-Genomics, a high-throughput facility at The Roslin Institute).
Rumor????
Tuesday, October 22, 2013-Data Release: Long-Read Shotgun Sequencing of a Human Genome
In order to help evaluate the utility of long, unbiased sequence reads for characterizing structural variation in the human genome using our recently released P5-C3 scaffolding sequencing chemistry, we have collected 10x long-read, shotgun coverage of a human genome sample. The human genome harbors many structural variations, including variable number tandem repeats, deletions, insertions, inversions, and repetitive mobile elements, which are often difficult to resolve using short-read technologies. We hope this data set will be of value to the bioinformatic and scientific community studying various forms of structural variation across the human genome. To access it, simply send us an email and you will receive instructions for downloading the data set.
In collaboration with Evan Eichler (Howard Hughes Medical Institute, University of Washington), we sequenced CHM1TERT, a well-studied cell line derived from a complete hydatidiform mole (CHM). A hydatidiform mole is defined as a pregnancy with no embryo and clinically presents in approximately 1 in 1,500 pregnant women in North America. The CHM cells have a diploid genome, typically XX, that is a result of replication of a haploid paternal (sperm) genome. Through the corresponding absence of allelic variation, this sample has been used to generate a haploid reference genome sequence, and many associated resources are available, including physical maps, genotypes (iSCAN), and a large-insert BAC library (CHORI-17). It is also one of the targets for the production of a higher quality “platinum” genome assembly.
We prepared ~20 kb DNA fragment libraries, size-selected with the BluePippin™ system from Sage Science, and sequenced with 3-hour movies using the P5-C3 sequencing chemistry. Some sequencing statistics are listed below:
Total number of reads: 3,679,463
Total number of post-filtered bases: 32,559,803,198
Average read length: 8,849 bp
Half of sequenced bases in reads greater than: 10,985 bp
5% of sequenced DNA inserts longer than: 18,060 bp
Longest DNA insert sequenced: 41,460 bp
PacBio® RS II instrument time for sequencing: 10 days
Number of SMRT® Cells: 66
(see link)
Figure 1. Subread length distribution. A subread is a DNA insert sequenced between two SMRTbell™ hairpin adapters. The solid black line (right y axis) denotes the amount of sequenced bases greater than a given subread length (x axis).
We also mapped the data against the human reference genome (GRCh37) and found generally even coverage across the reference, with numerous examples of structural variations highlighted by the long reads. A mapping coverage summary and a few examples highlighting structural variation are given below.
Figure 2. Uniform sequencing coverage upon mapping against the GRCh37 human genome reference. (A) Example coverage for chromosome 3. The gap in the center is due to lack of sequence in the reference (~3 million N bases) of the centromere. (B) Coverage histogram over all non-N bases of the GRCh37 reference.
(see link)
Figure 3. Examples of large deletions. The sharp breakpoints from the even shotgun read structure, combined with the lack of read coverage, indicate a 114.2 kb and a 4.9 kb deletion in this ~375 kb region of chromosome 3. The individual sequence reads are shaded by length (reads in black are >10 kb). Both deletions have been validated and are polymorphic in the human population.
Figure 4. Sequence structure of the Fragile X Mental Retardation (FMR1) Triplet CGG Repeat. (A) Read mapping to the reference genome sequence shows many insertions (green vertical lines) across this region on the X chromosome. (B) Consensus building from the reads and dot plot comparison reveals the true structure including an additional AGG-(CGG)9 repeat block in the CHM1 genome. http://blog.pacificbiosciences.com/2013/10/data-release-long-read-shotgun.html?utm_content=buffer3a5d4&utm_source=buffer&utm_medium=twitter&utm_campaign=Buffer
October 22nd, 2013- PacBio Achieves 500Mb/SMRT Cell Throughput in Newly Released Human Data
In collaboration with Evan Eichler (Howard Hughes Medical Institute, University of Washington), we sequenced CHM1TERT, a well-studied cell line derived from a complete hydatidiform mole (CHM). A hydatidiform mole is defined as a pregnancy with no embryo and clinically presents in approximately 1 in 1,500 pregnant women in North America. The CHM cells have a diploid genome, typically XX, that is a result of replication of a haploid paternal (sperm) genome. Through the corresponding absence of allelic variation, this sample has been used to generate a haploid reference genome sequence, and many associated resources are available, including physical maps, genotypes (iSCAN), and a large-insert BAC library (CHORI-17). It is also one of the targets for the production of a higher quality “platinum” genome assembly.
The stats are quite fascinating. They have 66 SMRT cells producing 32,559,803,198 bases of post-filtered nucleotides. Therefore, on average, each SMRT cell produced 493Mb of sequences. A few days back, we asked about the typical throughput of PacBio machines at seqanswers. Lex Nederberg and Genomax reported 150-300 Mb with pre-P4 chemistry and only one exceptional case of 730 Mb with P4 chemistry. On the other hand, 500 Mb per SMRT cell appears to be common with P5-C3 chemistry here. Other stats:
Average read length: 8,849 bp
Half of sequenced bases in reads greater than: 10,985 bp
5% of sequenced DNA inserts longer than: 18,060 bp
Longest DNA insert sequenced: 41,460 bp
PacBio® RS II instrument time for sequencing: 10 days
Their blog post has informative charts on read size distribution and one example of a 114.2 kb deletion in the human genome !!!
http://www.homolog.us/blogs/blog/2013/10/22/pacbio-releases-library-covering-human-genome-10x/
Saturday, October 19, 2013
Pacific Biosciences has been squeezing 454 in applications requiring long reads. My second (of two) 454 experiments was in de novo bacterial genome assembly, and some gain over pure Illumina was had -- but just after I got that dataset I got my first PacBio dataset with HGAP correction, and it was game over for any other current platform in that application. (A good article). http://omicsomics.blogspot.com/2013/10/ripples-from-454s-shutdown-announcment.html
Platform Comparison Study Finds PacBio Yields Most Complete Assembly of E. coli Genomes
October 15, 2013
By Monica Heger Full-text access for premium subscribers only http://www.genomeweb.com/sequencing/platform-comparison-study-finds-pacbio-yields-most-complete-assembly-e-coli-geno
Pacific Biosciences data analysis-(Published October 15, 2013
) A microbial clock provides an accurate estimate of the postmortem interval in a mouse model system
We took advantages of the long reads provided by the Pacific Biosciences (PacBio) sequencing platform to gain more detailed taxonomic resolution of the abundant bacteria and eukaryotes found in the early and late stage decomposition communities. In both cases high quality circular consensus sequences (CCS) were used. The 16S PacBio data consists of sequences of 800–900 bp in length. The taxonomic identity of the sequences was assessed by assigning sequences to OTUs using the Greengenes February 2011 release of the Greengenes (DeSantis et al., 2006; McDonald et al., 2012) 97% reference dataset. Taxonomy assignments of the long PacBio reads agreed with the short Illumina reads at the family level and the long reads made is possible to determine the taxonomic identity of the reads to the genus and in most cases species level. Taxonomy of the PacBio sequences was also verified by placing sequences within a phylogenetic tree using RAxML EPA algorithm (Berger et al., 2011). The 18S PacBio sequences were roughly 1200 base pairs in length. The 18S data was clustered into OTUs at 97% similarity using the open-reference protocol described above and the curated Silva 108 database. Initial taxonomy assignment was done by BLAST (Altschul et al., 1990) with an e-value threshold of e-10. Taxonomy assignment was refined by placing the sequences within the Silva 108 reference tree using maximum likelihood with RAxML EPA. These resulting taxonomy assignments were used to resolve the taxonomy of highly abundant community members. Bacterial and microbial eukaryotic taxa found were sorted by abundance at each site are reported in Supplementary File 1B. Genus and species level taxonomy was reported in relevant text of the manuscript. Because PacBio data were generated from only a small subset of samples we did not use these data for comparative analyses, and all statistical analyses were conducted using the Illumina HiSeq data.
(Go to link,very long article)
http://elife.elifesciences.org/content/2/e01104
Roche Shutting Down 454 Sequencing Business
October 15, 2013
http://www.genomeweb.com/sequencing/roche-shutting-down-454-sequencing-business
Who is WvSchaik? He is Assistant Prof. in Dept. of Medical Microbiology at UMC Utrecht. Interest in microbial genomics & evolution of antibiotic resistance.
Utrecht, The Netherlands .
Nature Biotechnology | Research | Article
Published online13 October 2013 A single-molecule long-read survey of the human transcriptome
Abstract•Author information•Supplementary information Global RNA studies have become central to understanding biological processes, but methods such as microarrays and short-read sequencing are unable to describe an entire RNA molecule from 5' to 3' end. Here we use single-molecule long-read sequencing technology from Pacific Biosciences to sequence the polyadenylated RNA complement of a pooled set of 20 human organs and tissues without the need for fragmentation or amplification. We show that full-length RNA molecules of up to 1.5 kb can readily be monitored with little sequence loss at the 5' ends. For longer RNA molecules more 5' nucleotides are missing, but complete intron structures are often preserved. In total, we identify ~14,000 spliced GENCODE genes. High-confidence mappings are consistent with GENCODE annotations, but >10% of the alignments represent intron structures that were not previously annotated. As a group, transcripts mapping to unannotated regions have features of long, noncoding RNAs. Our results show the feasibility of deep sequencing full-length RNA from complex eukaryotic transcriptomes on a single-molecule level.
(WvSchaik) "Is there even a need for Nanopore sequencing now that PacBio has improved its technology so much in the last year or so?" Tweet 12:46 PM - 13 Oct 13
http://www.nature.com/nbt/journal/vaop/ncurrent/pdf/nbt.2705.pdf
PacBio Blog;
Monday, October 7, 2013Characterizing Structural Variation in the Human Genome: ASHG 2013 Workshop and Presentations
We are excited to participate in the annual American Society of Human Genetics meeting again this year on October 22-26 in Boston, MA. With so many new PacBio® technology advances since last year, we wanted to give you a preview of how users are applying SMRT® Sequencing to better elucidate a variety of complex regions in the human genome.
On Thursday, October 24, we’ll be hosting a luncheon workshop from 12:30 p.m. to 2:00 p.m. entitled ‘Characterizing Structural Variation in the Human Genome Using Long-Read SMRT Sequencing.’ Join us in room 152 of the convention center (BCEC) to hear from several speakers who will share their experience using SMRT Sequencing on human genomes. Here is the speaker lineup:
• Evan Eichler, Ph.D., University of Washington: Reconstructing Complex Regions of Genomes Using Long-Read Sequencing Technology
• Ali Bashir, Ph.D., Mt. Sinai School of Medicine: Highlighting Unexplored Genomic Regions with SMRT Sequencing
• Swati Ranade, Ph.D., Pacific Biosciences: Targeting Complex Structural Motifs in Genes and Haplotypes with Long-Read SMRT Sequencing
Register to attend the workshop or if you’re not attending ASHG this year, sign up to receive a recording of the workshop. There will also be several talks during ASHG sessions featuring SMRT Sequencing. We encourage you to attend for a great perspective on how long-read sequence data is transforming the way in which scientists are studying the human genome.
• Megan Y. Dennis, University of Washington: Palindromic GOLGA core duplicon promotes 15q13.3 microdeletion, inversion polymorphisms, and large-scale primate structural variation. Wednesday, 2:15 p.m., Westin Boston Waterfront Hotel, Grand Ballroom AB.
• Dan Geraghty, Fred Hutchinson Cancer Research Center: Complete resequencing of extended genomic regions using fosmid targeting and PacBio’s Single Molecule Real-Time (SMRT®) long-read sequencing technology. Thursday, 2:45 p.m., BCEC Room 210.
• Ayal Hendel, Stanford University: Sensitive and quantitative measurement of nuclease-mediated genome editing at human endogenous loci using SMRT sequencing. Thursday, 3:15 p.m., BCEC Room 210.
• Paul J. Hagerman. UC Davis School of Medicine: Mechanisms of pathogenesis in fragile X-associated disorders. Saturday, 10:00 a.m., BCEC Room 205.
And of course the PacBio team will be available to answer any questions about our sequencing platforms, numerous applications, and more. Visit us at exhibit hall booth #806. http://blog.pacificbiosciences.com/2013/10/characterizing-structural-variation-in.html?utm_content=buffer61484&utm_source=buffer&utm_medium=twitter&utm_campaign=Buffer
Published: 3 October 2013---Using data generated solely from the Pacific Biosciences RS, we were able to generate the most complete and accurate de novo assemblies of E. coli strains. We found that the addition of other sequencing technology data offered no improvements over use of PacBio data alone. In addition, the sequencing data from the PacBio RS allowed for sensitive and specific calling of covalent base modifications. http://www.biomedcentral.com/1471-2164/14/675/abstract
An evaluation of the PacBio RS platform for sequencing and de novo assembly of a chloroplast genome
Published: 1 October 2013
Abstract (provisional)
Background
Second generation sequencing has permitted detailed sequence characterisation at the whole genome level of a growing number of non-model organisms, but the data produced have short read-lengths and biased genome coverage leading to fragmented genome assemblies. The PacBio RS long-read sequencing platform offers the promise of increased read length and unbiased genome coverage and thus the potential to produce genome sequence data of a finished quality containing fewer gaps and longer contigs. However, these advantages come at a much greater cost per nucleotide and with a perceived increase in error-rate. In this investigation, we evaluated the performance of the PacBio RS sequencing platform through the sequencing and de novo assembly of the Potentilla micrantha chloroplast genome.
Results
Following error-correction, a total of 28,638 PacBio RS reads were recovered with a mean read length of 1,902bp totalling 54,492,250 nucleotides and representing an average depth of coverage of 320x the chloroplast genome. The dataset covered the entire 154,959bp of the chloroplast genome in a single contig (100% coverage) compared to seven contigs (90.59% coverage) recovered from an Illumina data, and revealed no bias in coverage of GC rich regions. Post-assembly the data were largely concordant with the Illumina data generated and allowed 187 ambiguities in the Illumina data to be resolved. The additional read length also permitted small differences in the two inverted repeat regions to be assigned unambiguously.
Conclusions
This is the first report to our knowledge of a chloroplast genome assembled de novo using PacBio sequence data. The PacBio RS data generated here were assembled into a single large contig spanning the P. micrantha chloroplast genome, with a higher degree of accuracy than an Illumina dataset generated at a much greater depth of coverage, due to longer read lengths and lower GC bias in the data. The results we present suggest PacBio data will be of immense utility for the development of genome sequence assemblies containing fewer unresolved gaps and ambiguities and a significantly smaller number of contigs than could be produced using short-read sequence data alone.
The complete article is available as a provisional PDF. The fully formatted PDF and HTML versions are in production.
http://www.biomedcentral.com/1471-2164/14/670/abstract http://sciencealerts.com/stories/2513977/An_evaluation_of_the_PacBio_RS_platform_for_sequencing_and_de_novo_assembly_of_a_chloroplast_genome.html?utm_source=dlvr.it&utm_medium=twitter
NanoString, Accelerate, PacBio Shares Sharply up in September; Myriad, Sequenom Down
October 01, 2013
By a GenomeWeb staff reporter
http://www.genomeweb.com/nanostring-accelerate-pacbio-shares-sharply-september-myriad-sequenom-down?utm_source=twitterfeed&utm_medium=twitter&utm_campaign=Feed%3A+genomeweb%2Fgenomeweb-daily-news+%28GenomeWeb+Daily+News%29