Wednesday, December 11, 2013 5:11:55 PM
Understanding the biology of a genome requires knowing the full complement of mRNA isoforms. In recent years, microarrays, high-throughput cDNA sequencing, and RNA-seq have become very useful tools for studying transcriptomes. High-throughput cDNA sequencing is accurate but laborious, while the inherently complex nature of the transcriptome makes transcript assembly computationally intractable. Recently, Steijger et al. (1) showed that complete isoform reconstruction from RNA-seq short-read data remains challenging even when all constituent exons are identified.
A number of recent publications have demonstrated the utility of full-length transcript sequencing by taking advantage of the long read lengths of SMRT® Sequencing technology (2)–(4). SMRT Sequencing produces reads that originate from independent observations of single molecules; no assembly is needed if a read spans the entire length of the transcript. To demonstrate the capabilities of PacBio® Isoform Sequencing (Iso-Seq) technology and show a glimpse of the complexity of eukaryotic transcriptomes, we generated a deep dataset of full-length cDNA sequencing of RNA from MCF-7, a human breast cancer cell line. The sequencing data was collected from several internal training sessions where different library preparation techniques were tested. We are releasing the underlying data in an effort to aid the design of future PacBio Iso-Seq experiments and to spur advances in the development of bioinformatics tools for analyzing full-length transcripts.
In our final dataset, we obtained 44,531 non-redundant transcript-length consensus sequences ranging from 400 bp – 4,900 bp, with an average length of 1,929 bp (Fig. 1a). The total percentage of consensus bases that disagreed with the hg19 genome is 0.27%, out of which 0.16% are due to substitutions and thus could likely be true SNPs (Fig. 1b). About half of the transcribed loci have one observed isoform, while the rest have mostly 2-5 isoforms (Fig. 2). We compared our predicted full-length transcripts against the known annotations and found that we were able to recover full-length alternative splice forms (Fig. 3), alternative polyadenylation, novel transcripts, and known fusion genes (Fig. 4). We encourage interested researchers to explore the dataset.
Materials & Methods
Full-length cDNA was generated from polyA RNA using standard cDNA synthesis kits (Clontech® SMARTer™ and Invitrogen® Superscript® kits). To capture longer, rarer transcripts in sufficient abundance, parts of the double-stranded cDNA were size selected into three fractions, which were subsequently amplified and converted into SMRTbell™ templates. Details on the sample preparation can be found on Sample Net. SMRTbell libraries were sequenced using the P4-C2 sequencing chemistry with 2-hour movies.
After sequencing, we computationally determined the completeness of the sequences using polyA-tail signals and library adapters. To obtain a non-redundant set of full-length, high-quality transcript sequences without bias from other sequencing platforms, we developed a de novo, isoform-level clustering algorithm that uses only PacBio data. Briefly, the algorithm iteratively clusters reads to generate consensus sequences that represent the original transcripts. The algorithm takes into account the existence of the polyA-tail signal to differentiate isoforms with alternative stop sites. The final consensus sequences were called using Quiver and filtered to create the final polished, full-length, non-redundant dataset. Details of the clustering algorithm will be described in two upcoming webinars on Wednesday, January 22 at 8 AM PST and 5 PM PST.
Some statistics from the sequencing and results are listed below:
•Number of SMRT Cells: 119
•no-size selection: 12
•1-2 kb: 37
•2-3 kb: 37
•> 3 kb: 33
•Total number of post-filtered bases: 14,062,161,755
isoform. (b) Breakdown of differences to hg19. Consensus sequences were mapped to hg19 using GMAP (version 2013-07-20) with default parameters. Different error categories were aggregated over all 44,531 transcript sequences. Some errors are likely to be due to real biological differences from the reference sequence.
Figure 1. (see link) (a) Length distribution of polished, non-redundant transcript sequences. Each transcript sequence represents a unique isoform. (b) Breakdown of differences to hg19. Consensus sequences were mapped to hg19 using GMAP (version 2013-07-20) with default parameters. Different error categories were aggregated over all 44,531 transcript sequences. Some errors are likely to be due to real biological differences from the reference sequence.
Figure 2.(see link) Number of isoforms per loci. Transcripts that overlap on the genomic coordinate by 1 bp are grouped together to form non-overlapping transcribed loci. Total number of loci: 14,385. The majority (61%) of transcribed loci have only 1 or 2 transcripts while 0.6% of them have 20 or more isoforms. This is consistent with other studies of full-length cDNA sequencing of a single sample type [5].
Figure 3. (see link) UCSC browser screenshot of the CREM gene region. PacBio transcripts (top, red) capture multiple isoforms of the CREM gene, including alternatively spliced exons and alternative poly adenylation sites.
Figure 4. (see link) Known cancer fusion gene BCAS4/BCAS3 identified. PacBio transcripts (top, red) show three different fusion variants of the BCAS4/BCAS3 genes. All three variants contain a portion of the 5’ region of the BCAS4 gene (chr20q13) and a portion of the 3’ region of the BCAS3 gene (chr17q23).
References
1.T. Steijger, J. F. Abril, P. G. Engström, et. al., “Assessment of transcript reconstruction methods for RNA-seq,” Nat. Methods, vol. 10, no. 12, pp. 1177–1184, Nov. 2013.
2.D. Sharon, H. Tilgner, F. Grubert, and M. Snyder, “A single-molecule long-read survey of the human transcriptome,” Nat. Biotechnol., vol. 31, no. 11, pp. 1009–1014, Nov. 2013.
3.W. Zhang, P. Ciclitira, and J. Messing, “PacBio sequencing of gene families-a case study with wheat gluten genes,” Gene, 2013.
4.K. F. Au, V. Sebastiano, P. T. Afshar, J. D. Durruthy, L. Lee, B. A. Williams, H. van Bakel, E. E. Schadt, R. A. Reijo-Pera, J. G. Underwood, and W. H. Wong, “Characterization of the human ESC transcriptome by hybrid sequencing,” Proc. Natl. Acad. Sci. U. S. A., Nov. 2013.
(link) http://blog.pacificbiosciences.com/2013/12/data-release-human-mcf-7-transcriptome.html?m=1
Recent PACB News
- Ambry Genetics and PacBio Announce Collaboration to Sequence Up to 7,000 Human Genomes Aimed at Providing Answers for Families Battling Rare Diseases • PR Newswire (US) • 05/15/2024 01:45:00 PM
- Form S-3ASR - Automatic shelf registration statement of securities of well-known seasoned issuers • Edgar (US Regulatory) • 05/09/2024 08:33:12 PM
- Form 10-Q - Quarterly report [Sections 13 or 15(d)] • Edgar (US Regulatory) • 05/09/2024 08:21:46 PM
- Form 8-K - Current report • Edgar (US Regulatory) • 05/09/2024 08:12:15 PM
- PacBio Announces First Quarter 2024 Financial Results • PR Newswire (US) • 05/09/2024 08:05:00 PM
- PacBio Announces Preliminary First Quarter 2024 Revenue and Updates 2024 Revenue Guidance • PR Newswire (US) • 04/16/2024 12:05:00 PM
- Estonia National Biobank Selects PacBio to Sequence 10,000 Whole Genomes • PR Newswire (US) • 03/27/2024 12:00:00 PM
- PacBio Grants Equity Incentive Award to New Employee • PR Newswire (US) • 03/22/2024 08:30:00 PM
- PacBio Announces PureTarget™ Repeat Expansion Panel, Expanding its Portfolio of End-to-End Clinical Research Solutions • PR Newswire (US) • 03/12/2024 01:05:00 PM
- Form 4 - Statement of changes in beneficial ownership of securities • Edgar (US Regulatory) • 03/06/2024 10:36:07 PM
- Form 4 - Statement of changes in beneficial ownership of securities • Edgar (US Regulatory) • 03/06/2024 10:30:18 PM
- Form 4 - Statement of changes in beneficial ownership of securities • Edgar (US Regulatory) • 03/06/2024 10:26:40 PM
- Form 4 - Statement of changes in beneficial ownership of securities • Edgar (US Regulatory) • 03/06/2024 10:22:45 PM
- Form 144 - Report of proposed sale of securities • Edgar (US Regulatory) • 03/04/2024 11:32:39 PM
- Form 144 - Report of proposed sale of securities • Edgar (US Regulatory) • 03/04/2024 11:22:32 PM
- Form 4 - Statement of changes in beneficial ownership of securities • Edgar (US Regulatory) • 02/26/2024 09:55:28 PM
- Form 4 - Statement of changes in beneficial ownership of securities • Edgar (US Regulatory) • 02/26/2024 09:36:09 PM
- Form 4 - Statement of changes in beneficial ownership of securities • Edgar (US Regulatory) • 02/26/2024 09:25:48 PM
- Form 4 - Statement of changes in beneficial ownership of securities • Edgar (US Regulatory) • 02/26/2024 09:19:42 PM
- PacBio to Present at Upcoming Investor Conferences • PR Newswire (US) • 02/26/2024 09:05:00 PM
- Form 4 - Statement of changes in beneficial ownership of securities • Edgar (US Regulatory) • 02/21/2024 11:25:13 PM
- Form 4 - Statement of changes in beneficial ownership of securities • Edgar (US Regulatory) • 02/21/2024 11:20:57 PM
- Form 4 - Statement of changes in beneficial ownership of securities • Edgar (US Regulatory) • 02/21/2024 11:17:14 PM
- Form 4 - Statement of changes in beneficial ownership of securities • Edgar (US Regulatory) • 02/21/2024 11:07:18 PM
- Form 144 - Report of proposed sale of securities • Edgar (US Regulatory) • 02/20/2024 09:17:12 PM
North Bay Resources Announces 50/50 JV at Fran Gold Project, British Columbia; Initiates NI 43-101 Resources Estimate and Bulk Sample • NBRI • May 21, 2024 9:07 AM
Greenlite Ventures Inks Deal to Acquire No Limit Technology • GRNL • May 17, 2024 3:00 PM
Music Licensing, Inc. (OTC: SONG) Subsidiary Pro Music Rights Secures Final Judgment of $114,081.30 USD, Demonstrating Strength of Licensing Agreements • SONGD • May 17, 2024 11:00 AM
VPR Brands (VPRB) Reports First Quarter 2024 Financial Results • VPRB • May 17, 2024 8:04 AM
ILUS Provides a First Quarter Filing Update • ILUS • May 16, 2024 11:26 AM
Cannabix Technologies and Omega Laboratories Inc. enter Strategic Partnership to Commercialize Marijuana Breathalyzer Technology • BLO • May 16, 2024 8:13 AM