Friday, July 05, 2013 3:32:37 PM
Pacific Biosciences published a paper earlier this year on an approach to sequence and assemble a bacterial genome leading to a near-finished, or finished genome. The approach, dubbed Hierarchical Genome Assembly Process (HGAP), is based on only PacBio reads without the need for short-reads. This is how it works:
•generate a high-coverage dataset of the longest reads possible, aim for 60-100x in raw reads
•pre-assembly: use the reads from the shorter part of the raw read length distribution, to error-correct the longest reads, set the cutoff in such a way so that the longest reads make up about 30x coverage
•use the long, error-corrected reads in a suitable assembler, e.g. Celera, to produce contigs
•map the raw PacBio reads back to the contigs to polish the final sequence (rather, recall the consensus using the raw reads as evidence) with the Quiver tool
The approach is very well explained on this website. As an aside, the same principle can now be used with the PacBioToCA pipeline.
In principle, this approach could result in a finished genome, i.e. a gapless contig per chromosomal element (chromosomes and plasmids). A more theoretical study confirms this:
“Our results indicate that the majority of known bacterial and archaeal genomes can be assembled without gaps, at finished-grade quality, using a single PacBio RS sequencing library.” (Koren et al, arXiv:1304.3752)
As always, the proof is in the pudding. There have been reports, and even a publication here and there, that the HGAp approach actually works. In this blog post I would like to add our experiences, which in short are that HGAP can indeed result in (close-to) finished genomes.
At the Norwegian Sequencing Centre,with which I am affiliated, we recently received several bacterial genome DNA samples for PacBio sequencing. Given our very positive first experiences with size selecting PacBio libraries using the BluePippin, see my previous post, we decided to use this instrument also for these samples. Four of the samples yielded very nice libraries, which were sequenced, two SMRTcells each, on our (recently upgraded) PacBio RSII instrument.
Raw reads
We have never seen such long reads:
PB_0027 PB_0028 PB_0029 PB_0031
Count 80512 58524 45514 84169
Sum 595 Mbp 462 Mbp 351 Mbp 669 Mbp
Av. (bp)ngth 7393 7893 7714 7951
N50 (bp) 10662 11205 11109 11162
Largest (bp) 24397 25552 23992 25678
Note that the average read length is much longer than the specifications of the RSII, which is about 4.6 kbp.
These reads were then used in HGAP. We have smrtpipe, the analysis suite of Pacific Biosciences, installed, so I could simply make a file with the names of the input files, a default HGAP settings xml file, and run the whole thing on one of our big servers. The assemblies took about two days when given 32 CPUs and a lot of memory – I haven’t logged how much RAM they actually used.
Pre-assembly
Here are the results of pre-assembly, the correction of the largest 30x in raw reads with the rest of the reads:
PB_0027 PB_0028 PB_0029 PB_0031
Cutoff (bp)* 12106 12077 10371 12780
Count 9186 8636 10059 9252
Sum 107 Mbp 100 Mbp 106 Mbp 110 Mbp
Av. (bp) 11594 11562 10540 11876
N50 (bp) 12519 12770 11513 13120
Largest 20043 19090 19030 18681
*Cutoff: minimum length of seeds for error-correction.
After pre-assembly, there was more than a 100 Mbp in error-corrected, potentially high-quality reads with an N50 higher than one sometimes see for contigs of a short-read bacterial genome assembly!
Assembly
These 8 – 10 thousand reads were assembled by Celera, with Quiver polishing, into:
PB_0027 PB_0028 PB_0029 PB_0031
Contigs 3.4 Mbp 3.2 Mbp 4.3 Mbp 1.8 Mbp
45 kbp 76 kbp 80kbp 1.3 Mbp
64 Kbp 1.1 Mbp
45 Kbp 0.95 Mbp
17 kbp 95 kbp
Wow, mostly one laaarge contig (and I checked, these are without ‘N’ bases) and a few shorter ones. The exception was the last strain which assembled into a few large pieces, that together, according to what I understand, are too large. A further step for this assembly is trying the Minimus2 tool, to see whether there is enough overlap between the contigs to further reduce their number – a step generally recommended for HGAp assemblies. I haven’t tried this yet for this assembly.
So, it looks like ‘it just works’. Well, there was at least one case where a misassembly is suspected. Looking at the coverage plot (of the remapped raw pacbio reads) for the 4,3 Mbp contig of PB0029, we saw this:
Mapping coverage of raw PacBio reads to the largest contig of the PB_0029 assembly
The sudden jump in coverage after 1.2 Mbp points to a fusion of the sequences of two chromosomes – and in fact this is quite likely the case given what is know about these strains. For the others, it reamins to be seen whether the smaller pieces are in fact plasmids, or should be part of the major chromosome.
A few remarks before I conclude:
•these four samples are clearly success stories
•all had modest GC percentages, around 35 – 50%
•we also have had a sample that didn’t fragment very well and only yielded a 2 kbp insert library (giving CCS reads after sequencing)
•another strain didn’t behave as well either, resulting in reads averaging 3.5 kbp – assembly for this one has not been started yet
•there is no reference genome for these samples, so assembly accuracy, and per-base quality, could not be assessed fully
Conclusion
It looks like that for well-behaved samples, the approach of combining PacBio library creation, BluePippin size selection (optional, but highly recommended) and sequencing of two SMRTcells, works very well to give finished, or near finished bacterial genome assemblies. I want to emphasise the following, though: even though the assembly looks great, it is afterwards up to the biologist/researcher to make sure the contigs actually make sense given:
•the remapping of the reads
•what is known about the species (e.g. expected number of chromosomes)
•what is known about the sample (e.g., presence of plasmids)
•other, independent evidence, e.g. illumina reads, optical mapping results, etc
The title of this post aks: “De novo bacterial genome assembly: a solved problem?”. I dare to say we’re pretty close…
A bioinformaticians side-note
The bottleneck of the HGAp process was the two consensus calling steps: when the consensus of the longest reads are being called (based on the mapped shortest ones), and especially for the Celera contig consensus calling. The latter takes one contig at a time, and since these now are becoming millions of basepairs long, this can take up many hours, perhaps even half the total assembly time. By the way, overlapping the error-corrected reads was done in minutes… So, if someone is interested in developing a parallelised consensus caller, than can work with parts of a long contigs, and stitch the consensi back together when done, we bioinformaticians doing HGAP would be very grateful…
Acknowledgments
This post would not have been possible without the excellent skills of the NSC lab team, and I thank the owners of the Bacterial samples for which this post describes results for permission to use the metrics for this post. I apologize in advance for not being able to share the (raw and assembled) data presented here…
A better read if you click on link! http://flxlexblog.wordpress.com/2013/07/05/de-novo-bacterial-genome-assembly-a-solved-problem/#more-420
Recent PACB News
- PacBio and International Research Consortium CoLoRS Announce Release of First-Ever HiFi Long-Read Variant Database • PR Newswire (US) • 06/10/2024 01:05:00 PM
- Form DEFA14A - Additional definitive proxy soliciting materials and Rule 14(a)(12) material • Edgar (US Regulatory) • 05/30/2024 01:52:04 AM
- Form 4 - Statement of changes in beneficial ownership of securities • Edgar (US Regulatory) • 05/22/2024 09:25:43 PM
- Ambry Genetics and PacBio Announce Collaboration to Sequence Up to 7,000 Human Genomes Aimed at Providing Answers for Families Battling Rare Diseases • PR Newswire (US) • 05/15/2024 01:45:00 PM
- Form S-3ASR - Automatic shelf registration statement of securities of well-known seasoned issuers • Edgar (US Regulatory) • 05/09/2024 08:33:12 PM
- Form 10-Q - Quarterly report [Sections 13 or 15(d)] • Edgar (US Regulatory) • 05/09/2024 08:21:46 PM
- Form 8-K - Current report • Edgar (US Regulatory) • 05/09/2024 08:12:15 PM
- PacBio Announces First Quarter 2024 Financial Results • PR Newswire (US) • 05/09/2024 08:05:00 PM
- PacBio Announces Preliminary First Quarter 2024 Revenue and Updates 2024 Revenue Guidance • PR Newswire (US) • 04/16/2024 12:05:00 PM
- Estonia National Biobank Selects PacBio to Sequence 10,000 Whole Genomes • PR Newswire (US) • 03/27/2024 12:00:00 PM
- PacBio Grants Equity Incentive Award to New Employee • PR Newswire (US) • 03/22/2024 08:30:00 PM
- PacBio Announces PureTarget™ Repeat Expansion Panel, Expanding its Portfolio of End-to-End Clinical Research Solutions • PR Newswire (US) • 03/12/2024 01:05:00 PM
- Form 4 - Statement of changes in beneficial ownership of securities • Edgar (US Regulatory) • 03/06/2024 10:36:07 PM
- Form 4 - Statement of changes in beneficial ownership of securities • Edgar (US Regulatory) • 03/06/2024 10:30:18 PM
- Form 4 - Statement of changes in beneficial ownership of securities • Edgar (US Regulatory) • 03/06/2024 10:26:40 PM
- Form 4 - Statement of changes in beneficial ownership of securities • Edgar (US Regulatory) • 03/06/2024 10:22:45 PM
- Form 144 - Report of proposed sale of securities • Edgar (US Regulatory) • 03/04/2024 11:32:39 PM
- Form 144 - Report of proposed sale of securities • Edgar (US Regulatory) • 03/04/2024 11:22:32 PM
- Form 4 - Statement of changes in beneficial ownership of securities • Edgar (US Regulatory) • 02/26/2024 09:55:28 PM
- Form 4 - Statement of changes in beneficial ownership of securities • Edgar (US Regulatory) • 02/26/2024 09:36:09 PM
- Form 4 - Statement of changes in beneficial ownership of securities • Edgar (US Regulatory) • 02/26/2024 09:25:48 PM
- Form 4 - Statement of changes in beneficial ownership of securities • Edgar (US Regulatory) • 02/26/2024 09:19:42 PM
- PacBio to Present at Upcoming Investor Conferences • PR Newswire (US) • 02/26/2024 09:05:00 PM
- Form 4 - Statement of changes in beneficial ownership of securities • Edgar (US Regulatory) • 02/21/2024 11:25:13 PM
- Form 4 - Statement of changes in beneficial ownership of securities • Edgar (US Regulatory) • 02/21/2024 11:20:57 PM
FEATURED Fifty 1 Labs, Inc. Announces Major Strategic Advancements and Shareholder Updates • Jun 14, 2024 2:07 AM
ECGI Holdings Announces LOI to Acquire Pacific Saddlery to Capitalize on $12.72 Billion Market Potential • ECGI • Jun 13, 2024 9:50 AM
Snakes & Lattes Opens Pop-Up Location at The Wellington Market in Toronto: A New Destination for Fun and Games - Thanks 'The Well', PepsiCo, Indie Pale House & All Sponsors & Partners for Their Commitment & Assistance Throughout The Process • FUNN • Jun 13, 2024 8:18 AM
HealthLynked Introduces Innovative Online Medical Record Request Form Using DocuSign • HLYK • Jun 12, 2024 8:00 AM
Ubiquitech Software Corp (OTC:UBQU) Posts $624,585 Quarterly Revenue - Largest Quarter Since 2018 • UBQU • Jun 11, 2024 10:13 AM
Element79 Gold Corp Files for OTCQB Uplisting, Provides Financial Update • ELEM • Jun 11, 2024 9:25 AM