Pacific Biosciences of California Inc (PACB): Wednesday, August 26, 2015 The Road...

Pacific Biosciences of California Inc (PACB)

Reply Private New

Next 10 Prev Next

Send PM Follow Ignore

Followers	3
Posts	607
Boards Moderated	0
Alias Born	06/07/2010

Paulieme

Re: None

Thursday, 08/27/2015 5:15:07 PM

Thursday, August 27, 2015 5:15:07 PM

Wednesday, August 26, 2015

The Road to Hell is Paved with Bioinformatics Formats

If you really want to raise a bioinformaticist's blood pressure, loudly declare your new tool generates output in brand new data formats. This leads to the frequent observation that a large fraction of bioinformatics work is simply converting formats. It is probably consensus that the field is awash in too many formats, though it is equally clear that we can't agree on which should survive. Between some recent news and a Twitter thread on the subject that erupted last night, there was a bunch of fodder for me to collect in a Storify -- and to lay out my own idiosyncratic views.

For example, today Pacific Biosciences announced at their bioinformatics conference that they are moving off HDF5 for read data and will go to unaligned BAM. For the point of view of existing bioinformatics tools, that's a win -- many tools can consume BAM. Except, of course, most tools that take in unaligned data. And while BAM has its merits, metadata is stored in a very simple tag-value format. HDF5 had the problem of a lot of tools aren't well developed; one reason I tried out Julia last year was to deal with the HDF5 files from Oxford Nanopore's MinION; Perl' HDF5 library choked on them.

One sentiment you'll find in the Storify is a hope that Oxford will abandon HDF5 - but it is certainly not my wish. HDF5 is a sophisticated structured (and compressed) format, enabling a rich representation of metadata. For Nanopore and probably many future sequencing technologies, converting to FASTQ or BAM would lose a lot of information. Indeed, tools which use signal-level data from MinION are already appearing, such as nanopolish.

Sadly, the history of bioinformatics seems to be littered with an aversion to richly-structured data formats. NCBI tried to push ASN.1 as a format for Genbank back when I was a graduate student, but as far as I can tell it never caught on outside NCBI. XML, which is similar, seems to have made limited headway, showing up in schemes such as Systems Biology Markup Language (SBML), but seemingly avoided as often as used. Actually, there is a phylogenetic tree file I was playing with the other day in NeXML format, which is great -- except when I handed it off to a colleague whose tree viewer couldn't use it.

The big advantage of XML, ASN.1, HDF5, YAML is the nightmare of parsing bad formats is eliminated; there are real standards here. Compare this with all the ways a Genbank flatfile can be botched (I see this complaint routinely on Twitter), in part because Genbank files (and I think PDB; haven't munged one in a while) retain the mainframe-era penchant for the lateral position of text being meaningful. Format validators exist for these formats, meaning that it is straightforward to prove that a given file is legal. Now, whether it is gibberish is a different question; NCBI had to invest a lot of time early on cleaning up bad ranges and such in the Genbank data they inherited.

In contrast, take FASTQ format -- please. It is very simple, which is a little nice, but mostly written and read by computers, which should be able to deal with complexity. The cost of the simple is an inability (inherited from FASTA format) for any sort of standardized structured metadata. For example, just storing what quality encoding scheme is in use would be worth a mint, let alone which platform generated the data. FASTQ doesn't seem to break many programs -- in contrast with FASTA in which all sorts of arguments erupt over what is legal and illegal metadata encoding in the header (hint: there is no standard, so anything goes!)

Also consider this vision: a lot of high throughput sequencing consists of reading FASTQ files, making minimal changes to them (primarily trimming) and writing new FASTQ files. Imagine if instead of keeping a trail of FASTQ files, the reads were all stored in a simple relational database (SQLite is a gem for this sort of thing) and all those transformations stored as edits or replacements of the original sequence, with the metadata trackable throughout. Sure, those indexes and structures and metadata would consume some space, but far more would be saved. It's (at least to me) a beautiful idea -- except no tool out there could use it.

On the bright side, some formats have died. When I was an undergraduate, every database and sequence handling program seemed to have its own format. One of my first graduate school projects was a multi-format parser so I could read files in FASTA, Genbank, EMBL, SwissProt (very similar to EMBL) and GCG formats, and write FASTA, Genbank and GCG -- but that was hardly the whole the spectrum of formats in use those days (but it was the set I needed to deal with). I wish I could say that nobody is reinventing that wheel, but at Starbase I'm frequently asking for files sent to me to be converted out of Geneious or DNA*STAR formats . You'd think these folks would quit inventing proprietary binary formats, but noooo. Reminiscent of ABI, which for a long time kept the binary format for Sanger tracefiles a secret, with regular changes to screw up anyone who had hacked the format. Yet another advantage to formats such as XML that come with a format definition -- it is often possible to write parsers that simply ignore the parts of the file they don't understand, so long as the XML (or ASN.1 or similar) is valid.

While I'm ranting, a related curse is that every programmer feels a need to come up with a different way of naming the parameters and a different way of collecting them in a file. I feel a little churlish complaining, as I love these three tools, but wouldn't it be nice if MIRA, Celera Assembler and SPADES had parameter file formats that were at least closely related to each other? I really admire the effort that went into Nucleotid.es; just contemplating putting together all those config files would probably cause me to nix the project.

So in summary, I'd vote for rich formats which are precisely defined so that the headache of parsing, mis-parsing & crashing parsers can be behind us. Stop coming up with yet another tab-delimited mess (though I'm afraid this is a bit of a do-as-I-say-not-as-I-do -- I'm terrible about imposing such on myself). If you start a project, try to look around for something existing to at least steal generously from, instead of inventing yet another idiosyncratic format for bioinformaticians to curse out.

http://omicsomics.blogspot.com/2015/08/the-road-to-hell-is-paved-with.html
Posted by Keith Robison at 11:52 PM