picture home | pixelblog | qt_tools

David Van Brink // Tue 2007.08.7 11:20

Basics: Know Your Document Format

An application typically stores its state as a file. As an application user, you expect this file to work, even if you don’t touch it for a while. This note offers some primal advice on implementing a file format in your application.

Basics: Know Your Document Format

Your Application’s Document

Be in control of your file format — it’s your most persistent API. Application versions will come and go, but you and your users expect the documents to last forever.

One lesson I’ve learned is simple: there are no shortcuts for mastering your own file format. You have to design it and implement it. By all means leverage existing parsers, such as for XML or RIFF or what not. But I honestly believe that there is no appropriate automation for translating your in-memory-model to a file format.

Tools like JAXB which generate XML serializing and deserializing code based on a schema can be quite useful, but I always add an additional translation layer between these generated classes and the “real” API.

Too Much Model

I’ve seen bad results from “built-in serializers”; one application developer I worked with relied entirely on the built-in serializer of Apple’s “Yellow Box” (this was essentially “Cocoa” running on Windows, after NeXT, before Mac OS X). Every build of the application produced a different file format, and some daisy chain of machine generated updaters was able to continue to read it in. The real problem was that we ourselves couldn’t reliably answer the question, “What’s in the file?”. We couldn’t easily write other tools to manipulate or debug our own documents, or stimulate test cases. (Staccato Systems’ “Synth Builder”.)

Too Much File

At the other extreme, I’ve seen bad results from reading an hierarchical file structure into memory, and attempting at all times to keep it identical to its on-disk representation. This required a lot of work, and bugs manifested as extra data in the file which other readers just needed to ignore. But not all readers ignored it, and so these stray file elements became de-facto parts of the file format. (Altera’s “SOPC Builder”, and its PTF file format, before our total overhaul.)

Just Right

It’s safest to define your file format separately from your in-memory model. They may be similar, but it’s valuable to implement the file reader and writer as a separate entity from your model. This will involve a translation layer where you read in your file format, perhaps to an in-memory tree structure of some sort, and then populate your in-memory objects. There are opportunities to make this more or less automatic, as appropriate to your particular development effort. But, importantly, this translation is very well-defined and therefore very testable.

Generally…

Hierarchy.

A file format will usually be hierarchical. That’s just the common design pattern that seems to cover the bases. As with your object design, it’s handy to make the hierarchy something that maps intuitively to the problem domain. This is easier said than done, but is an opportunity for artfulness… to say the least. As to the low level file format itself, in general in this day and age you can probably just skip any tedious meetings or debates and create an XML-based structure. If you need to store lots of binary data, and file size is a concern, your XML document can refer to neighbor files. And under no circumstances let the argument that XML is too user-unfriendly or heavyweight influence the decision; these remarks come from those who don’t know from Koolaid. Really.

No Redundancy.

The persistent document should capture each datum once and only once. As a thought experiment, imagine someone manually editing the file, perhaps even in a text editor. Ideally, they should be able to modify an entry in the file and have a predictable feature of the document change when loaded.

Human Readable.

And speaking of the above thought experiment… This isn’t always possible, and in rare cases is undesirable (for reasonable or less-reasonable reasons). But usually it’s a great benefit to be able to examine your document in a text editor. It’s great for debugging, that much I can promise. Another advantage of a text-based file format is it removes any questions of byte ordering or floating point format. What number is hex 01 02? You have to define whether it reads as 258 or 513. But if your file says “x = 258;” then it’s perfectly clear.

Easy In, Rigorous Out.

When reading in your document, be generous. If something is missing, insert a reasonable default value. This strategy can be used with cautiousness and, dare I utter it, cleverness to extend the file format and still be backwards compatible. On the converse however: Always write out everything! It is not safe to skip an entry because its value happens to be “the default”. Someday you’ll change the default, but old documents mustn’t change.

Be Careful With Scripts.

It may be tempting to use a scripting language as a file format. For example, TCL is very easy to embed in C or Java programs, but it’s not ideal as a document file format. Some will argue that philosophically code and data are the same thing. My counterargument used to be something about predictable computation time and the halting problem: if you read in a script, it might run briefly or it might run forever. But I have, I hope, an even more compelling distinction: code can erase your hard disk but data can’t. It’s that simple.

When I’ve seen scripts used as a file format, invariably, eventually, there exists some need to “parse” the script without “executing” it, which is of course absurd. If you depend on being able to parse a subset of the language then you’ve reduced it to data. Use a data format, not a scripting language.

Closing Remarks

One day in the far future (but sooner than you expect) it will be time to rewrite your application from the ground up, or create a new application which interacts in some way with the old one. In either case, a well-defined and knowable file format will render these tasks possible.


(c) 2003-2011 omino.com / contact poly@omino.com