Current version

v1.10.4 (stable)


Main page
Archived news
Plugin SDK
Knowledge base
Contact info
Other projects


Blog Archive

An XML annoyance

A couple of days ago, I looked into XML as a possibility for an exchange format for a program I was working on. Using an off-the-shelf XML parser wasn't an option, so a relatively simple format was needed. XML seemed like a relatively good fit due to its hierarchical tag-based nature, and if I was going to use a simple text-based format, using a ubiquitous one seemed to be a good idea. I've acquired a bit of a distaste for XML over the years, primarily from seeing people convert 10MB of binary data to 100MB of XML for parsing in an interpreted language. For a simple file with a few data items, though, it makes a lot of sense.

The first set of warning bells went off when I pulled down the XML 1.0 standard from the W3C and discovered it was 35 pages long. W3C standards don't seem to be organized well in general, since they delve immediately into details without giving a good overview first. Well, I could deal with that -- I've survived ISO standard documents before, and these aren't that bad. Much of the standard deals with document type declarations (DTDs) and validation, which could be omitted.

That is, until I discovered the horrors of the internal DTD subset.

The internal DTD subset allows you to embed the DTD directly into the document. That's fine, and since it's wrapped in <!DOCTYPE> then in theory it should be easily skippable. Well, it would be, were it not for two little problems called character entities and attribute value defaults:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE data [
    <!ENTITY foo "The quick brown fox quickly jumped over the lazy dog's back.">
    <!ENTITY bar "&foo;">
    <!ATTLIST text
              mode  CDATA   "preformatted">

If you load this XML into a web browser like Firefox or Internet Explorer, you'll see the effects of the DTD, which is to introduce a mode attribute into the text tag, and to expand the &bar; character entity. These two features have a number of annoying consequences:

Suddenly XML didn't seem like a simple tag-based format anymore. I guess there's always CSV or INI....

Unfortunately, it seems that this has led to some compatibility problems in XML. The idea behind XML is that well-formedness is both strictly defined and strictly enforced in order to prevent the format from decaying. TinyXML was once recommended to me, and it's one of the parsers that doesn't parse the internal DTD subset, which means it doesn't really parse XML. SOAP apparently forbids their use as well, and both MSXML 6.0 and .NET 2.0 deny their use by default. The result is that there's now an effectively undocumented subset of XML. Ugh.

I really wonder how much benefit there was in including user-defined character entities and attribute defaults in the XML standard. It seems to me that if these two features had been omitted, there could have been a clear delineation in the standard between DTD/validation and data, and the core non-validating part could have been made much simpler.


This blog was originally open for comments when this entry was first posted, but was later closed and then removed due to spam and after a migration away from the original blog software. Unfortunately, it would have been a lot of work to reformat the comments to republish them. The author thanks everyone who posted comments and added to the discussion.