¶An XML annoyance
A couple of days ago, I looked into XML as a possibility for an exchange format for a program I was working on. Using an off-the-shelf XML parser wasn't an option, so a relatively simple format was needed. XML seemed like a relatively good fit due to its hierarchical tag-based nature, and if I was going to use a simple text-based format, using a ubiquitous one seemed to be a good idea. I've acquired a bit of a distaste for XML over the years, primarily from seeing people convert 10MB of binary data to 100MB of XML for parsing in an interpreted language. For a simple file with a few data items, though, it makes a lot of sense.
The first set of warning bells went off when I pulled down the XML 1.0 standard from the W3C and discovered it was 35 pages long. W3C standards don't seem to be organized well in general, since they delve immediately into details without giving a good overview first. Well, I could deal with that -- I've survived ISO standard documents before, and these aren't that bad. Much of the standard deals with document type declarations (DTDs) and validation, which could be omitted.
That is, until I discovered the horrors of the internal DTD subset.
The internal DTD subset allows you to embed the DTD directly into the document. That's fine, and since it's wrapped in <!DOCTYPE> then in theory it should be easily skippable. Well, it would be, were it not for two little problems called character entities and attribute value defaults:
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE data [
<!ENTITY foo "The quick brown fox quickly jumped over the lazy dog's back.">
<!ENTITY bar "&foo;">
<!ATTLIST text
mode CDATA "preformatted">
]>
<data>
<text>
&bar;
</text>
</data>
If you load this XML into a web browser like Firefox or Internet Explorer, you'll see the effects of the DTD, which is to introduce a mode attribute into the text tag, and to expand the &bar; character entity. These two features have a number of annoying consequences:
- All XML parsers, including non-validating ones, must parse the internal DTD subset. This means that an alternate tag parsing path must be introduced since the DTD doesn't follow the same attribute=value format that the rest of XML uses.
- The internal DTD subset cannot be ignored, since it can change the interpretation of the data.
- Character entities can now expand to arbitrary lengths. This prohibits in-place conversion and requires dynamic memory allocation. Even more fun is the possibility of nested expansion, which leads to the billion laughs attack.
- XML parsers must both parse elements and interpret them, due to the need to inject attribute defaults.
Suddenly XML didn't seem like a simple tag-based format anymore. I guess there's always CSV or INI....
Unfortunately, it seems that this has led to some compatibility problems in XML. The idea behind XML is that well-formedness is both strictly defined and strictly enforced in order to prevent the format from decaying. TinyXML was once recommended to me, and it's one of the parsers that doesn't parse the internal DTD subset, which means it doesn't really parse XML. SOAP apparently forbids their use as well, and both MSXML 6.0 and .NET 2.0 deny their use by default. The result is that there's now an effectively undocumented subset of XML. Ugh.
I really wonder how much benefit there was in including user-defined character entities and attribute defaults in the XML standard. It seems to me that if these two features had been omitted, there could have been a clear delineation in the standard between DTD/validation and data, and the core non-validating part could have been made much simpler.