Current version

v1.10.4 (stable)

Navigation

Main page
Archived news
Downloads
Documentation
   Capture
   Compiling
   Processing
   Crashes
Features
Filters
Plugin SDK
Knowledge base
Contact info
Forum
 
Other projects
   Altirra

Search

Archives

01 Dec - 31 Dec 2013
01 Oct - 31 Oct 2013
01 Aug - 31 Aug 2013
01 May - 31 May 2013
01 Mar - 31 Mar 2013
01 Feb - 29 Feb 2013
01 Dec - 31 Dec 2012
01 Nov - 30 Nov 2012
01 Oct - 31 Oct 2012
01 Sep - 30 Sep 2012
01 Aug - 31 Aug 2012
01 June - 30 June 2012
01 May - 31 May 2012
01 Apr - 30 Apr 2012
01 Dec - 31 Dec 2011
01 Nov - 30 Nov 2011
01 Oct - 31 Oct 2011
01 Sep - 30 Sep 2011
01 Aug - 31 Aug 2011
01 Jul - 31 Jul 2011
01 June - 30 June 2011
01 May - 31 May 2011
01 Apr - 30 Apr 2011
01 Mar - 31 Mar 2011
01 Feb - 29 Feb 2011
01 Jan - 31 Jan 2011
01 Dec - 31 Dec 2010
01 Nov - 30 Nov 2010
01 Oct - 31 Oct 2010
01 Sep - 30 Sep 2010
01 Aug - 31 Aug 2010
01 Jul - 31 Jul 2010
01 June - 30 June 2010
01 May - 31 May 2010
01 Apr - 30 Apr 2010
01 Mar - 31 Mar 2010
01 Feb - 29 Feb 2010
01 Jan - 31 Jan 2010
01 Dec - 31 Dec 2009
01 Nov - 30 Nov 2009
01 Oct - 31 Oct 2009
01 Sep - 30 Sep 2009
01 Aug - 31 Aug 2009
01 Jul - 31 Jul 2009
01 June - 30 June 2009
01 May - 31 May 2009
01 Apr - 30 Apr 2009
01 Mar - 31 Mar 2009
01 Feb - 29 Feb 2009
01 Jan - 31 Jan 2009
01 Dec - 31 Dec 2008
01 Nov - 30 Nov 2008
01 Oct - 31 Oct 2008
01 Sep - 30 Sep 2008
01 Aug - 31 Aug 2008
01 Jul - 31 Jul 2008
01 June - 30 June 2008
01 May - 31 May 2008
01 Apr - 30 Apr 2008
01 Mar - 31 Mar 2008
01 Feb - 29 Feb 2008
01 Jan - 31 Jan 2008
01 Dec - 31 Dec 2007
01 Nov - 30 Nov 2007
01 Oct - 31 Oct 2007
01 Sep - 30 Sep 2007
01 Aug - 31 Aug 2007
01 Jul - 31 Jul 2007
01 June - 30 June 2007
01 May - 31 May 2007
01 Apr - 30 Apr 2007
01 Mar - 31 Mar 2007
01 Feb - 29 Feb 2007
01 Jan - 31 Jan 2007
01 Dec - 31 Dec 2006
01 Nov - 30 Nov 2006
01 Oct - 31 Oct 2006
01 Sep - 30 Sep 2006
01 Aug - 31 Aug 2006
01 Jul - 31 Jul 2006
01 June - 30 June 2006
01 May - 31 May 2006
01 Apr - 30 Apr 2006
01 Mar - 31 Mar 2006
01 Feb - 29 Feb 2006
01 Jan - 31 Jan 2006
01 Dec - 31 Dec 2005
01 Nov - 30 Nov 2005
01 Oct - 31 Oct 2005
01 Sep - 30 Sep 2005
01 Aug - 31 Aug 2005
01 Jul - 31 Jul 2005
01 June - 30 June 2005
01 May - 31 May 2005
01 Apr - 30 Apr 2005
01 Mar - 31 Mar 2005
01 Feb - 29 Feb 2005
01 Jan - 31 Jan 2005
01 Dec - 31 Dec 2004
01 Nov - 30 Nov 2004
01 Oct - 31 Oct 2004
01 Sep - 30 Sep 2004
01 Aug - 31 Aug 2004

Stuff

Powered by Pivot  
XML: RSS feed 
XML: Atom feed 

§ JSON >> XML (at least for me)

A few days ago, in a similar mood to the one that caused me to start an Atari emulator, I decided to write my own XML parser.

I've had an increasing interest in language parsers ever since I got to the point of parsing algebraic infix expressions and simple C-like languages. I've written about XML annoyances before, but I don't actually have much occasion to work with XML at the code level, because:

And yet, one of the advantages of XML is that it keeps people from creating their own interchange formats, which are typically far more broken. Since I occasionally do need to import and export little bits of metadata, I wanted to see just how much would be involved in having a little XML parser on the side. It wouldn't need to be terribly fast, as we're talking about a couple of kilobytes of data at most being parsed on a fast CPU, but it would need to be small to be usable. And I just wanted to see if I could do it. So I sat down with the XML 1.0 spec, and started writing a parser.

I have to say, my opinion of XML has dropped several notches in the process (er, lower than it already was), and I'm convinced that we need a major revision or a replacement. I got as far as having a working non-validating, internal-subset-only parser that passed all of the applicable tests in the XML test suite, but after writing more than 2000 lines of code just for the parser and not having even started the DOM yet, I had already run into the following:

All of this adds up to a lot of flexibility and thus overhead that simply isn't necessary for most uses of XML that I've seen. For those of who say who cares and modern systems are fast, I'd like to remind you that every piece of complexity is a piece that can go wrong in terms of an export/import failing, a parser glitch turning into an exploit, or a source of stability problems. This can be true even with a parser that is 100% compliant with the standard if the parser does not have guards against infinite expansion or parser recursion depth. It'd be so much easier if someone would just go through and strip down XML to an "embedded subset" that only contains what most programmers really think is XML and actually use, but I don't see this happening any time soon.

So, in the end, I stopped working on the XML parser and started working on a JSON parser instead. First, it's so much easier to work off of a spec that essentially fits on one page and doesn't have spaghetti hyperlinks like a Choose Your Own Derivation Adventure book. Second, it's so much simpler. Names? Parsed just like strings, which can contain every character except a backslashes and control codes. Entities? Just a reduced set of C-like escapes in strings, and thankfully sans octal. Comments? None. Processing instructions? None. Normalization? None. And as a bonus, it's ideal for serializing property sets or tables. The JSON parser and DOM combined was less than half the size of the XML parser at under 1K lines and took less than a day total to write, and half of that is just UTF-8/16/32 input code (surrogates suck).

To be fair, there are a few downsides to JSON, although IMO they're minor in comparison:

Still, JSON looks much more lightweight for interchange. I'm especially pleased that native parsing support is making it into the next round of browser versions, which hopefully will improve its uptake and therefore available tools.

Comments

Comments posted:


This comment is kind of off-topic

Personally, I love these blogs, they're always interesting (even the bits over my head)

Couple the blogs with the updates etc to virtualdub and it gives an interesting overview of your programming likes/dislikes
I'm going to bet you HATE working on gui interfaces ;-)
That was something like 2 days (??) on a challenge, but not 5 minutes on the gui - oh the humanity of it all

On topic though, xml, it's the old joke about a camel

karl - 01 01 10 - 21:12


darn, the bracketted 'sobs' were stripped out!!
now it looks like whine !!

karl - 01 01 10 - 21:14


Wait... JSON doesn't allow a BOM? I'm pretty sure it is to be treated as a zero-width non-breaking space if you don't handle it specially, which means "can go anywhere whitespace is legal". I'm pretty sure it is implicitly allowed, json.org does say: "Whitespace can be inserted between any pair of tokens."

nielsm - 01 01 10 - 23:44


JSON is good, but there's something even better: Lua. Embrace Lua for everything, from configuration files to scripting/vdub batch jobs, to full-blown high-level programming of your core software. (Not CPU emulation or blend/blit inner loops of course, but almost everything else.) You'll thank me in a year or so :-)

Ivan-Assen Ivanov - 02 01 10 - 00:41


Have you looked at YAML? JSON is basically just a subset of YAML, using the full blown format you can get much nicer formatting and comments. There are also several parsers available for different languages.

Blacktiger - 02 01 10 - 02:31


google for rapidxml

guga40k - 02 01 10 - 04:30


Actually you don't really need a real BOM, as there is an "implied" BOM, according to the RFC:
http://www.ietf.org/rfc/rfc4627
See 3. Encoding

Then there is ECMAScript 5, which specifies JSON (again :p).
7.1 is a format-control character used primarily at the start of a text to mark it as Unicode and to allow
detection of the text's encoding and byte order. characters intended for this purpose can sometimes
also appear after the start of a text, for example as a result of concatenating files. characters are
treated as white space characters (see 7.2).

And
7.2 White space characters may occur between any two
tokens and at the start or end of input.

The ECMAScript 5 spec even tells you what Unicode must be used in section 6.

So to be on the save side you should be checking for both, a BOM and/or the RFC way.


About the duplicate names:
The RFC indeed only states "The names within an object SHOULD be unique".
But ECMAScript 5, OTOH, states "In the case where there are duplicate name Strings within an object, lexically preceding values for the same
key shall be overwritten.", so you can use that.

About the production for numbers: They are basically all IEEE floats. Just look how XSchema defines the float type ;)

Nils - 02 01 10 - 05:42


Personally I never worked with XML and from the horror stories I heard here and on http://www.thedailywtf.com, I think I'm rather lucky :)

It's good to see that people see that strict XML is a major nuisance and try to implement more lightweight formats inestead.

ggn - 02 01 10 - 06:16


Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. -- Jamie Zawinski

s/regular expressions/XML

Jon (link) - 02 01 10 - 06:24


If you are looking to write these config files by hand and not just as a program interchange, use the superset of JSON, YAML. It has pythonic whitespace indentation, and if you don't like their syntax you can fall back to JSON and it just works.

Here is an online parser to play with http://yaml-online-parser.appspot.com/

Paul Tarjan (link) - 02 01 10 - 07:53


This is a very odd critique. Writing an XML parser might be hard, but who cares? Every major programming language provides an XML parser for you, and most of them provide more than one.

Lets think about all the other things that would be "hard" to write, and reject them as well:

* An Operating system
* Video card drivers
* A web browser

All of those things require thousands of lines of code and I couldn't write any of them myself over a weekend. Therefore, I will stop using them.

Mike - 02 01 10 - 17:26


> This is a very odd critique. Writing an XML parser might be hard, but who cares? Every major programming language provides an XML parser for you, and most of them provide more than one.

Believe it or not, sometimes people actually have size or performance criteria to care about when selecting file formats or parsing libraries. It's not a good idea to take an 8MB library in an installer stub, or use an unnecessarily complex format when transferring 40GB of data. But apparently, you're too intent on providing a snarky answer to consider that perhaps not every set of requirements can be met by throwing off-the-shelf libraries together.

Phaeron - 02 01 10 - 22:12


@Karl:
> That was something like 2 days (??) on a challenge, but not 5 minutes on the gui - oh the humanity of it all

You're crazy if you think GUIs take 5 minutes. I spend more time than that in a layout editor just figuring out how I want user flow to work in a dialog.

@nielsm:
> Wait... JSON doesn't allow a BOM? I'm pretty sure it is to be treated as a zero-width non-breaking space if you don't handle it specially, which means "can go anywhere whitespace is legal". I'm pretty sure it is implicitly allowed, json.org does say: "Whitespace can be inserted between any pair of tokens."

Nope, it doesn't. JSON strictly defines whitespace as one of LF, CR, tab, or space. It doesn't include non-breaking spaces, zero-width spaces, or any other kinds of whitespace defined in Unicode. I believe this is also true of XML, which only treats U+FEFF specially for the BOM. It'd be complex and probably unnecessary to accommodate all of the kinds of whitespace. I did some research into this, and this is apparently one of the common areas of laxity in JSON parsers. Presumably it's not allowed because JavaScript eval() won't take kindly to finding an alien character at the start of the "code" that it receives.

@Ivan-Assen Ivanov:
> JSON is good, but there's something even better: Lua. Embrace Lua for everything, from configuration files to scripting/vdub batch jobs, to full-blown high-level programming of your core software. (Not CPU emulation or blend/blit inner loops of course, but almost everything else.) You'll thank me in a year or so :-)

Yuck, no. I already went down that path with VirtualDub's configuration files and sometimes I regret doing it, because in practice what it means is that no other application can properly process the files without including a substantial portion of the original program's core. The only reason it isn't completely insane in VirtualDub's case is that the scripting language doesn't have any flow control structures, and thus a script can always be compiled down into a functional form (and can never loop infinitely).

Lua is also far too lenient for my tastes: no distinct integer/real types (which is a PITA if you are using Direct3D without D3DCREATE_FPU_PRESERVE), and silent breakage on misspelled identifiers.

@Blacktiger:
> Have you looked at YAML? JSON is basically just a subset of YAML, using the full blown format you can get much nicer formatting and comments. There are also several parsers available for different languages.

YAML's turning into a bit of a kitchen sink, unfortunately. My general rule of thumb is that the more flexibility you add into a format, the easier it gets to write and the harder it gets to read. Unfortunately, this is often the reverse of what you want, as you generally want the asymmetry going the other way: hopefully, data is read at least as often as it is written.

IMO, if you want a really nice format for humans, you have to resort to a domain-specific format -- you can't really do it by creating a bloated format that tries do everything generically. Those are called programming languages.

@guga40k:

> google for rapidxml

From the RapidXml website:
"RapidXml is not a W3C compliant parser, primarily because it ignores DOCTYPE declarations."

Translation: RapidXml doesn't actually parse XML, because it ignores the internal DTD subset.

I deliberately did NOT do this in my prototype, because it's useless to evaluate the complexity of an XML parser that doesn't actually parse XML. I already know that I can not parse XML really quickly.

@Nils:
> Actually you don't really need a real BOM, as there is an "implied" BOM, according to the RFC:

The problem with this rule is that it only makes sense if you know that the text is JSON. It spells trouble if you are trying to push JSON through a generic text facility, like a text editor or a stream reader. XML's allowance of the BOM is nice in that if you do need to read the XML through such a facility it won't fark up the encoding even if it has no idea about XML. Not only does JSON not allow the BOM, it actually prohibits it.

> Then there is ECMAScript 5, which specifies JSON (again :p).

Careful: ECMAScript/JavaScript do not specify JSON. JSON is a _subset_. Just because the BOM may be allowed in JS does not mean you're allowed to use it in JSON. For instance, single quotes around member names isn't allowed even though it will compile. The format spec must not be tied to the language spec if the format is to be well-supported by parsers written in different languages.

> About the production for numbers: They are basically all IEEE floats. Just look how XSchema defines the float type ;)

Actually, they aren't. They follow the text format used by most languages to represent them, but there are no specifications on the range or precision of numbers. You can represent quad floats and bignums in JSON's numeric format. You're also free to make a JSON parser that attempts to identify integers separately from floats.

Phaeron - 02 01 10 - 23:12


> Believe it or not, sometimes people actually have size or performance criteria to care about when selecting file formats or parsing libraries.

Phaeron, I think Mike brought up a good point... that yes, XML parsing may be difficult, but that doesn't make XML any less useful. XML is very good for exporting/importing data between programs (it is easy to convert data to and from an XML schema), and it is very good for splitting and merging files. It is not good, as you pointed out, as a container for 40GB of data. If you are just moving data from one place to another, XML certainly is not appropriate.

Michael - 03 01 10 - 03:08


1. How many people know or care about JSON, and how many people use XML? I think there is no need to answer this.

2. If you need to store gigabytes of data, you can use the binary XML format such as the one used in MKV. Why would an ordinary user want to edit 40GB of XML text data? Needless to say there are libraries for that.

3. I cannot imagine a situation when a full-featured XML parser is not fast enough for reasonable amounts of data on today's processors. Again, gigabytes of XML text data are a nonsense. The text format overhead is just too much.

Mirage - 03 01 10 - 06:44


Considering Doug "eval is evil" Crockford wrote JSON, and a parser in JS for it, I don't think he banned the BOM because eval() would choke on it. Maybe he was negligent about it?

Iain Dalton - 03 01 10 - 17:24


@Mirage: many people actually use JSON without knowing about it. The fact that a common practice (storing data in a Javascript-like text array, which is probably one of the dumbest array notation syntax - which doesn't make it bad, mind you) has been somewhat formalized into JSON, doesn't make it an oddity.

Mitch 74 (link) - 03 01 10 - 20:41


Off topic, but since you mention XML, take a look at http://www.codeproject.com/KB/recipes/Fl..

Michael - 04 01 10 - 00:21


JSON is dog slow. You should try C++ Data binder to get rid of XML parsers completely.

http://www.artima.com/cppsource/xml_data..

Robin - 04 01 10 - 02:01


@Robin: JSON might be dog slow, but it's also damn simple.

Which was the point.

Mitch 74 (link) - 04 01 10 - 21:06


CDATA sections. Not only do I have to check for the starting <![CDATA sequence, which partially overlaps with the prefixes for processing instructions (PIs) and comments, but I also have to scan every text section for ]]> just so I can ban it, even though I don't see why this is necessary. XML doesn't ban > in text spans.

Banning > in text spans doesn't eliminate ambiguity - banning ]]>, however, does, it seems to me. ie.

<node>
<![CDATA
]]>
</node><node>
]]>
</node>

How many nodes is that? Treat either the first ]]> or second ]]> as the end of the CDATA and you still end up with valid XML.

yawnmoth - 05 01 10 - 04:24


But.... using XML means you don't have to write the parser! This advantage is quickly lost once you try doing it, reducing XML into any random self-invented format.

Gabest - 06 01 10 - 04:52


But.... using XML means you don't have to write the parser! This advantage is quickly lost once you try doing it, reducing XML into any random self-invented format.

From Phaeron's "02 01 10 - 22:12" post:

> This is a very odd critique. Writing an XML parser might be hard, but who cares? Every major programming language provides an XML parser for you, and most of them provide more than one.

Believe it or not, sometimes people actually have size or performance criteria to care about when selecting file formats or parsing libraries. It's not a good idea to take an 8MB library in an installer stub, or use an unnecessarily complex format when transferring 40GB of data. But apparently, you're too intent on providing a snarky answer to consider that perhaps not every set of requirements can be met by throwing off-the-shelf libraries together.

Anonymous Coward - 06 01 10 - 09:37


@yawnmoth:
> How many nodes is that? Treat either the first ]]> or second ]]> as the end of the CDATA and you still end up with valid XML.

This is actually a fairly simple problem -- you simply have to choose the definition. For instance, in regular expressions, you can either use * to do a maximally greedy match, or *? to do a minimal match. Think about it: XML parsers don't have difficulty finding the start of the next tag even though the whole file is a sea of angle brackets. In this case, choosing the first instance would solve the problem, and avoid the need to scan all text spans for this multi-character sequence.

It's possible that there's a good rationale for this restriction, but I didn't see it in the XML spec. It might be just for compatibility with SGML.

Phaeron - 06 01 10 - 15:19


Sorry if someone said this before, I usu. read the comments but it was tl; dr today.

However, YAML is a simple way to store data (or metadata), and I think that the latest version (1.2) is explicitly JSON compatible.

Also, it allows for that BOM stuff and can be used with ASCII rather than unicode. I think. The spec was also tl, but I did skim it intently.

I first heard about YAML a long time ago, before I even heard about Python, and it seemed back then to be a much better solution than XML.

I think XML is... bad. HTML was a good idea, but, XML takes it too far in the wrong direction. Luckily, it will keep plenty of mediocre programmers busy writing and debugging parsers.

Anyhow, YAML seems good, and might be what you want, esp. as the latest version is intentionally related to JSON.

Kentaro (link) - 06 01 10 - 16:54


This is actually a fairly simple problem -- you simply have to choose the definition. For instance, in regular expressions, you can either use * to do a maximally greedy match, or *? to do a minimal match.

(Minimally) greedy matching would still require you scan the data that'll go in the CDATA for a ]]>. Since there's no way to escape ]]>, you either have to remove it, error out of the XML compiler if it's there, or accept the fact that the user will be able to break out of the XML structure and in so doing create an XML file that's not formatted as you're expecting it to be formatted.

And say you do the opposite - say you assume the last ]]> matches the first

yawnmoth - 07 01 10 - 07:01


Grr... the unescaped < broke my post.

Anyway, continuing from where I left off,

Say you do the opposite - say you assume the last ]]> matches the first <![CDATA. At that point you can only have one CDATA field and that's it.

Or maybe you're proposing there be some sort of flag where a CDATA can have variable greedy-ness? ie. one CDATA may be (minimally) greedy and the other may be maximally greedy? I guess that could work, but once you used a maximally greedy CDATA no CDATAs could be used after wards, be they minimally or maximally greedy.

yawnmoth - 07 01 10 - 07:05


I love me some JSON, but no comments? Wat?

commenter - 26 01 10 - 11:27


Marc Kerbiquet has written an XML-Parser in Assembler (may be the fastest available).

http://tibleiz.net/asm-xml/

(off topic: strange yellow overlays (XP) with a black point in middle, when screenshotting videos - does anyone know that ?!)

greetings and thanks for the virtual dub. love it

Nils (link) - 29 01 10 - 11:34


You should also look at vtd-xml as the latest and most advanced XML technology

http://vtd-xml.sf.net

tom - 09 02 10 - 20:48


Check out YAML. I'm super pleased with YAML, which I believe is technically a superset of JSON. It has some great advantages like human-readability, comments, and the ability to embed YAML or JSON within it. I'd love to see YAML's acceptance grow. Please check it out. http://en.wikipedia.org/wiki/YAML

Andy - 24 02 10 - 08:36


The use of DTDs in XML for entities has other problems too. For example, many implementations use their own HTTP implementation that does no caching, causing unnecessary load on the servers, not to mention that it provides a single point of failure. See http://hsivonen.iki.fi/no-dtd/ .

Yuhong Bao (link) - 26 01 11 - 19:15

Comment form


Please keep comments on-topic for this entry. If you have unrelated comments about VirtualDub, the forum is a better place to post them.
Name:  
Remember personal info?

Email (Optional):
Your email address is only revealed to the blog owner and is not shown to the public.
URL (Optional):
Comment: /

An authentication dialog may appear when you click Post Comment. Simply type in "post" as the user and "now" as the password. I have had to do this to stop automated comment spam.



Small print: All html tags except <b> and <i> will be removed from your comment. You can make links by just typing the url or mail-address.