HTML 5 Parsing

One of the biggest wins of the HTML 5 recommendation is a detailed specification outlining how parsing of HTML documents should work. For too many years browsers have simply tried to guess and copy what others were doing in hopes that their parser would work well enough to not cause too many problems with HTML markup found in the wild.

While some parts of HTML 5 are certainly more contentious than others – the parsing section is one that is almost universally appreciated by browser vendors. Once browsers start to implement it users will enjoy the improved compatibility, as well.

One of the first implementations of the HTML 5 parsing rules was actually created to power the HTML 5 validator. (If you’re interested in testing it out, should validate as HTML 5.) This particular implementation is in Java, provides SAX and DOM interfaces for use, and is open source.

This is particularly interesting because Henri Sivonen (the author of the validator) just recently landed (Warning: Massive web page) a brand new HTML 5 parsing engine in Gecko, destined for the next version of Firefox.

What’s interesting about this particular implementation is that it’s actually an automated conversion of Henri’s Java HTML 5 parser to C++. This conversion happens automatically and changes will be pushed upstream to the Mozilla codebase.

Normally I would balk at the mention of a wholesale, programmatic, conversion of a Java codebase over to C++ but the results have been very surprising: A 3% boost in pageload performance.

And this is on top of the litany of bug fixes and compliance checks that this code base will be providing. You can examine some of the progress that went into the constructing the patch in the Mozilla bug.

If you’re interested in giving the new parser a try (it’s doubtful that you’ll see many obvious changes – but any help in hunting down bugs would be appreciated) you can download a nightly of Firefox, open about:config, and set html5.enable to true.

If there was ever a time to start playing around with the jump to HTML 5, now would be it. Since HTML 5 is a superset of the features provided by HTML 4 and XHTML 1 it ends up being surprisingly easy to ‘upgrade’: Just start by swapping out your current (X)HTML Doctype for the HTML 5 Doctype:

<!DOCTYPE html>

From there you can check the site HTML 5 Doctor for additional details on how to get the new HTML 5 elements working in all browsers.

Posted: July 7th, 2009

Subscribe for email updates

13 Comments (Show Comments)

Comments are closed.
Comments are automatically turned off two weeks after the original post. If you have a question concerning the content of this post, please feel free to contact me.

Secrets of the JavaScript Ninja

Secrets of the JS Ninja

Secret techniques of top JavaScript programmers. Published by Manning.

John Resig Twitter Updates

@jeresig / Mastodon

Infrequent, short, updates and links.