Pure JavaScript HTML Parser

Recently I was having a little bit of fun and decided to go about writing a pure JavaScript HTML parser. Some might remember my one project, env.js, which ported the native browser JavaScript features to the server-side (powered by Rhino). One thing that was lacking from that project was an HTML parser (it parsed strict XML only).

I’ve been toying with the ability to port env.js to other platforms (Spidermonkey derivatives and the ECMAScript 4 Reference Implementation) and if I were to do so I would need an HTML parser. Because of this fact it became easiest to just write an HTML parser in pure JavaScript.

I did some digging to see what people had previously built, but the landscape was pretty bleak. The only one that I could find was one made by Erik Arvidsson – a simple SAX-style HTML parser. Considering that this contained only the most basic parsing – and none of the actual, complicated, HTML logic there was still a lot of work left to be done.

(I also contemplated porting the HTML 5 parser, wholesale, but that seemed like a herculean effort.)

However, the result is one that I’m quite pleased with. It won’t match the compliance of html5lib, nor the speed of a pure XML parser, but it’s able to get the job done with little fuss – while still being highly portable.


4 Libraries in One!

There were four pieces of functionality that I wanted to implement with this library:

A SAX-style API

Handles tag, text, and comments with callbacks. For example, let’s say you wanted to implement a simple HTML to XML serialization scheme – you could do so using the following:

var results = "";

HTMLParser("<p id=test>hello <i>world", {
  start: function( tag, attrs, unary ) {
    results += "<" + tag;

    for ( var i = 0; i < attrs.length; i++ )
      results += " " + attrs&#91;i&#93;.name + '="' + attrs&#91;i&#93;.escaped + '"';

    results += (unary ? "/" : "") + ">";
  end: function( tag ) {
    results += "</" + tag + ">";
  chars: function( text ) {
    results += text;
  comment: function( text ) {
    results += "<!--" + text + "-->";

results == '<p id="test">hello <i>world</i></p>"

XML Serializer

Now, there’s no need to worry about implementing the above, since it’s included directly in the library, as well. Just feed in HTML and it spits back an XML string.

var results = HTMLtoXML("<p>Data: <input disabled>")
results == '<p>Data: <input disabled="disabled"/></p>'

DOM Builder

If you’re using the HTML parser to inject into an existing DOM document (or within an existing DOM element) then htmlparser.js provides a simple method for handling that:

// The following is appended into the document body
HTMLtoDOM("<p>Hello <b>World", document)

// The follow is appended into the specified element
HTMLtoDOM("<p>Hello <b>World", document.getElementById("test"))

DOM Document Creator

This is a more-advanced version of the DOM builder – it includes logic for handling the overall structure of a web page, returning a new DOM document.

A couple points are enforced by this method:

  • There will always be a html, head, body, and title element.
  • There will only be one html, head, body, and title element (if the user specifies more, then will be moved to the appropriate locations and merged).
  • link and base elements are forced into the head.

You would use the method like so:

var dom = HTMLtoDOM("<p>Data: <input disabled>");
dom.getElementsByTagName("body").length == 1
dom.getElementsByTagName("p").length == 1

While this library doesn’t cover the full gamut of possible weirdness that HTML provides, it does handle a lot of the most obvious stuff. All of the following are accounted for:

  • Unclosed Tags:
    HTMLtoXML("<p><b>Hello") == '<p><b>Hello</b></p>'
  • Empty Elements:
    HTMLtoXML("<img src=test.jpg>") == '<img src="test.jpg"/>'
  • Block vs. Inline Elements:
    HTMLtoXML("<b>Hello <p>John") == '<b>Hello </b><p>John</p>'
  • Self-closing Elements:
    HTMLtoXML("<p>Hello<p>World") == '<p>Hello</p><p>World</p>'
  • Attributes Without Values:
    HTMLtoXML("<input disabled>") == '<input disabled="disabled"/>'

Note: It does not take into account where in the document an element should exist. Right now you can put block elements in a head or th inside a p and it’ll happily accept them. It’s not entirely clear how the logic should work for those, but it’s something that I’m open to exploring.

You can test a lot of this out in the live demo.

While I doubt this will cover all weird HTML cases – it should handle most of the obvious ones – at least making HTML parsing in JavaScript feasible.

Posted: May 5th, 2008

Subscribe for email updates

35 Comments (Show Comments)

Comments are closed.
Comments are automatically turned off two weeks after the original post. If you have a question concerning the content of this post, please feel free to contact me.

Secrets of the JavaScript Ninja

Secrets of the JS Ninja

Secret techniques of top JavaScript programmers. Published by Manning.

John Resig Twitter Updates

@jeresig / Mastodon

Infrequent, short, updates and links.