Recently I was having a little bit of fun and decided to go about writing a pure JavaScript HTML parser. Some might remember my one project, env.js, which ported the native browser JavaScript features to the server-side (powered by Rhino). One thing that was lacking from that project was an HTML parser (it parsed strict XML only).
I’ve been toying with the ability to port env.js to other platforms (Spidermonkey derivatives and the ECMAScript 4 Reference Implementation) and if I were to do so I would need an HTML parser. Because of this fact it became easiest to just write an HTML parser in pure JavaScript.
I did some digging to see what people had previously built, but the landscape was pretty bleak. The only one that I could find was one made by Erik Arvidsson – a simple SAX-style HTML parser. Considering that this contained only the most basic parsing – and none of the actual, complicated, HTML logic there was still a lot of work left to be done.
(I also contemplated porting the HTML 5 parser, wholesale, but that seemed like a herculean effort.)
However, the result is one that I’m quite pleased with. It won’t match the compliance of html5lib, nor the speed of a pure XML parser, but it’s able to get the job done with little fuss – while still being highly portable.
htmlparser.js:
4 Libraries in One!
There were four pieces of functionality that I wanted to implement with this library:
A SAX-style API
Handles tag, text, and comments with callbacks. For example, let’s say you wanted to implement a simple HTML to XML serialization scheme – you could do so using the following:
var results = ""; HTMLParser("<p id=test>hello <i>world", { start: function( tag, attrs, unary ) { results += "<" + tag; for ( var i = 0; i < attrs.length; i++ ) results += " " + attrs[i].name + '="' + attrs[i].escaped + '"'; results += (unary ? "/" : "") + ">"; }, end: function( tag ) { results += "</" + tag + ">"; }, chars: function( text ) { results += text; }, comment: function( text ) { results += "<!--" + text + "-->"; } }); results == '<p id="test">hello <i>world</i></p>"
XML Serializer
Now, there’s no need to worry about implementing the above, since it’s included directly in the library, as well. Just feed in HTML and it spits back an XML string.
var results = HTMLtoXML("<p>Data: <input disabled>") results == '<p>Data: <input disabled="disabled"/></p>'
DOM Builder
If you’re using the HTML parser to inject into an existing DOM document (or within an existing DOM element) then htmlparser.js provides a simple method for handling that:
// The following is appended into the document body HTMLtoDOM("<p>Hello <b>World", document) // The follow is appended into the specified element HTMLtoDOM("<p>Hello <b>World", document.getElementById("test"))
DOM Document Creator
This is a more-advanced version of the DOM builder – it includes logic for handling the overall structure of a web page, returning a new DOM document.
A couple points are enforced by this method:
- There will always be a html, head, body, and title element.
- There will only be one html, head, body, and title element (if the user specifies more, then will be moved to the appropriate locations and merged).
- link and base elements are forced into the head.
You would use the method like so:
var dom = HTMLtoDOM("<p>Data: <input disabled>"); dom.getElementsByTagName("body").length == 1 dom.getElementsByTagName("p").length == 1
While this library doesn’t cover the full gamut of possible weirdness that HTML provides, it does handle a lot of the most obvious stuff. All of the following are accounted for:
- Unclosed Tags:
HTMLtoXML("<p><b>Hello") == '<p><b>Hello</b></p>'
- Empty Elements:
HTMLtoXML("<img src=test.jpg>") == '<img src="test.jpg"/>'
- Block vs. Inline Elements:
HTMLtoXML("<b>Hello <p>John") == '<b>Hello </b><p>John</p>'
- Self-closing Elements:
HTMLtoXML("<p>Hello<p>World") == '<p>Hello</p><p>World</p>'
- Attributes Without Values:
HTMLtoXML("<input disabled>") == '<input disabled="disabled"/>'
Note: It does not take into account where in the document an element should exist. Right now you can put block elements in a head or th inside a p and it’ll happily accept them. It’s not entirely clear how the logic should work for those, but it’s something that I’m open to exploring.
You can test a lot of this out in the live demo.
While I doubt this will cover all weird HTML cases – it should handle most of the obvious ones – at least making HTML parsing in JavaScript feasible.
Daniel Luz (May 5, 2008 at 9:14 am)
Great stuff! I’ll sure try it later today. But I guess a closing slash is missing in the XML part of this line:
HTMLtoXML("<img src=test.jpg>") == '<img src="test.jpg">'
As it is now, that’s more like an example of unquoted attributes :)
John Resig (May 5, 2008 at 9:15 am)
@Daniel: My mistake – I was just writing the examples by hand – you can see that it works properly in the demo.
Ara Pehlivanian (May 5, 2008 at 9:16 am)
Great work! This would have come in handy as a comment validator back when I was running my site in application/xhtml+xml, or even when I was overriding document.write and manually parsing 3rd party scripts.
Travis (May 5, 2008 at 9:36 am)
hmm:
hello world<br/>foo<br />bar
changes into:
hello world<br/>foo<br /=”/”/>bar
(one of our co-ops noticed it)
Henri Sivonen (May 5, 2008 at 9:37 am)
Since porting the html5lib Python or Ruby parser would take manual effort, I think it would be interesting to see if Google Web Toolkit can compile the Validator.nu HTML parser from Java to JavaScript. If not, porting the trunk of the Validator.nu HTML parser line-by-line should be a better and more mechanic match to languages that look roughly Java-ish or C-ish. (The trunk is being heavily refactored to allow interesting things including straight-forward or even automated porting to C or C++ or perhaps JavaScript with and Gecko-style parser suspendability.)
Philip Taylor (May 5, 2008 at 9:39 am)
Input like
<>
seems to get stuck in an infinite loop.The HTML 5 parsing algorithm isn’t really that hard to implement – I’ve got a rough JS version here. It’s pretty incomplete (it doesn’t handle things like <script> content, error handling in tables is probably dodgy, it hasn’t followed recent updates to the specification, etc), but it seems to work as a proof-of-concept, and it could probably become reasonably correct with another few days of work. And it’s only, uh, four thousand lines of code. Maybe there’s still room for smaller, less correct parsers…
Sunny (May 5, 2008 at 9:39 am)
Awesome :) Two hiccups when trying it out, though :
<img alt="" src="test.jpg" />
=><img alt="alt" src="test.jpg" /="/"/>
John Resig (May 5, 2008 at 9:49 am)
@Travis and Sunny: Fixed! In my defense, that’s not valid HTML ;-)
@Sunny: Also fixed the alt=”” issue.
@Philip: Fixed! Didn’t have any sort of exception handling – was an easy addition. I assume that this parser work is quite new – definitely wasn’t able to find anything back when I was building this in January. Glad to see that some progress is being made! But yeah, 4000 lines is a little bit on the “heavy” side.
Mislav (May 5, 2008 at 9:53 am)
@Travis, Sunny: that’s in fact invalid HTML, but parsers in web browsers seem to ignore the self-closing bit (or maybe they parse it as some weird attribute?), so web authors started happily using them while living in a illusion that they were writing XHTML.
But, I agree that Resig’s parser should handle this nicer than this. Maybe just ignore it.
@Resig: great stuff.
Geoffrey Sneddon (May 5, 2008 at 10:27 am)
A bug I found very quickly:
HTMLtoXML("") == ''
Kirk Cerny (May 5, 2008 at 10:29 am)
Right now you can put block elements in a head or th inside a p and it’ll happily accept them.
Sounds like you need to make a W3C Html Validator in JavaScript.
Philip Taylor (May 5, 2008 at 10:31 am)
John: My tokeniser implementation in JS (and C++ and Perl and OCaml…) was done and described quite a while ago, but I didn’t work on the tree construction part until roughly February, so it is fairly recent. It’s crazy, but fun :-)
John Resig (May 5, 2008 at 10:37 am)
@Geoffrey: I’m not sure I see your point – what would you expect the output to be? That looks valid to me.
@Kirk: Heh, well, not a full validator – but enough to force it into the right shape.
@Philip: Yeah, I can only imagine. Keep up the good work!
Simon Brüchner (May 5, 2008 at 10:45 am)
Just read an article about HTML vs. XHTML: http://www.debuggable.com/posts/xhtml-is-a-joke:4819bf98-4978-4027-896e-2ea44834cda3 which says that XHTML isnt that required…
Simon Brüchner (May 5, 2008 at 10:46 am)
… but neverthanless cool JS John!!!
Travis (May 5, 2008 at 11:20 am)
Aw c’mon, I was expecting a full JS implementation of Tidy! ;-) Nice work.
Iraê (May 5, 2008 at 12:47 pm)
Very cool implementation!
I’ll see how it plays with AdobeAIR and Jaxer. I’m sure i will be fun!
Geoffrey Sneddon (May 5, 2008 at 6:02 pm)
@John: Numeric character entity references in XML 1.0/1.1 must match a character in the Char production: U+FFFF (a non-character) does not match it, and therefore an entity representing it is non well-formed XML. The entity should be treated as an invalid Unicode character, being replaced with U+FFFD (�) or ?, or totally removed.
Dmitry Baranovskiy (May 5, 2008 at 11:49 pm)
Great stuff! This script could be a saver for WYSIWYG editors. However I found small issue: It recognise as a block level element, but not .
Mike (May 6, 2008 at 3:49 am)
How about converting tags to lowercase?
Rodrigo Asensio (May 6, 2008 at 9:29 am)
Nice work, I will use it to generate html on the fly from js.
thanks!
d3x (May 6, 2008 at 11:00 am)
How about a valid innerHTML method?
Myk Melez (May 6, 2008 at 2:01 pm)
Very cool. Is there a way to make it ignore script tags? I’m thinking it could be useful for parsing untrusted HTML snippets.
Doeke Zanstra (May 7, 2008 at 2:41 am)
One note about your env.js:
You know javascript knows nothing about threads. Maybe you could simulate this behaviour, by using java’s synchronized? Just an idea.
^love *encounter ~flow (May 8, 2008 at 6:39 am)
“this library doesn’t cover the full gamut of possible weirdness that HTML provides, it does handle a lot of the most obvious stuff” — good! did you have a look at http://www.crummy.com/software/BeautifulSoup/ ? it does a wonderful job at healing broken X/HT/MLish stuff and never balks. i use it to parse pointy brackets in http://code.google.com/p/shuttlepod/, and it works like a charm. the good thing is you most of the time get a representation that matches both your expectation, the intention of the author, and the interpretation of the browser. with classical XML parsers, what you get is more often than not an error message, and that is most likely not what you want. i never grokked exactly how L. Richardson set up the rules for healing HTML, but i can say it does work for me. plus, B.S. leaves any idiosyncrytic non-standard stuff as-is in the result, so it makes a very good foundation for the templating engine i’m writing ` tag with a “ and a “ added>.
oh, and default attributes à la “ => “. very good thing that. i’ve always thought of attributes that are mentioned but not filled as representing `true`, plain and simple. HTML can be very declarative, almost like a configuration file, and i think a configuration language should allow the plain mention of names: haveMoney, willTravel, both true. no need to add a nonce value. check it out: `checked` is already more expressive than `checked=’checked’`. does HTML 5 allow that?
~flow
^love *encounter ~flow (May 8, 2008 at 6:44 am)
ok that got swallowed. again, with pointy brackets written as parentheses: “… foundation for the templating engine i’m writing (imagine having a `(video/)` tag with a `(switch/)` and a `(slider default=’30%’/)` added) …”. so that is about server-side custom tags, which BeautifulSoup parses beautifully.
second ommission: “…oh, and default attributes à la `(x a)` => `(x a=’a’)`. very good thing that.”
sorry for the dub-up.
Stephan Schmidt (May 8, 2008 at 8:44 am)
Any plans to support namespaces?
Thanks
-stephan
Alex Robinson (May 16, 2008 at 10:54 am)
Ran into the following parse errors, when attempting to feed html in the wild through the parser…
HTMLtoXML(”)
-> “htmlparser.js”, line 121: exception from uncaught JavaScript
throw: Parse Error:…
HTMLtoXML(”)
-> “htmlparser.js”, line 121: exception from uncaught JavaScript
throw: Parse Error:…
HTMLtoXML(‘\n/* */\n’)
-> “htmlparser.js”, line 121: exception from uncaught JavaScript
throw: Parse Error:…
HTMLtoXML(‘\n/* */\n’)
->
/* */
(NB. the comment pops out of the style tag!)
All that aside, brilliant work John.
Alex Robinson (May 16, 2008 at 11:02 am)
(Gah! I totally misread the note. I thought it meant that code would be wrapped and angle brackets converted automatically. Try again)
HTMLtoXML('<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"></html>')
-> "htmlparser.js", line 121: exception from uncaught JavaScript throw: Parse Error:…
HTMLtoXML('<meta http-equiv="content-type" content="text/html; charset=utf-8">')
-> "htmlparser.js", line 121: exception from uncaught JavaScript throw: Parse Error:…
HTMLtoXML('<style type="text/css">\n/* <![CDATA[ */\n/* ]]> */\n</style>')
-> "htmlparser.js", line 121: exception from uncaught JavaScript throw: Parse Error:…
HTMLtoXML('<style type="text/css">\n/* */\n</style>')
-> <style type="text/css"></style>
/* */
(NB. the comment pops out of the style tag!)
STEVO (May 22, 2008 at 11:02 am)
<script>
code
</script>
==>
<script></script>
code
any chance of a fix for this?
ta
Ryan (May 26, 2008 at 11:43 am)
Great work! Kinda like Sarissa, but in full JS with full control.
Some problems with Sarissa that also is a problem with htmlparser.js:
some text with this inside
Also I has some problems with & in Sarissa, but it seems to work ok with your code.
Ryan (May 26, 2008 at 11:46 am)
Ugg:
<div> some text with this < inside </div>
Weston Ruter (June 4, 2008 at 1:47 pm)
Hey John, I’ve incorporated this HTML Parser into an implementation of
document.write()
for XHTML, which I know you’ve also worked on: http://weston.ruter.net/projects/xhtml-document-write/Hansor (June 5, 2008 at 7:12 am)
XMLtoDOM doesn’t work 4 me, though.
HTMLtoDOM("Data dddtestddd", document);
Gets me:
String contains an invalid character” code: “5
on line 273
with FF2.0 on Linux.
Nice project, though!
Tobi Buschi (June 22, 2008 at 9:02 am)
Great library!
Is there a easy way to indent the xml-code?