I’ve been working on a browser-based word game, naturally written in JavaScript, and have been encountering some interesting technical challenges along the way. I’ve written up my thought process here for others to learn from (note that most of this happened over the course of a month, or so). I’ve often found that while a final solution to a problem may be rather elegant and “make perfect sense” when looking at it – it’s only through the result of much trial and error that the solution was arrived upon.
To start, in my game, the user is frequently re-arranging letters – causing the game to look up words in a dictionary to see if they are valid, or not.
I’ve taken multiple passes at implementing a solution to this problem, ranging all the way from “I don’t care about performance, I just want it to work” all the way up to “thousands of people could be playing simultaneously, how do I scale?”
For my seed dictionary I used a variation of the Scrabble dictionary, which can be found via some creative Googling. The full dictionary ends up being around 916KB (with words separated by an endline).
Server-Side Solution
The first pass was stupid simple. It worked, but just barely. I took the dictionary and split it up into 26 files – one for each letter of the alphabet – and put all the words that started with the corresponding letter in the text file.
I then made a little PHP script to handle user requests – reading in the portion of the dictionary file and returning “pass” or “fail” if the word was found.
<?php # Get the word to be checked from the user $word = $_GET['word']; # Get the first letter of that word $first = substr( $word, 0, 1 ); # Open the corresponding file $handle = fopen("words/" . $first . ".txt", "r"); if ( $handle ) { # Keep going until the end of the file while (!feof($handle)) { # Get the word in the dictionary # (removing the endline) $line = trim(fgets($handle)); # And see if the word matches if ( $line == $word ) { # If so return "pass" to the client echo "pass"; exit(); } } fclose($handle); } # If we made it to the end, then we failed echo "fail"; ?>
(Please don’t use the above code, it’s terrible.)
Thus you would call the PHP script like so:
/words.php?word=test
And it would return either “pass” or “fail” (meaning that it would only work on the same domain).
So while the above worked it wasn’t nearly fast enough and it consumed a ton of memory on the server – each request was reading in large portions of potentially 100KB+ files. Time for a better solution.
A Better Server-Side Solution
My next attempt was to write a simple little web server (in Perl, this time) that pre-loaded the entire dictionary into memory, and handled all the requests via JSONP.
#!/usr/bin/perl use CGI; # Read the dictionary into the word hash my %words = map { $_, 1 } split(/\n/, `cat dict/dict.txt`); # Instantiate and start the web server use base qw(Net::Server::HTTP); __PACKAGE__->run( port => 8338 ); sub process_http_request { # Make a CGI object for the request my $cgi = CGI->new; # Get the word from the user my $word = $cgi->param( "word" ); $word =~ s/\W//g; # And the callback name my $callback = $cgi->param( "callback" ); $callback =~ s/\W//g; print "Content-type: text/javascript\n\n"; if ( $word && $callback ) { # Dump back the results in a JSONP format print $callback . '({"word":"' . $word . '","pass":' . (defined $words{ $word } ? 'true' : 'false') . '})'; } }
You would call the above using something like this:
http://example.com:8338/?word=test&callback=wordFound
There are a bunch of things that I don’t like about the above code (like using a hash to lookup the words, when more memory-efficient solutions exist, and instantiating a CGI object on every request – things like that). Also if I were to do this today I’d probably write it in Node.js, which I’m becoming more familiar with.
Compared with the previous solution though there are a large number of advantages. Since the dictionary is being loaded into memory only once, and into a hash, it makes lookups very, very, fast. I was timing entire HTTP requests at only a handful of milliseconds, which is great. Additionally since this solution utilized JSONP it was possible to set up a server (or multiple servers) dedicated to looking up words and have the clients connect to them cross-domain.
A Client-Side Solution
It was at this point that I realized a couple problems with the previous server-side solutions. If my game was going to work offline (or as a distributable mobile application) then a constant connection to a dictionary server wasn’t going to be possible. (And if slow mobile connections were taken into account then the responsiveness of the server just wasn’t going to be sufficient enough.)
The dictionary was going to have to live on the client.
Thus I wrote a simple Ajax request to load the text dictionary and make a lookup object for later use.
// The dictionary lookup object var dict = {}; // Do a jQuery Ajax request for the text dictionary $.get( "dict/dict.txt", function( txt ) { // Get an array of all the words var words = txt.split( "\n" ); // And add them as properties to the dictionary lookup // This will allow for fast lookups later for ( var i = 0; i < words.length; i++ ) { dict[ words[i] ] = true; } // The game would start after the dictionary was loaded // startGame(); }); // Takes in an array of letters and finds the longest // possible word at the front of the letters function findWord( letters ) { // Clone the array for manipulation var curLetters = letters.slice( 0 ), word = ""; // Make sure the word is at least 3 letters long while ( curLetters.length > 2 ) { // Get a word out of the existing letters word = curLetters.join(""); // And see if it's in the dictionary if ( dict[ word ] ) { // If it is, return that word return word; } // Otherwise remove another letter from the end curLetters.pop(); } }
Note that having the dictionary on the client actually allowed for an interesting new form of game play. Previously the player had to select a specific word and send it to the server – and only then would the server say if that word was valid or not. Since the dictionary now lives on the client (making lookups instantaneous, in comparison) we can change the logic a bit: The game now looks for the longest word at the start of the user’s letters. For example, if the user had the letters “rategk” then the function would work like so:
findWord( [ "r", "a", "t", "e", "g", "k" ] ) // => returns "rate" findWord( [ "k", "t", "a", "g", "e", "k" ] ) // => returns undefined
Naturally, something similar could’ve been done on the server-side – but that wasn’t readily apparent when working on the server. This is a case where a performance optimization actually created a new, emergent, form of gameplay.
But here’s the rub: We’re now sending a massive dictionary down to a client. It’s 916KB – and that’s absolutely massive.
Optimizing the Client-Side Solution
Compression and Caching
Turning on Gzip compression on the server reduces the dictionary file size from 916KB to a much-more-sane 276KB. Additionally configuring the cache-control settings of your server will ensure that the dictionary won’t be requested again for a very long time (assuming that the cache in the browser isn’t cleared).
There are some excellent articles already written on these techniques:
- Configuring your Cache Settings in Apache
- Configuring your Compression Settings in Apache
- Julien Lecomte also wrote a useful PHP script to do both simultaneously
Content Delivery Networks
Of course, all of this is a bit of a given when you use a Content Delivery Network – which I most certainly do. Right now I’m using Amazon Cloudfront, due to its relatively simple API, but I’m open to other solutions. This means that the dictionary file will be positioned on a large number of servers around the globe and served to the user in the fastest manner possible (using both gzip and proper cache headers).
Cross-Domain Requests
There’s a problem, though: We can’t load our dictionary from a CDN! Since the CDN is located on another server (or on another sub-domain, as is the case here) we’re at the mercy of the browser’s cross-origin policy prohibiting those types of requests. All is not lost though – with a simple tweak to the dictionary file we can load it across domains.
First, we replace all endlines in the dictionary file with a space. Second, we wrap the entire line with a JSONP statement. Thus the final result looks something like this:
dictLoaded('aah aahed aahing aahs aal... zyzzyvas zzz');
This allows us to do an Ajax request for the file and have it work as would expected it to – while still benefitting from all the caching and compression provided by the browser.
I alluded to another problem already: If the browser doesn’t cache the file, for some reason, or if the cache runs out of space and the file is expunged – it’ll be downloaded again. I want to try and reduce the number of times in which 216KB file will be downloaded – if not for the users then for reducing my bandwidth bill.
Local Storage
This is where we can use a great feature of HTML 5: Local Storage. Mark Pilgrim has a great tutorial written up on the subject that I recommend to all. In a crude nutshell: You now have an object that you can stuff strings into and they’ll be persisted by the browser. Most browsers give you around 5MB to play with – which is more than enough for our dictionary file. It also has great cross-browser support – with the simple API working across all modern browsers.
With a little tweak to our Ajax logic (taking into account the JSONP request, the CDN, and Local Storage) we now end up with a revised solution:
// See if the property that we want is pre-cached in the localStorage if ( window.localStorage !== null && window.localStorage.gameDict ) { dictReady( window.localStorage.gameDict ); // Load in the dictionary from the server } else { jQuery.ajax({ url: cdnHREF + "dict/dict.js", dataType: "jsonp", jsonp: false, jsonpCallback: "dictLoaded", success: function( txt ) { // Cache the dictionary, if possible if ( window.localStorage !== null ) { window.localStorage.gameDict = txt; } // Let the rest of the game know // that the dictionary is ready dictReady( txt ); } // TODO: Add error/timeout handling }); }
This gives us an incredibly efficient solution, allowing us to load the dictionary from a CDN (gzipped and with proper cache headers) – and avoiding subsequent requests to get the file if we already have it cached.
Improving Memory Usage
One final tweak that we can make. If you remember the previous dictionary lookup we loaded the entire dictionary into an object and then checked to see if a specific property existed. While this works, and is fast, it also ends up consuming a lot of memory (more so than the existing 916KB, at least).
To avoid this we can be a little bit tricky. Instead of putting the words into an object we can just leave the entire dictionary string intact and then do searches using a JavaScript String’s indexOf method.
Thus inside the dictReady
callback we’ll have something like the following:
function dictReady( txt ) { dict = " " + txt + " "; }
and our revised findWord
function will look something like this:
// Finding and extracting words function findWord( letters ) { // Copy all the tiles var curRack = letters.slice( 0 ), word = ""; // We're going to keep going through the available letters // looking for a long-enough word while ( curRack.length >= 3 ) { // Find a "word" from the available tiles word = curRack.join(""); // ... and see if it's in the dictionary if ( dict.indexOf( " " + word + " " ) >= 0 ) { // We've found the word and we can stop return word; } // ... otherwise we need to remove the last letter curRack.pop(); } }
All together, as of this moment, this is the most optimal solution to this particular problem that I can think of. It’s likely that I’ll find some additional tweaks (or ways of improving memory usage) in the future – but at the very least this solution keeps HTTP requests to a minimum, bandwidth usage to a minimum, memory usage to a minimum, and lookups fast.
Daniel (March 15, 2011 at 6:28 pm)
Have you looked into bloom filters? They seem to be a great answer for your problem. Very compact, quick lookup, 0% false negatives. The downsides: false positives are possible (you can control the odds of that happening by sizing your filter), and you can’t traverse the list of known words (only test for their presence). Implementing one in JavaScript shouldn’t be hard, using an array of 32-bit values or a string of Unicode characters, for instance.
Curtis Lassam (March 15, 2011 at 6:30 pm)
I’m sure “Make Pilgrim” will be very happy that you mentioned him. ^_^
Ben Newman (March 15, 2011 at 6:36 pm)
You can use a “trie” data structure[1] to get O(max word length) lookups and save tons of space by merging common prefixes.
If you’re really space-conscious, you can even do a second pass and merge common suffixes, too, which produces a data structure called a Directed Acyclic Word Graph (DAWG).[2]
[1] http://en.wikipedia.org/wiki/Trie
[2] http://en.wikipedia.org/wiki/Directed_acyclic_word_graph
Luke Brookhart (March 15, 2011 at 6:40 pm)
John, I recently wrote a small web app for finding words: http://wordfinder.mobi
While I see you’re trying to do a client-side lookup, I opted for the serverside solution, along with a MySQL database, since my app does complex matching.
Daniel (March 15, 2011 at 6:40 pm)
Also, a (indexed) database should be very efficient for lookups and memory usage, both on the server side and on Web SQL (more so than localStorage, since SQLite will know how to manage things without necessarily keeping the whole database in memory). I’m not familiar with IndexedDB, but since it’s also (presumably) backed up by SQLite, it should be good.
Brian (March 15, 2011 at 6:43 pm)
Two small mistakes:
Make Pilgrim -> Mark Pilgrim
“persisted by the server” -> “persisted by the browser”
Paul Sayre (March 15, 2011 at 6:58 pm)
Another option is to hash the words into a number. Store those numbers in an ordered array. To look up a word, hash the word into a number and do a binary search on the array. Not sure how much memory that would take, but it would be fast as heck!
David Pio (March 15, 2011 at 6:58 pm)
Agree with Ben..my first thought was “Trie”
DAWG sounds pretty cool too…especially the acronym SUP DAWG
Randall (March 15, 2011 at 6:59 pm)
There’s definitely a risk of going way too far down the rabbit hole of premature optimization. But you could have 26 dictionary strings, one for words starting with each letter, or 676 for words starting with (or consisting of) each pair of letters. If you do one of those, you could save a tiny bit of RAM by leaving the first letter (or two) off each entry.
And I’m sure it’s possible to do all sorts of algorithms — binary searches, radix tries represented as one long string, who knows what — but, honestly, scanning a reasonable-sized string is so dang fast the fancy stuff probably isn’t worth it.
Two fun ideas from that “probably not useful” zone: if the lookup delay is still noticeable, move the lookup to a time the user won’t notice it — after a tile is placed instead of after the play’s submitted. If the browser supports workers, it could even be in the background. And if the load time is a problem (I think it’s not), you could use a server-side dict until the client dict is all loaded.
gossi (March 15, 2011 at 7:02 pm)
Hmm, you could’ve simply loaded a javascript file with the dict in it through a script tag, instead of using an ajax-request. The script tag could be inserted dynamically, so your logic stays the same but removes the security issues with jsonp that are been rumoured around the web (i dont know them) and the request isn’t been prohibited by the browser cross-origin policy.
dict.js:
dict = “my words for my … dictionary”;
// err yes, there are some differences how browser handling parallel/async script loads – that would probably reduce my idea a little – neverthelss wanted to mention it.
kikito (March 15, 2011 at 7:35 pm)
+1 to Luke’s Trie comment. It’s exactly the right tool for the job.
Jesper (March 15, 2011 at 7:59 pm)
“window.localStorage !== null”. Shouldn’t that be “!=”? When localStorage is unavailable, it will be undefined, which == null but !== null so I’m thinking it’ll go on to crash at “window.localStorage.gameDict”.
Nick (March 15, 2011 at 8:00 pm)
The indexOf solution is definitely not algorithmically optimal.
The complexity of indexOf is linear in the number of characters in the string.
Dictionary lookups can (and should) be done in logarithmic time or better.
A Trie is definitely an excellent structure to play around with.
Its complexity (number of comparisons) is bounded: O(min(a, log(b))) – Where a is the length of the lookup word and b is the size (in words) of the dictionary.
Lucatoni (March 15, 2011 at 8:01 pm)
Its funny that a javascript guy gets the idea to make a webserver in perl. How oldschool isn’t that. Did a server side variant in node.js ever occur to you?
Henrik Hinrichs (March 15, 2011 at 8:07 pm)
+1 for Ben Newmans Trie structure
A played around with dictionary games before and last year I wrote some nodejs code that takes a wordlist and actually generates a js library or json. It also performs very fast when searching for correct words, and can be enhanced to contain word descriptions (taking another creative googling session).
The JS library also gzips better than the original wordlist – and is also smaller in the end.
You can find the MIT licensed code on github – https://github.com/mutaphysis/wordlistparser – feel free to use or ignore it.
Dave Ward (March 15, 2011 at 9:12 pm)
I was going to respond with “use a trie!” earlier too, but I figured I’d put some code where my mouth was. I’ve been implementing something very similar to this in C#, so I figured it would fun to convert to JS.
Say you had a wordlist of [‘bar’, ‘bars’, ‘foo’, ‘rat’, ‘rate’], you might boil that down to a trie structure like this to save space:
var trie = {
b: {
a: {
r: {
end: true,
s: {
end: true
}
}
}
},
f: {
o: {
o: {
end: true
}
}
},
r: {
a: {
t: {
end: true,
e: {
end: true
}
}
}
}
};
Not only does that sort of structure save space when you have a large wordlist, but you can traverse it for the longest match *very* quickly with an approach like this:
function findLongestWord(letters, trieContext) {
trieContext = trieContext || trie;
// If the trie contains a key matching the next letter,
// descend into that level of the trie and do it again.
if (trieContext[letters[0]]) {
// However, don't return this back up until we know
// there's a match farther down there somewhere.
var whatLiesBelow = findLongestWord(letters.slice(1), trieContext[letters[0]]);
// But if there is, it's the longest result; concatenate
// and send it back up the stack.
if (whatLiesBelow)
return letters[0] + whatLiesBelow;
}
// If none of that worked, check to see if we're at the end
// of any valid words and return this terminal letter if so.
if (trieContext[letters[0]] && trieContext[letters[0]].end) {
return letters[0];
}
// Else, there's just nothing interesting at this point.
// Return false back up and hope we had better luck earlier.
return false;
}
// Returns 'rat'
findLongestWord('ratt');
// Returns false
findLongestWord('unrate');
// Returns false
findLongestWord('abc');
// Returns 'rate'
findLongestWord(['r', 'a', 't', 'e']);
Apart from the performance/memory optimization, one nice thing about the trie is that it can handle wildcards easily (e.g. blank tiles in Scrabble) if you decide that you want them in the future. Adding wildcard support to an indexOf() approach massacres lookup performance.
Grant (March 15, 2011 at 9:55 pm)
We used a trie for the first version of Scrabbly (now Word²). compact, easy to use and quite nice, performance-wise.
David Stafford (March 15, 2011 at 10:17 pm)
I wrote this almost five years ago. It’s a Flash-based anagram / crossword solver:
http://www.davidst.com/bananagram/
A couple years ago I wanted to know if javascript could be fast enough so I redid the anagram finder here:
http://www.davidst.com/jsanagram/
It only does anagrams (and doesn’t support the “use all letters” option, either.) I didn’t get the crosswords in there. (There wasn’t any point. I was testing performance and anagrams are the hard part.)
Don’t bother looking at the code. I was just scrapping about trying to figure out how to make it work and I didn’t know anything about javascript at the time. It’s pretty ugly.
Thomas (March 15, 2011 at 11:18 pm)
Everyone is yelling to use a trie. Sure. It’s a great solution, and pretty dang perfect in this case. But just as a thought exercise you could consider your dataset as a perfectly balanced tree (because it’s never changing, your distinction for balancing can be arbitrary). If all you need to do is lookup a word that you know you are looking for, then a binary search will get you there hella-fast (pick the “middle” word in your set, check if it’s greater than — alphabetically in this case — or less than your taget word, and split the remaining half and recurse). If you make each word a fixed length (will probably add a few Kb to your dataset), then you can even do it with byte-offsets into a string, and all your comparisons are integers. I’ve done it that way with a simple database I built, and it could lookup a fixed-length record in a 100Mb dataset in about 20 milliseconds (if the file was in memory…)
Good luck! Have fun!
John Resig (March 15, 2011 at 11:26 pm)
@Curtis, Brian: Thanks for the typo fixes!
@All: Thanks for the Trie, DAWG, and Bloom filter suggestions! I’ll do a follow-up post and talk about those specifically. Note that this post was mostly written to get people thinking about practical forms of optimization (that hopefully transcend dictionaries themselves – which only have limited applications in JavaScript applications). I’ll definitely do another post digging deeper into dictionary lookups themselves. Thanks!
Doug (March 15, 2011 at 11:35 pm)
Yeah Steve Hanov’s been working on a similar problem. He is using trie and DAWG structures as others suggested, and shared his code:
http://stevehanov.ca/blog/index.php?id=115
http://stevehanov.ca/blog/index.php?id=114
He uses it for the site http://rhymebrain.com/
Dhruv (March 15, 2011 at 11:42 pm)
@Thomas I quite like your suggestion about the perfectly balanced binary tree (in an array) and I too have experienced it to be the fastest solution. The “keeping-all-words-fixed-length-in-a-string” trick is very useful if you are working with filed, but I guess for in-memory stuff, an array should be just fine.
Having said that though, I guess performance doesn’t matter so much (as memory usage) for a browser app. As John Resig noticed, perf. was an issue on the server since one server would be serving multiple requests (memory could be compromised since it would be amortized across all the requests) whereas in the browser, memory is a problem because you are eating up everyone’s RAM (which comes at a slight premium).
Joe (March 15, 2011 at 11:55 pm)
Couldn’t you just set up a cname redirect from your site to cloudfront, allowing you to access the dictionary without a callback?
Patrick Hall (March 16, 2011 at 12:52 am)
Hi John,
Using Bloom Filters by Maciej Ceglowski is a great intro to that topic.
As long as you’re messing around I’d encourage you to test with some non-ASCII text sooner rather than later. You’ll thank yourself later for bringing up all the Unicode bugs early on…
gazeteler (March 16, 2011 at 1:22 am)
very useful.. thanks
Emmanuel Briot (March 16, 2011 at 4:25 am)
Before I looked at the comment, I was also thinking trie, and made a few pseudo code examples.
Starting with the string you had in your post “aah aahed aahing aahs aal” (25 characters), it would result in a trie that looks like (the first column is the level of the node)
0 a
1 a
2 h#
3 e
4 d#
3 i
4 n
5 g#
3 s#
2 l#
That can be encoded for transmission from server to client as follows. A simple convention: + goes up one level, letters start new level, # indicates a valid end of word. ‘-‘ is often a valid letter in words, so don’t use this instead of “+”.
aah.ed#+ing#++s#+l# (19 characters)
I will leave as an exercise to verify that this can indeed be decoded into the initial trie tree.
Now, the client needs to decode this, since the string as such cannot be used for easy lookup. I do not think using javascript objects is
a good approach, since they don’t really save memory (lots of dictionaries and lookups in there). I suggest instead to use a simple array of tuples: each tuple is a letter and the index of the next letter at the same level (and with same parent). The letter should be encoded as a unicode integer value, not a String, to save space (say 2 bytes for the letter, and 2 for the index(?), that’s four bytes per letter in the trie tree, less than if we were using a htable or n-ary tree, but more than the encoded trie. We also need to indicate which letter is a final end. That might require an additional byte, unless we encoded that in the index itself as a special bit.
0 => (a, , -1) # No other letter on level 0
1 => (a, , -1)
2 => (h, #, 9) # other letter at the same level is "l", index 9
3 => (e, , 5)
4 => (d, #, -1)
5 => (i, , 8)
6 => (n, , -1)
7 => (g, #, -1)
8 => (s, #, -1)
9 => (l, #, -1)
Here is some pseudo-code that verifies whether a word is valid. I haven’t checked this in details.
current = 0;
parent = -1;
for each letter in word:
// current points to first candidate entry
while (current != -1) {
if (trie[current][0] == letter) {
// We found the entry for current letter, move to first
// child node
parent = current;
current ++;
// Move to next letter
break;
} else {
current = trie[current][2]; // next letter, same level
}
}
}
valid_word = (current != -1
&& trie[parent][1] == '#')
Adam Burmister (March 16, 2011 at 5:21 am)
Here’s a bloom filter I implemented a while ago now: https://github.com/adamburmister/JavaScript-Bloom-Filter/blob/master/bloom_filter.js
nmm (March 16, 2011 at 5:33 am)
A good tip is to not parse strings with javascript code if all you’re doing is simple splitting. Instead wrap the string with json and use JSON.parse(). The difference can be HUGE depending on the browser, i’ve seen 20x decrease in time.
So instead of string.split(“1 2 3 4 5″, ” “);
do JSON.parse(“[1,2,3,4,5]”);
I don’t remember the exact way of splitting(splice?) in javascript so consider the above code pseudojavascript.
For strings you will of course have to transfer two extra quote-marks per word but gzip should handle that quite well since it’s a consistent pattern of “,”
For the lookup i guess a sorted list of the words with binary search would have quite fast lookups and pretty low memory usage.
In addition to that i don’t know if it’s such a great idea to use the localstorage as a cache. The browsers has a dedicated cache for a reason.
Umair Jabbar (March 16, 2011 at 6:05 am)
Hi,
Thanks for sharing this. I am working on a user friendly typing speed calculator and this one here gave me some good inspiration.
Helen Neely (March 16, 2011 at 6:38 am)
@Joe – I think what John is looking for here is for his design to also work on mobile devices if and when he ports it on those platforms. Your suggestion is still valid, but redirecting mobile users would require network data connectivity -which isn’t cheap.
Teddy (March 16, 2011 at 7:37 am)
This is cool! I wrote my own ‘word game’ about 4 years ago, b/c I couldn’t stand how loud the shuffling of the dice is. I also didn’t like upside down or sideways-oriented dice.
It was all server-side and brute force dictionary matches, for generation of answers. I was able to match words with up to 13 letters before my PHP code just took way too long to run.
Kevin Hakanson (March 16, 2011 at 9:27 am)
Are there any memory size optimizations knowing that 26 letters (a-z) can fit in 5 bits but a character in JavaScript takes 16 (?). Maybe put three characters into a 3-gram (http://en.wikipedia.org/wiki/N-gram) since 3*15 < 16? Of course, this is a new problem space for me, so I just brainstorming and I haven't done much bit shifting in JavaScript.
Doeke (March 16, 2011 at 10:40 am)
What about regular expressions. You also could do something like this:
function findWord( letters ) {
var test = “”;
for(var i = letters.length-1; i>0; i–) {
test = “(” + letters[i] + test + “)?”;
}
test = “\\b” + letters[0] + test + “\\b”;
var result = new RegExp(test).exec( dict );
if (result) return result[0];
}
Doeke (March 16, 2011 at 10:42 am)
Adding code-tags:
What about regular expressions. You also could do something like this:
function findWord( letters ) {
var test = "";
for(var i = letters.length-1; i>0; i--) {
test = "(" + letters[i] + test + ")?";
}
test = "\\b" + letters[0] + test + "\\b";
var result = new RegExp(test).exec( dict );
if (result) return result[0];
}
Jeremy Martin (March 16, 2011 at 10:59 am)
Funny timing on this post. I’ve been working on a word-finder in Node.js for the last weekend or so. It uses the trie structure along with some clever prime factorization logic for the node filtering (plus a few trade secrets ;)), and the results are insanely fast.
For example, using the Words With Friends dictionary, I can locate 974 possible matches from the input string “qewfjvdslejsdfjuamcm” (20 chars) with an average look up time of 1.3ms.
Another advantage of this data structure is that you can do efficient (and algorithmically sane) “blank tile” searches. For example, searching for “abcd**” (indicating two blank tiles) returns 1234 matches in only 1.7ms.
Protip: sort the words before creating the trie, and then store the sorted word on the end node of a given path. If you sort alphabetically, you get space optimization, and if you sort based on character frequency, you get early pruning :)
Jeremy Martin (March 16, 2011 at 11:07 am)
Correction: I meant to say to sort the characters in each word before creating the trie, and store the _un_sorted word on the last node in that word’s node path.
Milo van der Leij (March 16, 2011 at 11:34 am)
I’m so happy to find so many other people who immediately started hashing out datastructures and pseudocode! I agree with the trie solution. Precompile the dictionary into a JSON trie and send that (gzipped, cached) to the client.
@Jeremy – loving the idea of sorting the letters before creating/using the trie. Having the entire word present at the end of a trie path seems to defeat the space-saving benefit of a trie, though.
Jeremy Martin (March 16, 2011 at 11:45 am)
@Milo
It’s definitely a space/cpu tradeoff (the characteristics of which are both relative to the dictionary input). If you’re doing a “pure trie”, then storing the words at the end doesn’t make any sense, since you already have the characters present in each node (and they’ll be in order if you haven’t sorted them). In my solution each node actually stores a prime number, based on the character it represents. That way I can do super fast node matching based on the result of modding it against a product derived from the input string. In JavaScript this requires some trickery since that product quickly grows too large for a long to store accurately, but it can be made to work. Anyway, storing the word on the node was necessary in my case, but definitely not for every implementation.
Mathias Nater (March 16, 2011 at 11:54 am)
Hi
Working on hyphenator.js (http://code.google.com/p/hyphenator/) I ran into the same question:
How can I effectifely transmit the relatively large pattern files (~100kB each) to the client.
Currently, I’m using the following format:
//pseudo
patterns = {
2 : 'patternsOfLength2',
3 : 'patternsOfLength3',
n : 'patternsOfLengthN'
};
This allowed me to save some new-line characters.
After loading those packed patterns, the client ‘unpacks’ the patterns by splitting the lines and stores them in a more verbose object.
If available, the unpacked object is stored in localStorage, thus the unpacking has only to be done once.
The same should work with words. A packer can be found here: http://hyphenator.googlecode.com/svn/trunk/compressor.html
Since the original hyphenation algorithm by F. M. Liang is using a trie I experimented with tries, too.
But the format probosed by Dave Ward above turned out to be too verbose for transmission. The format proposed by Emmanuel Briot look very promising, though.
Thanks for sharing your thoughts.
Mathias
Ryan (March 16, 2011 at 1:31 pm)
Tone those self-congratulations down a bit eh Jeremy?
Jeremy Martin (March 16, 2011 at 2:55 pm)
@Ryan Fair enough, I suppose (albeit unwarranted). The point wasn’t to be self-congratulatory though, so much as to put some numbers behind the efficiency of using a trie. I didn’t invent the structure, and the idea of using prime factorizations for anagram-style problems isn’t original either. Rather than simply chiming in with “ya, tries are fast”, it seems more useful to provide some numbers around it.
Randall (March 16, 2011 at 5:07 pm)
Can’t help but think we’re partly missing the point talking about algorithms. John’s post is about a practical all-around approach (simple, compact, fast enough), not squeezing every bit of speed he can out of lookup.
Nonetheless, I’ll probably come back later with my own entry in the algorithm-golf tournament :)
Christian Harms (March 16, 2011 at 5:09 pm)
@Mathias: I like this idea but I dont know if this use less memory than the tie structure. Btw. you dont have to split the strings if the words are sorted. You can use the good-old binary search with index access.
Stephen Jones (March 16, 2011 at 6:11 pm)
I agree Emmanuel Briot’s compression is good but you might consider using the shared prefix format used for Unix ISpell (I think). Words are sorted and compressed to a shared prefix count and suffix. So,
span, spangle, spar, sparse, sparsely, toast
is compressed to:
span, 4gle, 3r, 4se, 6ly, 0toast
Or, more compactly as: span4gle3r4se6ly0toast
(where 0 prefix means new root word)
Emmanual’s example “aah.ed#+ing#++s#+l#” becomes “aah3ed3ing3s2l”, a savings of 25%.
You can compress further by using punctuation characters for common suffixes: ! = ing, * = ment, etc.
David Pio (March 16, 2011 at 6:26 pm)
@Thomas in regards to a static, balanced tree for binary search purposes, I’d say its still not as good as trie for this specific case of looking up words in a dictionary.
The reason being that the computational complexity for a binary search is O(log N) where N is the # of words in the dictionary
A trie search’s complexity is O(M) where M is the # of letters in the word
In this scenario the length of a word is typically MUCH smaller than the number of words in a dictionary
Mike Koss (March 16, 2011 at 8:44 pm)
A fun enough challenge to encourage me to write a Trie solution in JavaScript:
http://lookups.pageforest.com
You can use this client application to convert any text dictionary into a Trie, and then do lookups against it. Here, for example, is the generated Trie from the words in this message(!)
{
"t": {
"o": 1,
"rie": 1,
"ext": 1,
"h": {
"is": 1,
"e": {
"n": 1,
"": 1
}
}
},
"solution": 1,
"javascript": 1,
"lookups": 1,
"pageforest": 1,
"c": {
"hallenge": 1,
"an": 1,
"lient": 1,
"o": {
"m": 1,
"nvert": 1
}
},
"you": 1,
"use": 1,
"a": {
"": 1,
"pplication": 1,
"n": {
"y": 1,
"d": 1
},
"gainst": 1
},
"d": {
"ictionary": 1,
"o": 1
},
"i": {
"n": {
"": 1,
"to": 1
},
"t": 1,
"s": 1
},
"h": {
"ttp": 1,
"ere": 1
},
"f": {
"un": 1,
"or": 1,
"rom": 1
},
"e": {
"n": {
"ough": 1,
"courage": 1
},
"xample": 1
},
"generated": 1,
"w": {
"rite": 1,
"ords": 1
},
"me": {
"": 1,
"ssage": 1
}
}
And the Trie can be server into your code via a saved JSONP file:
http://lookups.pageforest.com/docs/blog-sample/trie?callback=loadTrie
I think the trie can itself be compacted into a pure string format, yielding a file much compressed compared with the size of the dictionary (with all the JSON overhead, it’s coming in at about 20% higher for the full dictionary I found online).
gradbot (March 16, 2011 at 9:52 pm)
I just finished working on a Boggle solver. It uses a data structure similar to a trie but with only one character per level since the algorithm needs to step down one letter at a time.
http://gradbot.blogspot.com/2011/03/solving-boggle.html
Randall (March 17, 2011 at 1:44 am)
Here’s binary search with Mathias and Christian’s ideas:
Make the dictionary with this command line:
sort dict.txt | perl -ne '
chomp;
$a{length $_} .= $_;
END {
print "loadDict([\n";
print $_ ? qq| "$_",\n| : " undefined,\n" for @$a;
print"])";
}
' > dict.jsonp
Do lookups like so:
<script>
var dict;
function loadDict(d) { dict=d; }
function exists(word) {
var l = word.length;
var words = dict[l].length/l;
var low = 0, high = words-1, mid = Math.floor(words/2);
while (high>=low) {
var found = dict[l].substr(l*mid, l);
if ( word == found )
return true;
if ( word < found )
high = mid-1;
else
low = mid+1;
mid = Math.floor((low+high)/2);
};
return false;
}
</script>
<script src="dict.jsonp"></script>
Randall (March 17, 2011 at 2:14 am)
That had bugs! Pasted in a broken dict maker, and exists(‘i’) crashed because there were no 1-letter words in the dict. Take 2:
Dictionary:
sort dict.txt | perl -e '
while(<>) {
chomp;
$a[length $_] .= $_;
}
print "loadDict([\n";
print $_ ? qq| "$_",\n| : " undefined,\n" for @a;
print " undefined\n])";
''
Lookups:
<script>
var dict;
function loadDict(d) { dict=d; }
function exists(word) {
var l = word.length;
if ( !dict[l] ) return false;
var words = dict[l].length/l;
var low = 0, high = words-1, mid = Math.floor(words/2);
while (high>=low) {
var found = dict[l].substr(l*mid, l);
if ( word == found )
return true;
if ( word < found )
high = mid-1;
else
low = mid+1;
mid = Math.floor((low+high)/2);
};
return false;
}
</script>
<script src="dict.jsonp"></script>
Can def be done better. Back to other things! :)
Emmanuel Briot (March 17, 2011 at 4:44 am)
@Stephen Jones: starting from the word list first in the google query linked by John, and using your suggested common-prefix compression, I have the following results:
# Wordlist.txt (80612 words) from
# http://www.calvin.edu/~rpruim/scrabble/ospd3.txt
# one word per line: 625324 bytes
# with gzip: 238963 bytes
# sort and compress common prefixes: 234369 bytes (37.4%)
# encode common suffixes: 226331 bytes
# with gzip -7 (the default): 99843 bytes
So we do achieve much better compression rates than a simple gzip, while keeping the string very easy to decode.
def compress():
dict=open('wordlist.txt')
words = dict.readlines()
words.sort()
prev = words[0].strip()
all_suffixes = (("ing", "!"), # 4869 words
("ment", "*"), # 136 words
("ly", "@"), # 2064 words
("ion", "$"), # 498 words
("age", "%"), # 675 words
("ted", "^"), # 1019 words
("ous", "("), # 535 words
("ity", "&")) # 256 words
compressed = [prev]
for w in words[1:]:
w = w.strip()
for suff, repl in all_suffixes:
w = w.replace(suff, repl)
prefix = os.path.commonprefix([prev, w])
if len(prefix) > 9: # never occurs on this list anyway
prefix = 9
compressed.append("%d%s" % (len(prefix), w[len(prefix):]))
prev = w
return compressed
Angus Turnbull (March 17, 2011 at 4:46 am)
Interesting post! On another topic, a different approach to JSONP would be to save your data-structure-du-jour as a set of pixel values in a PNG (using its inbuilt compression, and perhaps some tweaking to use 5/6/7-bit values to represent your letters).
Load this up on the client’s CANVAS and extract the pixel values, optionally saving the result to localStorage. The disadvantage would be no cross-domain/CDN access without perhaps remapping the DNS of a subdomain.
Mike Koss (March 17, 2011 at 8:45 am)
An update on my Trie implementation. I added a packed format for the node representation. The same example I use above now becomes:
t1solution,javascript,lookups,pageforest,c4you,use,a6d8i9h11f12e13generated,w15me16
o,rie,ext,h1
is,e1
!n
hallenge,an,lient,o1
m,nvert
!pplication,n1gainst
y,d
ictionary,o
n1t,s
!to
ttp,ere
un,or,rom
n1xample
ough,courage
rite,ords
!ssage
The format works as follows:
1. Each Trie node is represented as one line of the file.
2. If the node is “terminal” (i.e., the string leading to that node is a word in the dictionary), then the line begins with a ‘!’ character.
3. The rest of the line is a list of character strings.
4. Strings that are “terminal”, are comma separated.
5. Strings that reference other nodes are appended with the numeric LINE OFFSET to the node that they reference.
Using this encoding format (and without suffix sharing), the ospd3.txt file compresses from 625,324 bytes down to 299,924 bytes.
This format is also very efficient to use “as-is” in a trie traversal (it does not have to be reconstituted into a collection of JavaScript objects). After loading, I would split the lines, and then the relative node references can be traversed in constant time.
Stephen Jones (March 17, 2011 at 9:04 am)
@Emmannuel Briot:
Great follow-through! Your data and analysis is great.
I know these compression tricks could go on forever. But, I’ll pull a Steve Jobs here and say, “Just one more thing…”
There’s another suffix compression trick that *nix spell checkers use: Encode all possible suffixes for a word as a suffix pattern.
Say you have a run of words like:
flex, flexed. flexer, flexes, flexing, flexible, flexibly, flexibility
You’d compress all of these down to: flex:FF
Where FF is an 8-bit bitmap of possible suffixes for the base word:
1=none,2=ed,4=er,8=es,16=ing,32=ible,64=ibly,128=ibility
(You could extend this to 12, 16 or 32 bit bitmaps for more suffixes)
“flex:FF” is 7 characters vs. 29 for “flex4ed4er4es4ing5ble7y6ility” (75% compression) or slightly less vs. your suffix codes.
Since a lot of verbs (and some nouns) follow these kinds of patterns, you’d save a lot that way.
OK, no more tweaks for me.
John Resig (March 17, 2011 at 9:51 am)
Hey everyone – I’ve posted a follow-up post here:
http://ejohn.org/blog/javascript-trie-performance-analysis/
Brock Wilcox (awwaiid) (March 18, 2011 at 7:59 am)
It’s very much a side-note, but I updated the perl example to use a more modern PSGI setup, http://github.com/awwaiid/dictserver
Very basic benchmarking on my machine says that the original Net::Server::HTTP version got around 350 reqs/sec, and that the PSGI version gets 2500 reqs/sec or more (with both starman and twiggy). Maybe I’ll do a blog entry with some graphs.
I’d LOVE to see the Node.js equivalent! Node.js uses libev, which can also be used by the PSGI/twiggy/AnyEvent stack, so in theory they should be pretty comparable technologies. I might also do Mojolicious version, all in the name of self-education on the various technologies and techniques.
Stephen (March 18, 2011 at 10:13 am)
Hi John, what about making the entire word list a tree, branching on each letter. Each branch can be a sub-object. This may save some space in JSON representation, and should be quite fast in searches. Not sure whether this O(logn) tree search will be faster than native implementation of indexOf though, but the tree search only have to search n steps (n = length of word), but indexOf has to go through the entire word file.
Sylvain (March 18, 2011 at 12:46 pm)
I have a dictionarie application in JavaScript at artimap.com and I am open sourcing the refactored code at ogbuzz.com. I found that in a real dict app you check for way more than the existence of a word, you also want to make suggestions to the user and for that the use of an unsorted array is ideal because you want to search through ALL the words in the dictionary anyway. @Thomas Some operations could be done faster if the array was sorted though. @John javascript dictionaries have a wide usage possibilities as part of web-sites search-engines and spreadsheet analyzers.
David Wilson (March 19, 2011 at 5:49 pm)
Hi John,
Last year I worked on an application with very similar requirements to yours. I chose the “trie-like” structure described by Mike Koss, with “$” as the key to indicate end of string.
I also discovered I could lop a *huge* chunk off the download size by arranging to precompress the gzipped contents using 7zip – lopping 11% off gzip -9’s output, from 227kb to 201kb.
The htaccess file for that, along with various other bits, is below:
https://bitbucket.org/dmw/translink-timetables-pure-js/
Alex Wilson (March 20, 2011 at 8:24 pm)
I often set values to true in a table, I find it a good replacements where some iterations would be used.