John Resig - Injecting Word Breaks with JavaScript

Injecting Word Breaks with JavaScript

Recently Eduardo Lundgren pinged me wondering if I had an alternate solution to injecting wbr tags inside a long word.

The wbr tag tells the browser where a possible line break can be inserted, should the need arise. (Opera has some problems with rendering them correctly, but it can be rectified using some CSS.) By adding wbr tags into words at strategic locations you can allow a content area to resize gracefully while still being readable.

I looked at his simplified solutions for a moment and came up with this solution:

function wbr(str, num) {  
  return str.replace(RegExp("(\\w{" + num + "})(\\w)", "g"), function(all,text,char){ 
    return text + "<wbr>" + char; 
  }); 
}

You would use it like so:

wbr("Hello everyone how are you doing?" + 
  "I'm writing an extravagently long string.", 6);

"Hello everyo<wbr>ne how are you doing? I'm writin<wbr>g an extrav<wbr>agently long string."

Now this is an incredibly simple solution and having breaks like writing are quite undesirable. After I wrote the above I did some more digging and read about various hyphenation algorithms that exist.

Looking in the above article I found a recent JavaScript library which provides a full solution (breaking in appropriate places for multiple languages). Of course, the resulting code checks in at about 80kb (15kb base library + 65kb English word library) so you’ll need to strongly consider if that solution is appropriate for your situation.

Posted: May 24th, 2008

Subscribe for email updates

12 Comments (Show Comments)

DrSlump (May 24, 2008 at 3:04 pm)

Instead of I’m using lately the Zero Width Space unicode character (U+200B) which in my tests is supported in all modern browsers plus IE6 with the kanji charset installed.

Another issue is the use of a dinamically generated regular expression on each call. In my tests (a couple of years ago) they compilation of regexps showed quite slow among browser, so if speed is a concern, the regular expressions could be pregenerated or reused on multiple calls.

By the way, the function can be made shorter and perform the wbr injection at the exact number of chars with the following:
str.replace(new RegExp("(\\w{" + num + "})(?=\\w)", "g"), "$1");
John Resig (May 24, 2008 at 3:12 pm)

@DrSlump: Above I linked to the Quirksmode page for the wbr tag where he outlines the different solutions. Could you verify that it works correctly in IE6?

As far as compilation is concerned it’s only occurring once per call (trivial) and unless you’re caching it for every single word width that may be called, I’m not sure what the benefit would be. Of course, if you’re only using one width then you might as well just write that regexp straight out and skip the function.

I like your revision to the solution – nice and simple.
DrSlump (May 24, 2008 at 3:39 pm)

@john

It works inconsistently in IE6. On some installs it works, in others it shows a ‘square’ character (code point not implemented in the font). I’ve found out that it depends on the fonts installed, installing the Microsoft package for asian languages solves the issue. A friend even told me that installing Office 2007 solved the ‘square’ chars too, but I can’t confirm.
Anyway, my case is special because I’ve stopped supporting IE6 about a year ago, I make the webs work in it but if they don’t render exactly as they are supposed to do I don’t waste much time on it, and to my surprise clients seems to understand it (development time is money after all).

My concern with the regexp compilation is that this is the kind of function you might end up using in a loop, so if compilation occurs on each call, the time will sum up quickly. A way I’ve worked around this in the past is to create a dictionary/hash object to cachè the dynamically build regexps, although they were quite more complex than this one. Something like this:

var regexpCache = {}; function function wbr(str, num) { if ( !regexpCache[num] ) { regexpCache[num] = RegExp("(\\w{" + num + "})(?=\\w)", "g"); } return str.replace( regexpCache[num], "$1" ); }

Another issue I found with injecting word breaks which I didn’t mention before is that some ‘non-word’ characters are not considered as breakable by the browsers. I had problems with dots for example. In the end I solved it the easy/dirty way by replacing the \w with [^\s-].
Sebastian Redl (May 24, 2008 at 4:48 pm)

I think anything but correct hyphenation is more an annoyance than a help. So if you really need hyphenation (which should only be the case if you have very narrow columns), I think the 80k are worth it. I also think that this can and should be handled by a server-side script instead of JavaScript.

On a nitpicky side note, it’s “extravagantly”.

Anyway, if you’re willing to accept breaks like “extrav-agantly”, then the “writin-g” problem is easily solved:

/(\w{6})(?=\w{2})/g

This won’t try to break words that are just one character longer than the hyphenation limit.
Sander Aarts (May 24, 2008 at 6:33 pm)

I agree with Sebastian that incorrect hyphenation is annoying. But if you really want to use a client-side script like yours, why not insert (shy hyphen) instead of the non-standard . That character at least adds a hyphen when it breaks the line. For Fx2 (zero-width space0 could be used.
Sander Aarts (May 24, 2008 at 6:35 pm)

Hmmm, that was to be expected of course.

I’ll try again:
 (shy hyphen)
(zero-width space)
Sander Aarts (May 24, 2008 at 6:37 pm)

That’s really weird, I really used & on both occasions this time
Rebecca Murphey (May 25, 2008 at 12:44 am)

We’ve struggled with this problem at length, with users entering extremely long strings that would then break our fixed-width layout, especially in IE6.

I had worked out a pleasantly small bit of js that used jQuery to basically count the approximate number of characters that would fit on a line in a given container, and then break any words that had more than that many characters (while managing not to break any long strings inside HTML tags, such as href attributes). This seemed ideal, because it meant we didn’t need to make assumptions about how soon to hyphenate.

On pages with a small number of elements, it worked like a charm, but as the DOM grew, even with regex caching, the time the thing took to run was brutal. We’ve shelved the approach for now and are, unfortunately, just using overflow:hidden to deal with the problem while trying to develop some methods for dealing with long strings at input time.

On another note, I was glad to see that FF3 is adding support for soft hyphens, because FF’s lack of support for them made that a non-viable solution for us. Hopefully it’s a problem we can revisit soon.
Glenn (May 26, 2008 at 7:56 am)

The 65kb English word library contains 18kb of quote char’s.
Phil H (May 26, 2008 at 12:12 pm)

. Fixed layouts
Fixed layouts with pixel values are generally not the best idea. Just divide by 16, swap px for em and see what happens. Font resizing just zooms the whole page, no hyphenation necessary.

. Long words
Given that the best length of a line for legibility is somewhere in the 20-40em range, very few normal words will break your layout. Perhaps just process it when you are about to insert the text into the database – use a proper hyphenation script to insert zero width spaces or wprs or whatever. You can always replace them all later. This way each chunk of text gets hyphenated properly once, instead of badly repeatedly.

. URLs
Unless you maintain a site dedicated to long strings of text, livingwithspacebarphobia or organic chemistry, URLs are the vast majority of long strings. Why treat them differently? Because they are not expected to be shown in their raw form; most people would prefer the long URL to be hidden behind the usual linktext.

Options:
1. Replace the linktext with a shortened form – keep some of the beginning (so we can see the domain) and perhaps the text between the last slash and the following dot (the page name), to make something like ‘slashdot.org/…/index…’.
2. Steal the title text of the linked page – in processing the form data, follow each link and get hold of the page title, and use that (or, again, a shortened form) as the link text.

The overriding principle? Sanitize user input. Even when all they are doing is typing in words and pasting in URLs, users can cause unexpected problems. And do the sanitizing on input, not as a reformatting exercise on output – the input processing happens once and the result is served up many times.
Steve S (June 4, 2008 at 6:32 pm)

@Phil H:

I don’t think you really want to insert zero with space unicode, or other hyphenation indication, into a database that you will perform queries on.

Also, there is no clever way of shortening URLs that would break your layout while maintaining some of their semantics. Look at the URL of this page: you can’t assume last slashes or dots.

Finally, fetching the page title introduces a lot of new problems: the page, or one of the five or ten you’re fetching, may load slowly, and its title may be dynamically generated, or change over time.

Generally, I like the wbr function as a quick fix, but proper hyphenation seems to belong on the server, not on the client. The idea of large number of users wasting bandwidth and CPU time to all hyphenate the same text seems somewhat misguided to me.
Mathias (June 9, 2008 at 4:52 am)

Hi

I’m the author of Hyphenator.js (http://code.google.com/p/hyphenator/) and I like to answer some points that where discussed here.

Hyphenation isn’t as much a problem in English than it is in other languages like German (wich is my first language, pleas excuse shortcomings). In german words are longer in average, because we often build long compound words.

Hyphenator.js is more a “Proof of Concept” and a way to learn about JavaScript to me, than
a deploy-phase library. Allthough I use it on my website. Personally I hope that it will be obsolete, when browsers support CSS3-hyphente (http://www.w3.org/TR/css3-text/#hyphenate).

The use oft soft-hyphen for word-breaking and zero-width-space for URL-breaking is a decision that I made in regard of the HTML-Standard (http://www.w3.org/TR/html401/struct/text.html#h-9.3.3) wich has no -Tag. Using has no negative issue on FF<3 but works well in Gecko 1.9.

The size of the patterns-data is an issue (in german the file is even bigger!) But take in count, that they may be cached by th browser.
I’m also seeking after better solutions, but a packed structure means also longer execution.

Hyphenation could be done easily and with lesser costs on server-side, true. But I personally think, that hyphenation belongs to the layout, wich is belonging to the browser. Furthermore I suspect search-engines to not properly handle texts that are served hyphenated.

I was very happy to find a link to my project here!
Regards,
Mathias Nater

Comments are closed.
Comments are automatically turned off two weeks after the original post. If you have a question concerning the content of this post, please feel free to contact me.