Recently I internationalized a Node/Express web application that I’ve been working on and it seems to have gone fairly well (users in multiple languages are using it happily and I’m seeing a marked increase in traffic because of it!). Not much of what I’m writing up here is particular to Node, per se, just a general strategy for internationalizing a web application.
I’ve used enough internationalized web sites, and travelled to enough foreign countries and attempted to use English language sites back in the US, that I knew what kind of features I wanted:
- Full parity between languages. Wherever possible the same content should be available to everyone.
- Use sub-domains to contain different language versions. It’s overkill/expensive to use different TLDs and it’s annoying to have to twiddle query strings or paths to match a language.
- No automatic translations of content to the user’s native language. There is nothing worse than arriving at a site and being forced into a version that you can’t read either because it’s poorly translated, or you’re being GeoIP detected, or you’re on a computer whose language settings don’t match your own. If you visit a URL it should always be in the same language
- No automatic redirects to a site with the user’s native language. Same as the last one. If I visit “foo.com” I should not be automatically routed to “es.foo.com” because it thinks that I speak Spanish. Instead, give the user a notification in their native language and allow them to visit the page themselves.
The end result would be a simple URL system that works like so:
domain.com
– Main site (English)ja.domain.com
– Japanese SiteXX.domain.com
– Other languages
and due to the full translation parity and the lack of URL modification it meant that for every page that you visited you could visit the same page in another language just by changing the sub-domain.
For example:
domain.com/search?q=mountain ja.domain.com/search?q=mountain
Both work identically, just the second one is presented in Japanese.
This has the benefit that in the header of the page I can link the user to the same exact page but in their native language.
Additionally I can use the rel=”alternate” hreflang=”x” technique to help Google understand the structure of my site better. I can put this in the header of my page and Google will show the language-preferred version of the site in the Google results.
<link rel="alternate" hreflang="ja" href="http://ja.domain.com/" />
Server
Encouraging users to find the correct content is a key implementation detail, considering that the content is not translated automatically nor is the user redirected to content in their native language. While the links in the header are a good start I also wanted to show a message at the top of the page encouraging the user to view the content (with the message being written in their native language).
As it turns out this can be particularly tricky to implement. The easiest way to do it is to simply check the user’s request headers and look at what’s listed in their “Accept Language” and then display the message based upon that. This really only works if your content is always dynamic and is never being cached.
If that’s not the case for your application, how and where you do your caching matters.
In my particular application I’m using nginx in front of a proxied collection of Node/Express servers. This means that everything coming from the Node server is cached (including any messages to the user telling them to visit another page).
As a result, in order to display this message to the user we’re going to need to manage the logic for this on the client. Unfortunately this is where we hit another stumbling block: It’s not possible to reliably determine the desired language of the user using just JavaScript/the DOM.
Thus we’ll need to get the server to pass us some extra information on what the user’s desired language is.
To do this I used the nginx AcceptLanguage Module and then set a cookie with the desired language and passed it to the client. This is the relevant nginx configuration to make that happen.
set_from_accept_language $lang en ja; add_header Set-Cookie lang=$lang;
And now on the client-side all the needs to happen is reading the cookie for the desired language and displaying the redirect message if the current language and the desired language don’t match.
This gives the best of all worlds: nginx continues to aggressively cache the results from my Node servers and the client displays a message in the user’s native language encouraging them to visit the appropriate sub-domain.
i18n Logic
I’ve written a new Node i18n module which follows the makes the following strategy possible.
None of the i18n logic is particularly out of the ordinary but there are a few strategies I took that helped to simplify things.
- All translations are stored in named JSON files.
- Those files are loaded and used for in-place translation in the application.
- Translations are done using the typical
__("Some string.")
technique (wherein “Some string.” is replaced with the translated string, if it exists, otherwise “Some string.” is returned instead).
Since all requests to the server are handled by a single set of servers this means that translation logic cannot be shared – it must be initialized and used on a request-by-request basis. I’ve seen other i18n solutions, like i18n-node that assume that the server will only ever be serving up pages in a single language, and this tends to fail in practice – especially in the shared-state, asynchronous, realm of Node. For example:
Since whenever a request comes in and set the current language, it sets the current language for the shared i18n object – and given the asynchronous nature of Node it’s possible that other requests may be happening at the same time, and thus changing the displayed language of another request as a result.
You’ll want to make sure that, at minimum, your current language state is stored relative to the current request to avoid this problem. (My new i18n node module fixes this, for example.)
In practice it means that you’ll be adding a i18n
property to the request
object, likely as a piece of Express middleware, like so:
app.use(function(req, res, next) { req.i18n = new i18n(/* options... */); next(); });
Workflow
I have it so that the i18n logic behaves differently depending upon if the server is in development mode or in production mode.
When in development mode:
- Translation JSON files are read on every request.
- Translation files are updated automatically any time a new string is detected.
- Warnings and debug messages are shown.
In production mode:
- All JSON translation files are cached the first time they’re read.
- Translation files are never updated dynamically.
- No warnings or debug messages are shown.
The two major differences here are the caching and the auto-updating of the JSON files. When in development it’s quite useful to have the translation files reload on every request, in case a change has been made to their contents. Whereas in production they really should be considered static files.
Additionally, the workflow of having the translation files update every time a new string is found is actually quite useful: It helps you to catch strings that you may have forgotten to translate. Naturally doing this in production (frequently hitting the disk) is not a good idea.
Marking Up Strings
Strings that need to be translated can be found in a number of locations: Inside your application source, inside templates, inside JavaScript files, and even (god forbid) inside CSS files.
In the case of my application I had no strings inside my JavaScript or CSS files. This worked out nicely because I had already written my application in such a way that content is never being dynamically constructed from strings inside my client-side JavaScript. If I were to do that I would use a template and put the template directly into the HTML of my page, using something like my JavaScript Micro-Templating solution.
I consider it to be especially important that you try to avoid having any translatable strings in your JavaScript or CSS files as those files you’ll want to heavily cache and likely put onto a CDN. Naturally, you could dynamically replace those strings as part of your build process and just generate a number of script/css files, one for each language you support, but that is likely up to you and how much extra work you want to introduce into your build.
Inside my application I made sure that the only time I ever attempted to translate a string it was inside of a Express view handler (meaning that I had access to the request object, which is where I bound my i18n object).
An example of using the i18n object inside a view:
module.exports = { index: function(req, res) { req.render("index", { title: req.i18n.__("My Site Title"), desc: req.i18n.__("My Site Description") }); } };
For my templates I use the confusingly-named swig, but the technique for actually using the i18n methods will be roughly the same for most templating systems:
{% extends "page.swig" %} {% block content %} <h1>{{ __("Welcome to:") }} {{ title }}</h1> <p>{{ desc }}</p> {% endblock %}
A string is wrapped with a __(...)
call and the string is replace with the translated string, as it is called.
Translation
So far the actual translation process has been relatively simple. It’s just me doing the translation and I’m not out-sourcing anything (at least not yet). Additionally my site is relatively simple, only a couple dozen strings at the moment.
I’ve been able to temporarily “cheat” in a few ways (at least until I hire a real translator):
- Use Open Source Translations. There are already massive Open Source projects out there that have done a lot of hard work in translating their UIs. For example Drupal has all of their translations online in easy-to-download formats. I was able to find a number of strings that I needed by going through their files.
- Look at already-localized sites. This is another cheat but look for other sites that have some of the same features of your site and have already gone through the hard work of localizing into multiple languages, like Google. (In my case I was working on a search engine so a number of Google’s strings directly matched strings on my site.)
- Google Translate. I know, I know – but I was really surprised at how much better Google Translate has gotten as of late, especially for translating single words or concepts. It’s able to tell you the exact meanings for different possible translations, which is really impressive.
Conclusion
I’m only a couple weeks in to having implemented this process on my site, so I’m sure some things are likely to change as I start to scale up more. I’ve already substantially increased traffic to my site, especially as Google has started to index the newly-translated site. As I mentioned before I have a new Node i18n module that complements the above process and hopefully make it easier for others to follow, as well.
Nicolae Vartolomei (January 11, 2013 at 5:18 pm)
> For example:
> domain.com/search?q=mountain
> ja.domain.com/search?q=mountain
>
> Both work identically, just the second one is presented in Japanese.
And how do we go with url i18n?..
John Resig (January 11, 2013 at 5:26 pm)
@Nicolae: I don’t translate the URLs, at least for my application it isn’t terribly important. I looked around at a number of similar web sites (including a number of web sites that were in Japanese-only, for example, and even they generally used English inside their URLs). That is something that I could consider at some point but it would be a considerable amount of extra work, for sure.
figital (January 11, 2013 at 5:40 pm)
Is there a way to detect whether the browser knew the URL was:
ja.domain.com/search?q=?????? (mountain I think)
or wound up being converted to (or from):
ja.domain.com/search?q=%E3%82%B5%E3%83%B3%E3%82%BB%E3%83%B3%E3%82%84%E3%81%BE
Not *translating* the URL … but being able to detect that the user saw the string the way they or you thought it should look?
For example : “works in English now but we might hire a translation team next year for a Japanese version but we’d like to know *now* that URLs will function appropriately then?” (both looking good visually in the URL bar AND translating correctly to between browser and server)
David W (January 11, 2013 at 5:40 pm)
What about saving language pref on user profile? Facebook comes in swedish to me whereever i go. Always on http://www.facebook.com
John Resig (January 11, 2013 at 5:42 pm)
@figital: That’s a good question – I don’t know of any way to detect that off-hand, unfortunately. As far as I know the URL always shows up as a URL-encoded string – although if that’s not the case I’d love to know!
@David: In the case of my web site I don’t have any user accounts (at least not yet) so I haven’t encountered that issue as of yet.
figital (January 11, 2013 at 6:02 pm)
I would guess it is detectable, I haven’t had a reason to look into it yet (unfortunately I only speak English and have only had the chance to work on Spanish/Portuguese multilingual sites circa 2001).
In this case I did copy and paste some Unicode from a URL in Chrome into Windows Notepad and then into the text area before submitting my previous post. Between hitting submit and this page redisplaying the comment, it is now “??????”.
HTTP GET/POST must understand, then the server, then the middleware, then the database. Whoa! (right-to-left Arabic/Hebrew must be awkward in a query string) Wikipedia seems like a great place to fiddle/test/sniff. Good luck! (sorry … didn’t mean to massively feature-creep you!)
Johann (January 11, 2013 at 6:08 pm)
For translations, I can recommend proz.com. Just use a new email for this since IMHO it’s impossible to turn off all proz marketing and some translators will share your email with their colleagues, too.
Alex Sexton (January 11, 2013 at 6:33 pm)
Whatchya using for gender and pluralization of the strings?
John Resig (January 11, 2013 at 6:57 pm)
@figital: I don’t think that applies, I think the non-ASCII content is encoded before the request is sent to the server (and before the browser URL updates) thus it’ll still show as a URL-encoded string. Not sure if there’s another way to represent it, perhaps it could be converted and redirected on the server.
@Johann: Awesome suggestion, thank you!
@Alex: I’m very familiar with your work (on both Jed and messageformat) – I hope to have a need for them some day! In my personal application I have no need for gender (there are no user accounts or socialization) and only very limited pluralization requirements (which can be mitigated with just basic one/other string replacement). Perhaps as the application grows something like that might be useful. We’re looking at possibly using your work for some upcoming i18n work at Khan Academy, we’ll see how that progresses!
Adam (January 11, 2013 at 7:33 pm)
This is a really great write up. I’m working on a node.js module for the Gengo translation API in the coming month and finding a work flow that fits with i18n modules like @Alex’s Jed is a key aspect of that. A good localization strategy should allow the translation process to be as effortless as managing your native language copy once things are set up.
I’ll keep my eyes peeled for your release next week.
Ben Atkin (January 11, 2013 at 8:32 pm)
Aren’t most resources still conceptually the same when they’re presented in different languages? Why shouldn’t they have the same URL? I agree that being forced to read something in a different language is a problem, but how about having a selector that uses a cookie to store the current language? This would stop the proliferation of URLs for a single resource, and enable people who are following a lot of deep links to your site to read it in the language of their choice.
Ben Atkin (January 11, 2013 at 8:34 pm)
That said, I think your choice of using a subdomain rather than the pathinfo or a query string is a good one.
Rob Miller (January 11, 2013 at 8:45 pm)
I’m working on my first i18n for a php/js web app, and think I have come up with a way to save my volunteer translators some time. I’m scraping all of my text nodes to a Google Spreadsheet. In the adjacent columns I’m using =GoogleTranslate() to take a first pass at the translation, then inviting native speakers to to the spreadsheet to fix it, instead of asking them to start from scratch. Import the table via phpmyadmin then jsonify via php calls.
John Resig (January 11, 2013 at 8:55 pm)
@Ben: There is a huge advantage to having separate pages in distinct languages, namely for the benefit of search engines. In my initial, limited, testing I’ve been getting substantial, increased, traffic from the new language that I’m targeting (thanks to generating new sitemaps for the new subdomains).
There is also an advantage to being able to willingly visit a single page in another language. The site that I’m building is a reference catalog and it can be useful for a user to view a page in another language while browsing through pages in their native language. While I agree that in most cases users will interact with the site in an all-or-nothing way being able to have full control over that experience is something that I want to grant to the user.
@Rob Miller: That’s a very interesting idea, I like that – thank you for passing it along!
Jacky (January 11, 2013 at 10:53 pm)
I don’t think using subdomain is a good idea unless what you presented to the user is related to the location (i.e. Japan) instead of just language.
For example, a user in Japan might be interested in knowing content related to Japan but want to display the interface as English.
Dan Wolff (January 12, 2013 at 3:04 am)
You make a lot of assumptions in your extremely simplistic
{{ __("Welcome to:") }} {{ title }}
.I’m going to set
title=Facebook
for this example.First of all, it would make for more natural language if the result were “Welcome to Facebook” (without the colon), so you’d have
{{ __("Welcome to") }} {{ title }}
. But this assumes that all languages have the same structure, which is certainly not true.E.g. in Finnish it should be “Tervetuloa Facebookiin” (to=iin in the case) which is certainly impossible with your much too simple system.
Also remember that there are synonyms in English which will not be synonyms in every other language, where one construct is used in English, multiple may be needed for other languages. What is plural in English might not be in other languages, or there are multiple plural forms for different numbers.
You might end up with
__("Welcome to {{title}}", {title: __("site-name")})
, where the Finnish message would beTervetuloa {{GRAMMAR:illative|{{title}}}}
, whereGRAMMAR
would be a function making its best guess at what the given form is (illative) based on the title.You should check out how MediaWiki translates its messages – that’s the golden standard for me. It supports different messages for variations of plural, grammar and gender. See http://translatewiki.net/
Ryan Petrich (January 12, 2013 at 5:33 am)
Instead of setting a cookie that can then be read out via JavaScript, why not include the redirect message directly in the HTML and set Vary: Accept-Language? Does nginx not honor Vary headers that an origin server returns?
Florian (January 12, 2013 at 5:36 am)
Nice article. I recently wrote my own library for translating web apps on the client-side, so this was an interesting read. In case you want to take a look: https://github.com/js-coder/x18n
David (January 12, 2013 at 8:15 am)
John,
Using country flags for languages is something wrong. We all know about that.
Do we need language organizations around the world to pick their “language icons”, like countries choose their flags, or like central banks choose their currency characters ?
I think you are influential enough to initiate that. :)
John Resig (January 12, 2013 at 11:43 am)
@Jacky: I agree, if it was location-centric I’d probably just buy a new country-specific domain name.
@Dan: Sorry for the poor demo snippet — I actually never do that in my particular application, was just trying to come up with something contrived for people to see. Great point about the Mediawiki translation project!
@Ryan: I’m not sure if that particular solution would change anything as the redirect message would still have to be hidden (or shown) somehow – and that would most likely still have to happen on the client (hence the need for the cookie).
@Florian: I actually really like the API that you’ve designed — great work!
@David: A couple quick points: 1) If anyone ever comes up with a location-agnostic language icon for English, I’ll implement it in a heartbeat. Using the Japanese flag for Japanese seems very safe. 2) Based upon the traffic to my site so far, all of my English-speaking audience is from the US. 3) The only time you’d be seeing the “US” flag (which does represent en-US, in my case) is when you’re on the Japanese version of the site.
Although attempting to get into a massive standardization battle over language icons sounds like my version of hell, haha!
Ruben Verborgh (January 12, 2013 at 1:00 pm)
While this approach is pragmatic, the idea that a URL points to a language-specific version is different from the notion of URLs in the HTTP spec. A URL points to a resource, and a client can do content negotiation on this resource to obtain a representation that is optimal for the browser (e.g., HTML for graphical browsers, JSON for Web apps) and the user. An aspect thereof is user language, so the same URI can resolve to an English or Japanese version.
Is this only nice in theory? No there’s more, and here’s a simple example:
what happens if a user “likes” something (or tweets about it, or…)? While en.domain.com/jquery and ja.domain.com/jquery are actually the same resource, they have different URLS, so likes of different language speakers will not add up together.
The downside of pragmatism is that we end up with a situation where people just propose different solutions that work for them (and some of them – like yours – are good), instead of adhering to a global standard, and build interfaces around that. The “a computer that’s not your own” is not a common scenario anymore today, when most people own more than one device that can browse the Web. If Web builders stick to the standards, then those browsers can help the user show the content (s)he wants. If everybody makes different systems that work for them, browsers can only work sub-optimally.
Ariel Flesler (January 12, 2013 at 1:43 pm)
Great article John!
Stefan Hayden (January 13, 2013 at 3:05 pm)
On the question of automatic translations and automatic redirects I take a slightly different tack. I make those assumptions that you say not to make. But only once. So if you go to whatever.com/login I might redirect you to whatever.com/es/login (based on browser language) but at that point you’re free to switch back.
I really think these assumptions are good to make as most times they will be correct. But as long as you can switch out to whatever you choose then you are never stuck.
Also as to language icons I would just not use them. It seems like the best answer is to just only the name of the language in the native format. There is no need to show all the language options translated in to Spanish. Just list them DEUTSCH ESPAÑOL.
DjebbZ (January 14, 2013 at 5:54 am)
Did you check the i18ntext-node module ? https://github.com/jamuhl/i18next-node/
Boris (January 14, 2013 at 10:34 am)
It’s never a good idea to use flags for languages.
It causes more trouble than it’s worth. All you need is text across the top, written in that language: “English | Espanol | etc”. Native speakers will be drawn to it immediately. You then don’t clutter your JP site with too much English text and flags, and you don’t upset Brits / Aussies by using US flag.
Jonathan (January 15, 2013 at 6:27 am)
@Boris: I disagree. Flags are more catchy and attract immediate attention.
Btw, John’s blog has been mentioned in the list of programming and software blogs here-> http://www.talkora.com/technology/List-of-programming-and-software-development-blogs_105 (look for entry #20 in the list)
Ciantic (January 15, 2013 at 7:16 am)
Does anoyne have working scanners for HTML and JavaScript files with __(“text”) something like __ syntax? Especially for HTML template files.
Mature environments tend to have xgettext scanners ready, and it really helps the translation process with tools such as PoEdit capable to use the scanner in xgettext.
stephband (January 15, 2013 at 12:17 pm)
Flags are a no-no here in Switzerland, where we support three native languages plus English in one country, and the localisation of those languages is different from in their mother countries.
FR | DE | IT | EN does the job much less politically.
(And yes, I hate seeing the US flag for English – but then, you can’t argue that it should be the British flag either, but the English flag, which I’d bet far fewer people would recognise.)
Jason Mulligan (January 18, 2013 at 10:58 am)
Interesting approach. Why not use the Accept-Language & Content-Language headers and have 1 domain? HTTP affords all the tools to make I18N easy to implement.