How I built a translation engine in a weekend

and managed to go out on Friday and Saturday nights

Google Translate is powered by the technique called machine translation. It can translate sentences from one natural language to another, without human interactions. I recently heard that Mozilla was starting a project to create an open source machine translation engine. That immediately resonated in me and I decided to give it a try and build my own English-Japanese translator in JavaScript.

Day 1: Get ready

On the Friday late afternoon, I had a general idea of what I wanted to implement: a machine translation engine using syntax to translate sentences. I wanted something simple and couldn't wait to see it working. I realized I had this "fire in the belly" and jumped directly into the action. I spent a small amount of time gathering what I needed: a corpus of translated pairs in English and Japanese and a part-of-speech tagger for these languages. I started fiddling around with the tools with as few coding as possible.

Day 2: Put it together

I had a late night and woke up on Saturday around 11 o'clock. I spent a few hours cleaning up what I had done the day before, creating a dedicated folder and project in my IDE. I started coding and refactor the code. I also spent some time sketching solutions on paper, to make sure I hadn't missed an important point.

Day 3: Make it work

I got back to the project in the early afternoon. I mostly did coding this day. I was so excited to see my translation engine work that I sat in front of my computer for ~10 hours. At the end of the day, I had something working. After a bit of cleaning, I created the repo on Github and pushed the code.

After day 3

Whatever happened after day 3 is not important. Most of the work has been done over the weekend and I had a working prototype. Obviously, it is just a toy system, that is nothing comparable to Google Translate, but I'm happy I was able to do it in a rather limited amount of time. And most importantly, I can iterate on it and progressively make it better. Try it for yourself!

Conclusion

I was able to achieve this because JavaScript is very suited for fast prototyping. If it were a business, I could have started generating profit from day 4. I really love the idea of hacking a quick and dirty prototype and see how it works. Next time you have an idea for a business, do some quick prototyping and launch it as early as possible!

And if you're wondering, I had dinner in a lovely Italian restaurant on Saturday night :-)

Comments

The problem of user language lists in JavaScript

I've always wondered why it is not possible to get in JavaScript the list of all languages as configured in the browser. This list is made available to servers via a HTTP header.

On the other hand, JavaScript can only get the first language using:

console.log(navigator.language); // 'en', 'fr', 'de' or whatever

Browsers inconsistencies

Getting the language of the browser is part of the HTML5 specifications, but implementations vary widely.

Internet Explorer

Let's start with IE that has (surprisingly) the most complete set of (non standard) features about the environment language. navigator.userLanguage gets the first language set by the user (can be changed in Internet Options > General > Languages ; see Internet Explorer Dev Center).

navigator.browserLanguage returns the language of the UI of the browser. You can't change this value, it is decided by the version of the executable you installed (This property and all its subtleties are described in Internet Explorer Dev Center).

Finally, navigator.systemLanguage will give you the locale used by the OS (See Internet Explorer Dev Center.

Firefox & Safari

navigator.language returns the first language in the list of languages.

In Firefox, you can define it in Options > Content > Languages.

Safari uses the language set at the system level (See navigator.systemLanguage of IE above). You cannot just override it in Safari.

Some details are available on the navigator.language page on MDN.

Chrome

Invariantly returns the language of the UI through navigator.language without a possibility to change it. The value is similar to navigator.browserLanguage of IE.

Chrome extensions can retrieve the full list of languages as set by the user thanks to chrome.i18n.getAcceptLanguages() (Not sure why this API is async):

chrome.i18n.getAcceptLanguages(function(requestedLocales) {
  // 'requestedLocales' is an array of strings.
});

What's wrong with the current approach?

Well, apart from the semantic inconsistencies, knowing only the main language is a serious limitation. Let's illustrate this with my personal experience. My browsers have the following configuration:

  1. Japanese
  2. English (GB)
  3. English
  4. English (US)
  5. French
  6. Spanish
  7. Korean
  8. German

(Yeah, I want to train my Japanese, so I listed it first!)

A website available in several languages will show the interface in Japanese and English if not available.

But a local app will only get the first language, Japanese. So if this language is not available, the UI will be shown in a totally random language that I may not understand.

Also, if the app knows the full list, it could use it to detect visitors speaking rare languages and ask them to collaborate on translation.

How to fix it?

First of all, an easy cross browser way to get the preferred language is to do:

/** @const */ DEFAULT_VALUE = 'en';
/** @const */ PREFERRED_LANGUAGE = navigator.language || navigator.userLanguage || navigator.browserLanguage || navigator.systemLanguage || DEFAULT_VALUE;

That's fine as long as your app supports this language.

If not, you could always ping a server, get the accept-language HTTP header, send the response back to JavaScript and parse it. That means that you need a scriptable server, so you cannot do it for, offline apps or for apps hosted on a static web server like Github Pages. Oh and Mac have only one language configured at a time: you will never get more than 1 element in the list of this header on Safari.

What to do next?

The good news is things are being worked on.

First, there is this bug on the WHATWG bug tracker to discuss this feature, with all the details here. Firefox is already thinking of implementing it.

To put it in a nutshell, the current consensus is to have a new JavaScript property called navigator.languages that returns an array of language codes sorted by preference order:

console.log(navigator.languages);
// The configuration above would give something like:
// ['ja', 'en-GB', 'en', 'en-US', 'fr', 'es', 'ko', 'de'];

Also, firing an event when language changes is also being discussed. Firefox OS allows this on certified apps with a proprietary API:

navigator.mozSettings.addObserver('language.current', function(event) {
  console.log('New language', event.settingValue); // Default value is 'en-US'.
});

To conclude

Making the full list of languages configured in a browser available to JavaScript can't be done front-end at all and partially using back-end. But this information is stored in the OS/browser itself. Am I the only one to think something is wrong here?

This topic is hard and has been around for a while, so some of the resources I consulted might be outdated. Use the comments below if you spotted any error.

Comments