Building a Voice Assistant in JavaScript

tl;dr: You can build a voice assistant in JavaScript with existing open source libraries but the result sucks.

In the last couple of days, I explored the voice in the browser. Here are my findings.

What's available

Most of the voice related libraries in JavaScript that I found are simple wrappers around the Web Speech API. Under the hood browsers use external services so I ruled out these libraries immediately as I wanted something that works offline.

What I found:

Both these projects are brought to the web thanks to Emscripten and asm.js.

Building a (not so) intelligent voice assistant

Thanks to a discussion with my friend Thomas this weekend, CENode.js was brought to my attention. It's an open source project to enable human-machine conversation based on a human-friendly format. It can be used to simulate intelligence.

I had all the elements at hand to build a browser-based, offline voice assistant (Think Siri, OK Google, Amazon Echo...).

Coding (or rather gluing)

Gluing all parts together was unsurprisingly easy. Pocketsphinx.js calls a function when the web worker has a final hypothesis:

worker.onmessage = (evt) => {
  if (evt.data.hyp !== undefined && evt.data.final) {
    gotHypothesis(evt.data.hyp);
  }
};

I can feed this hypothesis to CENode.js by creating a new card:

function gotHypothesis(hypothesis) {
  const card = "there is a nl card named '{uid}' that is to the agent 'agent1' and is from the individual 'User' and has the timestamp '{now}' as timestamp and has '" + hypothesis.replace(/'/g, "\\'") + "' as content.";
  node.add_sentence(card);
}

Then I just have to wait until the deck is polled for new cards to my attention (my name is User). The new card contains the answer to my request and I can just do:

speak(card.content);

The function speak will call espeak.synth() of speak.js.

Adding a grammar

This is it for the JavaScript code. But an important part is still missing. Pocketsphinx.js needs to be fed with a grammar and the pronunciation of all the words used in the grammar. This is used to reduce the scope of detected phrases. So I'll conveniently reuse all the possible requests supported by CENode.js.

Considering the CENode.js demo about astronomy, here's a list of some requests that it can understand (let me know if I missed any):

What orbits Mars?
What does Titan orbit?
What is Saturn?
What is a star?
List instances of type planet

The first step is to compute all the possible permutations that make sense in English and that CENode.js will understand. I found 124 of them. This is the corpus I'll use in the next steps.

The grammar used by pocketsphinx.js is a Finite State Grammar. Think of a finite state machine where each state adds a word to the phrase:

s = 0 // s is the state with initial value 0.

[]    -> [What] -> [does] -> [Titan] -> orbit?
s = 0    s = 1     s = 2     s = 3      s = 4

The grammar for this simple phrase is:

const grammar = {
  numStates: 5,
  start: 0,
  end: 4,
  transitions: [
    { from: 0, to: 1, word: "WHAT" },
    { from: 1, to: 2, word: "DOES" },
    { from: 2, to: 3, word: "TITAN" },
    { from: 3, to: 4, word: "ORBIT" },
    { from: 4, to: 4, word: "<sil>" }
  ]
};

The more phrases you add, the more complex it gets. For each state, pocketsphinx.js uses the grammar to figure out what is a possible next step. If your grammar is correct, you cannot end up with meaningless sentences.

I wrote a small script to build grammars from a corpus of phrases that I may publish if anyone is interested.

Adding pronunciation

For each word in the corpus, pocketsphinx.js needs to know how to pronounce it.

I fed my corpus to Sphinx Knowledge Base Tool. I simply took the *.dic file from the gz archive, split it on line breaks and split again each line on tabs and I got something like:

const wordList = [
  ["WHAT","W AH T"],
  ["WHAT(2)","HH W AH T"],
  ["ORBIT","AO R B AH T"],
  ["MARS","M AA R Z"],
  ...

The result

Et voila ! That's all I needed to get an astronomy centred voice assistant that can tell me what satellites orbit around what planets and other detailed infos. I have to say it's pretty cool, specially because I'm a big fan of astronomy and I got this project done in under 2 days.

As a side note I tested it only on Firefox Nightly on my laptop. Results may differ on Chrome or on mobile devices.

The result though is not really convincing. First of all pocketsphinx.js voice recognition is bad. I can probably blame it on my accent (I'm French, remember), but it's so frustrating to say "What orbits Saturn?" hundreds of time with no results while my colleague got understood at the first try! The good thing at least is it can't return phrases out of the grammar, so it will always output something that makes sense.

CENode.js is also super limited and I don't really know how to improve it beyond the example provided (the format is well documented though). Some simple phrases, outside of the understanding scope, fail miserably:

What is orbited by Titan? --> fails
What does Titan orbit?    --> succeeds

But if you ask "What is Saturn?", you get:

Saturn is a planet. Saturn orbits the star 'sun' and is orbited by the moon 'Titan'...

The passive voice does not seem to be supported in requests.

Finally, what is output by speak.js is pretty terrifying. Whatever the voice you choose, they all sound robotic. They have a cool retro feeling if that's what you're after (and you can argue that it works better for something related to astronomy and space), but they're far from the quality of commercial products.

Going further

This experimentation was limited in time so I didn't want to spend too much on it, but if I had had a real project to build, I'd had seriously looked at:

I don't think it's worth sharing the code that's messy and I'll leave it as an exercise to the reader, unless somebody is really interested.

What I can share though is the FSG generator from a corpus. It needs some cleaning but the code can run on node or in the browser.

There's probably tons of mistakes and approximations in this post, I'm by no means a voice expert, so please correct me if need be. Also if you have any ideas on how to improve it or if I missed an existing library, please do let me know in the comments. Thanks!

Comments

Welsh keyboard for Firefox OS

I recently worked on a way to automatise the creation of dictionaries for word suggestions when typing in Firefox OS (More on that one day, hopefully). As a test case, I generated a Welsh keyboard, as it is one of the few languages I know a little that didn't have a keyboard in Firefox OS (Japanese keyboard is another exception though). Typing on a mobile is hard enough, so word suggestions and autocorrect are keys for a better experience. If you speak Welsh, even a little, and use Firefox OS on the master branch, please help us to test the keyboard and assess its quality.

Install and activate the Welsh keyboard

Make sure you are using the latest version of B2G. If your device is a Flame or a Sony Z3C running Firefox OS, then you should get new updates automatically. Otherwise, you'll want to flash the latest version.

Here's a quick demo of how to install it on your device.

Test the keyboard

The dictionary builtin in the keyboard is based on Wikipedia in Welsh, which has, I believe, a decent number of pages (around 64,000 articles, ranking it 64th biggest language out of 290).

From my experience, the results is very satisfying. I found the words suggested by this keyboard very useful. But my knowledge of Welsh is very limited, so it may not be completely helpful to a fluent speaker (or writer rather). In particular, I'm afraid that it may be too formal or lack slang or colloquial expressions, frequently used in SMS or social networks.

We need you!

If you speak Welsh, please, try it for yourself. I'd love to hear your feedback and if you have any comments, please use the field below, or reach me on Twitter at @g_marty. If this method is proved to give good results, I'd love to give it a go at producing dictionary in more languages.

Diolch!

Comments

Japanese keyboard for Firefox OS

tldr: You can't efficiently type in Japanese on a Firefox OS device, unless you buy a fx0 device or you build Gaia yourself and are very patient!

Firefox OS is targeting a global audience. You can set your interface in 90 different languages and there are over 70 language specific keyboards! But developing a keyboard is more challenging for some languages than others and Japanese is one of them.

I won't give much details about why is a Japanese IME hard to implement, but the main reason is that there is a massive amount of suggestions that need to be stored on the device somehow. In Japanese, any word can be usually written in several different ways. Storing such a dictionary can cause serious issues on low-end devices with limited memory and processing power.

But that's not a reason to just giving up typing in Japanese on Firefox OS.

fx0

Earlier this year, KDDI released the fx0, a Firefox OS powered device, on the Japanese market. This phone comes with a Japanese keyboard called iWnn IME and developed by Omron Software.

The implementation is efficient and powerful. It comes with many options to configure the keyboard to perfectly suit your needs and habits. Such a high quality is not really surprising as the same keyboard is already available on other platforms (Android notably).

This keyboard has been open sourced in the iwnn-ime-sample Github repo. However the licence is expiring on 30th September 2015 (not sure what's happening afterwards).

The readme file claims it's possible to flash it on a Flame running Firefox OS v2.0. So I tried and everything is working, except for the core feature: the words suggestion. To achieve this in an optimised way, a local server running on the device is used to query words. This part doesn't work. The keyboard however can still be used to input kana, but you won't get any word suggestions.

I then flashed it on a master (tried both on a Flame and Sony Z3C). I encountered the same issue with the words suggestion together with other few minor bugs. But if you use it as a kana only keyboard, then this is probably acceptable.

Drop me a comment below if you need help on how to install it.

iWnn IME for Firefox OS

Gaia jp-kanji keyboard

There is already a Japanese keyboard in Gaia. To install it, you must build Gaia yourself and flash your device. Here is the suggested command line:

$ GAIA_KEYBOARD_LAYOUTS=en,jp-kanji make reset-gaia

Once flashed, you must enable the new keyboard in Settings > Keyboards > Select keyboards.

This IME comes with words suggestion, but it is neither as user-friendly nor as fast as the iWnn IME one. That is the reason why it is not available to current builds. I love how the design fits in the Firefox OS UI though.

There are plans to build a better Japanese keyboard for Firefox OS, but nothing has happened yet.

Gaia jp-kanji Keyboard

So unfortunately, there is no easy way to install a Japanese keyboard on Firefox OS, but we're working on it! If you're wondering what I'm doing with my device, I use the iWnn IME keyboard, but because of its limitations, I avoid typing in Japanese on my mobile whenever I can :-)

Comments