a tokenizer in JavaScript for Japanese by Kudo Taku. The license allowed me to use it on natural. I’m fully satisfied: no dictionary to maintain and no heavy tools to develop! It is using a statistical model to determine where to cut tokens. And it turned out to be quite efficient given its light weight.
The only sentence that failed to be tokenized properly is the famous “すもももももももものうち。” (that can be written “李も桃も桃のうち。” meaning “plums and peaches are both types of peach”). But this one is a bit tricky.
Now that we can tokenize Japanese, we can also:
I wanted to extend one of my previous NLP project to Japanese, but couldn’t find a tokenizer at the time (it was developed in PHP). So I reluctantly discarded this idea. I hope I’ll be able to help natural enough so that such decisions won’t have to be taken in the future.
As a side note, there are also many tokenizers for Chinese out there in several programming languages. I will have to check their licenses on my spare time to see if they are compatible with that of natural. But I’m not sure how to benchmark their respective accuracies as I don’t speak Chinese :-(