The only sentence that failed to be tokenized properly is the famous “すもももももももものうち。” (that can be written "李も桃も桃のうち。" meaning "plums and peaches are both types of peach"). But this one is a bit tricky.
Now that we can tokenize Japanese, we can also:
- Remove stop words
- Stem tokens
- Compute N-grams
I wanted to extend one of my previous NLP project to Japanese, but couldn't find a tokenizer at the time (it was developed in PHP). So I reluctantly discarded this idea. I hope I'll be able to help natural enough so that such decisions won't have to be taken in the future.
As a side note, there are also many tokenizers for Chinese out there in several programming languages. I will have to check their licenses on my spare time to see if they are compatible with that of natural. But I'm not sure how to benchmark their respective accuracies as I don't speak Chinese :-(