Analyzers for Japanese text

Lucene has 2 analyzers for Japanese full text search

  1. analyzers-kuromoji
  2. analyzers-icu

Are these supported by Mongo Atlas?

Hi @Supriya_Bansal,

MongoDB Atlas search analysers support lucene.cjk which is good for Japanese:

Please let me know if you have any additional questions.

Best regards,
Pavel

Thank you @Pavel_Duchovny
Japanese text has a mix of different phonetics such as Kanji, Kana, Kuromoji. Would CJK suffice for all these?

2 Likes

I’m a Japanese Solutions Architect.
You should use lucene.kuromoji as tokenizer for Japanese.
lucene.cjk just tokenizes text with StandardTokenizer`, normalizes content and do some nomalization and forms bigrams of CJK.

Kuromoji can tokenize with internal dictionary.
So it can achieve higher accuracy. But since they uses dictionary, in some case, it cannot treat some word if it is not in dictionary. In such case, you can create customAnalyzer with ngram tokernizer and mix the result.