Analyzers for Japanese text

Supriya_Bansal · February 2, 2021, 7:53pm

Lucene has 2 analyzers for Japanese full text search

analyzers-kuromoji
analyzers-icu

Are these supported by Mongo Atlas?

Pavel_Duchovny · February 3, 2021, 7:12am

Hi @Supriya_Bansal,

MongoDB Atlas search analysers support lucene.cjk which is good for Japanese:

https://docs.atlas.mongodb.com/reference/atlas-search/analyzers/language

Please let me know if you have any additional questions.

Best regards,
Pavel

Supriya_Bansal · February 16, 2021, 7:42pm

Thank you @Pavel_Duchovny
Japanese text has a mix of different phonetics such as Kanji, Kana, Kuromoji. Would CJK suffice for all these?

Chie_Hayashida · February 19, 2024, 5:22am

I’m a Japanese Solutions Architect.
You should use lucene.kuromoji as tokenizer for Japanese.
lucene.cjk just tokenizes text with StandardTokenizer`, normalizes content and do some nomalization and forms bigrams of CJK.

Kuromoji can tokenize with internal dictionary.
So it can achieve higher accuracy. But since they uses dictionary, in some case, it cannot treat some word if it is not in dictionary. In such case, you can create customAnalyzer with ngram tokernizer and mix the result.