What strategy should I use to implement an efficient full text search?

Hi all,

I work on wprn.org with an atlas cluster handling the search. I was writing an issue on my repo to increase the score of the longest matched n-gram but I realized I might have misunderstood what Atlas does under the hood regarding FTS. I would like to double check it.

Right, now, here is how I proceed:

  • I get the search string and split it into an array with the space char as seperator
  • I remove all the strings that are matching a list of 1217 stop words
  • I send it to the serverside resolver and search using the search array provided
    $search: {
          text: {
            query: search,
            path: [
              'name',
              'description',
              'contact.lastname',
              'contact.entity',
              'team.lastname',
              'team.entity',
            ],
          },
        },

The downside with my approach is that matching a phrase does not boost the score. Ideally, the longest sequence of words that matches should get the highest score.

For instance, if we search for early analysis of covid variants the boost level could be:

early analysis of covid variants : 4
early analysis of covid : 3
analysis of covid variants : 3
covid variants: 2
early analysis: 2

The approach I planned to chose was to insert all the n-grams I find in the search string into my search array. Each of those would be boosted depending on the number of words in the string minus those belonging to the stop words.

For a long search string, it would increase big time the number of strings elements I am searching for. So before I commit into this, I wanted to check with the community if there are better approaches.

Am I doing it the right way?

Subsidiary question: I generate a static score for each item I search that I use as a basic sort. It is based on popularity (views/time) metrics and ratings. Do you guys think it is a good idea to use it as a coefficient of the search boost?