MapReducer returning the number of words of each size from a document value

I have a database called books,

this database contains several documents such as Book_name, authors, etc…

One of these documents is called “the_book” and it’s value is basically a string which represents the content of the book itself.

For example:

"_id" : ObjectId("60b3576fb220dae53d75c995"),
"Book_name" : "blablabla",
"authors" : "Buddy",
"the_book" : "this is what this book is about"

What I am trying to do is to code a mapReducer that returns a pair of <_id,_value> where _id is equal to the size of each word in the_book and where _value is equal to the number of words of this size.

For example, if we only had one book such has "the_book":"SUPPOSING that Truth is a woman--what then?" we should get:

{_id:”1”, _value:”1”}, {_id:”2”, _value:”1”}, {_id:”4”, _value:”3”}, {_id:”5”, _value:”2”}, {_id:”9”, _value:”1”}

since we have one word “a” of size 1, one word “is” of size 2, three words “then”, “what” and “that” of size 4, two words “truth” and “woman” of size 5 and one word “supposing” of size 1.

If we had more books then it should have return the total sum of words for all the books.

I guess the mapper has to emit a pair of <word,size_of_word> for each word in “the_book” value and the reducer has to sum this up to get the <_id,_value> requested.

I’m experiencing troubles with the way of splitting the string into an array of words (since the delimiter isn’t always just a space as in the example there are --)

Thanks for the help !

Hi @buddy_jewgah,

Here is my attempt.

[
  {
    '$addFields': {
      'sizes': {
        '$map': {
          'input': {
            '$filter': {
              'input': {
                '$regexFindAll': {
                  'input': '$the_book', 
                  'regex': '[a-z]*', 
                  'options': 'i'
                }
              }, 
              'as': 'val', 
              'cond': {
                '$ne': [
                  '$$val.match', ''
                ]
              }
            }
          }, 
          'as': 'val', 
          'in': {
            '$strLenCP': '$$val.match'
          }
        }
      }
    }
  }, {
    '$unwind': {
      'path': '$sizes'
    }
  }, {
    '$group': {
      '_id': '$sizes', 
      'count': {
        '$sum': 1
      }
    }
  }
]

In action in Compass on the complex example you provided.

The trick of my solution is to use a regex expression to identify the different words. The good thing is that the regex can be as simple or complex as you need.

That’s the result I get:

{ "_id" : 9, "count" : 1 }
{ "_id" : 2, "count" : 1 }
{ "_id" : 1, "count" : 1 }
{ "_id" : 4, "count" : 3 }
{ "_id" : 5, "count" : 2 }

It’s most probably not the most optimized solution. I guess it’s possible to use $reduce in the first stage but it was too much for my small head…

It would be a lot faster if you don’t have to use $unwind.

Cheers,
Maxime.