I have a database called books,
this database contains several documents such as Book_name, authors, etc…
One of these documents is called “the_book” and it’s value is basically a string which represents the content of the book itself.
For example:
"_id" : ObjectId("60b3576fb220dae53d75c995"),
"Book_name" : "blablabla",
"authors" : "Buddy",
"the_book" : "this is what this book is about"
What I am trying to do is to code a mapReducer that returns a pair of <_id,_value> where _id is equal to the size of each word in the_book and where _value is equal to the number of words of this size.
For example, if we only had one book such has "the_book":"SUPPOSING that Truth is a woman--what then?"
we should get:
{_id:”1”, _value:”1”}, {_id:”2”, _value:”1”}, {_id:”4”, _value:”3”}, {_id:”5”, _value:”2”}, {_id:”9”, _value:”1”}
since we have one word “a” of size 1, one word “is” of size 2, three words “then”, “what” and “that” of size 4, two words “truth” and “woman” of size 5 and one word “supposing” of size 1.
If we had more books then it should have return the total sum of words for all the books.
I guess the mapper has to emit a pair of <word,size_of_word> for each word in “the_book” value and the reducer has to sum this up to get the <_id,_value> requested.
I’m experiencing troubles with the way of splitting the string into an array of words (since the delimiter isn’t always just a space as in the example there are --)
Thanks for the help !