Build a schema for post and comments: Array of comments objectID OR post objectID in each comment

My situation: I have a post collection where each post have many comments and comments collections.

post_id: "post_1"
comments: [comment_1, comment_2, comment_3, ..., comment_10000]
content: "Here is a row of post collection"

and

comment_id: "comment_1",
content: "Here is a row of comment collection",

Now, I can easy retrieve a post with comments by query:

post.find()

then lookup or populate (using mongoose). I call this as SCHEMA_1, follow one-to-many pattern.

After, I think about one-to-sqiullions patterns, then build other SCHEMA_2, that using reverse reference in comment collection, like:

comment_id: "comment_1",
post_id: "post_1"

Now, I aslo can easy retrieve comments of a post by query:

comment.find( {post_id} )

Just one query with post_id had index.

OVERALL query for read:

  • With SCHEMA_1: post.find() then lookup(or populate with mongoose)
  • With SCHEMA_2: comment.find( {post_id} ) - just one query with post_id had index

MY QUESTION IS: Which schema is better performance for read query ? I don’t actually deeply understand how lookup work behind with array of references. In SCHEMA_1, if for each reference of comment, lookup need travel comment collection one (like correlate subquerry in SQL), it’s terrible if array of objectID is large.
And if I follow one-to-sqiullition patterns, I have a thought that performance is better because it just run one query, no need to JOIN.

How can I understand this problem ?

Hello @Dung_Ha_Quy, welcome to the MongoDB Community forum!

The questions you had posted are valid, but, the answers will be clear (I think) if you know the amount of data and the important queries in your application - that is the application’s functionality in little more detail. These aspects can help determine how to organize or model the data.

For example, in your application a post can have about 10 to 20 comments, with each comment has a few (about 10 lines) of text. This will help think about storing all the comment data within each post itself. Then the design and the queries like will be simple as all the related data is stored together.

db.post.find( { postId: "My Data Model" } )

The query will fetch post related data like title, content, date, etc., and also all the comments associated with the post. You can use projection to control the fields required in a particular page of the application. You can also, query for specific comment. Both the postId and commentId fields can be indexed for fast access.

But, if the number of comments is very large, and ever growing (only some extraordinary posts have many comments - an unlikely scenario), then store commentId values within the post as an array field. The comments are stored within a separate collection.

Then, you can always query a post for its information with a simple query. To find a specific comment information you will need to do a “join” operation - the $lookup aggregation query to access details of comments. Note that even the lookup operations can use indexes. Ideally, you use lookup queries where the performance may not be very important.

An option, would be to store the postId value (and even the post title) in the comments collection. This will help query the comments collection using the postId (or post title), without a $lookup of the post collection.

Note that, for each option you will also have to take into consideration the other operations on the collection like - inserts, updates and deletes.

Some relevant questions that can help in modeling the data:

  • How many comments you think a post will have? How much content is likely to be in each comment? This will help determine whether to include all comment details within a post or use another collection for comments.
  • Can you identify 2 to 3 important queries in your application? What are application’s main functions (for example, the web pages and the kind of data you want to show in a particular page, etc.).

@Prasad_Saya Thank you so much for your answer !
Actually, i have a little bit unclearly about performance, exactly PERFORMANCE of READ query if array of comment objectID is large, maybe UNBOUNED. And I just don’t know the comparision performance of two scenarior - to read amounts of comment of a post:

  • Using lookup if store array comment objectID in post ( I’m worry about performance here because I saw from document: Each objectID need to travel to comment collection one, so array of 10000 objectID will need to travel comment collection 10000 times, or maybe I misunderstand this ??? )

  • Just store postID in each comment, then using only ONE query. Is this better performance than lookup above ?.

And of course, there are other CRUD operator, but I just focus on READ operator.
Can you explain to me more about lookup operation, thank u !

First of all, you should not have an unbounded array field in your document - this will be a bad design and will lead to problems later on. There can be an array field with large number of bounded elements. You can define index on an array field - these are called as Multikey Indexes.

I suggest, you create some samples of data in your collections and try some queries - including the $lookup queries. Use the explain with “executionStats” mode and study the query plans.