Separating Data That is Accessed Together

Lauren Schaefer, Daniel Coupal7 min read • Published Feb 12, 2022 • Updated May 31, 2022

MongoDB Schema

Rate this article

We're breezing through the MongoDB schema design anti-patterns. So far in this series, we've discussed four of the six anti-patterns:

Normalizing data and splitting it into different pieces to optimize for space and reduce data duplication can feel like second nature to those with a relational database background. However, separating data that is frequently accessed together is actually an anti-pattern in MongoDB. In this post, we'll find out why and discuss what you should do instead.

If you prefer to learn by video (or you just like hearing me repeat, "Data that is accessed together should be stored together"), watch the video above.

Separating Data That is Accessed Together

Much like you would use a join to combine information from different tables in a relational database, MongoDB has a $lookup operation that allows you to join information from more than one collection. $lookup is great for infrequent, rarely used operations or analytical queries that can run overnight without a time limit. However, $lookup is not so great when you're frequently using it in your applications. Why?

$lookup operations are slow and resource-intensive compared to operations that don't need to combine data from more than one collection.

The rule of thumb when modeling your data in MongoDB is:

Data that is accessed together should be stored together.

Instead of separating data that is frequently used together between multiple collections, leverage embedding and arrays to keep the data together in a single collection.

For example, when modeling a one-to-one relationship, you can embed a document from one collection as a subdocument in a document from another. When modeling a one-to-many relationship, you can embed information from multiple documents in one collection as an array of documents in another.

Keep in mind the other anti-patterns we've already discussed as you begin combining data from different collections together. Massive, unbounded arrays and bloated documents can both be problematic.

If combining data from separate collections into a single collection will result in massive, unbounded arrays or bloated documents, you may want to keep the collections separate and duplicate some of the data that is used frequently together in both collections. You could use the Subset Pattern to duplicate a subset of the documents from one collection in another. You could also use the Extended Reference Pattern to duplicate a portion of the data in each document from one collection in another. In both patterns, you have the option of creating references between the documents in both collections. Keep in mind that whenever you need to combine information from both collections, you'll likely need to use $lookup. Also, whenever you duplicate data, you are responsible for ensuring the duplicated data stays in sync.

As we have said throughout this series, each use case is different. As you model your schema, carefully consider how you will be querying the data and what the data you will be storing will realistically look like.

Example

What would an Anti-Pattern post be without an example from Parks and Recreation? I don't even want to think about it. So let's return to Leslie.

Leslie decides to organize a Model United Nations for local high school students and recruits some of her coworkers to participate as well. Each participant will act as a delegate for a country during the event. She assigns Andy and Donna to be delegates for Finland.

Leslie decides to store information related to the Model United Nations in a MongoDB database. She wants to store the following information in her database:

Basic stats about each country
A list of resources that each country has available to trade
A list of delegates for each country
Policy statements for each country
Information about each Model United Nations event she runs

With this information, she wants to be able to quickly generate the following reports:

A country report that contains basic stats, resources currently available to trade, a list of delegates, the names and dates of the last five policy documents, and a list of all of the Model United Nations events in which this country has participated
An event report that contains information about the event and the names of the countries who participated

The Model United Nations event begins, and Andy is excited to participate. He decides he doesn't want any of his country's "boring" resources, so he begins trading with other countries in order to acquire all of the world's lions.

Leslie decides to create collections for each of the categories of information she needs to store in her database. After Andy is done trading, Leslie has documents like the following.

Code Snippet

When Leslie wants to generate a report about Finland, she has to use $lookup to combine information from all five collections. She wants to optimize her database performance, so she decides to leverage embedding to combine information from her five collections into a single collection.

Leslie begins working on improving her schema incrementally. As she looks at her schema, she realizes that she has a one-to-one relationship between documents in her Countries collection and her Resources collection. She decides to embed the information from the Resources collection as sub-documents in the documents in her Countries collection.

Now the document for Finland looks like the following.

Code Snippet

As you can see above, she has kept the information about resources together as a sub-document in her document for Finland. This is an easy way to keep data organized.

She has no need for her Resources collection anymore, so she deletes it.

At this point, she can retrieve information about a country and its resources without having to use $lookup.

Leslie continues analyzing her schema. She realizes she has a one-to-many relationship between countries and delegates, so she decides to create an array named delegates in her Countries documents. Each delegates array will store objects with delegate information. Now her document for Finland looks like the following:

Code Snippet

Leslie feels confident about storing the delegate information in her country documents since each country will have only a handful of delegates (meaning her array won't grow infinitely), and she won't be frequently accessing information about the delegates separately from their associated countries.

Leslie no longer needs her Delegates collection, so she deletes it.

Leslie continues optimizing her schema and begins looking at her Policies collection. She has a one-to-many relationship between countries and policies. She needs to include the titles and dates of each country's five most recent policy documents in her report. She considers embedding the policy documents in her country documents, but the documents could quickly become quite large based on the length of the policies. She doesn't want to fall into the trap of the Bloated Documents Anti-Pattern, but she also wants to avoid using $lookup every time she runs a report.

Leslie decides to leverage the Subset Pattern. She stores the titles and dates of the five most recent policy documents in her country document. She also creates a reference to the policy document, so she can easily gather all of the information for each policy when needed. She leaves her Policies collection as-is. She knows she'll have to maintain some duplicate information between the documents in the Countries collection and the Policies collection, but she decides duplicating a little bit of information is a good tradeoff to ensure fast queries.

Her document for Finland now looks like the following:

Code Snippet

Leslie continues examining her query for her report on each country. The last $lookup she has combines information from the Countries collection and the Events collection. She has a many-to-many relationship between countries and events. She needs to be able to quickly generate reports on each event as a whole, so she wants to keep the Events collection separate. She decides to use the Extended Reference Pattern to solve her dilemma. She includes the information she needs about each event in her country documents and maintains a reference to the complete event document, so she can get more information when she needs to. She will duplicate the event date and event topic in both the Countries and Events collections, but she is comfortable with this as that data is very unlikely to change.

After all of her updates, her document for Finland now looks like the following:

Code Snippet

//  Countries collection

{
   "_id": "finland",
   "official_name": "Republic of Finland",
   "capital": "Helsinki",
   "languages": [
      "Finnish",
      "Swedish",
      "Sámi"
   ],
   "population": 5528737,
   "resources": {
      "lions": 32563,
      "military_personnel": 0,
      "pulp": 0,
      "paper": 0
   },
   "delegates": [
      {
         "first_name": "Andy",
         "last_name": "Fryer"
      },
      {
         "first_name": "Donna",
         "last_name": "Beagle"
      }
   ],
   "recent-policies": [
      {
         "policy-id": ObjectId("5ef34ec43e5f7febbd3ed7fb"),
         "date-created": ISODate("2011-11-09T04:00:00.000+00:00"),
         "title": "Country Defense Policy"
      },
      {
         "policy-id": ObjectId("5ef357bb3e5f7febbd3ed7fd"),
         "date-created": ISODate("2011-11-10T04:00:00.000+00:00"),
         "title": "Humanitarian Food Policy"
      }
   ],
   "events": [
      {
         "event-id": ObjectId("5ef34faa3e5f7febbd3ed7fc"),
         "event-date": ISODate("2011-11-10T05:00:00.000+00:00"),
         "topic": "Global Food Crisis"
      },
      {
         "event-id": ObjectId("5ef35ac93e5f7febbd3ed7fe"),
         "event-date": ISODate("2012-02-18T05:00:00.000+00:00"),
         "topic": "Pandemic"
      }
   ]
}

Summary

Data that is accessed together should be stored together. If you'll be frequently reading or updating information together, consider storing the information together using nested documents or arrays. Carefully consider your use case and weigh the benefits and drawbacks of data duplication as you bring data together.

Be on the lookout for a post on the final MongoDB schema design anti-pattern!

When you're ready to build a schema in MongoDB, check out MongoDB Atlas, MongoDB's fully managed database-as-a-service. Atlas is the easiest way to get started with MongoDB and has a generous, forever-free tier.

Check out the following resources for more information:

Rate this article

This is part of a series

MongoDB Schema Design Anti-Patterns

Up Next

Case-Insensitive Queries Without Case-Insensitive Indexes

Continue

More in this series

Tutorial

Connection to MongoDB With Java And SOCKS5 Proxy

Apr 17, 2024 | 2 min read

Quickstart

MongoDB and Node.js Tutorial - CRUD Operations

Aug 22, 2023 | 17 min read

Quickstart

Introduction to Multi-Document ACID Transactions in Python

Sep 23, 2022 | 10 min read

Quickstart

Creating a REST API for CRUD Operations With Quarkus and MongoDB

Apr 17, 2024 | 7 min read