Bloated Documents

Lauren Schaefer, Daniel Coupal6 min read • Published Feb 12, 2022 • Updated May 31, 2022

MongoDB Schema

Rate this article

Welcome (or welcome back!) to the MongoDB Schema Anti-Patterns series! We're halfway through the series. So far, we've discussed three anti-patterns: massive arrays, massive number of collections, and unnecessary indexes.

Today, let's discuss document size. MongoDB has a 16 MB document size limit. But should you use all 16 MBs? Probably not. Let's find out why.

If your brain feels bloated from too much reading, sit back, relax, and watch this video.

Bloated Documents

Chances are pretty good that you want your queries to be blazing fast. MongoDB wants your queries to be blazing fast too.

To keep your queries running as quickly as possible, WiredTiger (the default storage engine for MongoDB) keeps all of the indexes plus the documents that are accessed the most frequently in memory. We refer to these frequently accessed documents and index pages as the working set. When the working set fits in the RAM allotment, MongoDB can query from memory instead of from disk. Queries from memory are faster, so the goal is to keep your most popular documents small enough to fit in the RAM allotment.

The working set's RAM allotment is the larger of:

50% of (RAM - 1 GB)
256 MB.

For more information on the storage specifics, see Memory Use. If you're using MongoDB Atlas to host your database, see Atlas Sizing and Tier Selection: Memory.

One of the rules of thumb you'll hear frequently when discussing MongoDB schema design is data that is accessed together should be stored together. Note that it doesn't say data that is related to each other should be stored together.

Sometimes data that is related to each other isn't actually accessed together. You might have large, bloated documents that contain information that is related but not actually accessed together frequently. In that case, separate the information into smaller documents in separate collections and use references to connect those documents together.

The opposite of the Bloated Documents Anti-Pattern is the Subset Pattern. The Subset Pattern encourages the use of smaller documents that contain the most frequently accessed data. Check out this post on the Subset Pattern to learn more about how to successfully leverage this pattern.

Example

Let's revisit Leslie's website for inspirational women that we discussed in the previous post. Leslie updates the home page to display a list of the names of 100 randomly selected inspirational women. When a user clicks on the name of an inspirational woman, they will be taken to a new page with all of the detailed biographical information about the woman they selected. Leslie fills the website with 4,704 inspirational women—including herself.

Leslie says she's big enough to admit that she's often inspired by herself

Initially, Leslie decides to create one collection named InspirationalWomen, and creates a document for each inspirational woman. The document contains all of the information for that woman. Below is a document she creates for Sally Ride.

Code Snippet

Leslie notices that her home page is lagging. The home page is the most visited page on her site, and, if the page doesn't load quickly enough, visitors will abandon her site completely.

Leslie is hosting her database on MongoDB Atlas and is using an M10 dedicated cluster. With an M10, she gets 2 GB of RAM. She does some quick calculations and discovers that her working set needs to fit in 0.5 GB. (Remember that her working set can be up to 50% of (2 GB RAM - 1 GB) = 0.5 GB or 256 MB, whichever is larger).

Leslie isn't sure if her working set will currently fit in 0.5 GB of RAM, so she navigates to the Atlas Data Explorer. She can see that her InspirationalWomen collection is 580.29 MB and her index size is 196 KB. When she adds those two together, she can see that she has exceeded her 0.5 GB allotment.

The Atlas Data Explorer shows the size of the InspirationalWomen collection is 580.29 MB and the size of its three associated indexes is 196 KB.

Leslie has two choices: she can restructure her data according to the Subset Pattern to remove the bloated documents, or she can move up to a M20 dedicated cluster, which has 4 GB of RAM. Leslie considers her options and decides that having the home page and the most popular inspirational women's documents load quickly is most important. She decides that having the less frequently viewed women's pages take slightly longer to load is fine.

She begins determining how to restructure her data to optimize for performance. The query on Leslie's homepage only needs to retrieve each woman's first name and last name. Having this information in the working set is crucial. The other information about each woman (including a lengthy bio) doesn't necessarily need to be in the working set.

To ensure her home page loads at a blazing fast pace, she decides to break up the information in her InspirationalWomen collection into two collections: InspirationalWomen_Summary and InspirationalWomen_Details. She creates a manual reference between the matching documents in the collections. Below are her new documents for Sally Ride.

Code Snippet

Leslie updates her query on the home page that retrieves each woman's first name and last name to use the InspirationalWomen_Summary collection. When a user selects a woman to learn more about, Leslie's website code will query for a document in the InspirationalWomen_Details collection using the id stored in the inspirationalwomen_id field.

Leslie returns to Atlas and inspects the size of her databases and collections. She can see that the total index size for both collections is 276 KB (180 KB + 96 KB). She can also see that the size of her InspirationalWomen_Summary collection is about 455 KB. The sum of the indexes and this collection is about 731 KB, which is significantly less than her working set's RAM allocation of 0.5 GB. Because of this, many of the most popular documents from the InspirationalWomen_Details collection will also fit in the working set.

The Atlas Data Explorer shows the total index size for the entire database is 276 KB and the size of the InspirationalWomen_Summary collection is 454.78 KB.

In the example above, Leslie is duplicating all of the data from the InspirationalWomen_Summary collection in the InspirationalWomen_Details collection. You might be cringing at the idea of data duplication. Historically, data duplication has been frowned upon due to space constraints as well as the challenges of keeping the data updated in both collections. Storage is relatively cheap, so we don't necessarily need to worry about that here. Additionally, the data that is duplicated is unlikely to change very often.

In most cases, you won't need to duplicate all of the information in more than one collection; you'll be able to store some of the information in one collection and the rest of the information in the other. It all depends on your use case and how you are using the data.

Summary

Be sure that the indexes and the most frequently used documents fit in the RAM allocation for your database in order to get blazing fast queries. If your working set is exceeding the RAM allocation, check if your documents are bloated with extra information that you don't actually need in the working set. Separate frequently used data from infrequently used data in different collections to optimize your performance.

Check back soon for the next post in this schema design anti-patterns series!

Check out the following resources for more information:

Rate this article

This is part of a series

MongoDB Schema Design Anti-Patterns

Up Next

Separating Data That is Accessed Together

Continue

More in this series

Tutorial

Microservices Architecture With Java, Spring, and MongoDB

Apr 17, 2024 | 4 min read

Tutorial

Building with Patterns: The Subset Pattern

Sep 23, 2022 | 3 min read

Tutorial

MongoDB Time Series with C++

Apr 03, 2024 | 6 min read

Tutorial

Client-Side Field Level Encryption (CSFLE) in MongoDB with Golang

Feb 03, 2023 | 15 min read

Bloated Documents
Example
Summary
Related Links

MongoDB

Bloated Documents

Bloated Documents

Example

Summary

Related Links

This is part of a series

MongoDB Schema Design Anti-Patterns

Up Next

More in this series

Related

Microservices Architecture With Java, Spring, and MongoDB

Building with Patterns: The Subset Pattern

MongoDB Time Series with C++

Client-Side Field Level Encryption (CSFLE) in MongoDB with Golang

Table of Contents