Click to watch keynotes and sessions from MongoDB.live, our virtual developer conference.
HomeLearnArticle

Massive Arrays

Published: Apr 22, 2020

  • MongoDB
  • Schema Design

By Lauren Schaefer

 and Daniel Coupal

Share

Design patterns are a fundamental part of software engineering. They provide developers with best practices and a common language as they architect applications.

At MongoDB, we have schema design patterns to help developers be successful as they plan and iterate on their schema designs. Daniel Coupal and Ken Alger co-wrote a fantastic blog series that highlights each of the schema design patterns. If you really want to dive into the details (and I recommend you do!), check out MongoDB University's free course on Data Modeling.

Sometimes, developers jump right into designing their schemas and building their apps without thinking about best practices. As their apps begin to scale, they realize that things are bad.

Oh, this is bad. I should not have done this.

We've identified several common mistakes developers make with MongoDB. We call these mistakes "schema design anti-patterns."

Throughout this blog series, I'll introduce you to six common anti-patterns. Let's start today with the Massive Arrays anti-pattern.

#Massive Arrays

One of the rules of thumb when modeling data in MongoDB is data that is accessed together should be stored together. If you'll be retrieving or updating data together frequently, you should probably store it together. Data is commonly stored together by embedding related information in subdocuments or arrays.

The problem is that sometimes developers take this too far and embed massive amounts of information in a single document.

Consider an example where we store information about employees who work in various government buildings. If we were to embed the employees in the building document, we might store our data in a buildings collection like the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// buildings collection { "_id": "city_hall", "name": "City Hall", "city": "Pawnee", "state": "IN", "employees": [ { "_id": 123456789, "first": "Leslie", "last": "Yepp", "cell": "8125552344", "start-year": "2004" }, { "_id": 234567890, "first": "Ron", "last": "Swandaughter", "cell": "8125559347", "start-year": "2002" } ] }

In this example, the employees array is unbounded. As we begin storing information about all of the employees who work in City Hall, the employees array will become massive—potentially sending us over the 16 mb document maximum. Additionally, reading and building indexes on arrays gradually becomes less performant as array size increases.

The example above is an example of the massive arrays anti-pattern.

So how can we fix this?

Instead of embedding the employees in the buildings documents, we could flip the model and instead embed the buildings in the employees documents:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// employees collection { "_id": 123456789, "first": "Leslie", "last": "Yepp", "cell": "8125552344", "start-year": "2004", "building": { "_id": "city_hall", "name": "City Hall", "city": "Pawnee", "state": "IN" } }, { "_id": 234567890, "first": "Ron", "last": "Swandaughter", "cell": "8125559347", "start-year": "2002", "building": { "_id": "city_hall", "name": "City Hall", "city": "Pawnee", "state": "IN" } }

In the example above, we are repeating the information about City Hall in the document for each City Hall employee. If we are frequently displaying information about an employee and their building in our application together, this model probably makes sense.

The disadvantage with this approach is we have a lot of data duplication. Storage is cheap, so data duplication isn't necessarily a problem from a storage cost perspective. However, every time we need to update information about City Hall, we'll need to update the document for every employee who works there. If we take a look at the information we're currently storing about the buildings, updates will likely be very infrequent, so this approach may be a good one.

If our use case does not call for information about employees and their building to be displayed or updated together, we may want to instead separate the information into two collections and use references to link them:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// buildings collection { "_id": "city_hall", "name": "City Hall", "city": "Pawnee", "state": "IN" } // employees collection { "_id": 123456789, "first": "Leslie", "last": "Yepp", "cell": "8125552344", "start-year": "2004", "building_id": "city_hall" }, { "_id": 234567890, "first": "Ron", "last": "Swandaughter", "cell": "8125559347", "start-year": "2002", "building_id": "city_hall" }

Here we have completely separated our data. We have eliminated massive arrays, and we have no data duplication.

The drawback is that if we need to retrieve information about an employee and their building together, we'll need to use $lookup to join the data together. $lookup operations can be expensive, so it's important to consider how often you'll need to perform $lookup if you choose this option.

If we find ourselves frequently using $lookup, another option is to use the extended reference pattern. The extended reference pattern is a mixture of the previous two approaches where we duplicate some—but not all—of the data in the two collections. We only duplicate the data that is frequently accessed together.

For example, if our application has a user profile page that displays information about the user as well as the name of the building and the state where they work, we may want to embed the building name and state fields in the employee document:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// buildings collection { "_id": "city_hall", "name": "City Hall", "city": "Pawnee", "state": "IN" } //employees collection { "_id": 123456789, "first": "Leslie", "last": "Yepp", "cell": "8125552344", "start-year": "2004", "building_name": { "name": "City Hall", "state": "IN" } }, { "_id": 234567890, "first": "Ron", "last": "Swandaughter", "cell": "8125559347", "start-year": "2002", "building": { "name": "City Hall", "state": "IN" } }

As we saw when we duplicated data previously, we should be mindful of duplicating data that will frequently be updated. In this particular case, the name of the building and the state the building is in are very unlikely to change, so this solution works.

#Summary

Storing related information that you'll be frequently querying together is generally good. However, storing information in massive arrays that will continue to grow over time is generally bad.

As is true with all MongoDB schema design patterns and anti-patterns, carefully consider your use case—the data you will store and how you will query it—in order to determine what schema design is best for you.

Be on the lookout for more posts in this anti-patterns series in the coming weeks.

When you're ready to build a schema in MongoDB, check out MongoDB Atlas, MongoDB's fully managed database-as-a-service. Atlas is the easiest way to get started with MongoDB. With a forever-free tier and promo code LAUREN200 for when you're ready to move beyond the free tier, you're on your way to realizing the full value of MongoDB.

Check out the following resources for more information:

More from this series

MongoDB Schema Design Anti-Patterns
  • Massive Arrays
  • Massive Number of Collections
  • Unnecessary Indexes
  • Bloated Documents
  • Separating Data That is Accessed Together
  • Case-Insensitive Queries Without Case-Insensitive Indexes

Related

MongoDB University M320: Data Modeling
Building with Patterns
Data Modeling Introduction
MongoDB Icon
  • Developer Hub
  • Documentation
  • University
  • Community Forums

© MongoDB, Inc.