Schema Design Anti-Patterns Video Series is Live on YouTube

Lauren_Schaefer · October 13, 2020, 12:59pm

Hey Peeps,

All three videos of my schema design anti-patterns video series are now live. You can check them out at Schema Design Anti-Patterns Series - YouTube.

If you prefer blog posts to videos, check out the blog series that covers the same topics: A Summary of Schema Design Anti-Patterns and How to Spot Them | MongoDB

Comment here if you have any questions.

Joseph_Gan · October 13, 2020, 3:18pm

Hello @Lauren_Schaefer,

I enjoyed all 3 parts of your video very much, together with building-with-patterns-a-summary. I think it’s really great, and hope that there would be more of these, especially with some real-world use-case and example would definitely be better.

Having said that, most of the videos and guides and pattern only seem to be focusing a lot of the read portion but I thought that maybe there could be a series that focuses of the write? Like what’s the best way to handle the write when there’s need to.

For example, as mentioned in extended-reference-pattern, we duplicate a couple of data that is frequently accessed but rarely updated in the collection, so that the read is fast which I agree, but I don’t know what’s the best way to handle if I have to update my collection in the event that the duplicated data is updated. Let me try to give an example of that.

I have x number of collections, all having a same subset of data in each collection using the extended-reference-pattern.

// collection-a
{
  name: 'a1',
  epd: {
    profile_key: '12345',
    profile_name: 'helloworld',
  } 
}

// collection-b
{
  name: 'b1',
  epd: {
    profile_key: '12345',
    profile_name: 'helloworld',
  } 
}

// continue for up to collection-n

// collection epd
{
  profile_key: '12345',
  profile_name: 'helloworld',
  // and some more fields
}

Assuming that profile_name rarely gets updated, but if it ever does, what would be the best way to handle these kind of updates across numerous collection? Is there a good pattern for this sort of stuff?

This is just one example, but I hoped that I did bring the point across where most of the guides only tells the pros of having the patterns, and the cons are (e.g) data duplication. But it doesn’t also tells us how to handle the data duplication properly whenever it does get updated. I thought it was also as important as identify the pattern to resolve the read issue.

Another quick example would be subset-pattern, it tells us to place only a couple of document in collection-a so to quickly access a sub-set of information but it doesn’t tell us how to manage effectively/efficiently on updating the docs.

// movies
{
  title: 'fast',
  reviews: [
    // keep top 10 here
  ]
}

// reviews
// all the rest of the reviews

Say I have a review that will be promoted to the movies top 10 review, and one of the top 10 review will be demoted to the reviews collection. Is there any recommended pattern/practice to do this sort of thing?

I hope I did make some sense here, because those are the things that I don’t quite understand fully.

Lastly, thank you for all the videos and blogs post which is very useful, and keep them coming!

Lauren_Schaefer · October 15, 2020, 2:02pm

Hey @Joseph_Gan!

Great questions! I’ll be thinking about how we can answer these better in the future. In the meantime, let’s get to your specific questions.

As with almost everything in MongoDB, the answer is “it depends on your use case.” I’ll walk through one possible solution, but know that there may be a better way based on your use case.

Regarding the extended-reference pattern question:
One option would be to do handle the updates at the application level. You could do something like db.collA.updateMany({'epd.profile_key': '12345'},{$set: {'epd.profile_name': 'new'}})
whenever you make an update to the profile name. You’d have to ensure that your development team knew about this data duplication and handled it correctly every time.

Another option would be to use change streams or triggers. These allow you to watch for updates in a collection and take a specified action whenever an update occurs. This would prevent you from worrying about forgetting to update the documents in your application code every time.

Another thing to consider is how detrimental data inconsistency would be. You may want to wrap your updates in a transaction to ensure that the update occurs in all of the collections or none of the collections.

Regarding the Subset Pattern question:

Again, the best way to do this will depend on how you’ve implemented it and your particular use case.

I’d likely keep all of the information about every review in the reviews collection – regardless of whether it’s in the movies collection or not. I’d keep the movie name and rating in the reviews array in the movies collection. Whenever you’re updating the movie reviews, check to see if the review is high enough to add the item to the movies collection. If so, remove the bottommost review and add the new review.

Again, we could consider whether using a transaction, change stream, and/or trigger would help.

Joseph_Gan · October 16, 2020, 3:22pm

Thank you for your response.

Regarding the extended-reference-pattern, when you mentioned to update at application level, you meant to say that whenever profile name is updated, it will triggers a update to all collections

db.collA.updateMany({'epd.profile_key': '12345'},{$set: {'epd.profile_name': 'new'}})
db.collB.updateMany({'epd.profile_key': '12345'},{$set: {'epd.profile_name': 'new'}})
db.collC.updateMany({'epd.profile_key': '12345'},{$set: {'epd.profile_name': 'new'}})
// and so on

The cons of this would be that the application would need to know what are the available collections, and update it whenever there is an new collection. Am I right to say so? Bringing the example slightly further, if there is below 100k documents per such collection (maybe 1 or 2 would have more, say up to a million), would there be any different between using updateMany and bulkWrite. When should I use one over another?

Going back to the other options as mentioned, triggers don’t look an option to me since it’s a on-prem setup, and my current thinking is to go with something like change-stream, where it listens for the change, and then update all the collections that needs to be updated. Would I need to be mindful of concurrency issue here, where some other actions triggers an update to the document, then the change-stream triggers the update?

// profile.name gets updated
// change-stream detected change, running the update across collections (a-z)
// another operation trigger an update to collection-f doc
// change-stream update the docs with the updated profile.name but overwrites the previous update (above)

Thanks for the suggestion regarding the subset-pattern

Lauren_Schaefer · October 16, 2020, 3:43pm

@Joseph_Gan When I say make the update at the application level, I mean that whenever the application makes an update to the profile_name, the application would need to make updates to all of the collections where profile_name is set. It would look like the code block you created.

Yes, you’re totally right about the cons. The application developers would need to be aware of all of the places profile_name is set and make all of the updates in every piece of the application where the update is made.

updateMany() is a wrapper on top of bulkWrite(), so the performance is the same.

I’m trying to think through the example you gave regarding concurrency. I can’t think of why you would want to set the profile_name in collection-f explicitly. You would probably want to only make the update in the epd collection and then push the changes out everywhere else.

Joseph_Gan · October 16, 2020, 4:09pm

updateMany() is a wrapper on top of bulkWrite(), so the performance is the same.

I was trying to find the docs somewhere but I couldn’t. I remember that you perform a batch execute with bulkWrite? Something similar to the post written here. Am I able to the same for updateMany operations?

Regarding the concurrency, I wasn’t making myself clear. I meant that another operation trigger an update to collection-f doc on a different field

// profile.name gets updated
// change-stream detected change, running the update across collections (a-z)
// another operation trigger an update to collection-f doc on a different field
colF.update(id, { anotherField: 'not profile.name' });
// change-stream update the docs with the updated profile.name but overwrites the previous update (above)
// this change-stream update ops updates the `profile.name` but overwrites `anotherField` to its previous state

If I’m using patch to patch specific field, then this should not matter but would be a problem if use update operation?

Lauren_Schaefer · October 20, 2020, 12:20pm

Hi @Joseph_Gan,

I found that updateMany() is a wrapper for bulk write by using the Mongo Shell.

I typed

db.foo.updateMany

and the following was returned:

function(filter, update, options) {
    var opts = Object.extend({}, options || {});

    // Pipeline updates are always permitted. Otherwise, we validate the update object.
    if (!Array.isArray(update)) {
        // Check if first key in update statement contains a $
        var keys = Object.keys(update);
        if (keys.length == 0) {
            throw new Error(
                "the update operation document must contain at least one atomic operator");
        }
        // Check if first key does not have the $
        if (keys[0][0] != "$") {
            throw new Error('the update operation document must contain atomic operators');
        }
    }

    // Get the write concern
    var writeConcern = this._createWriteConcern(opts);

    // Result
    var result = {acknowledged: (writeConcern && writeConcern.w == 0) ? false : true};

    // Use bulk operation API already in the shell
    var bulk = this.initializeOrderedBulkOp();

    // Add the updateMany operation
    var op = bulk.find(filter);
    if (opts.upsert) {
        op = op.upsert();
    }

    if (opts.collation) {
        op.collation(opts.collation);
    }

    if (opts.arrayFilters) {
        op.arrayFilters(opts.arrayFilters);
    }

    op.update(update);

    try {
        // Update all documents that match the selector
        var r = bulk.execute(writeConcern);
    } catch (err) {
        if (err instanceof BulkWriteError) {
            if (err.hasWriteErrors()) {
                throw err.getWriteErrorAt(0);
            }

            if (err.hasWriteConcernError()) {
                throw err.getWriteConcernError();
            }
        }

        throw err;
    }

    if (!result.acknowledged) {
        return result;
    }

    result.matchedCount = r.nMatched;
    result.modifiedCount = (r.nModified != null) ? r.nModified : r.n;

    if (r.getUpsertedIds().length > 0) {
        result.upsertedId = r.getUpsertedIdAt(0)._id;
    }

    return result;
}

For more information about bulkWrite, visit https://docs.mongodb.com/manual/core/bulk-write-operations/. For more information about updateMany, visit https://docs.mongodb.com/manual/reference/method/db.collection.updateMany/index.html.

If you want to update many documents, you can choose to use bulkWrite or updateMany and you’ll get the same performance.

Regarding concurrency: I see what you mean now.

MongoDB intelligently handles locking for you. If you have two simultaneous transactions that touch the same document, one transaction will occur before the other. You have two options.

You can pull the latest copy of the document, make changes to it, and send it back to the database.
You can choose to update only the fields you want to update.

See https://docs.mongodb.com/manual/faq/concurrency/ for more details.

Joseph_Gan · November 1, 2020, 10:01am

Hi @Lauren_Schaefer,

Thank you for the explanation. It’s much clearer to me now! Appreciate it.

Cheers

Lauren_Schaefer · November 2, 2020, 12:27pm

You’re welcome!