Data model for change tracking of document revisions

I would love to hear some opinions on a data model I’m working on.

I have an existing system that tracks changes to documents, such that for every current revision of a distinct document there is a history of everything that has happened to it since insertion. I dislike my existing system and am considering some alternative approaches.

  1. Separate collections (current system).
  2. Single collection, separate documents.
  3. Single collection, single document.
  4. Change streams.

Briefly, a document looks something like this:

{ "id":"foo", "revision":1, "value":"First version"}

Modifying the value field increments the version number and a log of its previous state is written along with the update. In reality there are many more document properties, but this is the gist of it. More about my design goals after I’ve described the approaches I’m considering.


1) Separate collections.
The current system maintains a master collection where all documents are updated, thus maintaining a single current state. For every update, I insert its previous state into a separate “revisions” collection.

The main reason I dislike this is that updates are not atomic and for reliability I need to use transactions. This approach also doubles my collections, which may not sound important, except I have lots. That’s another story, but for now halving my collections would be a bonus.


2) Single collection, separate documents.
Another route I’m investigating is to store every revision as a separate document in the same collection. This means that every “update” is actually an insert. Listing all current documents would be done with an aggregate query such that only the document with the highest revision number is returned.

My main concern with this approach is that two people making an update at roughly same time (having both pulled version n into application memory) could end up both inserting the same incremented revision number (n+1). I’m also unsure about the performance of the aggregate query as the number of revisions for each document becomes larger.


3) Single collection, single document.
My favourite route is to store a complete history of each document within the document. Each update pushes the dirty state of the document being modified into an array.

This works well and is atomic, but I am concerned about the 16MB limit of document size. Although unlikely it is certainly a possibility and I would need a backup plan for overflow data.


4) Change streams.
I’ve done some experiments with change streams whereby every update is detected by a constantly running background process which logs the change event to another collection.

I feel like this is too precarious to rely upon for data integrity. If an event was missed for some reason (e.g. momentary downtime) then the data is gone. I am keen that every update from my application is guaranteed to log the change, as long as the main document update succeeds.


In terms of design goals:

Performance: The most common action will be querying/listing the current revision of documents. The next most common action is probably updating them. The least common action is likely to be querying historical revisions. The performance of that last part is not of much concern.

Integrity: I want to ensure that if a document update succeeds that the change logging also succeeds. One without the other must be impossible.

I’m open to suggestions for new approaches, or tips to make these ones work well.

Thanks for reading!

Hello @timw, that is a lot of detail about your design plans! At the first look, I am leaning on the option “3) Single collection, single document” , but for a question.

The question is what do you do with the logged data and the performance requirement? A thought is that to overcome the document limit of 16 MB (you had already mentioned about this), you can periodically archive the log data to another collection. And, does this fit into your usage plans of this data?

Another point is, do you know how many operations are going to be in the history log for a document over a period of time, say in 1, 2, 5 and 10 years? And what is the size of each log document? And, accordingly you can know how much log data can be stored for a document (within the parent document). Note that 16 MB is a lot of data.

The other options would be Change Streams (this will be a separate history collection) or populate a separate collection in a Transaction. That is assuming your deployment is a Replica Set or a Sharded Cluster.

Thanks for the reply. I’m leaning toward option 3 too. It nicely encapsulates everything and I think has the lowest risks compared with the other ideas.

I mentioned briefly that my performance requirement for querying the historical data is not critical. It will mostly be for analytics and rollbacks. It’s much more important that the reading and writing of the current revision is performant.

As to the 16MB limit. Hitting this would be a real edge case. Most editable data will be around 1KB and a single document would be very unlikely to every receive anything close to 16,000 revisions during “normal” use! I’d expect even 100 revisions to be unusually high. However there is nothing to stop a user creating a 1MB document, which would then be seriously restricted to 16 edits. This would be very unusual and I could place a size limit to reduce the risk.

As long as there exists a hard limit I must have a plan for it. People do weird things, and when they’re paying customers it’s difficult to tell them they’re “using it wrong”. Furthermore the system provides a remote API, so it’s not hard to imagine a situation where two automated processes end up competing with one another and creating an infinite series of back-and-forth edits. I’ve learned to rule nothing out as impossible :slight_smile:

I’m not keen to systematically offload historical data from all documents, as this would add complexity I’m trying to remove. I would however be ok with an overflow method. Possibly adding a pointer to another document if one became too large. Regarding performance of historical data, I don’t see this as a problem and it would rarely be used - if ever.

See the Outlier Pattern, this addresses creating an overflow document linked to the original document. This method applies in your situation as the “outlier” or “overflow” document may be required one in a n (some large number) number of documents. The main issue is that the outlier documents are to be handled differently in your application.

Thanks for the link. Exactly the situation here!