Multi-tenancy and shared data

Nicklas_Ring · October 15, 2020, 6:45am

We’re currently planning a large refactoring of our current monolithic system, we’re considering replacing most parts of our current relational database with MongoDB.

We realize this means rethinking our data model from square one.

One aspect of the refactoring is multi tenancy and we can’t quite figure out how to solve this.

Let’s say you have a core system with permissions, default data and a bunch of different types for all types of things (No pun intended).

These core system assets would be nice to share between tenants, that might live in different databases sharing the same data structure.

These things could be handled in the backend models, fetching the right data in the right place.

But let’s say tenants are a part of the same organization and their data should also be shared amongst themselves.

Here lies our question. Has someone here faced a similar problem ?
Are there solutions in place in MongoDB to make this implementation easier, making the backend logic for this “sharing” not overly complex.

Pavel_Duchovny · October 15, 2020, 4:17pm

Hi @Nicklas_Ring

Welcome to MongoDB community!

MongoDB has verious efficient ways to implement Multi-tenancy. However, in order to better assist you please answer the following:

How many tenants are there?
Are they all have the same sizes or differ drastically?
Is there query pattern different or alike.
What is the expected data size?
What is the expected growth over next 2 years.
Are all tenants and application are in a single dc or multiple?
What are the security considerations? Can developers see different tenant data? Can tenants see different tenanat data?
Are you considering a replica set or sharding? If a replica set will you shard when the data will grow?
What MongoDB version are you expect to use?

Best
Pavel

Nicklas_Ring · October 20, 2020, 6:34pm

Hi Pavel,

Currently 400+ tenants in our most active region, but this increase to thousands in larger markets.
Sizes differ drastically yes, if we count tenants that are a part of a branch (The branch as a single tenant, since they need to share data), they differ even more.
Their query pattern is different for some tenants that use the system in a “special” way. We’ve had issues with this before and had to make optimizations specifically for them.
Depends on how we structure things, but 500mb+ per tenant, hard to answer.
In markets with the largest potential there might be an increase of tenants by 200+, if things explode maybe 1000, but i’d say that is unlikely, nevertheless we should plan for that.
Single DC, tenants do not have isolated domains.
Tenants should not be able to see other tenant data when they are not a part of a branch, when a tenant is a part of a branch, all tenants within the branch should have a common data pool (in lack of a better wording).
We’ve been looking at sharding, but don’t know if that is a good fit, a replica set seems like a less complex solution. We introduce complexibility as it is with multi tenancy and multi regions handled within the same platform is that is a possibility.
Latest MongoDB version

Thanks

Pavel_Duchovny · October 21, 2020, 5:43am

Hi @Nicklas_Ring,

So if I am reading correctly your current expected size is ~400*500mb=~200gb.

Additionally it sounds like branch is the isolation level where a single branch should not have access to others.

Also you should benefit from tiering branches into different replica set based on their activity/size and/or geolocations this way you can size/place hw and tune queries more specifically.

Since the number of tenants may grow significantly having a collection per tenant in a database per branch might have some scalablity problems as it may potentially result in many files on disk for indexes and collections. Wired Tiger storage might consume more memory and have more open file handles with this approach.

I would think about the following options:

Within the replica set have a database per branch where the collections inside have a tenant id field indexed and used in queries.

Advantages: better data isolation as roles can be defined on a database level. A better structure if running across different replica set as in each case there will be a database per branch. More structured privilege system relaying on MongoDB roles and permissions.

Disadvantages: the application user will need to have more privileges to dynamically create databases and index new branches on the fly.
If sharding is considered you will need to eventually shard more collections which may result in a less scalable solution. May still have many collections pushing the Wired Tiger to its file limits

Have the branch and tenant id indexed as a compound index in a main per replica set database and collection for all tenants.

Advantages: less collections which allows less overhead on files. Easier to shard in the future and use sharding as a scale out approach. Will require less administration privs from the application.

Disadvantages Less isolated as application bugs might query wrong branch data. If branches use the data very differently you might endup with locks and lots of indexes on a single collection which might impact performance of writes.

I recommend reviewing the following information:
https://www.mongodb.com/article/schema-design-anti-pattern-summary

Best
Pavel

Darren_O_Connor · February 1, 2022, 8:55pm

@Pavel_Duchovny thanks for the great response and sorry to revive an old thread, but we were wondering:

In your first recommended approach (add a tenant id field to the collection and use it in queries), does MongoDB have any features that help ensure that queries the isolation between tenants is maintained at the database level rather than at the application level?

For example in DynamoDB there is an option for fine-grained access control based on IAM roles that prevents any application from accessing data from another tenant based on the partition key.

Pavel_Duchovny · February 2, 2022, 6:15am

Hi @Darren_O_Connor ,

If you use realm application services to access your atlas database you can set collection rules for read and write operations.

Potentially, it can verify that a tenant_id is equal to the fetched documents.

Thanks
Pavel