Creating a new database vs a new collection vs a new cluster

Shruthi_s1 · March 16, 2021, 12:13am

When would I create a new database in MongoDB - does creating a separate database offer any advantage? From my understanding all the databases in a cluster share the same hardware resources so there’s no advantage there. Same for collections.
When would I create a new collection? Only when the data looks different from other collections? or is there some sort of configuration, isolation etc that creating a new collection vs using the same collection provides?

Pavel_Duchovny · March 16, 2021, 6:01am

Hi @Shruthi_s1,

Welcome to MongoDB community.

Its true that collection and databases share the same resources and potentially connection string from your drivers.

Moreover database is a logical context and does not necessarily influence number of files or eventually data size.

Database seperation is usually done as same collection have different context or for security measures where I have a user per database and it can read/write data to that database only.

However, seperating collections is a data design consideration.

Its important to remember that we want to access as less documents as possible to fetch our information while maintaining a good tradeoff to our index sizes, write performance and concurrency patterns.

In general a completely separate business logic entity should store its data in. A sperate collection and have its own indexes and access pattern.

There are other considerations like relationship, compression and future scale out (sharding) that can influence a good choice of schema and collection seperation.

I recommend reading the following:

https://www.mongodb.com/article/schema-design-anti-pattern-summary/

https://www.mongodb.com/article/mongodb-schema-design-best-practices/

Thanks.
Pavel

Prasad_Saya · March 16, 2021, 6:32am

Hello @Shruthi_s1, welcome to the MongoDB Community forum!

Databases and Collections:

In general, an application has some data associated with it. A typical web application has a database where the application’s data is stored. For example, a financial accounting application has various modules like accounts receivables, payables and general ledger - this is a categorization of an application at a very high level. As such each of these modules can be an application by itself. The data is also categorized by its functionality.

If you are building such an application, it is likely the application’s data is stored in three databases - one for each module. And, within the accounts receivables module there are various functions like, customer management, invoice management, etc. Each of these data is different and is stored in different collections within the accounts receivables database.

Another example is a blogging application. There are users, blog posts and reviews. The data is stored in different collections - users, blogs and reviews (maybe users and blogs plus reviews). It will be impractical to store user and blog information in a same collection. Because, user data is different, it has different fields and structure - user name, password, email, etc. A post’s data is a title, content, the user who wrote it, reviews, etc. These cannot be put together in same collection. The data is inserted, updated, and queried from the collection. To get user data you go to user collection.

So, you can think about collections are a grouping of similar data. And a database is a grouping of similar collections, i.e., data serving a larger functionality or a module. You would not like storing customer and invoice information in a same collection - it is impractical to store and use. Analogically, it is like putting salt and pepper in different containers - different containers for different ingredients serving different purposes.

MongoDB Clusters:

MongoDB has standalone, replica-set and sharded clusters. These configurations serve different purposes.

A standalone is a single server where all the databases (and their collections) are stored. In case the server goes down, your application and its users will wait until the server is again up and running.

A replica-set has the feature that the data is replicated on multiple databases servers. So, the advantage is if one of the servers die, other servers with their replicated data will go on serving the application and its users.

A sharded cluster has multiple shards - each shard is a replica-set - and the application’s data is distributed among these shards. For example, the customer data is stored on multiple shards. If there are five shards, and there are one hundred customers, you can think that each shard stores about twenty customer data (actual distribution is done based on criteria like shard key).

MongoDB has sharding at collection level, and a sharded cluster can have sharded and un-sharded data.

Finally:

How do you determine what cluster, database or collection? It is a broad subject. In fact it’s a combinations of various subjects like data modeling (or database design), then there is application design, etc. And, these are also based upon the requirements of an application.

Some useful references:

system · July 8, 2021, 9:41pm

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.