Any case where "Collection per entity" is a good model?

Joao_Pinela · May 4, 2021, 3:36pm

Hello All,

Usually I come across suggestions of splitting a use collection into entity collections.
Like if you have an IOT application, with a “sensor_temperature” collection and documents, because it will grow too much, the idea would be for example, collections like

senson_temperature_houseA
senson_temperature_houseB
senson_temperature_houseC

and so on. even if you have 1million different houses.
So house A does not need to see ( or know about) houseB’s data. But does this make any sense?
In any case? Any case?
If so, which cases/conditions, and why?

Thank you so much.
Best Regards,
JP

Asya_Kamsky · May 4, 2021, 6:34pm

That sounds like it could be a recipe for disaster or it could be a very good way to doing it, depending on how the data is collected and how it’s used.

If you think of each house as a separate tenant/customer/user then maybe there are valid reasons to split the data, but remember that even if it’s all in one collection you can split it (eventually sharding) by house/tenant_id (plus other fields, as it makes sense).

So, like always, in MongoDB the answer is “it depends”

Asya

Joao_Pinela · May 4, 2021, 6:40pm

Hello Asya,

thank you for your feedback

one good idea, I would think, is data isolation/ privacy . That is really the only reason, because other than that, you can “split” in one single collection by a simple attribute “house_id” or similar.

I don’t see any more benefits. Performance-wise, versus the complexity of coding, doesn’t seem one.
Would it be THAT much better, if there isn’t a requirement for data isolation for privacy?

Thank you again.

Best Regards,
JP

MaBeuLux88 · May 4, 2021, 7:03pm

EDIT: I started to type this hours ago. Then I went to a food break and didn’t see the 2 previous answers.

Hi @Joao_Pinela,

My answer might not be the only truth but, let’s try.

First, I would say that MANY collections in MongoDB is generally a bad idea. I’d say that it’s better to have a few very large collection with many documents in them rather than MANY MANY collections with a limited numbers of documents in them.
Also, this will make any aggregation involving the entire data set a lot more complex, because you would have to $unionWith all the collections to calculate the average temperature for example.
I think if you HAVE to split your data set into a FEW collections, I would use something with a lot less cardinality so the number of collections stays completely under control.
For example, I would use the year or month_year.

sensor_temp_2020
sensor_temp_2021
OR
sensor_temp_01_2021
sensor_temp_02_2021

At least here, if you need to calculate the average temperature for 2020 and 2021, if you chose the first option, it’s trivial, it’s more complicated if you choose the second option.

If you need the averages per months, I would go for the first or second option, in that case, both aggregations are trivial.

I think it’s all coming down to “how are you going to query your data”?

Another GREAT pattern for IOT data with too many documents would be to use the bucket pattern.

Basically, instead of storing 1 temperature per document, you store the entire day or month of temperatures in a single document using arrays. This can divide your number of documents very significantly. But don’t make jumbo documents either. A few hundreds KB top would be my recommendation.

Also, I would use Online Archive to archive automatically the old values into S3 to reduce the costs but keep that data queryable using the federated queries that still allow to query both the “hot” data in Atlas and the archived one.

I hope this helps.
Cheers,
Maxime.

Joao_Pinela · May 4, 2021, 8:36pm

thanks @MaBeuLux88 .

ok, from your answer I see that splitting per entity , like the following collection names,

sensor_temp_house0001_2020
sensor_temp_house0001_2021
sensor_temp_house0002_2020
sensor_temp_house0002_2021
sensor_temp_house0003_2020
sensor_temp_house0003_2021
…
sensor_temp_house9999_2020
sensor_temp_house9999_2021

it could make sense, depending on the access patterns, and if you don’t have more than maybe 1000 houses, because you could only store 10 years of data (according to the suggested max 10000 collection on the Massive Number of Collections article)

which means that if you had many MANY houses, or many users (like 1M users) this pattern is simply not a good idea.

I see. depends on entity number.

Thank you @MaBeuLux88 and all

best regards,
JP

MaBeuLux88 · May 5, 2021, 1:46pm

The bucket pattern I mentioned in my previous answer is usually the go-to solution for IOT to reduce the number of documents in the collection.
For example, maybe you could bucket your sensor readings by house per month.

If you take one measurement every hour, you would have 31*24 = 744 values per doc which is totally manageable I think. The document would look something like:

{
  "_id": "house0001_05_2021",
  "values": [
    {
      "date": ISODate(...),
      "v": 34.3
    }, 
    {...}
  ]
}

Again, it can or cannot be a valid solution, it depends on the access patterns. But this solution would divide the number of documents in the collection by 744.

Maybe another solution could be around the granularity. For example maybe after one year, you don’t need to keep all the details and you could squash the readings for one day in a single averaged value. Which could be done with Realm Scheduled Triggers for instance.

It’s really down to what the data is for and how it’s consumed.

Cheers,
Maxime.

Joao_Pinela · May 5, 2021, 1:55pm

I see. Thank you for the help and perspective.

system · May 10, 2021, 1:55pm

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.