[General] Confusion on immutable nature of shard keys

I am struggling to understand why one would choose anything except a monolithic field as the shard key.

Initially as one starts to think about the pros of having a collection sharded you go: ‘wow amazing’. A decent example is provided in one of the lectures showing a case like lastnames.

Now at my place of work our existing databases are large tables of client information and a very common way of searching our DB is by email or name. Later on when we start to touch on targeted searches and you think back on a example like lastnames the takeaway is again: Great! additional improvements to real world issues for employees aka TIME.

But then we get a lecture explicitly stating that shard key values become immutable… It’s at this point my dream of amazingly fast database queries shattered. I’ll explain, currently in my real world example of our client database of about 200k records, there exists 0 fields that we could realistically make immutable. Names, Surnames, emails, tell numbers, addresses, countries, the list goes one, at some point in time these have required an some form of update, because of very obvious reasons. People move, get married, make spelling errors on sign up, get new phones etc. etc.

Thus, if I understood correctly, in the real world no optimal candidate for “shard key” would be viable and you will be forced to use something like a client ID or the likes that numerically increases and are unique. To avoid hotspotting it will be hashed and any hope of targeted searches will forever be lost.

If I misunderstood the entire point of shard keys in this manner, and anyone could provide clarity. It would be appreciated. :slight_smile:

Hi @Morne_99418,

Here we will take a look on our two database needs: Scaling out and targeted search.
Firstly, we will have to consider scaling out our data. If we are storing tons of data and we expect high throughput operations, we should consider Sharding. It will distribute the load over multiple servers and ensure distributed reads/writes, better storage capacity and higher availability even if some shards are unavailable.

Please consider the following shard key specifications for evenly distributing the data across multiple shards:

A sharded collections must also have an index that supports the shard key i.e. the index can be an index on the shard key or a compound index where the shard key is a prefix of the index. This ensures the queries to be targeted as it reduces the need for collection scan. Please refer the following doc for more details on this:

For scenarios like this, I would find a compound shard key (may be a compound index) which is a combination of a unique immutable field (eg: Client_id) with some other field. Please go through the following link for getting an idea of choosing shard key as per your requirements as it has various scenarios explained:

I hope it helps!!
Please let me know, if you have any further doubts.

Kindest Regards,
Sonali

1 Like