Watch keynotes and sessions from MongoDB.live, our virtual developer conference.

Questions about hashed sharding

Does Hash function return different value for shard keys in case a new shard added or remove from the existing cluster?

It will help me to understand that how data/records distribution happens in the scenario when node failure happens in a very big cluster? And how much performance impact would be?

Welcome to the community @Deepak_Pengoria!

A hash function returns a consistent result value for a given input. In the case of hashed sharding in MongoDB, a single field value is hashed. Changing the number of shards does not affect the shard key.

Data in a sharded collection is distributed based on chunks which represent contiguous ranges of shard key values. Chunk ranges are automatically split based on data inserts and updates, and balanced across available shards when the imbalanced distribution of chunks for a collection reaches a migration threshold.

If you are new to sharding, I would definitely give consideration to whether hashed sharding is the best approach for your use case. The default range-based sharding approach supports efficient range-based queries based on the shard key index and allows compound shard keys which can be useful for zone-based data locality.

Hashed sharding is best suited to distributing inserts based on a monotonically increasing value like an ObjectID or a timestamp. For more info, see Hashed vs Ranged Sharding.

Each shard is backed by a replica set which determines the level of data redundancy and fault tolerance. A production replica set typically includes three data bearing mongod processes, which allows any single mongod to be unavailable (for example, due to maintenance or failure) while still maintaining availability and data replication.

You can provision additional replica set members to increase your fault tolerance.

For more insight I recommend taking some of the free online courses in the MongoDB for DBA Learning Path at MongoDB University.

Regards,
Stennie

Thanks for quick response. Could you please point me the link of hashFunction() implementation/source-code which calculates shardKey/partitionKey while partition the records/data?

Hi Deepak,

You can find the server implementation on GitHub: mongodb/src/mongodb/hasher.cpp.

If you just want to see what the hashed values look like, there is a convertShardKeyToHashed() helper in the mongo 4.0+ shell.

For example, two ObjectIDs generated in close sequence will be roughly ascending:

> o = new ObjectId()
ObjectId("5ea77f07fcb94966883f5d5e")

> o2 = new ObjectId()
ObjectId("5ea77f0afcb94966883f5d5f")

However, the hashed values will be very different:

> convertShardKeyToHashed(o)
NumberLong("329356589482501449")

> convertShardKeyToHashed(o2)
NumberLong("-3285311604932298140")

Since data in a sharded collection is distributed based on contiguous chunk ranges, naive sharding on {_id:1} with default ObjectIDs would result in new inserts always targeting the single shard which currently has the chunk with the maximum _id value. This creates a “hot shard” scenario with inserts targeting a single shard plus the additional overhead of frequently rebalancing documents to other shards. Adding more shards will not help scale inserts with this poor choice of monotonically increasing shard key.

If you instead shard based on hashed ObjectId values (using { _id: 'hashed'}), consecutive inserts will land in different chunk ranges and (on average), different shards. Since inserts should be well distributed across all shards, rebalancing activity should be infrequent and only necessary when you add new shards for scaling (or remove existing shards).

As mentioned earlier, you should definitely consider whether hashed-based sharding is the best approach for your use case. If there is a candidate shard key based on one or more fields present in all of your documents, you will have more options for data locality using the default range-based sharding approach and Choosing a good shard key.

Regards,
Stennie

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.