Suggestions Choosing Shard Key

i need help on choosing a shard key in my mongodb sharded cluster.
Scenario My application is built on .net core 2.1. What it does is actually read websites and update details in the database. I’ve list of around 1 million websites which need to be crawled. The application just finds new pages which are not already in my database and saves them to database.

Cluster and Server Details I have 3 shards (one primary and 2 secondary each) on dell r820 machines. Each machine having 512gb of RAM. And i run my application on 4 dell r620 machines, its mutithreadrd application.

Database Structure: I have 2 databases mainly, one for all the home pages list and one for Pages.

HomePages:

_id

URL (shard key)

Pages:

_id

URL (shard key and unique indexed to avoid duplicate entries in collection)

HomePageURL

AlreadyRead (indexed field)

So the application reads home pages and saves the inner pages from home page in Pages database. And the other part of application gets Pages from Pages database where AlreadyRead is 0, updates it to 1 and crawls it to save other pages found on that page in the database. But this part takes time as the data size grows, which i think is because of wrong shard key as it is set to URL field, and the command goes on all shards (i am assuming). I am saving URL without http or www. And if i set the HomePageURL as the shard key, it unevenly distributes the data across clusters ( which i already experienced, it was having 92% of data on one cluster).

Cutting the long story short, cosidering the above scenario, what could be the best shard key? Or do i have to choose compound shard key?

Hi @Meva159,

Welcome to MongoDB community!

It sounds like your queries are either on HomePageURL or on HomePageURL , URL, AlreadyRead .

Based on this I believe you should consider having a shard key on HomePageURL, URL . while the index that the shardkey is using can potentially be a unique one based on:

HomePageURL : 1 ,
URL : 1,
AlreadyRead : 1

Indexes can be covering shard keys prefixes to be a key index. This could be unique.

Please note that re sharding a collection is a painful task. 4.4 only allows refine a shardkey by adding fields but it can be done only forward.

To fully reshard a collection you will need to:

  1. Dump it.
  2. Drop the collection and the metadata in the config servers.
  3. Recreate new collection/indexes and new shard key.
  4. Presplit it for load.
  5. Load the dumped data via mongos.

Best

@Pavel_Duchovny Thanks for replying with your valuable feedback. basically my queries is based on AlreadyRead. I pick up the URL from PagesDB where AlreadyRead is 0, update it to 1 and read its HTML. And save the new pages found from that page in PagesDB with AlreadyRead set as 0. I’ve a unique index on URL field to avoid duplicate entries in the PagesDB.
What I’ve read in mongod documentation, it seems I should have a compound shard key on HomePageURL, URL and possibly some other filed maybe AppliationNumber, as my application runs or multiple servers. Do you think choosing such shard key and applying unique index on URL field will solve my problem ?

Hi @Meva159,

So if your queries on AlreadyRead : 0 perhaps you can index it seperately from your shard key as it will be executed as a scattered query anyhow. This index might be partial :wink:

I think if the shard key cannot answer your queries it should be designed to allow the best equal distribution and avoid hot sharding for your writes therefore it sounds like the compound key might be right.

Uniqueness can be enforced across shards only when the shard key index is a unique one.

Perhaps you can have a unique shardkey on just url but use a hash function on it.

Best
Pavel

@Pavel_Duchovny ,
Thanks once again for taking time out to write with your valuable feedback.
I’ve already Indexed AlreadyRead. what I am asking is, actually when the application saves new pages in PagesDB, it gets slow as the data size grows, as the shard key is only on URL field. and it seems, the query hops on multiple shards to write the data, making it slower.
so will it fix this problem if I make the shard key as follows
URL:1, HomePageURL: 1, ApplicationNumber: 1
as my application runs on difference servers and I can include it in my shard key.
Now the queries which will run will be as follows
select URL from PagesDB where AlreadyRead is 0 and ApplicationNumber is 1
And while saving the new pages in PagesDB, the application will pass URL,HomePageURL and ApplicationNumber in order to target the specific shard.

This is what I am thinking right now, your feedback on this will be much appreciated.

Hi @Meva159,

To optimise write workload you should consider adding HomePageURL and ApplicationNumber , but this is possible only on 4.4 server with FCV 4.4.

I guess that if writes will be evenly distributed it will make queries eventually run faster as more resources will be available.

Thanks
Pavel

@Pavel_Duchovny
Wouldn’t that unevenly distribute the data across the shards? I previously created ShardKey on HomePageURL, but it created hot shards with some shards having more than 90% of the data. As my application is designed such that a specific website is bind to specific Application. Means if an application picks up a HomePage (say abc.com) from HomePagesDB, that particular application will read all the Pages of that specific website (abc.com/about , abc.com/contact, abc.com/careers and so on… ).
Also I am using Mongodb V4.2.9 right now, but I can upgrade to 4.4.
I am actually new to MongoDB, still learning its basics. so if I upgrade to 4.4, and chose to make a compound shard key, going through the documentation, it seems that I first need to create a compound index, and then shard the collection based on the compound index name. am I correct in interpreting it ?

Hi @Meva159,

Ok so having just HomePageURL does not make sense as it will create a hot shard.

However, adding ApplicationNumber and PageHomeURL to URL should create more split points and randomise the write access to different shards.

Refining shard keys are only available in 4.4 which means adding shard key fields.

You currently do not have an ability to completely reshard yhe collection and this can be done only by exporting recreating and importing. Which is a painful process that we are working in improving.

To avoid hot shards just by using HomePageURL is by defining it as hashed shard key , but this still means resharding the whole collection .

Let me know if this clear things up.

Read more here:

Best
Pavel

@Pavel_Duchovny
Once again many thanks for your feedback. so I got your point and as I said earlier, I can upgrade to 4.4 and open to reshard my collection.
Last thing, does it make sense if I create a compound key on HomePageURL as hashed with URL and ApplicationNumber?

Hi @Meva159,

This only make sense if you this combination is monotonically increasing:
https://docs.mongodb.com/manual/core/sharding-shard-key/#monotonically-changing-shard-keys

If not then I don’t see a point in creating a compound hash key. Also this is available in 4.4 as well.

Best
Pavel

thanks. creating a compound shard key solved the problem :slight_smile:
one more thing, is it necessary to create indexes separately on the fields used in the compound shard key ?

Hi @Meva159,

The index will be created if the collection is empty. Otherwise I suggest to build indexes on a rolling or background methods.

Best
Pavel

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.