I had a problem with the lab “Sharding a collection” in chapter 3, and I haven’t been the only one. There are a few posts in the forum about this issue, for example:
In a nutshell, the problem is that we mongoimport
a file into our primary shard, and it contains 516784 documents, as it should. We then enable sharding for the database which contains that collection (having already added a second shard to the cluster), and shard the collection. And then we run the validation script validate_lab_shard_collection, and it tells us that we’ve got the wrong number of documents:
Incorrect number of documents imported - make sure you import the entire dataset.
Having repeated the process (several times) of dropping the collection, importing it again, creating the index and then sharding the collection, and then monitoring the number of documents in the collection from 3 different mongo shells (one connected to the mongos on port 26000 and the others connected to whichever port is the current primary node of each of the two shards), I have reached the conclusion that
- As soon as we start sharding the collection, the balancer has a lot of work to do, migrating chunks from the primary shard to the other shard(s) (although there’s only one other shard in this lab)
- When the balancer starts working, all the chunks are on the primary shard
- The balancer can take a while to distribute the chunks evenly amongst the shards
- Whilst chunk migration is in progress, the total number of documents in a collection can appear to be larger than it should be
- Whilst chunk migration is in progress, the sum of the number of documents in each collection, in each of the shards, when added together, can appear to be larger than it should be
And, critical to my question:
- Whilst a chunk migration is in progress, the documents in that chunk appear to be in both the shard that the chunk is being moved from, and the shard that it’s being moved to
The solution to this appears to be to monitor the number of documents which appear to be in the collection according to the mongos, and wait until that number of documents matches the number of documents we originally imported (at which point the validation script will succeed).
This is all very well, but my day job is to provide IT services to an organisation which requires “exactly once”. They’re not going to like the same document being returned more than once just because it’s in a chunk which is being migrated
So… Is there a way to query the cluster for a document in a chunk which is being migrated, which, whilst that chunk is being migrated, will only return that document from the shard that it’s being migrated from, or from the shard that it’s being migrated to, without returning both of them?