Best practice for using _id for natural PKs

If there is a natural PK such as invoice number or part id and this PK is immutable, is it best practice to use _id for the natural PK rather than adding a separate field for it?

What if the natural PK has two components, e.g. country code and passport number. Would best practice be to make _id a document with two fields? Or, would it be better to leave _id alone and instead create two separate fields for the PK and add a unique index on these two fields?

My opinion…

There’s no harm in substituting the auto generated _id value for a natural manually generated PK. However, besides the strains of manually maintaining the uniqueness of this field, you need to bear in mind the following points:

  1. the original _id field contains a timestamp of when the document was created so you’ll be losing this functionality; which of course you can generate yourself in a separate field
  2. the _id field contains a machine number + process id hash which means that if there’s a requirement to merge documents or collections (especially from different machines), it will still be unique
  3. If there’s ever a requirement to pre-fill some fields before the PK is known, you won’t be able to insert these pre-filled values
  4. If for any reason you forget to assign a value to the _id field during document creation, the auto generated value will be inserted instead which then means that the document will not follow your convention. This can be fixed by replacing the entire document or catching this out before it gets inserted.

Re your composite PKs example, these two fields are meaningful values and shouldn’t be combined. Plus, a person’s country code and/or passport number can change so this may not be the best PK for certain use cases.

So if you’re comfortable with the above points, then go ahead and replace the _id field with your own. Otherwise, leave the _id field as-is and maintain a separate unique indexed field(s).

2 Likes

As i know _id value is guarantee unique per collection and there are some probability (very little :d) of duplicated _id in two collections;

@Shubham_Ranjan am i right ?)

I think only @kanikasingla (Curriculum Support Engineer) and @danielcoupal (Curriculum Engineer) look after this course, so let’s see what they say.

_id = Hash of Epoch time + machine id + process id + counter (that starts from a random value)

When you consider a combination of these values, the chances of getting a duplicate between two collections on the same machine is very very very slim. The only way there can be a crossover across two collections is when the epoch time, process id and random counter is the same; which in practical terms, the chances of this happening is near zero. I guess, if it was to include a hash of the name of the collection as well, it will further reduce the possibility of a crossover.

1 Like

Hi @babaikawow,

As @007_jb mentioned, the chances of having same _id is very very very very slim.

_id by default is ObjectId which has timestamp in it for first 4-bytes which is Unix epoch time which can never be same even after one millisecond. And considering the fact that it acts as a primary key. So, inserting document with same _id value will result into error.

Kanika

@kanikasingla,
@babaikawow is talking about two separate collections potentially in the same db :slight_smile:

My bad :frowning:

Interesting question, but that as well will have very slim chance of having the same _id. machine-id, process-id, timestamp, one random value-- all same at one time will be very rare chance.

Kanika

I made small research and have the next result, maybe it will be interested for you:D

There are amazing explanation of problem on [stackoverflow] (https://stackoverflow.com/questions/4677237/possibility-of-duplicate-mongo-objectids-being-generated-in-two-different-colle) Look at answer :slight_smile:

Also i done small research. Imagine that we have two collections Books and Newspapers.
Remember that ObjectId consist of 12 bytes: 4 - timestamp, 5- random value; 3 - increment;

I insert(by using nodejs) during 1 sec 1 document to Books - it automatically(generated by mongo) has ObjectId -5de64eabba9830b9cad7a0c4
After that i insert 1 document to Newspapers and it has ObjectId 5de64eab ba9830b9cad7a0c5

So lets see parts of 2 ObjectId are:

Timestamp   "Random value"    Increment counter
5de64eab     ba9830b9ca       d7a0c4
5de64eab     ba9830b9ca       d7a0c5

So the increment part is across all database; And chance of dublicated in two colections is reallly minimal - When per 1 second we insert over 2^24 = 16 777 216‬ documents, that i think may be slightly imposible;

There are some others scenarios but also highly unlikely.

Mongo provide us insert document with any valid ObjectId as “_id”, so we may duplicate(i mean in two collections) by this way for example. MongoDB check unique of “_id” only for collections not for database;

Hope this information is useful :smiley:
I will be grateful for clarification and observation

1 Like