Mongdb Compass Sample Accuracy

Hi everyone,

Just started the course and it’s my first time studying mongodb/nosql databases.

I’m struggling to understand how mongdb compass samples should aid us.

If you analyze multiple times the video.movies schema, the results of the ‘genre’ fields will vary a lot.
This occurs because the sample analyzes only 1000 documents, and the schema actually contains more than a million documents.

So, I wonder how useful is the sample, since it’s accuracy is based, in this case, on about 1% of the current data.

Thanks in advance.

1 Like

Hi Fabio,

did you get any response on this
I too got the similar doubt while analyzing the citibike.trips data

in citibike.trips for ‘start station name’: ‘W 52 St & 5 Ave’ it shows 1% of data in sample of 1000
with that prediction we should get 19902 documents for the filter {‘start station name’: ‘W 52 St & 5 Ave’}
but we are getting only 7475 which is only 0.3%
hence we can not relay on these predictions

Let me know if you get any response, or if I am doing any thing wrong in my calculation. Thanks,

Hi Ismail,

Unfortunately I still have this doubt :frowning:

Hi @Fabio_05467, @Ismail_37955,

Sorry for the late response.

What is sampling and why is it used?

Sampling in MongoDB Compass is the selection a subset of data from a particular collection and analyzing the documents within the sample set.

Sampling is a common technique in statistical analysis because analyzing a subset of the data gives similar results to analyzing all of it. In addition, sampling allows results to be generated quickly rather than performing a computationally-expensive collection scan.

Won’t sampling miss documents?

Sampling is chosen for its efficiency: the amount of time required to perform a sample is minimal, on the order of a few seconds. Increasing the sample confidence will demand more processing power and time. Furthermore, sophisticated outlier detection requires an inspection of every document in a MongoDB deployment, which would be unfeasible for large data sets. The MongoDB team is in the process of conducting user tests on large data sets to find a reasonable balance.

And if you want to analyze a set of data, you can always query that data using query filter.
For example, if you need to analyze documents (in citibike.trips) having tripduration: null, you just need to type { tripduration: { $eq: null } } there and click Analyze.

For more details, please refer: Schema documentation and FAQ Compass: What is sampling and why is it used?

Kanika