Any way to get faster inserts?

Pavel_Duchovny · May 11, 2021, 9:35am

Alright so maybe the best way to start is to break it into 11 clients each one with its own file no?

harris · May 11, 2021, 9:44am

i am doing for research purposes for my university benchmarking on psql,timescaledb and mongodb on queries,inserts time,delete time etc…Now in the insert test i want the comparison to be equal for the 3 databases.So i want to use only one client for mongodb as i did with the others…

Pavel_Duchovny · May 11, 2021, 11:19am

Hi @harris,

Ok now I understand…

You should checkout the following project and maybe use it and/or just look at the code to learn the write pathways:

Thanks,
Pavel

harris · May 11, 2021, 12:07pm

Yes i have checked it in the past,but i didnt find something to helpful…The insert command i used for timescaledb,postgresql was copy,then for mongodb naive(1 document per row) i used insert_many so with mongodb recommend (1 document with 12 subdocuments inside) i should use the same bulk method as the other ones…anyway i will try to fix the last code i sent here

harris · May 11, 2021, 12:14pm

I tried something like this.What do you think?

for file in sorted_files:
    df = process_file(file)

    row_count = df.shape[0]
    next_bucket = mycol1.find_one({"nsamples": {"$lt": 12}})

    if next_bucket:
        available_in_next_bucket = 12 - len(next_bucket['samples'])
        rows_for_update = df.iloc[0: available_in_next_bucket]
        mycol1.update_one(
            {'_id': next_bucket['_id']},
            {
                "$push": {"samples": {"$each": rows_for_update}},
                "$inc": {"nsamples": len(rows_for_update)}
            })
    else:
        # if bucket doesn't exist can just skip to the next loop.
        available_in_next_bucket = 0

    # get the number of updates required past the initial bucket
    update_count = math.ceil((row_count - available_in_next_bucket) / 12)
    if update_count < 1:
        continue

    new_docs = []
    for i in range(update_count):
        # get rows for the next bucket
        rows_for_update = df.iloc[i * 12 + rows_for_update: ((i + 1) * 12) + rows_for_update]
        new_docs.append({
            "samples": rows_for_update,
            "nsamples": len(rows_for_update)
        })
    mycol1.insertMany(new_docs)

Pavel_Duchovny · May 11, 2021, 12:29pm

@harris,

Just FYI we have some major improvments to the time series use case in the upcoming MongoDB roadmap.

You will be able to build and run time-series workloads faster, easier, and less expensive than ever before.

Thanks,
Pavel

steevej · May 11, 2021, 12:33pm

In my opinion, you are mixing too much application logic and database access to be able to get an accurate measure.

You reading files so your performance might restricted by the media from which you are reading.

I recommend to completely separate you file reading logic and bucket creation from the insert/update done to the database, otherwise you will have difficulty determining what is slow. Once all your buckets are created then and only then you do the database operations.

But this brings another issue, all these buckets needs to fit in RAM otherwise you might introduce another factor (swapping) that will influence your performance negatively.

Another issue to consider is resource contention between your client and the database. So it is better to have a dedicated database server with enough resource for your use case. You better make sure you can isolate the traffic.

Since you simulate time series like an IOT scenario, then I think that @Pavel_Duchovny’s idea to split it in 11 clients is the best approach. Because in real life your database access will come from many sources. Now with a simple client you are serializing everything which is not realistic.