Increase WT allocation_size

Hi, on this link https://source.wiredtiger.com/3.1.0/tune_page_size_and_comp.html there is a note that allocation_size can be tuned between 512B and 128 MB How do we modify that variable and start mongod process that will have allocation_size of 16KB for example, the default is 4KB ?

This does not work

replica1:PRIMARY> db.adminCommand( { "setParameter": 1, "wiredTigerEngineRuntimeConfig": "allocation_size=64KB"}) { "ok" : 0, "errmsg" : "WiredTiger reconfiguration failed with error code (22): Invalid argument", "code" : 2, "codeName" : "BadValue"}     replica1:PRIMARY> db.createCollection(    "users",    { storageEngine: { wiredTiger: { configString: "allocation_size=64KB" } } } ) { "ok" : 0, "errmsg" : "22: Invalid argument", "code" : 2, "codeName" : "BadValue"}

Hi, does anyone have any ideas? Thanks in advance!!

What version of mongodb are you using?
Looks like parameter value is not in required format
Please check mongod.log around the time when you ran this command.It may give more details
Some docs suggest the value shoud be power of 2
Instead of giving 64kb try 64X1024
mongo documentation does not give much details but referring to WT doc

Word of caution as per mongo doc

WARNING
Avoid modifying the wiredTigerEngineRuntimeConfig unless under the direction from MongoDB engineers as this setting has major implication across both WiredTiger and MongoDB.

Hi, Ramachandra, we are using v3.6.

What does your mongod.log say?
Did you try with value 65536 (64X1024)

I tried that and it’s the same error. I don’t see anything in the error log that can point me in some direction.

Hi @Al_Tradingsim

What is the motivation to change this parameter? Are you seeing issues that necessitates changing it?

Note that changing internal WiredTiger parameters are not supported nor encouraged, since the defaults were designed and tested for the vast majority of use case. Changing the allocation size could put the deployment into an untested territory.

If you’re having certain issues, please describe the issue in more details, along with your MongoDB versions, and what attempts (other than changing WiredTiger parameters) that you have tried.

Best regards,
Kevin

Thank you for your response Kevin

We are working on a dynamic market scanner. It lets users to define and run custom dynamic scans across our data set. Scanners are filters built upon a set of primitives, (i.e. last price or previous day closing price) but also functions’ result like total volume, that is the sum of all the trades’ size of the day up to the current timestamp.

MongoDB’s aggregation pipeline seems to be the perfect match for this new feature, because it can express many of the primitives we need without precomputing the values, which is an essential requirement for a dynamic scanner.

So far we found that simple primitives like closing prices are pretty fast, as they essentially need just a lookup across the symbols on a given timestamp. Unfortunately this is not the case for aggregated primitives, like the total volume, that has to scan thousands rows of the selected symbols in that day.

We tried different setups, and we found that nested documents are faster than flat ones, because they need less disk access. Disk is obviously playing a big role here, and for that reason we have pretty fast and expensive NVMEs disk. We ran a set of benchmarks to test disk performances, and we found that our NVME disks are able to match up the memory bandwidth when reading blocks with a 512KB size : 5,5GB/s vs 8,5 GB/s.

This should mean that sequential reads from disk can be as fast as memory, and for our scenario means we should be able to read 3 GB of uncompressed data in almost half a second. It turns out that Mongo is way slower than that.
While exploring the issue, we found that Mongo is actually allocating blocks at 4KB (wiredTiger.block-manager.file allocation unit size).
So, we tried the same disk benchmarks with the same block size, assuming this is the block size mongo is reading from the disks. The benchmarks show a 850MB/s maximum bandwidth, ~7 time less than the optimum. This matches what we are seeing on our mongo benchmarks: the aggregation pipeline is 6 time faster on nested documents than on unwound flat ones.

So, we are wondering if we can improve the overall mongodb performances by increasing the wiredtiger file allocation unit size to 512K, matching the optimum block size of our disk benchmarks. Is that possible? Is there any other tricks to achieve the NVMEs maximum read speed from Mongo?

Let me know what you think and feel free to ask me any more detail.

And to answer your question regarding Mongo version, We are on Mongo 3.6

Hi @Al_Tradingsim

I think you have done an impressive amount of work in figuring out how the disk performed with various block sizes.

Having said that, there may be further optimizations that could be done on your schema & query design that may or may not necessitate tuning internal knobs. I would suggest to explore all possible optimization avenues (query, indexing, schema design, etc.) before turning into WiredTiger allocation sizes, as this is the riskiest approach and may lead to unintended consequences. Is this something that is possible in your use case?

Unfortunately this is not the case for aggregated primitives, like the total volume, that has to scan thousands rows of the selected symbols in that day.

Is this the specific query that is not as performant as you need? Could you provide some example documents and the aggregation, and also the required result?

We tried different setups, and we found that nested documents are faster than flat ones, because they need less disk access.

I’m curious if this means that your working set cannot fit in RAM, since in most cases, you want to avoid having to read from disk as much as possible and do most work from RAM. Could you elaborate on your provisioned hardware?

Best regards,
Kevin

Hi @kevinadi.
I’m the lead developer at tradingsim working on this issue.
The specific query is an aggregation pipeline of timesales data. A document is a simple object: {sym: 'symbol', price: XXX, size: YYY, timestamp: ZZZ}. The pipeline works on a subset of the symbols and in a user defined date range, it groups by minutes & select some prices in the group (last, first, max, min) sums the size, and finally it sorts the results. We have tens of thousands of symbols and millions of timesales across many years of data, queried by tens of users concurrently with very low latency requirements (sub second).
As said in the previous post by @Al_Tradingsim, we already tried different schemas, starting from flat objects (82 bytes average size) to nested docs in minute blocks (770 bytes on average). The next step will be using nested docs in daily blocks, but this needs substantial changes on the pipeline code and a full data reload, which is a significant effort.
Our benchmarks show that reading data in blocks of 512Kbytes from our NVMEs is on par with the RAM bandwidth for the same amount of data (5,5 GB/s vs 8,5 GB/s). Therefore, our guess is that big reading blocks can outperform small ones when accessing the disks. This is an optimization that could boost many queries we are actually run, not just this specific pipeline.
About the hardware: we have a cluster of 3 Xeon Gold 5222 servers: 184GB RAM, 12TB RAID-0 NVME disks
RAID details:

Personalities : [raid0]
md0 : active raid0 nvme3n1p1[3] nvme2n1p1[2] nvme0n1p1[0] nvme1n1p1[1]
      12501934080 blocks super 1.2 512k chunks

Hi @Ivano_Picco, welcome to the community.

Actually, the command to change allocation_size failed because it was constrained by at least two other parameters: internal_page_max and leaf_page_max. Both of them must be multiplies of allocation_size. Since internal_page_max defaults to 4k and leaf_page_max defaults to 32k, setting allocation_size larger than 4k will fail. To be able to set allocation_size larger than 4k, you must also increase those two numbers.

Having said that, this is a very use-case specific tuning and should only be attempted when everything else from the MongoDB and hardware side failed to produce the desired outcome, since tuning those numbers could have unintended consequences, wasted disk space being one of them. Have you tried experimenting with different read ahead settings? If this is set too high, you might see a lower performance.

A key performance indicator is checking the query’s explain results and seeing how many times the query yields (which typically indicates a disk bottleneck), how many documents returned vs. documents examined (which indicates query targeting inefficiencies), are the right index being used, etc. I would start from this area before going deep into WiredTiger tuning.

If you haven’t seen it yet, there are also a series of blog posts for time series data which may be worth checking: Time Series Data and MongoDB: Part 1, Part 2, and Part 3.

Another tool that could be useful is Keyhole, where you can quickly examine the database’s performance. It can work with seed data where you can supply your example documents to the tests will be more tailored for your use case. See the blog posts linked in the Github description for details into Keyhole’s operation, and also other avenues for MongoDB performance analysis.

Best regards,
Kevin

2 Likes