Best practices for syncing large data sets

Obi_Anachebe · March 19, 2021, 4:02am

I’m working on a project that involves syncing wearable data with MongoDB Realm. Data can be continuously collected for days at a time and 24 hours worth of data could easily be up to 2-3 GB.

This data never needs to be accessed by the client, so ideally, it would be write-only and would live exclusively in its Atlas collection.

My understanding is that MongoDB Realm more or less syncs down everything in a partition that the user has read access to. If this is the case, the footprint our app takes up on the device could hit capacity very quickly.

I think our usecase is a good candidate for query-based sync, but unfortunately, that’s not quite ready. In the meantime, is there a workaround that will allow us to get all the benefits of sync while deciding which collections we want to remain on the device?

So far, we’ve tried inserting write-only documents using cloud functions accessed via the iOS SDK, but if internet connection is lost, we’d lose that data. Any help is greatly appreciated!

Pavel_Duchovny · March 19, 2021, 6:30am

Hi @Obi_Anachebe,

If some of the data does not need to be synced why wouldn’t you configure the read rules partitions that way that uneeded data will not be synced?

Or all data producers are the clients?

My thoughts is that data that is not produced by the device will need to be accessed via internet… Does that works?

I recommend viewing this:

Scalable Realm Sync Design - YouTube

Thanks
Pavel

Obi_Anachebe · March 19, 2021, 6:49am

Hey, @Pavel_Duchovny! The clients are the ones producing the data (wearable data is captured from the mobile app) so we need the client to have write permissions.

Would maybe configuring our write rule permissions to be “date aware” allow us to only sync down data within a certain range of time to limit our footprint on the device?

For example, we could set the partition key to be:

user=\(user.id)&timestamp=1612248970621

And our cloud function to assess read permissions would only grant read access for partitions with a timestamp within the last 24 hours, while our write access cloud function would allow write access no matter when the timestamp is.

Would this allow us to not have every single document duplicated on the device?

Aniket_Kadam · March 19, 2021, 9:56am

That’s an interesting solution! Have you considered having one partition where the client can write and a database trigger to change the partition of the data to another one where the client doesn’t have read access? This will also appear to ‘delete’ the data from the client’s point of view.
This may simplify the rules, and avoid having to change the partition on the device everyday. (edited)

Obi_Anachebe · March 19, 2021, 5:28pm

Great suggestion! This worked perfectly. I set up an insert trigger that looks something like this:

exports = async function(changeEvent) {
  const docId = changeEvent.documentKey._id;

  const partition = changeEvent.fullDocument._partitionKey;
  const urlParams = new URLSearchParams(partition);
  const userId = urlParams.get("user");
  if (!userId) return;
  
  const collection = context.services.get("mongodb-atlas").db("MY_DB").collection("MY_COLLECTION");
  return await collection.updateOne({ _id: docId }, { $set: { _partitionKey: "write_only=" + userId } });
};

Now let’s say I only wanted to keep items from that collection that were created within the last X hours on the client. How would I go about doing this?

My idea was to run a scheduled trigger that runs every hour and updates an item’s partition if it was created greater than X hours ago.

I’m not sure adding timestamp info to the partition key would work because my understanding is that the realm partition value has to match the object’s partition exactly to be synced to the client.

Pavel_Duchovny · March 19, 2021, 5:57pm

Hi @Obi_Anachebe,

Then I think a schedule trigger will work for you.

You may consider online archiving and query the database through the federated queries… But the sync is only valid for non archived data…

However if its complex, change the partition key every z hours with a trigger , I suggest considering bulk updates and not a single updateMany command.

Thanks,
Pavel

Obi_Anachebe · June 22, 2021, 5:43pm

One issue I’ve been running into since implementing this is that over time, we’ve developed a large build-up of sync operations from inserting and updating the initial document, then changing the partition key which deletes and re-inserts the document.

It’s gotten so bad that it can take up to 15 minutes to sync down initial data. And sometimes, new inserts don’t show up in Atlas right away. Is there any way around this, or is this just a side effect of this approach?

For reference, an account can have between 100-500k documents with each document averaging a size of 2.8KB. Here is the document schema:

class Reading: Object {
    @objc dynamic var _id: ObjectId = .generate()
    @objc dynamic var acc: Motion?
    @objc dynamic var deviceId: String?
    @objc dynamic var deviceSessionId: ObjectId?
    let ecg = List<Int64>()
    let hr = RealmProperty<Int64?>()
    let leadOn = RealmProperty<Bool?>()
    let recordTime = RealmProperty<Int64?>()
    let rr = RealmProperty<Double?>()
    let rri = List<Int64>()
    let rwl = List<Int64>()
    @objc dynamic var type: String?
    @objc dynamic var data: String?
    @objc dynamic var userId: String?

    static override func primaryKey() -> String? {
        return "_id"
    }
}

class Motion: EmbeddedObject {
    let x = RealmProperty<Int64?>()
    let y = RealmProperty<Int64?>()
    let z = RealmProperty<Int64?>()
}

Pavel_Duchovny · June 23, 2021, 5:03am

Hi @Obi_Anachebe ,

To better investigate this and involve engineering you can open a support case with us.

Reference this conversation there for context.

Thanks
Pavel