HomeLearnHow-to

How to Automate Continuous Data Copying from MongoDB to S3

Published: May 03, 2021

  • MongoDB
  • Atlas
  • Online Archive
  • ...

By Joe Karlsson

Share

Modern always-on applications rely on automatic failover capabilities and real-time data access. MongoDB Atlas already supports automatic backups out of the box, but you might still want to copy your data into another database to run advanced analytics on your data or isolate your workload. For this reason, it can be incredibly useful to set up automatic continuous replication of your data for your workload.

In this post, we are going to set up an automated way to continuously copy data from a MongoDB database into an AWS S3 bucket in the Parquet data format by using MongoDB Atlas Database Triggers. We will first set up a MongoDB Atlas Data Lake to consolidate a MongoDB database and our AWS S3 bucket. Next, we will set up a Trigger to automatically add a new document to a collection every minute, and another Trigger to automatically copy our data to our S3 bucket. Lastly, we will run a test to ensure that our data is being continuously copied into S3 from MongoDB.

#What is Parquet?

For those of you not familiar with Parquet, it's an amazing file format that does a lot of the heavy lifting to ensure blazing fast query performance on data stored in files. This is a popular file format in the Data Warehouse and Data Lake space as well as for a variety of machine learning tasks.

One thing we frequently see users struggle with is getting NoSQL data into Parquet as it is a columnar format. Historically, you would have to write some custom code to get the data out of the database, transform it into an appropriate structure, and then probably utilize a third-party library to write it to Parquet. Fortunately, with MongoDB Atlas Data Lake's $out to S3, you can now convert MongoDB Data into Parquet with little effort.

#Prerequisites

In order to follow along with this tutorial yourself, you will need to do the following:

  1. Create a MongoDB Atlas account, if you do not have one already.
  2. Create an AWS account with privileges to create IAM Roles and S3 Buckets (to give Data Lake access to write data to your S3 bucket).

#Create a MongoDB Atlas Data Lake and Connect to S3

We need to set up a MongoDB Atlas Data Lake, because once we connect our data sources to the Data Lake, we will replicate our MongoDB data and utilize MongoDB Atlas Data Lake's $out to S3 so we can convert our MongoDB Data into Parquet as well as save it to our S3 bucket.

The first thing you'll need to do is navigate to the "Data Lake" tab on the left-hand side of your Atlas Dashboard and then click "Create Data Lake" or "Configure a New Data Lake."

Screenshot from MongoDB Atlas showing the Create a Data Lake page with red arrows pointing to the Create Data Lake and Configure a New Data Lake buttons.

Then, you need to go ahead and connect your S3 bucket to your Atlas Data Lake. This is where we will write the Parquet files. The setup wizard should guide you through this pretty quickly, but you will need access to your credentials for AWS.

Note: For more information, be sure to refer to the documentation on deploying a Data Lake for a S3 data store. (Be sure to give Atlas Data Lake "Read and Write" access to the bucket, so it can write the Parquet files there).

Screenshot from MongoDB Atlas Data Lake configuration modal with the AWS S3 data store option selected.

Select an AWS IAM role for Atlas.

  • If you created a role that Atlas is already authorized to read and write to your S3 bucket, select this user.
  • If you are authorizing Atlas for an existing role or are creating a new role, be sure to refer to the documentation for how to do this.

Enter the S3 bucket information.

  • Enter the name of your S3 bucket. I named my bucket mongodb-data-lake-demo.
  • Choose Read and write, to be able to write documents to your S3 bucket.

Assign an access policy to your AWS IAM role.

  • Follow the steps in the Atlas user interface to assign an access policy to your AWS IAM role.
  • Your role policy for read-only or read and write access should look similar to the following:
1{
2 "Version": "2012-10-17",
3 "Statement": [
4 {
5 "Effect": "Allow",
6 "Action": [
7 "s3:ListBucket",
8 "s3:GetObject",
9 "s3:GetObjectVersion",
10 "s3:GetBucketLocation"
11 ],
12 "Resource": [
13 <role arn>
14 ]
15 }
16 ]
17}
  • Define the path structure for your files in the S3 bucket and click Next.
  • Once you've connected your S3 bucket, we're going to create a simple data source to query the data in S3, so we can verify we've written the data to S3 at the end of this tutorial.

#Connect Your MongoDB Database to Your Data Lake

Now, we're going to connect our Atlas Cluster, so we can write data from it into the Parquet files on S3. This involves picking the cluster from a list of clusters in your Atlas project and then selecting the databases and collections you'd like to create Data Sources from and dragging them into your Data Lake.

Screenshot from MongoDB Atlas showing the MongoDB Atlas Cluster option selected and the cluster with our event data in it also selected.Screenshot from MongoDB Atlas Data Lake configuration page showing how the Atlas Cluster and S3 data source have been moved into our new Data Lake.

#Create a MongoDB Atlas Trigger to Create a New Document Every Minute

Now that we have all of our data sources set up in our brand new Data Lake, we can now set up a MongoDB Database Trigger to automatically generate new documents every minute for our continuous replication demo. Triggers allow you to execute server-side logic in response to database events or according to a schedule. Atlas provides two kinds of Triggers: Database and Scheduled triggers. We will use a Scheduled trigger to ensure that these documents are automatically archived in our S3 bucket.

  1. Click the Atlas tab in the top navigation of your screen if you have not already navigated to Atlas.
  2. Click Triggers in the left-hand navigation.
  3. On the Overview tab of the Triggers page, click Add Trigger to open the trigger configuration page.
  4. Enter these configuration values for our trigger:
Screenshot from MongoDB Atlas Trigger configuration page showing the options for our new Trigger. Trigger type: Scheduled. Name: Create_Event_Every_Min_Trigger. Enable: On. Schedule Type: Basic. Link Data Sources: Cluster 1 and atlas_data_lake. Select an Event Type: Function.

And our Trigger function looks like this:

1exports = function () {
2
3 const mongodb = context.services.get("NAME_OF_YOUR_ATLAS_SERVICE");
4 const db = mongodb.db("NAME_OF_YOUR DATABASE")
5 const events = db.collection("NAME_OF_YOUR_COLLECTION");
6
7 const event = events.insertOne(
8 {
9 time: new Date(),
10 aNumber: Math.random() * 100,
11 type: "event"
12 }
13 );
14
15 return JSON.stringify(event);
16
17};

Lastly, click Run and check that your database is getting new documents inserted into it every 60 seconds.

Screenshot from the MongoDB Atlas webpage showing the data that was generated by the Create_Event_Every_Min_Trigger with a red box around the new data.

#Create a MongoDB Atlas Trigger to Copy New MongoDB Data into S3 Every Minute

Alright, now is the fun part. We are going to create a new MongoDB Trigger that copies our MongoDB data every 60 seconds utilizing MongoDB Data Lake's $out to S3 aggregation pipeline. Create a new Trigger and use these configuration settings.

Screenshot from MongoDB Atlas Trigger configuration page showing the options for our new Trigger. Trigger type: Scheduled. Name: archive_on_s3_every_min. Enable: On. Schedule Type: Basic. Link Data Sources: Cluster 1 and atlas_data_lake. Select an Event Type: Function.

Your Trigger function will look something like this. But there's a lot going on, so let's break it down.

  • First, we are going to connect to our new Data Lake. This is different from the previous Trigger that connected to our Atlas database. Be sure to put your Data Lake name in for context.services.get. You must connect to your Data Lake to use $out.
  • Next, we are going to create an aggregation pipeline function to first query our MongoDB data that's more than 60 seconds old.
  • Then, we will utilize the $out aggregate operator to replicate the data from our previous aggregation stage into S3.
  • In the format, we're going to specify parquet and determine a maxFileSize and maxRowGroupSize.

    • maxFileSize is going to determine the maximum size each partition will be.
    • maxRowGroupSize is going to determine how records are grouped inside of the parquet file in "row groups" which will impact performance querying your Parquet files similarly to file size.
1exports = function () {
2
3 const datalake = context.services.get("NAME_OF_YOUR_DATA_LAKE_SERVICE");
4 const db = datalake.db("NAME_OF_YOUR_DATA_LAKE_DATABASE")
5 const events = db.collection("NAME_OF_YOUR_DATA_LAKE_COLLECTION");
6
7 const pipeline = [
8 {
9 $match: {
10 "time": {
11 $gt: new Date(Date.now() - 60 * 60 * 1000),
12 $lt: new Date(Date.now())
13 }
14 }
15 }, {
16 "$out": {
17 "s3": {
18 "bucket": "mongodb-data-lake-demo",
19 "region": "us-east-1",
20 "filename": "events",
21 "format": {
22 "name": "parquet",
23 "maxFileSize": "10GB",
24 "maxRowGroupSize": "100MB"
25 }
26 }
27 }
28 }
29 ];
30
31 return events.aggregate(pipeline);
32};

If all is good, you should see your new Parquet document in your S3 bucket. I've enabled the AWS GUI to show you the versions so that you can see how it is being updated every 60 seconds automatically.

Screenshot from AWS S3 management console showing the new events.parquet document that was generated by our $out trigger function.

#Wrap Up

In this post, we walked through how to set up an automated continuous replication from a MongoDB database into an AWS S3 bucket in the Parquet data format by using MongoDB Atlas Data Lake and MongoDB Atlas Database Triggers. First, we set up a new MongoDB Atlas Data Lake to consolidate a MongoDB database and our AWS S3 bucket. Then, we set up a Trigger to automatically add a new document to a collection every minute, and another Trigger to automatically back up these new automatically generated documents into our S3 bucket.

We also discussed how Parquet is a great format for your MongoDB data when you need to use columnar-oriented tools like Tableau for visualizations or Machine Learning frameworks that use Data Frames. Parquet can be quickly and easily converted into Pandas Data Frames in Python.

If you have questions, please head to our developer community website where the MongoDB engineers and the MongoDB community will help you build your next big idea with MongoDB.

Additional Resources:

MongoDB Icon
  • Developer Hub
  • Documentation
  • University
  • Community Forums

© MongoDB, Inc.