Register for the free MongoDB.live developer conference July 13 and 14. Click to register now.
HomeLearnHow-toHow to Get MongoDB Data into Parquet in 10 seconds or Less

How to Get MongoDB Data into Parquet in 10 seconds or Less

Updated: Jun 11, 2021 |

Published: Jun 11, 2021

  • Data Lake
  • MongoDB
  • Data
  • ...

By Benjamin Flast

Rate this article

For those of you not familiar with Parquet, it’s an amazing file format that does a lot of the heavy lifting to ensure blazing fast query performance on data stored in files. This is a popular file format in the Data Warehouse and Data Lake space as well as for a variety of machine learning tasks.

One thing we frequently see users struggle with is getting NoSQL data into Parquet as it is a columnar format. Historically, you would have to write some custom code to get the data out of the database, transform it into an appropriate structure, and then probably utilize a thirdparty library to write it to Parquet. Fortunately, with MongoDB Atlas Data Lake’s $out to S3, you can now convert MongoDB Data into Parquet with little effort.

In this blog post, I’m going to walk you through the steps necessary to write data from your Atlas Cluster directly to S3 in the Parquet format and then finish up by reviewing some things to keep in mind when using Parquet with NoSQL data. I’m going to use a sample data set that contains taxi ride data from New York City.

#Prerequisites

In order to follow along with this tutorial yourself, you will need the following: An Atlas cluster with some data in it. (It can be the sample data.) An AWS account with privileges to create IAM Roles and S3 Buckets (to give us access to write data to your S3 bucket).

#Create an Atlas Data Lake and Connect to S3

The first thing you’ll need to do is navigate to the “Data Lake” tab on the left hand side of your Atlas dashboard and then click “Create Data Lake” or “Configure a New Data Lake.”

https://mongodb-devhub-cms.s3.us-west-1.amazonaws.com/mongodb_atlas_data_lake_6f9b5bd0e6.png

Then, you need to connect your S3 bucket to your Atlas Data Lake. This is where we will write the Parquet files. The setup wizard should guide you through this pretty quickly but you will need access to your credentials for AWS. (Be sure to give Atlas Data Lake “Read and Write” access to the bucket so it can write the Parquet files there.)

https://mongodb-devhub-cms.s3.us-west-1.amazonaws.com/mongodb_atlas_data_lake_configuration_page_9f9d17bae0.png

Once you’ve connected your S3 bucket, we’re going to create a simple data source to query the data in S3 so we can verify we’ve written the data to S3 at the end of this tutorial. Our new setup tool makes it easier than ever to configure your Data Lake to take advantage of the partitioning of data in S3. Partitioning allows us to only select the relevant data to process in order to satisfy your query. (I’ve put a sample file in there for this test that will fit how we’re going to partition the data by _cab_type).

1mongoimport --uri mongodb+srv://<USERNAME>:<PASSWORD>@<MONGODB_URI>/<DATABASE> --collection <COLLECTION> --type json --file <FILENAME>

https://mongodb-devhub-cms.s3.us-west-1.amazonaws.com/mongodb_atlas_data_lake_config_page_path_components_840f20a4ad.png

https://mongodb-devhub-cms.s3.us-west-1.amazonaws.com/data_lake_visual_config_builder_22cc77069b.png

#Connect Your Data Lake to an Atlas Cluster

Now we’re going to connect our Atlas cluster, so we can write data from it into the Parquet files. This involves picking the cluster from a list of clusters in your Atlas project and then selecting the databases and collections you’d like to create Data Sources from and dragging them into your Data Lake.

https://mongodb-devhub-cms.s3.us-west-1.amazonaws.com/atlas_cluster_data_source_button_11737377af.png

https://mongodb-devhub-cms.s3.us-west-1.amazonaws.com/atlas_cluster_connected_in_data_lake_df7052880b.png

#$out to S3 in Parquet

Now we’re going to connect to our Data Lake using the mongo shell and execute the following command. This is going to do quite a few things, so I’m going to explain the important ones.

  • First, you can use the ‘filename’ field of the $out stage to have Atlas Data Lake partition files by “_cab_type”, so all the green cabs will go in one set of files and all the yellow cabs will go in another.
  • Then in the format, we’re going to specify parquet and determine a maxFileSize and maxRowGroupSize.

    • maxFileSize is going to determine the maximum size each partition will be.
    • maxRowGroupSize is going to determine how records are grouped inside of the Parquet file in “row groups” which will impact performance querying your Parquet files, similarly to file size.
  • Lastly, we’re using a special Data Lake aggregation “background: true” which simply tells Atlas Data Lake to keep executing the query even if the client disconnects. (This is handy for long running queries or environments where your network connection is not stable.)
1db.getSiblingDB("clusterData").getCollection("trips").aggregate([
2 {
3 "$out" : {
4 "s3" : {
5 "bucket" : "ben.flast",
6 "region" : "us-east-1",
7 "filename" : {
8 "$concat" : [
9 "taxi-trips/",
10 "$_cab_type",
11 "/"
12 ]
13 },
14 "format" : {
15 "name" : "parquet",
16 "maxFileSize" : "10GB",
17 "maxRowGroupSize" : "100MB"
18 }
19 }
20 }
21 }
22], {
23 background: true
24})

https://mongodb-devhub-cms.s3.us-west-1.amazonaws.com/aws_s3_interface_f21ce56fa9.png

https://mongodb-devhub-cms.s3.us-west-1.amazonaws.com/aws_s3_yellow_cab_folder_55e21a7e89.png

#Blazing Fast Queries on Parquet Files

Now, to give you some idea of the potential performance improvements for Object Store Data you can see, I’ve written three sets of data, each with 10 million documents: one in Parquet, one in uncompressed JSON, and another in compressed JSON. And I ran a count command on each of them with the following results.

db.trips.count() 10,000,000

TypeData Size (GB)Count Command Latency (Seconds)
JSON (Uncompressed)~16.1297.182
JSON (Compressed)~1.178.070
Parquet~1.021.596

#In Review

So, what have we done and what have we learned?

  1. We saw how quickly and easily you can create a Data Lake in MongoDB Atlas.
  2. We connected an Atlas cluster to our Atlas Data Lake.
  3. We used Data Lake to write Atlas cluster data to S3 in Parquet format.
  4. We demonstrated how fast and space-efficient Parquet is when compared to JSON.

#A Couple of Things to Remember About Atlas Data Lake

  1. Parquet is a super fast columnar format that can be read and written with Atlas Data Lake.
  2. Atlas Data Lake takes advantage of various pieces of metadata contained in Parquet files, not just the maxRowGroupSize. For instance, if your first stage in an aggregation pipeline was $project: {fieldA: 1, filedB: 1}, we would only read the two columns from the Parquet file which results in faster performance and lower costs as we are scanning less data.
  3. Atlas Data Lake writes Parquet files flexibly so if you have polymorphic data, we will create union columns so you can have ‘Column A - String’ and ‘Column A - Int’. Atlas Data Lake will read union columns back in as one field but other tools may not handle union types. So if you’re going to be using these Parquet files with other tools, you should transform your data before the $out stage to ensure no union columns.
  4. Atlas Data Lake will also write files with different schemas if it encounters data with varying schemas throughout the aggregation. It can handle different schemas across files in one collection, but other tools may require a consistent schema across files. So if you’re going to be using these Parquet files with other tools, you should do a $project with $convert’s before the $out stage to ensure a consistent schema across generated files.
  5. Parquet is a great format for your MongoDB data when you need to use columnar oriented tools like Tableau for visualizations or machine learning frameworks that use data frames. Parquet can be quickly and easily converted into Pandas data frames in Python.
Rate this article
MongoDB Icon
  • Developer Hub
  • Documentation
  • University
  • Community Forums

© MongoDB, Inc.