MongoDB to Redshift ETL: Full GuideMay 3, 2023
It’s a common story for any company with a modern data stack: your main transactional database does the heavy lifting to back your applications, but when it’s time for analytics, you need a copy of all that data in your data warehouse.
More precisely, you might find yourself having to move data from MongoDB to Redshift.
These two systems complement each other well: while MongoDB can handle a huge volume of structured and unstructured data, Redshift can be easily integrated with downstream applications for real-time analysis.
If you need to set up MongoDB to Redshift ETL (preferably in real time), this article will ease your work. We’ll cover a step-by-step guide to exporting data from MongoDB to Amazon Redshift using two methods.
What is MongoDB?
MongoDB is one of the most popular open-source NoSQL databases. Since the database is schemaless, it provides scope for rapid development and flexible deployment at scale. MongoDB is a flexible and easy-to-use database that stores data in JSON-like documents. Each document stores data in key-value pairs.
When it comes to managing large volumes of structured and unstructured data, MongoDB is the preferred choice over relational databases. MongoDB offers high scalability and flexibility with its superior technical features such as replication, load balancing, ad-hoc queries, sharding, indexing, and many more.
MongoDB supports Linux, macOS, and Windows operating systems. It is compatible with C, C++, Go, Node.js, Python, and PHP.
What is Amazon RedShift?
Amazon Redshift is a petabyte-scale fully managed data warehouse storage service hosted by Amazon Web Services (AWS). Unlike traditional databases that place values in rows and columns, Redshift stores data in a column-based format. Because of its columnar data storage, defined schema, and massively parallel processing, Redshift has become foundational for analytical reporting.
Amazon Redshift is popular for the following features:
- Massive Parallel Processing (MPP) executes the most complicated queries, even on elephant-sized data.
- Redshift stores data in columnar format which highly reduces the number of disk I/O requests to optimize analytical query performance.
- Whenever a user submits a query, Redshift verifies the result cache for a valid and cached copy of the query result. If the match is found in the result cache, it doesn’t execute the query. It rather uses a cached result to drastically reduce the query runtime.
MongoDB to Redshift ETL
You can quickly migrate your data from MongoDB to Amazon Redshift in two different ways:
- Manually Move Data from MongoDB to Redshift
- Using a SaaS Alternative to Move data from MongoDB to Redshift (in real time)
Method 1: MongoDB to Redshift ETL using Custom MongoDB Script
This section includes a step-by-step guide to moving data manually from MongoDB to Redshift.
In order to complete the process, first, you need to generate a .csv file of the MongoDB database. MongoDB provides two different export formats: JSON and CSV. Use the
mongoexport command to export the data.
mongoexport is a command line tool used to export data stored in MongoDB database in CSV or JSON format. This tool is a part of MongoDB tool packages. You need to run
mongoexport in the command prompt, not in the Mongo shell.
Open the command prompt, connect to the MongoDB server, and follow the below steps.
Step 1: Use
mongoexport to export the collection from MongoDB to a JSON or a CSV file.
plaintextmongoexport --collection=collection_name --db=db_name --out=outputfile.csv
db_name: Name of the database.
collection_name: Name of the collection to be exported.
outputfile.json: Name of the .csv file you want to export.
Step 2: Perform a transformation on the .csv file and upload the structured CSV or JSON to an AWS S3 bucket with either of the below methods.
2.1: As MongoDB supports flexible schema, it is difficult to understand a collection and create a compatible table in Amazon Redshift. Therefore, before uploading the file to the S3 bucket, you need to create a table schema.
2.2: You can also install the AWS CLI to upload files into S3 from your local machine. The AWS CLI makes it easy to upload files into the S3 bucket. If you already have installed the AWS CLI, use the below command to upload .csv files in the S3 bucket. After moving .csv files into the S3 bucket, you can create a table schema from the command prompt.
plaintextAWS S3 CP D:\outputfile.csv S3://S3bucket01/outputfile.csv
Step 3: Load data to Amazon Redshift with the COPY command from the S3 bucket to the predefined table.
If you are following Step 2 (2.1), use the following COPY command to move files from the S3 bucket to Redshift.
plaintextCOPY table_name from 's3://S3bucket_name/table_name-csv.tbl' 'aws_iam_role=arn:aws:iam::<aws-account-id>:role/<role-name>' csv;
If you are following Step 2 (2.2), then run the below command to move files from the S3 bucket to Redshift using COPY command. To load files in CSV format, add csv at the end in your COPY command.
plaintextCOPY db_name.table_name FROM ‘S3://S3bucket_name/outputfile.csv’ 'aws_iam_role=arn:aws:iam::<aws-account-id>:role/<role-name>' csv;
IAM_ROLE: This parameter is used to authenticate the AWS account id and role.
After successfully executing the above commands, you can check if all the tables are copied correctly in
STL_LOAD_ERRORS stores a history of all records and displays the errors that occurred while loading the data from the respective file. For reference, check the list of possible load errors here.
If you don’t get any
STL_LOAD_ERRORS then your MongoDB to Redshift ETL is successful.
Challenges in MongoDB to Redshift ETL
While the above manual process might look trouble-free and cost-effective, it comes with the following challenges:
- Schema Detection: Redshift stores data in a columnar form, and understanding the schemaless database of MongoDB before storing it in Redshift is a burdensome task. In a collection, different documents can have the same field names with different data types. For instance, a field called mobile_no can be stored as a string or number. Fields like this will have to be converted while creating a table schema in Redshift.
- Complex structure of objects and arrays: MongoDB can have multiple nested arrays and objects in a document. To understand this dynamic nested structure is a challenging job before uploading a file in your S3 bucket.
- Data Type Incompatibility: Redshift does not support all the data types that are supported by MongoDB. Therefore, this might be a major concern while migrating files using the COPY command. For efficient data storage and query performance in the Redshift data warehouse, it is important to understand the Redshift data types.
- Real-time Complexities: Data in MongoDB is updated quickly. This can lead to a scenario where you regularly keep on uploading CSV files!
Thankfully, alternatives exist. You can use Estuary Flow to create a real-time ETL pipeline from MongoDB to Redshift.
Method 2: MongoDB to Redshift ETL using SaaS Alternatives
The above manual approach works perfectly, but the most productive method in terms of saving costs, time, and resources is SaaS alternatives. SaaS alternatives can help you make your data available to downstream applications without extensive overhead. Additionally, you don't need to handle the challenges that come with manual deployment. It can overcome the limitations of the above method:
- Latency: The problem with the manual method is by the time you export CSV files, your source data might change or update. You might end up exporting multiple CSV files to capture real-time updates, which can create an inaccurate picture of the data.
- Time-consuming: You need to transform each document of MongoDB into a tabular format. This process will take time to create and upload your CSV files in S3. ETL tools can automate the transformation and apply a basic schema on unstructured data.
- Repetitive: With a data pipeline that continuously processes new data, there’s no need to repeat a manual process.
Estuary Flow helps you easily migrate your data from MongoDB to Amazon Redshift or various other destinations, including Google Sheets, Snowflake, and more. This is the ultimate time-saving and streamlined approach to transferring your data easily. Once deployed, the pipeline operates continually in real-time.
Follow these steps to build a real-time data pipeline from MongoDB to Redshift with Estuary Flow.
Step 1: Register for Estuary Flow or log in if you already have an account. Once you log in, click on Captures.
Step 2: In the Captures window, click on + New Capture. The Create Capture window will appear.
Step 3: In this window, type MongoDB in the search connectors. You’ll see the MongoDB tile. Click on Capture.
Step 4: Now fill in the Capture details, providing a unique name for your capture.
Step 5: In the Endpoint config, add the address, username, password, and database name to capture from.
Step 6: Once you have filled in all the details, click on Next. Flow will initiate a connection with your MongoDB database and identify data collections.
Step 7: Click Save and Publish.
Step 8: Once the dialog box shows that you’ve published the capture successfully, click Materialize Collections.
Step 9: Search and choose Amazon Redshift. Click on Materialize.
Step 10: Provide the Materialization name and Endpoint config details. Click on Next. You can add or remove collections at this point.
Step 11: Finally, click on Save and Publish.
For more help on this method, you can visit the following links:
Give it a quick try to replicate your MongoDB data to Redshift in real time, as your first pipeline is free!
Migrating data from MongoDB to the Redshift warehouse is definitely a great way to streamline data for analytical purposes. If you are used to MongoDB querying and COPY commands, using the manual method is an easy and cost-effective approach. But considering manual efforts, programmatic challenges, and latency issues, the manual method isn’t an ideal approach for all real-time scenarios. However, using the second method, Estuary Flow, you can streamline your ETL without requiring new investments in infrastructure or development.
Keywords: mongodb, redshift