3 Ways to Move Data From MongoDB to SnowflakeMarch 26, 2023
You’ve found yourself trying to figure out how to migrate data from MongoDB into Snowflake. Great! Let’s get started.
Before anything else, you first need to understand the given use case, and why this body of work is important to fulfilling business objectives. This is important so the solution you come up with both fulfills the requirements and adheres to your organization’s larger data strategy.
For this example, let’s assume the following:
- You’ve been asked to move source data from MongoDB into your company’s data warehousing tool, Snowflake. You’re using a cloud provider like AWS, or Azure.
- You need this pipeline to capture both historical and incremental changes happening within the source data. You need those changes to become available in Snowflake in near real-time.
- This data is going to be used for downstream analytics purposes in a 3rd-party business intelligence tool, queried with SQL.
- You have a fairly low appetite for open-source tools and frameworks.
Given these parameters, there are a variety of options to choose from. Below, I’ll walk through a few possible ones.
While considering different approaches, you should be focused on:
- How well it fulfills the given requirements
- If it’s scalable
- How easy it is to maintain.
It’s also important to consider which existing resources are available for you to use, and if adding a new tool makes sense. In isolation, no given method is inherently “good” or “bad.” Rather, any approach you take has certain positives and certain shortcomings that must be closely considered against any given situation.
Method #1: Connect MongoDB to Snowflake With Native Cloud Provider Tools & Snowpipe
Snowpipe is a tool offered by Snowflake that allows you to move away from scheduled batch loading jobs. It replaces a more traditional approach such as using the SQL COPY into command and allows your data to become available on a more real-time basis.
At a high level, Snowpipe performs this by loading the data into a staging environment in much smaller portions.
You would be using Snowpipe in addition to the native services offered by your cloud provider, like AWS or Azure. Here are just two examples, one for each cloud provider, of how your platform infrastructure could be designed, and given that, how you could move data from MongoDB into a Snowflake warehouse on that cloud provider.
- AWS — You could have a Kinesis delivery stream responsible for landing your MongoDB source data into an S3 bucket. If you have an SNS system enabled, you can use that associated successful run id to load data into Snowflake via Snowpipe.
- Azure — You could trigger Snowpipe using an Event Grid message for Blob storage events. Basically, your MongoDB files would be loaded into an external Azure stage. You’d create a blob storage event message, that would notify Snowpipe via Event Grid when those files are ready to be dropped into Snowflake. Snowpipe would then copy those queued files into a target table you created in Snowflake.Snowflake has a complete guide on connecting
Snowpipe to Azure blob storage, which you can find here.
There are a few important considerations with this approach.
First, it requires a thorough understanding of NoSQL databases like MongoDB, Snowflake, and your cloud provider. Troubleshooting and identifying the root cause of problems within a complex data pipeline like this requires significant domain expertise. This can be a difficult hurdle to overcome for immature or small data teams who “wear many hats”.
Second, this approach can be difficult to assign ownership over, especially regarding long-term management. This is because the resources you’re using to create this data pipeline are typically owned by a totally different team outside of Data. Choosing this approach would require significant discussion across other engineering teams to establish clear boundaries of ownership and long-term responsibility.
Method #2: Connect MongoDB to Snowflake With a custom Python Script via Data Orchestration Tool
A second option is to write an ETL pipeline in Python, orchestrated by Airflow or a similar tool.
Assuming you’re using AWS, you could use the MongoToS3Operator to move data from MongoDB to your S3 bucket, then use the SnowflakeOperator to move said data from the S3 bucket into a Snowflake table.
Astronomer, a popular managed service for Airflow, provides useful pipeline examples in their GitHub repository. They also offer a tutorial you can follow to practice building an ETL pipeline with Python on your local machine.
Pros of connecting MongoDB to Snowflake with Python and Airflow include:
- Approaching this programmatically offers customization and flexibility for each pipeline.
- Airflow has a large offering of custom sensors, hooks, and operators that can address a wide variety of potential use cases your organization can easily scale with.
- Airflow offers a visual monitoring and management interface, which is helpful when you’re troubleshooting a pipeline error and need to see what’s going on at each step.
Of course, there are cons to consider.
Cons of this approach:
- You maintain the code for each pipeline. This can become difficult as scale, and complexity increase, or if the pipeline needs to be changed regularly.
- The process is slower. This is because you have to wait for an Engineer to modify the code itself before you see the changes you requested be enabled. Depending on the urgency of a particular request, and an Engineering team’s current bandwidth, these requests can remain in the backlog for weeks, or months at a time.
- It can be difficult to maintain consistency across each pipeline if a clearly defined standard doesn’t exist for the team to follow. Each developer has their own “style” of approaching problems, syntax, etc. This approach requires teams to spend significant time establishing and documenting a quality standard for design, coding, use of frameworks, hooks, and operators, and more to ensure pipelines are well maintained and expectations are clear.
Method #3: Connect MongoDB to Snowflake using a managed data pipeline tool
There are a number of available SaaS tools that can handle your data ingestion needs for you.
This approach is helpful for teams that:
- Need to make data available downstream quickly as possible.
- Prefer not to deal with challenges associated with manual deployment.
The managed tool will quickly get your source data from MongoDB into Snowflake via a preconfigured connector. The individual fields from your MongoDB source data may become varchar data type columns as JSON objects in Snowflake.
Once your data becomes available in Snowflake, you can begin using SQL to explore your source data. Because of the nature of this ingestion process, it’s important to build a clear strategy around schemas and data organization, and create clear documentation the entire team can access.
There are a number of pros that come with choosing a managed service for your data pipeline. Here are just a couple:
- You can execute much faster. This is true for both making data available to your stakeholders, and resolving issues within the pipeline itself. This is because the managed tool is handling what’s going on “under the hood” for you. You no longer need to spend hours troubleshooting in your CLI, or guessing which part of the pipeline is causing your log file error.
Basically, it allows you to skip all the annoying parts of a more typical data ingestion process and move faster. This is particularly relevant for teams who own the data ingestion process, but have limited Platform, Data Engineering, or other specialty resources.
- It allows you to expand client offerings and services. Most managed tools offer hundreds of existing connectors for various sources and destinations. This gives you more bandwidth to support additional clients, and more complex use cases that may be unfeasible otherwise.
Estuary Flow is one such data pipeline tool. It allows you to create a data flow connecting MongoDB to Snowflake in a web UI.
What’s the best way to move data from MongoDB to Snowflake?
As we’ve learned, the best method depends on your use case, your team, and the other technology resources at your disposal.
But to put it as succinctly as possible…
- If you have a large team of engineers and excellent data project management frameworks, take advantage of your cloud provider and Snowpipe.
- If you’re operating at a smaller scale, but still prefer the flexibility of a programmatic approach, use Python and an orchestration platform.
- If you’re a mixed team of professionals looking to quickly expand and scale with less engineering burden, you might benefit from a managed data pipeline tool.
Which would work best for your use case?
Keywords: airflow, astronomer, aws, azure, mongodb, python, snowflake, snowpipe