Estuary

Snowflake OpenFlow Deep Dive: Architecture, Use Cases, and Comparisons

Learn what Snowflake OpenFlow is, how its architecture works, and how it compares to Estuary and Fivetran. A deep dive into ingestion patterns, performance, and costs.

Blog post hero image
Share this article

For a long time, Snowflake users have relied on the typical COPY INTO commands for data ingestion, using Snowpipe for micro-batching or tools like Fivetran for ELT. But things have changed recently. With the surge of GenAI and an influx of unstructured data, the old “land, then transform” approach has started to crack.

In 2024, Datavolo was acquired by Snowflake, shortly before the announcement of Snowflake OpenFlow. It was built by the creators of Apache NiFi (a system used to process and distribute data). This acquisition has enabled Snowflake to redesign its data ingestion layer and enable its users to build and extend processors from a data source to any destination.

In this article, we’ll be focusing on Snowflake OpenFlow’s architecture, how it works, the ways in which you can use it, who its competitors are, and where its strengths and limitations lie, with a special focus on the 2026 data landscape.

Key Takeaways

  • Snowflake OpenFlow enables in-flight data processing. It allows teams to transform, enrich, and route data before it lands in Snowflake.

  • OpenFlow is built on Apache NiFi via the Datavolo acquisition. As such, it brings flow-based programming and visual orchestration into the Snowflake platform.

  • OpenFlow is best suited for unstructured data, streaming ingestion, and AI pipelines. This is where traditional ELT tools fall short.

  • OpenFlow complements tools like Estuary and Fivetran. It trades ultra-low latency and simplicity for flexibility and customization.

What Is Snowflake OpenFlow?

Snowflake OpenFlow is an integration service that connects virtually any data source to any destination. It features hundreds of processors that support structured and unstructured text, images, audio, video, and sensor data.

Snowflake Openflow architeture flow, showing data sources, analytics, data engineering, AI, and applications integration, with a focus on platform management, governance, and observability.
Source image - Snowflake OpenFlow overview

Most connectors simply move data from one place to another, but OpenFlow is a flow-based orchestration engine, which means it lets engineers process data while it’s still “flowing”, before it even touches a Snowflake table.

Let’s circle back to the classic ELT process. We start by loading the raw data into the Bronze layer. Then, we use SQL to transform it and clean it before moving it into the Silver layer. This setup works just fine for CSVs or JSON, but when we apply it to other types of data and processes, it starts to break down. Some examples include:

  1. Unstructured data: SQL won’t be able to extract what you need from a PDF or an image.
  2. Real-time routing: In some cases, the same data stream must be routed to multiple destinations, like a vector database and an Iceberg lake, and SQL is not capable of achieving that.
  3. Complex API logic: If you are dealing with multi-step OAuth2 flows or paginated REST APIs, SQL won’t be able to help you out.

Datavolo Acquisition and Flow-Based Programming (FBP) Model

A headline for an article reads: "Snowflake to acquire Datavolo: Empowering data engineers with hybrid, multimodal pipelines, and open source flexibility". The Snowflake and Datavolo logos appear on the right side.
Screenshot taken from Snowflake's announcement of the Datavolo acquisition.

Even though the acquisition was mentioned earlier, it’s worth revisiting since it was a strategic move to bring flow-based programming (FBP) into the warehouse. As discussed, this acquisition effectively brings Apache NiFi 2.0, a system created by the NSA to manage massive, complicated data streams, into Snowflake.

Now, with OpenFlow, you get all that NiFi power without the burden of managing servers, complex technical setups, or dedicated teams.

Architecture: Control Plane vs. Data Plane

Snowflake uses a modern split-plane architecture to keep your data secure while allowing your system to scale easily.

Control Plane

You can access the control plane directly through Snowsight (Snowflake’s web interface). Through your workspace, you can visually build your flows by dragging and dropping processors, connecting them, and managing versions with Git. This plane is in charge of all the instructions and metadata. However, I should note that the actual data never touches this plane, which helps keep information private and secure.

Data Plane (“Runtime”)

This is where the actual data processing occurs. It can be deployed in two ways:

  1. Managed (SPCS): The runtime runs inside Snowpark Container Services. It is a fully managed option operated by Snowflake, and it automatically scales based on your credit usage.
  2. BYOC (Bring Your Own Cloud): If your company needs to keep everything in-house due to strict data residency requirements (or other compliance reasons), you can deploy an OpenFlow Agent in your own VPC on AWS or Azure. Your data remains local, but instructions still come from the Snowflake Control Plane.

Image Registry

Snowflake uses a system image registry to ensure the platform remains secure. When you add a new processor (such as a Python tool to read PDFs), the data plane agent pulls a signed and verified container image from this registry. This guarantees that every piece of code running in your environment is authorized and kept up to date.

FlowFiles in Snowflake OpenFlow: Content, Metadata, and Backpressure

Every piece of information is wrapped in a FlowFile, which has two main components. The content refers to the actual data payload, such as the raw bytes of an image or a specific row from a spreadsheet. The attributes store key-value metadata attached to the payload, like the file’s origin, its unique ID, or the type of data it contains.

Handling “Spiky” Data

Backpressure helps handle sudden surges of data. As such, it's one of OpenFlow's greatest assets.

In traditional pipelines, a spike in data can crash the consumer. With OpenFlow, however, every connection acts as a safety buffer. You can also set limits (for instance, “pause if the queue reaches 5,000 FlowFiles”) to give the ingestion engine room to breathe and prevent the runtime from crashing during peak loads.

Snowflake Connection Service

This service acts as a bridge between NiFi and Snowflake. It manages the key-pair authentication and ensures that the NiFi stream is mapped to the right Snowflake endpoint.

Let’s have a look at a JSON representation of a Controller Service config:

plaintext
{ "controllerService": "SnowflakeConnectionService", "properties": { "accountUrl": "<https://xy12345.snowflakecomputing.com>", "user": "OPENFLOW_INGEST_USER", "privateKey": "${snowflake_private_key_secret}", "warehouse": "INGEST_WH", "database": "RAW_DB", "schema": "PUBLIC" } }

Note: The privateKey is mapped directly to a Snowflake Secret to ensure credentials are not exposed in plaintext on the canvas.

High-Value Data Ingestion Patterns in Snowflake OpenFlow

Openflow is an ingestion and replication tool, not a full-scale transformation (ETL) engine, designed to move raw or lightly processed data into Snowflake with high efficiency and native AI integration. It focuses on the "L" (Load) and "E" (Extract) phases, ensuring data is landed in a query-ready state. In this section we’ll introduce the five primary high-value patterns used in the industry today:

Pattern 1: Database Replication and CDC

With OpenFlow, there will be no more SELECT * queries to your database. It uses log-based Change Data Capture (CDC) to stream changes.

plaintext
-- Run on source Postgres DB ALTER SYSTEM SET wal_level = logical; CREATE PUBLICATION snowflake_export FOR ALL TABLES; SELECT * FROM pg_create_logical_replication_slot('snowflake_slot', 'pgoutput');

The CaptureChangePostgreSQL processor reads the database logs and creates FlowFiles for every insert, update, and delete. These then stream directly into Snowflake.

Pattern 2: Streaming Event Ingestion

Snowpipe Streaming is Snowflake’s low-latency ingestion API designed for real-time event data. This method sends data directly to a table and skips the staging step entirely.

plaintext
CREATE OR REPLACE TABLE raw_events ( event_id UUID, payload VARIANT, processed_at TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP() );

The PutSnowflakeStreaming processor sends NiFi records to the Snowflake table buffer and is capable of achieving latencies as low as 2 seconds.

Pattern 3: Internal Staging

If you’re building a RAG application, you’ll need a place to land PDFs or audio files. In this case, you can use PutSnowflakeInternalStage to upload these files directly to Snowflake.

plaintext
CREATE OR REPLACE STAGE docs_stage DIRECTORY = (ENABLE = TRUE) ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE'); -- Querying the metadata of the binary files SELECT * FROM DIRECTORY(@docs_stage);

OpenFlow moves the files from the source (such as a local SFTP) into the stage. Snowflake’s Directory Tables make these files immediately available to your AI functions, no extra steps required.

Pattern 4: Managed Iceberg

Vendor lock-in is no fun, but you can avoid it by writing your data in Apache Iceberg format while letting Snowflake manage the catalog.

plaintext
CREATE OR REPLACE ICEBERG TABLE customer_data ( id INT, name STRING ) EXTERNAL_VOLUME = 'my_s3_volume' CATALOG = 'SNOWFLAKE';

The PutIcebergTable processor writes Parquet files to your S3 bucket and updates the table metadata. As a result, engines like Apache Spark can query the data immediately.

Pattern 5: The Cortex AI Transform

This is arguably OpenFlow’s best feature. Basically, you can extract entities from your data stream before it lands in a table, which allows you to clean or enrich your data in-flight.

Let’s have a look at the conceptual logic of the Cortex processor:

plaintext
-- Logic executed in-flight by the Cortex Processor SELECT SNOWFLAKE.CORTEX.EXTRACT_ENTITIES( attribute.content, ['company', 'person', 'location'] );

We place a CortexProcessor between a source and a sink (destination). It then sends the content of each FlowFile to an LLM, extracts entities, and stores them as attributes that can be used in your table.

Security and Governance in Snowflake OpenFlow

There exist several ways to keep your data safe with Snowflake, and Openflow integrates directly into Snowflake’s security framework ensuring that ingestion pipelines follow the same Role-Based Access Control and encryption standards as your data warehouse. Let’s have a look at the options in more detail:

RBAC (Role-Based Access Control)

To build a flow, you need to have the OPENFLOW_ADMIN role. Moreover, for it to run successfully, the data plane must have a role that has the USAGE grant on the specific warehouse and the INSERT grant on the target table. Otherwise, the flow may encounter permission issues.

Secrets Management

When your code or app needs to connect to an external system (like Salesforce or Stripe), it typically requires an API_KEY. Typing this secret value directly into your code is risky, especially if it’s publicly available. A safer approach is to reference the secret value using Snowflake secrets management, such as: SECRET_VALUE = ${secrets.api_key} .

That way, the secret value isn’t stored in your code.

Data Lineage

Every action in OpenFlow is recorded and can be reviewed in Snowflake Trail. You can access these logs at any time and trace what happened in case something has gone wrong.

Snowflake OpenFlow vs. Estuary vs. Fivetran: The 2026 Data Integration Landscape

Selecting the right tool for your data pipeline is very important, and it will depend on whether you prioritize simplicity, real-time speed, or native AI capabilities. In this section, we’re comparing Openflow against its biggest competitors:

OpenFlow vs. Estuary

Estuary is good for moving data instantly. Even though it combines batch and streaming into a single platform, it’s designed for streaming above all else. You should choose it if you need your data to be identical in two places at the same time (subsecond latency).

On the other hand, OpenFlow is a great choice if you are dealing with “messy” data, such as files and complex APIs, or when you need to run AI models on the data during ingestion.

OpenFlow vs. Fivetran

Fivetran is very easy to set up. You just provide your login details and watch it automatically move your data into your storage. It’s ideal for standard SaaS-to-warehouse pipelines where you don’t want to manage a canvas. It works particularly well with popular apps everyone uses, like Salesforce or Zendesk.

As mentioned in the previous comparison, OpenFlow is better suited for handling unstructured data, or in those situations when Fivetran’s settings aren’t flexible enough for your use case.

Estuary vs. OpenFlow vs. Fivetran

As a summary of the previous section, in the following table we compare Estuary, Openflow and Fivetran’s capabilities with a focus on use cases, latency, pricing, AI, CDC, data handling, and governance and security.

 EstuarySnowflake OpenFlowFivetran
Ideal use casesReal-time dashboards, operational syncing, database data migrationComplex AI pipelines, unstructured data ingestionStandard marketing/sales analytics (Salesforce, Google Ads, and similar)
Minimum latencySub 100ms~2s1 minute (real-time requires self-hosted HVR)
Pricing modelVolume-basedCompute-basedRow-based
Transformation styleStreaming ETLVisual flow (300+ drag & drop processors)Post-load ELT (primarily uses dbt once data is already in the warehouse)
AI/LLM readinessHigh (native vector database sinks with pinecone and AI API calling)Excellent (access to Snowflake Cortex AI)Limited (provides data models for RAG, but doesn’t have real-time AI processing)
CDC methodLog-based (WAL/Binlog); highly efficient for databasesLog-based via NiFi processors (CaptureChangePostgreSQL, and similar)Log-based; very reliable but can be expensive at high volumes
Unstructured data handlingGood (handles files and JSON streams well)Excellent (specifically built for this kind of data)Minimal (primary focus is on structured/tabular data from SaaS APIs)
Governance and securitySOC2, HIPAA; external to warehouseNative (inherits Snowflake RBAC, secrets, and Horizon governance)SOC2, HIPAA, ISO; external to warehouse

Pricing and Lock-In

Every company takes vendor lock-in into account. Although OpenFlow is powered by Apache NiFi, the managed features are Snowflake-only, which means you have to be a Snowflake user.

Refer to pattern 4 earlier in this article to see how to avoid vendor lock-in.

You will have to pay for OpenFlow using Snowflake Credits, but there’s also the option of bringing your own cloud (BYOC) if that’s what you prefer. This may also help you save on high-bandwidth jobs since you bypass Snowflake’s compute markup and only pay for your own vCPU and RAM. In any case, pricing is based on the runtime size (small, medium, or large nodes, which we will discuss further in the following section) and how long those nodes are active. Essentially, you are paying for the service to stay active regardless of your data volume.

By contrast, Fivetran uses a Monthly Active Rows (MAR) model, which lets you pay for the rows updated or inserted. If your data volume is low, Fivetran is often cheaper, but if you have massive datasets, the per-row cost can escalate quickly.

Similarly, Estuary uses a “pay-for-what-you-use” model; however, it bills you based on how much data you move (GB per month) rather than the number of rows or the size of the machine.

Snowflake OpenFlow Performance, Scaling, and Limitations

Scaling the Runtime

There are three runtime options available, which you can choose from depending on your specific use case:

  • Small: 1 vCPU / 2GB RAM (normally a good option for light API polling)
  • Medium: 4 vCPU / 16GB RAM (works well for standard CDC and Snowpipe Streaming)
  • Large: 16 vCPU / 64GB RAM (when dealing with heavy AI/ML or image processing)

Limitations

OpenFlow may be very powerful, but there are a few things to keep in mind before deciding if it’s the right fit for your team.

First, its regional availability is still limited to AWS and Azure services, with many Google Cloud (GCP) regions in the preview phase. If this is what you're using, I suggest that you wait until your region is fully available.

Second, there’s a steep learning curve since the platform is built on Apache NiFi. You will need some basic knowledge of flow-based programming to be able to use this "hands-on" tool.

Finally, not all connectors are available. The most popular ones are, but if you need a less common or niche ELT connector, you may be out of luck. You can still build your own logic, though, but it won’t be as fast or as convenient as it would be if you were using one of their provided and maintained connectors.

Conclusion

Snowflake OpenFlow undoubtedly represents the most significant change to the Snowflake ingestion story in a decade. It has transformed from a basic loader into a powerful orchestrator, positioning itself at the center of the AI pipeline. Beyond data movement, it provides a visual, well-governed, and scalable infrastructure, which helps you get things done efficiently.

For data engineers in 2026, the question is no longer “How do I move this data?” but “How much value can I add to this data before it lands?” If this issue has been on your mind lately, OpenFlow might just be what you’re looking for.

FAQs

    Can I run Snowflake OpenFlow in my own VPC?

    Yes, by BYOC. You can deploy an OpenFlow Agent to keep data processing local to your AWS or Azure VPC while receiving instructions from the Snowflake control plane.
    Snowflake’s OpenFlow is based on a modern split-plane architecture and flow-based programming. Not only does this approach keep data more secure, but it also works well with complex data streams and allows for easier scaling. There are two planes: - Control plane, or the visual "drag-and-drop" canvas where you design your flows through Snowsight (Snowflake’s web interface). It handles all the instructions and metadata, but the actual data never touches this plane. - Data plane, or the processing plane. It can be managed (running inside Snowpark Container Services) or BYOC, where the agent runs in your own AWS/Azure VPC.
    - Content: The actual data (such as the bytes of an image or a CSV row) - Attributes: Metadata (key-value pairs) like the file source, timestamp, or unique IDs.
    It uses a feature called backpressure. Every connection between processors acts as a buffer. You can even set OpenFlow to "pause" ingestion if the queue reaches a certain limit.
    Yes. OpenFlow features a Cortex AI Transform pattern that allows you to clean or enrich data in-flight. You can place a Cortex processor in your flow between a source and a sink. It will send content from FlowFiles to an LLM, extract entities, and store them as attributes.
    The choice really depends on your goal. For example, Fivetran is great for standard SaaS-to-warehouse pipelines since there is no mess and no canvas management. On the other hand, you should go for Estuary if you work with real-time dashboards and need subsecond latency. Finally, OpenFlow is best used for unstructured data or complex AI pipelines.
    OpenFlow’s pricing is compute-based. You pay for the runtime nodes (small, medium, or large) using Snowflake Credits for as long as they are active.

Start streaming your data for free

Build a Pipeline

About the author

Picture of Ana Escobar Llamazares
Ana Escobar Llamazares

Ana is a results-driven Data Platform Engineer with a focus on building scalable, high-performance architectures. Combining a passion for emerging technologies with a commitment to continuous technical evolution, she specializes in engineering the foundational platforms that power (real-time) data initiatives. She's dedicated to the philosophy that 'the path is made by walking'; continuously upskilling to solve complex engineering challenges.

Streaming Pipelines.
Simple to Deploy.
Simply Priced.
$0.50/GB of data moved + $.14/connector/hour;
50% less than competing ETL/ELT solutions;
<100ms latency on streaming sinks/sources.