Estuary

What to Do After Data Lands in Snowflake: dbt, AI & Reverse ETL

Once data lands in Snowflake, learn how to transform it with dbt, power AI workflows with Cortex, and route it back to operational systems with reverse ETL.

Blog post hero image
Share this article

After data lands in Snowflake, the three highest-value next steps are transforming it with dbt, activating it for AI workloads through Snowflake Cortex, and routing it back to operational systems via reverse ETL. Each of these turns Snowflake from a storage destination into the central hub of a working data architecture, and each depends on how fresh and reliable your ingestion layer is to begin with.

If you've just gotten your Snowflake pipeline running reliably, congratulations! That's an achievement in and of itself. It's also where the interesting work starts.

The teams getting the most out of Snowflake treat it as a starting point, not an endpoint. This post walks through what those three downstream layers actually look like in practice, why they're worth investing in, and how the latency choices you made at ingestion ripple through every one of them.

This guide is for data teams that already have data landing in Snowflake and are deciding how to transform, activate, and operationalize it.

This is the final post in the Right-Time Snowflake Playbook series. Previous posts covered choosing the right ingestion method, understanding hidden ingestion costs, and evaluating Snowflake integration tools.

How do you transform raw data after it lands in Snowflake?

The common approach is to use dbt, a SQL-based data modeling tool that helps you transform raw ingested tables into clean, modeled data that analysts and downstream systems can rely on.

Raw data landing in Snowflake is rarely ready for analysis. Column names are inconsistent. Tables need to be joined. Fields need to be normalized, filtered, or enriched. Getting from raw ingested state to something analysts can actually query requires a transformation layer, and dbt has become the default choice for that job.

dbt projects can run inside your Snowflake environment to define transformations as version-controlled SQL and produce auditable models without overwriting the original data.

What dbt does

dbt lets data teams define expected data models in SQL, run transformations against raw data, and produce clean tables or views that downstream consumers can reliably query. It does this without destroying the original data, so you get a progressive, auditable transformation pipeline that's easy to backtrack through if something looks wrong.

This makes a big difference when you're ingesting from multiple sources with different schemas. And it also helps cut through the noise and make data more user-friendly for your analysts when working with complex sources. dbt gives you a consistent layer that standardizes everything into models your analysts and business users can depend on.

How to add dbt to your Snowflake stack

If you're using Snowflake natively, you can run dbt Core directly within the Snowflake environment, keeping your transformation layer inside your existing infrastructure.

Most third-party integration platforms also have native dbt integrations.

  • Fivetran supports both dbt Core and dbt Cloud, triggering transformation runs automatically after each sync.
  • Estuary integrates with dbt Cloud, enabling transformation runs as part of pipeline orchestration. More general materialization triggers can also kick off dbt Core jobs via a workflow like GitHub Actions.
  • Airbyte offers dbt integrations for post-sync transformations.

The practical effect is a clean ELT pipeline where transformations run automatically as fresh data arrives, keeping downstream models current without manual intervention.

How dbt timing affects pipeline latency

dbt integrations also make clear why data latency matters beyond the ingestion layer. If your pipeline delivers fresh data every few minutes but your dbt transformations only run once an hour, your effective data freshness is determined by the transformation cadence, not the ingestion cadence. Right-time thinking applies throughout the pipeline, not just at the source.

Note: Fivetran and dbt Labs announced a merger in October 2025, pending regulatory approval. Both companies have committed to keeping dbt Core open source under its current license regardless of the transaction's outcome.

How do you use Snowflake data for AI workloads?

Snowflake Cortex is the native way to run AI workloads directly on data in your warehouse, without exporting it to a separate AI infrastructure layer. Cortex comprises a suite of AI capabilities, including Cortex Search for semantic search and RAG, Cortex Analyst for natural language querying of structured data, and Cortex Agents for multi-step tasks that combine both. The quality of any of these outputs is bounded by the freshness of the data your ingestion pipeline delivers.

AI workloads have become one of the most compelling reasons to invest in data freshness. Models that operate on stale data produce stale results, and as organizations build AI features directly into their products, the quality of those features is directly tied to the quality of the underlying data pipeline.

What Cortex offers

Besides Cortex Code, Snowflake’s platform-wide agent, Cortex includes several distinct capabilities worth understanding separately:

  • Cortex Search enables semantic search across textual data stored in Snowflake. This powers internal knowledge search, customer-facing search features, and Retrieval-Augmented Generation (RAG) applications, where an LLM retrieves relevant context from your own data before generating a response.
  • Cortex Analyst lets you ask natural language questions about structured data and receive SQL queries as output, lowering the barrier to self-service analytics for business users who know what they want to know but not how to write the query.
  • Cortex Agents builds on both, enabling agents that combine results from Cortex Analyst and Cortex Search to complete multi-step tasks and answer complex business questions that require pulling from multiple data sources.
  • Cortex Fine-tuning allows LLM customization so you can use pre-trained models while still specializing the model. This is especially useful when your use case requires deep domain knowledge.

Why your ingestion strategy affects your AI outputs

For AI use cases built on Cortex, the value of your outputs is directly tied to how current your data is. A RAG application built on Cortex Search is only as useful as the data it can retrieve. If your ingestion pipeline updates that data once a day, your AI application is working from a 24-hour-old view of the world. If your ingestion pipeline updates it in near real-time, your AI application can answer questions about what's happening now.

This is one of the clearest cases for right-time architecture: if you're investing in AI capabilities, your ingestion strategy needs to match the latency requirements of those features, not just the requirements of your dashboards.

What is reverse ETL, and how does it work with Snowflake?

Reverse ETL is the pattern of sending cleaned, aggregated data from a warehouse like Snowflake back out to operational systems: CRMs, marketing platforms, product databases, and customer-facing applications. This is so that data can drive action, not just analysis. The most important technical decision in any reverse ETL setup is how changes are detected when extracting data from Snowflake, because that choice determines compute cost, egress fees, and how fresh the downstream systems can be.

Reverse ETL has become an increasingly important part of modern data architectures as organizations look to close the loop between analytics and operations. Not every data journey ends at the warehouse. The aggregated, cleaned data sitting in Snowflake is often most valuable when it's pushed back out to the systems where work actually happens.

Common reverse ETL use cases

  • Sales operations: Enriching CRM records with product usage data, health scores, or churn risk signals calculated in Snowflake.
  • Marketing: Syncing audience segments built from behavioral data back to ad platforms for targeting.
  • Product: Feeding personalization engines with user attributes or recommendations computed in the warehouse.
  • Customer success: Pushing health metrics to support tooling so teams can act on the data, not just analyze it.

How data gets extracted from Snowflake matters

Not all reverse ETL implementations are equivalent. The key distinction is how changes are detected and how much Snowflake compute gets consumed in the process. There are three common patterns:

  1. Full-refresh extraction re-queries all data on each cycle and reprocesses everything to identify what's changed. This works but is expensive: you're paying Snowflake compute costs to re-examine data that hasn't changed, and latency is bounded by your query interval.
  2. Time Travel–based extraction uses Snowflake's native Time Travel feature to capture only data that has changed since the last sync. This reduces compute consumption compared to full refresh, but it's still dependent on Snowflake's Time Travel retention window and query scheduling rather than continuous capture.
  3. Log-based CDC captures changes incrementally as they occur at the database level, independent of query scheduling. This is the approach Estuary's Snowflake capture connector uses, and it delivers lower latency and minimal compute overhead for high-frequency or high-volume reverse ETL workflows.

For many use cases, the difference won't be material. For teams running frequent syncs at high data volumes or building operational workflows where latency matters, it's worth understanding which extraction model you're using before you build on it.

The egress cost factor

Snowflake charges data egress fees for data leaving the warehouse, in addition to the compute costs of querying it. When evaluating reverse ETL options, account for both, particularly for high-volume or frequently-run pipelines. An extraction method that minimizes data transferred can meaningfully reduce egress costs compared to full-refresh approaches.

Complete Snowflake data flow: ingestion to activation

A complete Snowflake data flow has four layers: ingestion brings data in from source systems, transformation cleans and models it, analysis and AI put it to work inside the warehouse, and reverse ETL routes it back out to operational systems.

Four boxes in horizontal order depict the four layers of a Snowflake data flow, starting with ingestion on the left, then transformation, analysis & Ai,, and reverse ETL.
The 4 layers of a complete Snowflake data flow, from ingestion, to transformation and analytics within Snowflake, and ending with reverse ETL back out to operational systems.

The architecture that emerges from thinking through all of these layers looks something like this:

  1. Ingestion: Data arrives from source systems into Snowflake via your chosen ingestion method (batch COPY INTO, Snowpipe, or Snowpipe Streaming), using a third-party platform to manage connectors and reliability.
  2. Transformation: dbt runs automatically after each ingestion cycle, converting raw tables into clean, standardized data models.
  3. Analysis and AI: Analysts query the clean models directly. Cortex powers AI applications that need semantic search or natural language querying against the warehouse.
  4. Reverse ETL: Changed data flows back out to operational systems via efficient incremental extraction, keeping CRMs, ad platforms, and product databases in sync with the warehouse.

Each layer builds on the previous one, and the quality of each layer is directly shaped by the latency and reliability of the one before it. An ingestion pipeline that delivers stale or inconsistent data doesn't just affect your dashboards. It affects your dbt models, your Cortex AI outputs, and the operational decisions your reverse ETL pipeline is designed to inform.

This is ultimately why right-time architecture matters beyond the ingestion layer. It's not about achieving low latency for its own sake. It's about ensuring that every downstream system, whether it is analytical, operational, or AI-powered, is working from data that's fresh enough to be useful for the decisions that depend on it.


This is the final post in the Right-Time Snowflake Playbook series. If you've been reading along, you now have a practical framework for designing your Snowflake ingestion strategy, evaluating the real costs of different approaches, choosing between platform options, and thinking about where your data goes after it lands. We hope it's been useful and that the next pipeline you build is faster, cheaper, and easier to maintain than the last one.

For the complete guide including detailed setup walkthroughs for COPY INTO, Snowpipe, Snowpipe Streaming, Airbyte, Estuary, and Fivetran, download the full Snowflake integration whitepaper.

Ready to try Estuary? Start for free — no credit card required.

FAQs

    What should I do with data after it lands in Snowflake?

    After data lands in Snowflake, the three highest-value next steps are transforming it with dbt, activating it for AI workloads through Snowflake Cortex, and routing it back to operational systems via reverse ETL. Each layer turns Snowflake from a storage destination into the central hub of a working data architecture.
    dbt transforms raw ingested tables into clean, standardized models. Snowflake Cortex runs AI and natural language workloads directly against those models. Reverse ETL routes the cleaned, aggregated data back out to operational systems like CRMs and ad platforms. Together, they form a complete pipeline from raw ingestion to downstream action.
    The most efficient approach is log-based CDC, which captures changes incrementally as they occur rather than re-querying all data on each cycle. Compared to full-refresh or Time Travel-based extraction, log-based CDC delivers lower latency, lower Snowflake compute costs, and reduced egress fees for high-volume or high-frequency sync workflows.

Start streaming your data for free

Build a Pipeline

About the author

Picture of Emily Lucek
Emily LucekDeveloper Advocate / Data Engineer

Emily is an engineer and technical content creator with an interest in developer education. At Estuary, she works with data pipelines for both streaming and batch data and finds satisfaction in transforming a mess of information into usable data. Previous roles familiarized her with FinTech data and working closely with REST APIs.

Streaming Pipelines.
Simple to Deploy.
Simply Priced.
$0.50/GB of data moved + $.14/connector/hour;
50% less than competing ETL/ELT solutions;
<100ms latency on streaming sinks/sources.