
Moving data from Snowflake to Databricks looks simple at first, but most teams quickly discover that the two platforms do not natively sync in real time. Snowflake is optimized for scalable SQL analytics, while Databricks excels at machine learning, Delta Lake storage, and large-scale processing. When your AI pipelines, feature engineering workloads, or streaming use cases depend on fresh data, traditional batch exports or cloud storage hops are not fast or dependable enough.
The short answer is this: you can move data from Snowflake to Databricks using several methods including cloud storage staging, ETL tools, or custom Spark jobs, but the only way to achieve reliable real-time CDC replication with sub-second latency is to use a right-time data platform like Estuary. Right-time means you choose exactly when data moves, whether sub-second, near real time, or batch, without rewriting pipelines or maintaining separate tools.
This guide explains every method for syncing Snowflake and Databricks, the tradeoffs of each approach, and why right-time data movement has become the new standard for analytics and AI teams. You will also learn how to set up a dependable Snowflake-to-Databricks pipeline using Estuary that delivers continuous updates, handles schema changes automatically, and requires no Spark code or orchestration.
Key Takeaways
- You can move data from Snowflake to Databricks using several methods, but only right-time platforms support reliable real-time CDC replication suitable for AI and machine learning workloads.
- Traditional ETL pipelines rely on batch jobs, which introduce delays, require ongoing maintenance, and break easily during schema changes or data spikes.
- A right-time data platform like Estuary lets you choose exactly when data moves, whether sub-second, near real time, or batch, without rewriting pipelines or managing multiple tools.
- Estuary continuously captures changes from Snowflake and delivers them directly into Databricks Delta Lake with exactly-once guarantees and automatic schema evolution.
- The result is faster model training, fresher analytics, and a more efficient architecture that unifies your warehouse and lakehouse without complex orchestration or custom Spark jobs.
Why Teams Sync Snowflake and Databricks
Snowflake and Databricks often sit together in the modern data stack because each platform excels at different parts of the analytics and AI lifecycle. Snowflake provides a simple and scalable environment for SQL workloads, BI reporting, and operational analytics. Databricks is built for advanced processing, large-scale data engineering, machine learning, and open formats like Delta Lake and Parquet. When combined, they give teams the flexibility to support both analytical and computationally heavy workflows without forcing everything onto a single system.
Many organizations want to connect Snowflake and Databricks so each platform can do what it does best:
- Analytics, dashboards, and business reporting in Snowflake
- Feature engineering, model training, and AI pipelines in Databricks
- Delta Lake storage for ML workflows
- Notebook-driven experimentation and data science in Databricks
These workflows depend heavily on data freshness. A model trained on stale data quickly becomes inaccurate. A feature store built from lagging tables produces weaker predictions. Real-time inference pipelines require fast and continuous updates to deliver relevant outputs.
Without a reliable method to keep Snowflake and Databricks in sync, teams end up with duplicated datasets, inconsistent versions of truth, and unnecessary compute cost. Unifying the two systems through dependable right-time data movement solves this problem by making sure that every downstream operation in Databricks always sees the latest version of Snowflake data.
Suggested Read: Databricks vs Snowflake
Core Challenges of Moving Data from Snowflake to Databricks
Syncing Snowflake and Databricks is harder than it looks because the two platforms do not have a native real-time connection. Most teams end up stitching together batch pipelines or custom scripts, which creates delays and reliability issues.
Here are the main challenges:
1. No direct, real-time path between the platforms
Snowflake cannot stream changes directly into Databricks, so you must build or manage a pipeline.
2. Batch pipelines slow down AI and analytics
Staging data in cloud storage and loading it with Auto Loader works for daily syncs but not for real-time or near real-time updates.
3. CDC is difficult to build yourself
Handling inserts, updates, deletes, checkpoints, and ordering from Snowflake streams requires significant engineering effort.
4. Schema changes break most pipelines
A new column or type change in Snowflake often requires manual updates to keep Databricks in sync.
5. High operational overhead
Orchestrators, Spark jobs, monitoring, alerting, and retries quickly become complex and expensive.
6. Risk of inconsistent or duplicate data
Without strong guarantees, Databricks may receive partial loads or duplicates, which directly impacts ML and AI accuracy.
All the Ways to Move Data from Snowflake to Databricks
There are several patterns for moving data from Snowflake to Databricks. They all live in the same general world of ETL, ELT, or data replication, but they make very different tradeoffs around latency, complexity, and control.
Below are the main approaches.
1. Batch file exports with Databricks Auto Loader
How it works
You export data from Snowflake to cloud storage (for example S3, GCS, or ADLS), then use Databricks Auto Loader to ingest those files into Delta Lake tables.
When it fits
- Daily or hourly refreshes
- Reporting and BI that does not need real-time updates
Tradeoffs
- Simple, but always batch
- Managing export schedules and file layouts
- Sensitive to schema changes
2. Snowflake external tables and Auto Loader
How it works
Snowflake writes data to external storage that both Snowflake and Databricks can see. Databricks uses Auto Loader or standard reads to load from there.
When it fits
- Incremental batch ingestion
- Migration and offloading scenarios
Tradeoffs
- Still batch focused
- Extra storage cost and complexity
- Limited support for true change data capture
3. Managed ETL and replication platforms
This is where tools like Fivetran, Airbyte, and Estuary all live. They connect to Snowflake for you and handle most of the extract and load work into Databricks or Delta Lake. The key difference is how they treat time.
There are two main flavors.
3.1 Schedule based ETL tools (Fivetran, Airbyte, Matillion and similar)
These tools typically run syncs on a schedule and move data in batches.
Strengths
- Very quick to get started
- Great for standard SaaS to warehouse or warehouse to lakehouse replication
- Good fit when you only need data refreshed every few hours or once a day
Limitations
- Latency is tied to the schedule, not to individual changes
- Real-time or near real-time ML and streaming use cases are harder
- CDC support, where available, is often layered on top of a batch oriented design
Best use cases
- Dashboards and reporting
- Periodic data warehouse to lakehouse syncs
- When cost and simplicity matter more than low latency
3.2 Real-time and CDC replication platforms (Estuary)
Estuary belongs in the same broad ETL and replication category, but is designed to support continuous change data capture with very low latency. Instead of waiting for scheduled syncs, Estuary streams inserts, updates, and deletes from Snowflake into Databricks Delta Lake as they happen.
How it works
- Connects to Snowflake using native CDC
- Streams each change into Estuary collections
- Materializes data into Databricks Delta tables in near real-time
- Also supports right-time control, letting teams choose sub-second, near real-time, or batch delivery
Strengths
- True CDC replication built for continuous updates
- Sub-second or near real-time latency
- Exactly-once delivery and schema enforcement
- Smooth handling of schema changes with no manual work
- No need for staging files or custom Spark jobs
Best for
- AI and ML feature pipelines that depend on fresh Snowflake data
- Real-time analytics and streaming use cases
- Continuous sync between warehouse and lakehouse
4. Custom Spark or Python pipelines
How it works
Engineering teams write their own Spark jobs, Python scripts, and orchestrations to pull from Snowflake and load into Databricks.
When it fits
- Highly specialized logic that is hard to express in managed tools
- Teams that want full control and have staff to maintain it
Tradeoffs
- Significant engineering and maintenance overhead
- CDC and schema evolution are non-trivial to implement correctly
- Monitoring, retries, and observability are all custom work
5. Snowflake to Kafka to Databricks
How it works
Snowflake change events are pushed or replicated into Kafka, then Databricks consumes from Kafka as a streaming source.
When it fits
- Organizations that already run Kafka at scale
- Event driven architectures that treat data as streams
Tradeoffs
- Operationally complex
- Not ideal for straightforward table replication
- Another piece of infrastructure to manage
Why Right-Time Data Movement Matters
Not all pipelines need the same level of freshness. Some workloads run perfectly well with daily updates, while others require data within seconds. What most teams discover is that latency is not a single requirement. It varies across analytics, machine learning, and operational workflows. This is why right-time data movement matters.
Right-time simply means choosing the timing that fits each use case without building separate pipelines or tools. Some examples:
- Sub-second updates for real-time inference, fraud detection, or anomaly alerts
- Near real-time updates for ML feature stores, operational analytics, and fast experimentation
- Batch updates for dashboards, reporting, or compliance workloads
Traditional ETL pipelines force teams into a batch-only mindset. Real-time streaming systems force them into continuous processing even when they do not need it. Right-time data movement removes this constraint by supporting all timing models inside one platform.
When moving data from Snowflake to Databricks, right-time replication ensures that machine learning pipelines, notebooks, and Delta Lake tables always see the most current version of your data. This leads to more accurate models, faster experimentation, and simpler architectures that do not require multiple separate systems for batch and streaming.
Step-by-Step Guide: Snowflake to Databricks with Estuary
You can set up a Snowflake to Databricks pipeline in a few minutes. The process requires no Spark jobs, no orchestration tools, and no staging in cloud storage. Estuary handles capture, storage, transformation, and delivery for you.
Below is the complete setup workflow.
Step 1: Connect Snowflake as Your Source
- Log in to the Estuary Dashboard. If you don’t have an account yet, create one for free — no credit card required.
- In the left sidebar, click on Sources, then hit the + New Source button.
- From the list of connectors, select Snowflake and click Capture.
- Enter your Snowflake credentials:
- Host: Your Snowflake account URL (e.g.
xy12345.us-east-1.snowflakecomputing.com) - Database and Warehouse: Where your source data lives.
- User and Password: A Snowflake user with appropriate roles (we recommend creating a dedicated
ESTUARY_USER).
- Host: Your Snowflake account URL (e.g.
- Estuary will auto-discover your Snowflake schema. Select one or more tables to sync. Estuary will now capture all inserts, updates, and deletes in real time using CDC.
Step 2: Set Up Databricks as the Destination
- From the dashboard sidebar, go to Destinations and click + New Materialization.
- Select Databricks from the list and click Materialize.
- Fill in your Databricks configuration details:
- Address: Host and port for your SQL warehouse.
- HTTP Path: From your SQL warehouse.
- Catalog Name: Name of your Unity Catalog.
- Personal Access Token: Generate a Personal Access Token in Databricks.
- Link the collections from your Snowflake source to this Databricks materialization. Estuary will ensure schema compatibility.
Step 3: Save and Activate the Pipeline
- Click Save & Publish to activate your pipeline. Estuary begins streaming data from Snowflake to Databricks immediately.
- From the dashboard, you can:
- Monitor sync status and latency in real time
- View row counts and throughput
- Edit schemas and transformations
- Enable logging and error alerts
- Want to transform data in-flight? Use Estuary’s UI for field mappings, or go deeper with SQL and TypeScript derivations.
Need to scale across more tables or use cases? Repeat the same flow. Estuary supports multiple pipelines and horizontal scaling.
You’ve now built a production-ready, real-time data pipeline from Snowflake to Databricks in minutes, with zero code and full observability.
Handling Schema Evolution Automatically
Snowflake schemas change over time. New columns are added, types are updated, and tables evolve. Estuary is designed so these changes flow through to Databricks with minimal work from you.
Snowflake side
- The Snowflake CDC connector manages its own streams and transient staging tables in a schema such as
ESTUARY_STAGING. - When you add or change columns in a source table, Snowflake streams expose the updated schema.
- Estuary picks up the new structure on the next polling cycle (default 5 minutes, configurable with the
intervalfield in the capture spec). - The corresponding Flow collection schema is updated so new fields are available downstream.
You do not need to manually manage streams or staging tables.
Databricks side
- The Databricks materialization writes into Delta Lake tables using merge by default.
- As the collection schema evolves, new fields can be added to the Databricks table as part of normal syncs, assuming your Delta table allows schema evolution.
- You can optionally enable
delta_updatesfor high volume workloads, andcolumnMappingin Delta if your environment needs name based schema evolution.
Most of the time, new fields simply appear in the target table after the next sync.
What you actually need to do
Usually nothing, except:
- Update any transformations if they rely on fields that changed or were removed.
- Verify your Delta tables are configured to allow schema evolution and, if needed, enable column mapping for more complex changes.
You do not need to rebuild the pipeline, recreate tables, or write Spark code just because the Snowflake schema changed.
Common Challenges and How Estuary Solves Them
Even well-designed Snowflake to Databricks pipelines face predictable issues. Estuary addresses these challenges so data stays consistent and pipelines stay reliable.
1. Large initial backfills
Challenge: Copying the full contents of a Snowflake table into Databricks can be slow and resource heavy.
Estuary solution: Performs an automated backfill once, then switches to CDC. You can control the process with backfill settings or adjust the Snowflake polling interval for cost or freshness.
2. Latency and warehouse cost
Challenge: Real-time syncs often require keeping a Snowflake warehouse running.
Estuary solution: Uses a configurable polling interval (default 5 minutes) so the warehouse only runs when needed. Lower intervals give fresher data, higher intervals reduce cost.
3. CDC complexity
Challenge: Handling inserts, updates, deletes, ordering, and checkpoints manually is error prone.
Estuary solution: Manages Snowflake streams, staging tables, and checkpointing internally with exactly once guarantees.
4. Schema evolution
Challenge: New columns or type changes often break pipelines.
Estuary solution: Automatically detects schema changes from Snowflake and updates the downstream Delta Lake table when possible.
5. Consistency and duplicate prevention
Challenge: Many DIY and batch pipelines introduce duplicate rows or partial updates.
Estuary solution: Materializes into Delta Lake using merge or delta updates with consistent keys and strong ordering.
6. Operational overhead
Challenge: Multiple tools, orchestrators, and storage layers increase complexity.
Estuary solution: Capture, storage, transformation, and delivery all happen in one platform with unified monitoring.
Conclusion
Moving data from Snowflake to Databricks is no longer just a matter of linking two platforms. It is about keeping analytics, machine learning, and operational systems aligned with the freshest version of your data. Batch pipelines can still work for slow-changing workloads, but real-time or near real-time use cases require continuous replication that can adapt as schemas evolve and tables grow.
Estuary makes this possible by capturing Snowflake changes through CDC, shaping them in Flow collections, and delivering them directly into Delta Lake with strong consistency and minimal operational work. This approach gives teams dependable pipelines, predictable costs, and the flexibility to choose the timing that fits each workload, whether sub-second, near real-time, or scheduled batch.
If your goal is to unify Snowflake analytics with Databricks processing, Estuary provides a straightforward and reliable way to keep both systems in sync and ready for modern AI and data engineering needs.
Next Steps
- Explore the Estuary Demo: See how right-time data pipelines work in action at.
- Start Your First Integration: Set up your Snowflake to Databricks pipeline in minutes with Estuary’s no-code interface.
- Learn from the Documentation: Explore configuration, delta updates, and advanced transformations.
- Talk to an Expert: Have specific latency, compliance, or architecture needs? Connect with our team.
FAQs
How do I replicate Snowflake tables to Delta Lake with real-time updates?
How does schema evolution work when syncing Snowflake to Databricks?
How are inserts, updates, and deletes handled in Databricks when using Snowflake CDC?

About the author
Dani is a data professional with a rich background in data engineering and real-time data platforms. At Estuary, Daniel focuses on promoting cutting-edge streaming solutions, helping to bridge the gap between technical innovation and developer adoption. With deep expertise in cloud-native and streaming technologies, Dani has successfully supported startups and enterprises in building robust data solutions.

















