redshift

17 min read

March 16, 2026

How to Load Data into Amazon Redshift: 4 Methods Compared

Learn 4 proven methods to load data into Amazon Redshift: COPY command, AWS Glue, Zero-ETL, and Estuary for real-time CDC. Step-by-step guide with code examples and a comparison table.

Dani Pálma Head of Data & Marketing

Share this article

Hayden AI

From Postgres to Analytics: How Hayden AI Powers Data Movement with Estuary.

Read Success Story

Share this article

Summarize this page with AI

Table of Contents

Start Building For Free

Amazon Redshift is one of the most widely used cloud data warehouses in the world. Getting data into it reliably, on schedule, and at scale is where most engineering teams run into real challenges. The wrong method can mean stale dashboards, brittle pipelines, or AWS bills that spiral out of control.

This guide covers four proven methods to load data into Amazon Redshift, each suited to a different use case:

Method 1: The COPY Command (native bulk loading from S3)
Method 2: AWS Glue (serverless ETL with complex transforms)
Method 3: Amazon Zero-ETL Integration (near-real-time CDC from Aurora and RDS)
Method 4: Estuary (real-time CDC from any database, SaaS tool, or API)

For each method, you will find an explanation of how it works, when to use it, setup steps, code examples, and an honest look at the trade-offs. The comparison table and decision guide at the end will help you choose the right approach for your situation.

What to Know Before You Start

Redshift Uses Massively Parallel Processing

Redshift distributes query and load work across multiple compute nodes and slices. This means loading data via bulk files is dramatically faster than inserting rows one at a time. The COPY command, which AWS explicitly recommends as the most efficient way to load a table, reads multiple files in parallel and distributes them across all slices simultaneously.

Redshift is a Destination, Not a Pipeline

Redshift does not extract or transform data on its own. Everything in this guide describes tools and services that move data from your source systems into Redshift. The pipeline lives outside of Redshift.

Two Deployment Options

Provisioned clusters: You choose node types and counts. You pay for uptime whether the cluster is active or idle.
Redshift Serverless: Scales automatically. You pay per RPU-second. Great for variable workloads, but continuous streaming pipelines can prevent auto-pause and raise costs unexpectedly.

S3 as the Staging Layer

Almost every method in this guide moves data through Amazon S3 at some point. The COPY command reads directly from S3. Estuary uses S3 as a high-performance staging area. For best results, your S3 bucket and Redshift cluster should always be in the same AWS region.

Supported File Formats

Redshift supports CSV, JSON, Parquet, ORC, and Avro, as well as Gzip and Zstandard compression. Parquet is the recommended format for analytical workloads because of its columnar structure and compression efficiency.

Maintenance After Large Loads

After any significant bulk load, run VACUUM to reclaim deleted space and ANALYZE to update table statistics. These commands help the query optimizer make better decisions and keep performance healthy over time.

Method 1: The COPY Command

Best for: bulk loading large files from Amazon S3, DynamoDB, Amazon EMR, or SSH remote hosts

What It Is

The COPY command is Redshift's native built-in bulk loading tool. It is the fastest and most cost-effective way to load a table because it uses Redshift's MPP architecture to read multiple files simultaneously across all compute slices. AWS documentation explicitly states it is the most efficient loading method available.

How It Works

You stage data files in S3 and run a COPY command that tells Redshift where to find them, what format they are in, and which IAM role to use for authentication. Redshift reads the files in parallel across all nodes and loads them into the target table in one operation.

Basic COPY Example

plaintextCOPY your_table
FROM 's3://your-bucket/data-prefix/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftS3Role'
FORMAT AS PARQUET;

plaintext-- For CSV files with a header row:
COPY your_table
FROM 's3://your-bucket/data.csv'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftS3Role'
CSV IGNOREHEADER 1
REGION 'us-east-1';

Auto-COPY: Continuous File Ingestion

AWS launched continuous auto-copy for Redshift in 2024. A single SQL command sets up a job that automatically loads new files landing in an S3 path, without any scheduling infrastructure.

plaintextCOPY store_sales
FROM 's3://your-bucket/store_sales/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftS3Role'
FORMAT AS PARQUET
JOB CREATE job_store_sales AUTO ON;

File Preparation Tips

Split files into equal sizes between 1 MB and 1 GB. Redshift assigns one file per slice, so equal-sized files mean even parallel load distribution.
Compress files with Gzip or Zstandard before uploading to reduce transfer time and cost.
Run only one COPY command per target table at a time. Multiple concurrent COPY commands on the same table are serialized and offer no speed benefit.
Use IAM roles for authentication rather than long-lived access keys where possible.

Pros and Cons

Pros	Cons
Built into Redshift, no additional tools needed	No support for operational databases or SaaS sources directly
Extremely fast for large batch loads from S3	No schema drift handling, manual changes required
Supports CSV, JSON, Parquet, ORC, and Avro	Error handling is manual via STL_LOAD_ERRORS system table
Auto-COPY enables hands-off continuous S3 ingestion	Not suitable for sub-minute data freshness requirements
Low cost, uses existing Redshift compute	S3 staging must be managed by your team

Method 2: AWS Glue

Best for: complex data transformation within the AWS ecosystem before loading to Redshift

What It Is

AWS Glue is Amazon's serverless, fully managed ETL service. It runs Apache Spark under the hood, which gives it the ability to handle complex transformations, join datasets from multiple sources, and write to Redshift. You write transformation logic in Python or Scala, and Glue manages the compute infrastructure.

How It Works

An AWS Glue pipeline has three main components.

The Crawler scans your source data and registers its schema in the Glue Data Catalog, a central metadata store.
The ETL Job reads from the catalog, applies your PySpark or Scala transformation logic, and writes the result to Redshift.
A Scheduler or event trigger kicks off the job on your chosen cadence.

Supported Sources

AWS Glue connects to over 70 data sources natively. These include Amazon S3, Amazon RDS, Amazon DynamoDB, Amazon Kinesis, MongoDB, and any JDBC-compatible database. Connecting to non-AWS SaaS tools typically requires custom development or third-party connectors in the AWS Marketplace.

PySpark Job Example

Manage imports:

python
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

Initialize contexts:

python
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

Read data from a table:

python
# Read from source catalog
source_df = glueContext.create_dynamic_frame.from_catalog(
	database='my_database',
	table_name='orders'
)

Add mappings:

python
# Apply transforms (filter, rename, cast)
transformed = ApplyMapping.apply(
	frame=source_df,
	mappings=[('order_id', 'long', 'order_id', 'long'),
          	('amount', 'double', 'revenue', 'double')]
)

Store transformed data to Redshift:

python
# Write to Redshift
glueContext.write_dynamic_frame.from_jdbc_conf(
	frame=transformed,
	catalog_connection='redshift_connection',
	connection_options={
    	'dbtable': 'analytics.orders_clean',
    	'database': 'analytics'
	},
    redshift_tmp_dir='s3://my-temp-bucket/glue-tmp/'
)

Pros and Cons

Pros	Cons
Serverless, no cluster management	Cold start latency of several minutes per job run
Handles complex multi-step transforms	Requires Apache Spark expertise
Auto schema discovery via Glue crawler	Primarily AWS-ecosystem sources, weak SaaS support
Deep AWS Lake Formation and EMR integration	Expensive for high-frequency or continuous workloads
Supports 70+ AWS and JDBC data sources	Schema changes require re-running the crawler

Method 3: Amazon Zero-ETL Integration

Best for: near-real-time CDC from Aurora, RDS, and DynamoDB sources already running in the same AWS region

What It Is

Amazon Zero-ETL integration is a native AWS feature that replicates data from Aurora and RDS databases directly into Redshift in near real time, without requiring any ETL pipeline. AWS manages the Change Data Capture at the storage layer. Data is available in Redshift within seconds of being written to the source database.

Supported Sources

Aurora MySQL-Compatible Edition
Aurora PostgreSQL-Compatible Edition
Amazon RDS for MySQL
Amazon RDS for PostgreSQL
Amazon DynamoDB (15-30 minute refresh interval, not continuous CDC)
Enterprise SaaS applications via AWS Glue Zero-ETL integrations: Salesforce, Zendesk, ServiceNow, SAP, and others

Setting Up Zero-ETL

Enable binary logging on your Aurora or RDS source. For Aurora PostgreSQL, confirm you are running version 16.4 or higher.
Open the AWS Console and navigate to your Aurora or RDS cluster.
Choose Zero-ETL integrations and create a new integration, selecting your Redshift cluster or Serverless workgroup as the target.
Configure a resource policy on the Redshift namespace to authorize the integration source.
Create a destination database in Redshift to receive the replicated data.
Monitor integration status via Redshift system views.

Note: All source tables must have a primary key. Tables without a primary key cannot be replicated and will be excluded from the integration. For Aurora PostgreSQL, you must also configure at least one data filter pattern specifying which databases and tables to replicate.

Key Limitations

Source and target must be in the same AWS region. Cross-region replication is not supported.
Non-AWS sources are out of scope. Oracle, SQL Server, MongoDB, Google Cloud, and on-premises databases cannot use Zero-ETL.
Replicated tables in Redshift are read-only. You cannot run INSERT, UPDATE, or DELETE on them directly.
DDL changes on source tables, such as adding or renaming columns, can trigger a full table resync. On large tables this can make the table unavailable for queries for 20 to 90 minutes.
Continuous CDC activity on Redshift Serverless prevents auto-pausing, which can cause significantly higher RPU costs for teams on the serverless tier.
DynamoDB Zero-ETL refreshes every 15-30 minutes at minimum, not continuously.
Each Redshift data warehouse supports a maximum of 50 zero-ETL integrations.

Pros and Cons

Pros	Cons
No pipeline to build or manage for supported sources	AWS sources only, no external databases or cross-cloud support
Near-real-time latency from Aurora and RDS sources	Replicated tables are read-only in Redshift
No additional cost beyond existing Aurora and Redshift charges	DDL changes can cause 20-90 minute table unavailability
Handles initial backfill automatically	Continuous CDC prevents Redshift Serverless from auto-pausing
Supports SaaS apps via AWS Glue integration	All tables must have a primary key
	Max 50 integrations per Redshift warehouse

Method 4: Estuary

Best for: real-time CDC from any database, SaaS tool, or API into Redshift, with automatic schema evolution and managed infrastructure

What It Is

Estuary is a right-time data platform that delivers sub-second Change Data Capture from databases and SaaS tools into Amazon Redshift. It is fully managed, meaning you configure pipelines through a web UI and Estuary handles the infrastructure, scaling, retries, and schema changes automatically.

Unlike Zero-ETL, Estuary works with any source regardless of cloud provider. The Redshift connector stages data in S3 for high-performance bulk loading, giving you real-time CDC with the loading efficiency of the COPY command.

How It Works

Estuary pipelines run as three connected stages:

Capture: Estuary connects to your source and captures changes as a continuous stream of documents.
Collection: Changes are stored in a durable, replayable, schema-enforced collection.
Materialization: The Redshift connector reads from the collection and writes to your Redshift tables using S3 as a staging layer, then issues a COPY command.

Supported Sources

Estuary supports 120+ source connectors, including SQL databases (PostgreSQL, MySQL, SQL Server), NoSQL databases (MongoDB, Amazon DocumentDB), object storage (Amazon S3, GCS), SaaS sources (Salesforce, HubSpot, Shopify), and many more. See the full connector list.

Step-by-Step Setup

Step 1: Sign up

Create a free account at dashboard.estuary.dev/register. The free tier includes 2 connector instances and up to 10 GB per month, letting you test out your pipeline without any commitment.

Step 2: Set up your source capture

In the Estuary dashboard, go to Sources and click + New Capture. Search for your source connector, enter your credentials, and click Next.

Estuary runs a discovery phase that automatically infers your source schema and proposes collections. Review auto-discovered collections and then Save and Publish your capture.

Note that some sources will require additional setup to connect. For example, PostgreSQL sources require logical replication to be enabled on your database. Estuary's capture connector docs walk through this per source.

Step 3: Configure network access for Redshift

Estuary connects to your Redshift cluster from a set of fixed egress IP addresses. You have two options:

Public access with IP allowlist: make your cluster publicly accessible and allow Estuary's egress IPs on port 5439. All connections are SSL-only.
SSH tunnel: if your cluster is in a private VPC, set up an EC2 bastion host and configure an SSH tunnel. Note that S3 staging traffic uses HTTPS and bypasses the tunnel regardless.

Step 4: Create your S3 staging bucket

Create an S3 bucket in the same region as your Redshift cluster. Create an IAM user with read and write access to the bucket and note its Access Key ID and Secret Access Key. The IAM policy should look something like:

json
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": ["s3:GetObject","s3:PutObject","s3:DeleteObject","s3:ListBucket"],
    "Resource": [
      "arn:aws:s3:::your-staging-bucket",
      "arn:aws:s3:::your-staging-bucket/*"
    ]
  }]
}

Step 5: Create the Redshift materialization

In the Estuary dashboard, go to Destinations and click + New Materialization. Search for Amazon Redshift and enter your connection details:

Field	Description	Example
Address	Cluster host and port	my-cluster.abc123.us-east-1.redshift.amazonaws.com:5439
User	Redshift database user	etl_user
Password	Database password	your_password
Database (optional)	Target database name	analytics
Schema (optional)	Database schema (defaults to public)	public
S3 Staging Bucket	Bucket name	my-redshift-staging
AWS Access Key ID	IAM key for S3 bucket	AKIAIOSFODNN7EXAMPLE
AWS Secret Access Key	IAM secret for S3 bucket	wJalrXUtnFEMI/K7MDENG...
Region	Bucket and cluster region	us-east-1
Bucket Path (optional)	S3 key prefix for staged files	estuary/staging/

Click Source from Capture to link your capture, then click Save and Publish. Estuary performs an initial backfill of existing data and switches to continuous CDC mode.

Step 6: Verify the pipeline

In the dashboard, check that the materialization shows a Running status. Then run a quick verification in Redshift:

sql
-- Confirm table was created
SELECT table_schema, table_name
FROM information_schema.tables
WHERE table_schema = 'public'
ORDER BY table_name;

sql
-- Check row count
SELECT COUNT(*) FROM public.orders;

sql
-- Inspect Estuary metadata table (for exactly-once bookkeeping)
SELECT * FROM public.flow_checkpoints_v1;

YAML Configuration Reference

Estuary connectors use a backing specification file for configuration. If you want to manage your pipeline programmatically rather than through the UI, you can use flowctl to work with your connectors’ spec files.

Alternatively, you can edit the specification file directly in the UI using the Advanced Specification Editor.

plaintextmaterializations:
  your_prefix/redshift-main:
    endpoint:
  	   connector:
        image: ghcr.io/estuary/materialize-redshift:v2
        config:
      	address: my-cluster.abc123.us-east-1.redshift.amazonaws.com:5439
      	user: etl_user
      	password: secret_password
      	database: analytics
      	schema: public
      	awsAccessKeyId: AKIAIOSFODNN7EXAMPLE
      	awsSecretAccessKey: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
      	bucket: my-redshift-staging
      	region: us-east-1
      	bucketPath: estuary/staging/
    bindings:
  	   - resource:
          table: orders
        source: your_prefix/postgres/orders
  	   - resource:
      	table: customers
        source: your_prefix/postgres/customers

Update Modes

Standard mode (default): Estuary reads the current document state, merges incoming changes, and writes the result as an upsert keyed on the document's primary key. This keeps Redshift tables as a current-state mirror of your source.
Delta updates mode: Estuary appends each change event as a new row without merging. Useful for audit logs and change history. Enable per binding with delta_updates: true.

Performance Tips

Tip: Run at most one Redshift materialization per schema. Estuary uses a table-level lock on metadata tables during each transaction commit. Multiple materializations sharing a schema must take turns acquiring this lock, which serializes commits and reduces throughput. Add all collections as bindings to a single materialization instead.

Tip: For large initial backfills, set maxTxnDuration to 10-30 minutes in your shard configuration to allow large batches to accumulate before committing. This significantly improves backfill speed compared to many small transactions.

Tip: For steady-state streaming where data freshness of a few hours is acceptable, configure a syncFrequency (e.g., 4h) to batch updates. Fewer, larger syncs are much more efficient in Redshift than frequent small commits.

Schema Handling

Estuary creates the necessary Redshift tables automatically when you publish a materialization. Manually pre-created tables are not supported.

All table and column identifiers are treated as lowercase by default. If your source has fields that differ only by case, such as myField and MyField, the materialization will fail. Rename fields from the capture UI or use projections in your collection spec before they reach Redshift.

Source-Specific Guides

PostgreSQL to Redshift: Stream PostgreSQL to Redshift in real time
MySQL to Redshift: Sync MySQL to Redshift in real time
MongoDB to Redshift: Connect MongoDB to Redshift

Ready to try Estuary? Connect your first source to Amazon Redshift in minutes. Start free

Pros and Cons

Pros	Cons
Free tier	Free tier limited to 10 GB/month and 2 connectors
Works with 120+ sources across any cloud or on-premises	S3 staging bucket and IAM credentials required
Sub-second CDC latency from source to Redshift	One materialization per schema recommended for performance
Exactly-once delivery with automatic retry and recovery	IAM AssumeRole not yet supported; long-lived access keys required for S3
Managed data movement platform
Automatic schema evolution and table creation
Supports both standard merge and delta update modes

Side-by-Side Comparison: All 4 Methods

Feature	COPY Command	AWS Glue	Zero-ETL	Estuary
Setup complexity	Medium	High	Medium	Low
Real-time CDC	No	No	Near-real-time (AWS only)	Yes, sub-second
Source types	S3, DynamoDB, EMR, SSH	70+ AWS and JDBC sources	Aurora, RDS, DynamoDB, SaaS apps	120+ databases, SaaS, APIs, files
Non-AWS sources	No	Limited via JDBC	No	Yes
SaaS connectors	No	Marketplace only	Limited (via AWS Glue)	Yes, built-in
Schema evolution	Manual	Semi-auto via crawler	Limited, DDL causes resyncs	Automatic
Exactly-once delivery	No	No	Yes	Yes
Latency	Batch (hours to days)	Minutes (cold start)	Seconds (Aurora/RDS)	Configurable, from seconds to batch syncs
Infrastructure needed	S3 + IAM role	Serverless (AWS manages)	None (AWS manages)	S3 + IAM role
Best use case	Bulk loads from S3	AWS ETL with transforms	Aurora and RDS near-real-time	Any source, real-time, multi-connector

Which Method Should You Choose?

Your situation	Best method
You have data files in S3 and need scheduled bulk loads	COPY Command
You need complex multi-step data transformation before loading, all within AWS	AWS Glue
Your source is Aurora or RDS and you want near-real-time sync with no pipeline to build	Zero-ETL Integration
Your source is PostgreSQL, MySQL, MongoDB, Salesforce, or any non-AWS system	Estuary
You need real-time CDC with exactly-once semantics and automatic schema handling	Estuary
You have sources across multiple clouds or on-premises databases	Estuary
You want one tool to connect all your databases and SaaS tools to Redshift	Estuary
You want to avoid Spark expertise, cold starts, and AWS-only limitations	Estuary

Best Practices for Loading Data into Redshift

Split files into equal sizes between 1 MB and 1 GB. Redshift assigns one file to each node slice. Equal-sized files give every slice the same amount of work and maximize parallel load efficiency.
Compress files before staging in S3. Gzip and Zstandard are supported natively and reduce both transfer time and cost without any extra configuration.
Keep your S3 bucket in the same region as your Redshift cluster. Cross-region data transfer adds latency and cost with no benefit.
Run VACUUM and ANALYZE after large loads. VACUUM reclaims deleted row space and restores sort order. ANALYZE updates statistics that the query optimizer uses to build efficient query plans.
Use staging tables for bulk transformations. Load raw data into a staging table first, apply your transformation logic, then use ALTER TABLE APPEND to merge clean data into the production table. This is faster than INSERT INTO SELECT.
Run only one COPY per table at a time. Multiple concurrent COPY commands targeting the same table are serialized by Redshift. Running them in parallel offers no speed benefit and can slow things down.
Use IAM roles rather than static access keys wherever possible. Roles are easier to rotate, scope, and audit than long-lived access key pairs.

Troubleshooting Common Redshift Load Errors

Error or Symptom	Likely Cause	Fix
Rows in STL_LOAD_ERRORS after COPY	Type mismatch, extra columns, or malformed data in source files	Query STL_LOAD_ERRORS for the file name and line number. Fix the source file or use COPY options like IGNOREHEADER, MAXERROR, or ACCEPTINVCHARS.
Schema mismatch fails the COPY	Source file has more or fewer columns than the target table	Update the table schema to match the source, or use a JSONPath or column list in the COPY command.
Serializable isolation violation in Estuary logs	Multiple Estuary materializations are sharing the same Redshift schema	Ensure each materialization uses a unique endpoint schema. Use per-binding schema overrides to still land tables in a shared target schema.
Zero-ETL table unavailable for 20-90 minutes	DDL change on source table triggered a full table resync	Plan schema changes during low-traffic windows. Monitor resync progress via Redshift system views.
Redshift Serverless bills higher than expected	Continuous CDC from Zero-ETL prevents auto-pause	Set RPU-hour limits in the console, or migrate continuous CDC workloads to a provisioned RA3 cluster.
Permission denied on COPY or staging	IAM role missing S3 read permissions or not attached to the cluster	Attach the IAM role with the correct S3 policy to the Redshift cluster in the AWS Console.
S3 region mismatch error	S3 staging bucket is in a different region than the Redshift cluster	Move or recreate the staging bucket in the same region as your Redshift cluster.

Conclusion

There is no single right method for loading data into Amazon Redshift. Each approach covered in this guide fits a different set of requirements.

The COPY command remains the fastest and most cost-effective option when your data already lives in S3. AWS Glue is the right choice when you need complex transformation logic before data lands in Redshift and you are fully within the AWS ecosystem.

Zero-ETL is an excellent option if your source is Aurora or RDS and you want near-real-time sync without building and maintaining a pipeline, though its AWS-only constraint rules it out for most teams with diverse source systems.

Estuary covers what the other methods cannot: real-time CDC from any database or SaaS tool, regardless of cloud provider, with exactly-once delivery, automatic schema evolution, and managed infrastructure. If your team needs a single platform that connects all your sources to Redshift and keeps them in sync continuously, Estuary is the most operationally straightforward path to get there.

Start Loading Data into Amazon Redshift with Estuary

Stop building and maintaining batch pipelines that leave your analytics team waiting. Estuary delivers low-latency CDC from any database or SaaS tool into Amazon Redshift, exactly-once and ready to scale.

About the author

Dani PálmaHead of Data & Marketing

Dani is a data professional with a rich background in data engineering and real-time data platforms. At Estuary, Daniel focuses on promoting cutting-edge streaming solutions, helping to bridge the gap between technical innovation and developer adoption. With deep expertise in cloud-native and streaming technologies, Dani has successfully supported startups and enterprises in building robust data solutions.

How to Load Data into Amazon Redshift: 4 Methods Compared

What to Know Before You Start

Redshift Uses Massively Parallel Processing

Redshift is a Destination, Not a Pipeline

Two Deployment Options

S3 as the Staging Layer

Supported File Formats

Maintenance After Large Loads

Method 1: The COPY Command

What It Is

How It Works

Basic COPY Example

Auto-COPY: Continuous File Ingestion

File Preparation Tips

Pros and Cons

Method 2: AWS Glue

What It Is

How It Works

Supported Sources

PySpark Job Example

Pros and Cons

Method 3: Amazon Zero-ETL Integration

What It Is

Supported Sources

Setting Up Zero-ETL

Key Limitations

Pros and Cons

Method 4: Estuary

What It Is

How It Works

Supported Sources

Step-by-Step Setup

YAML Configuration Reference

Update Modes

Performance Tips

Schema Handling

Source-Specific Guides

Pros and Cons

Side-by-Side Comparison: All 4 Methods

Which Method Should You Choose?

Best Practices for Loading Data into Redshift

Troubleshooting Common Redshift Load Errors

Conclusion

Start Loading Data into Amazon Redshift with Estuary

Start streaming your data for free

About the author

Streaming Pipelines.

Simple to Deploy.

Simply Priced.