Estuary

How to Load Data from S3 to Redshift: 3 Methods Compared

Discover a step-by-step guide on how to load data from S3 to Redshift using the COPY command, AWS Glue, and Estuary. Simplify and optimize your data pipeline process.

s3 to redshift
Share this article

Moving data from Amazon S3 to Amazon Redshift is one of the most common workflows for analytics teams. S3 is often used as a low-cost landing zone for raw files, while Redshift is used to query, model, and analyze that data at scale.

There are three common ways to load data from S3 to Redshift:

  1. Use the Redshift COPY command for fast bulk loading.
  2. Use AWS Glue for managed ETL and transformation workflows.
  3. Use Estuary when you want an automated, low-maintenance pipeline with near real-time sync.

The right method depends on how often your data changes, how much transformation you need, and how much infrastructure you want to manage.

Quick Comparison: S3 to Redshift Methods

MethodBest ForProsLimitations
Redshift COPY commandBulk loading files from S3Fast, native, cost-effectiveRequires SQL, IAM setup, manual orchestration
AWS GlueBatch ETL and transformation jobsServerless, AWS-native, good for transformationsMore setup, often batch-oriented, can be complex
EstuaryAutomated or real-time pipelinesLow-code, continuous sync, easier operationsBasic technical knowledge

Before You Load Data from S3 to Redshift

Whichever method you pick, you will need:

  • An S3 bucket containing your source files
  • A Redshift cluster or Redshift Serverless workgroup, and a target database
  • IAM permissions that let Redshift or Glue read your S3 bucket (IAM roles for COPY)
  • Data files in a supported format such as CSV, JSON, Parquet, Avro, ORC, or delimited text
  • A plan for schema changes, duplicate rows, failed loads, and refresh frequency

For the COPY and Glue methods you also create the destination table yourself. For Estuary, the connector creates the tables, which changes the prerequisites slightly (covered in Method 3).

Method #1: Using COPY Command to Load Data from S3 to Redshift

The COPY command is the fastest native way to bulk load files from S3 into Redshift. It is the right choice when your files are already prepared and you want a direct load into an existing table.

It fits teams that load large batches of already-clean files, are comfortable with SQL and IAM roles, and do not need real-time sync.

When to Use the COPY Command

Use COPY when:

  • You need to load large batches of files from S3
  • Your data is already clean or lightly transformed
  • You are comfortable writing SQL
  • You can manage IAM roles and load orchestration
  • You do not need real-time synchronization

Step 1: Create or Identify the S3 Bucket

Upload your files to an S3 bucket. Organize them by source, table, and date when possible.

Example structure:

plaintext
s3://company-data/events/year=2026/month=06/day=08/

This makes it easier to manage incremental loads and troubleshoot failed files.

Step 2: Create the Redshift Target Table

Before running COPY, create the destination table in Redshift.

Example:

sql
CREATE TABLE public.events ( event_id VARCHAR(255), user_id VARCHAR(255), event_type VARCHAR(100), event_time TIMESTAMP, properties SUPER );

Make sure the Redshift schema matches the structure of your S3 files.

Step 3: Configure IAM Permissions

Redshift needs permission to read objects from your S3 bucket. The recommended approach is to attach an IAM role to your Redshift cluster or workgroup and grant that role the required S3 permissions.

At minimum, the role usually needs permission to list the bucket and read the objects being loaded.

Step 4: Run the COPY Command

Example for JSON data:

sql
COPY public.events FROM 's3://company-data/events/' IAM_ROLE 'arn:aws:iam::123456789012:role/MyRedshiftRole' FORMAT AS JSON 'auto';

Example for CSV data:

sql
COPY public.customers FROM 's3://company-data/customers/customers.csv' IAM_ROLE 'arn:aws:iam::123456789012:role/MyRedshiftRole' CSV IGNOREHEADER 1;

Watch out for: duplicate and failed loads

This is where COPY bites teams in production. COPY does not deduplicate. If a load fails partway and you re-run it against the same S3 prefix, you can load the same rows twice and silently double your counts. The recovery is manual: truncate and reload, or load into a staging table and MERGE into the target. Two habits prevent it. Load through a staging table and MERGE on a key rather than appending directly, and use a COPY manifest so the command loads exactly the files you intend, not whatever happens to be in the prefix.

A single malformed row also aborts the entire batch by default. For large loads, set an explicit error tolerance and inspect the load-error system tables so one bad file does not block the rest.

COPY is native, fast, and cheap, but you own the orchestration, scheduling, error handling, and monitoring for every recurring load.

Method #2: Load Data from S3 to Redshift Using AWS Glue

AWS Glue is a serverless data integration service. It can discover data, catalog metadata, transform files, and write data into Redshift.

AWS Glue is a better fit than a plain COPY command when you need transformations, schema handling, or managed ETL jobs.

Important Note About Glue Crawlers

A Glue crawler does not load S3 data into Redshift. A crawler scans your data and creates or updates metadata tables in the Glue Data Catalog. To actually move data into Redshift you still need a Glue ETL job or Glue Studio workflow. Teams routinely lose a day to this misunderstanding, so plan for the ETL job from the start.

When to Use AWS Glue

Use AWS Glue when:

  • You need to transform data before loading it
  • You want serverless ETL inside AWS
  • You need scheduled batch jobs
  • You already use the Glue Data Catalog
  • You need to read from multiple AWS data sources

Step 1: Prepare the S3 Source Data

Store your files in S3 and use a consistent structure. For example:

plaintext
s3://company-data/orders/ s3://company-data/customers/ s3://company-data/events/

Use partitioned folders if your data is time-based.

Step 2: Create an IAM Role for AWS Glue

Create an IAM role that allows Glue to:

  • Read from the S3 source bucket
  • Write temporary files if needed
  • Connect to Redshift
  • Access any required secrets or credentials

Step 3: Use a Glue Crawler to Catalog S3 Data

Create a Glue crawler and point it to the S3 path. The crawler detects the schema and creates metadata tables in the Glue Data Catalog.

This is useful for understanding file structure, but it is not the same as loading the data into Redshift.

Step 4: Create a Glue ETL Job

Create an AWS Glue job that reads from the S3 source or Glue Catalog table, applies any transformations, and writes the data to Redshift.

Typical transformations include:

  • Renaming columns
  • Casting data types
  • Filtering records
  • Flattening nested JSON
  • Handling null values
  • Deduplicating records

Step 5: Schedule and Monitor the Job

You can run the job manually or schedule it. Monitor job runs, failures, and data quality checks to ensure the Redshift tables are updated correctly.

AWS Glue Pros

  • Serverless ETL
  • Good for transformations
  • Integrates with S3, Redshift, and the Glue Data Catalog
  • Useful for scheduled batch pipelines

AWS Glue Cons

  • More complex than the COPY command
  • Debugging jobs can take time
  • Not always the best fit for low-latency pipelines
  • Requires AWS Glue, IAM, Redshift, and S3 configuration

Method #3: Load Data from S3 to Redshift Using Estuary

Estuary is a data integration platform that can help automate data movement from S3 to Redshift without requiring teams to manually manage COPY commands, Glue jobs, or orchestration scripts.

This method is useful when you want a managed pipeline that can keep Redshift updated as source data changes.

When to Use Estuary

Use Estuary when:

  • You want an easier setup than custom AWS jobs
  • You need automated S3 to Redshift pipelines
  • You want to reduce pipeline maintenance
  • You need near real-time or continuous data movement
  • You prefer a low-code data integration workflow

Here's the following steps to use Estuary for loading data from S3 to Redshift:

Prerequisite: Set up Redshift to connect with Estuary. Learn how here.

Step 1: Setup Estuary Account

  • Log in to your Estuary account. If you don’t have one yet, you can register for free.

Step 2: Source Data Capture Setup

  • Once you’ve logged in, locate the Captures option on the left side of the Estuary dashboard. On the Captures page, click on the New Capture button. This step is to set up the source end of the data pipeline.
s3 to redshift - new capture page
  • Search for S3 in Source Connectors.
s3 to redshift - S3 connector search result
  • Then, click on the Capture button. You’ll be directed to the S3 Capture connector page.
s3 to redshift - S3 connector page
  • Fill in all the required fields, then click on Next. Then click on Save and Publish. This will capture your data in S3 into a Estuary collection.

Step 3: Destination Configuration

  • The next step is to set up the destination end of the pipeline. In this case, it’s Amazon Redshift. Click on Materializations on the left side of the Estuary dashboard. On the Materializations page, click on New Materialization.
s3 to redshift - New materialization button
  • Search for Amazon Redshift in Search Connectors, then click on the Materialization button.
s3 to redshift - Redshift connector search result
  • You’ll be redirected to the Redshift connector page, where you must fill in the required fields like Address, database user name, and Password. You will also have to provide the details of Amazon S3 buckets, along with the access and secret access key for read/write permission. 
s3 to redshift - Redshift connector page
  • Once you’ve filled in the required information, click on Next, then click on Save and Publish.

This will set up the automated data transfer process from S3 to Redshift. For more information on the process, here’s all the documentation to help out:

Estuary Pros

  • Easier setup than building custom Glue jobs
  • Supports automated data movement
  • Reduces operational overhead
  • Good fit for teams that want managed pipelines
  • Useful when data needs to stay fresh in Redshift

Estuary Cons

  • Less customizable than fully custom AWS code
  • Requires connector configuration and platform onboarding

Which S3 to Redshift Method Should You Choose?

Choose the COPY command if you need a fast, low-cost way to bulk load files and your team is comfortable with SQL and AWS permissions.

Choose AWS Glue if you need to transform data before loading it, run scheduled ETL jobs, or manage data discovery through the Glue Data Catalog.

Choose Estuary if you want an easier, managed way to automate S3 to Redshift data pipelines and reduce the time spent maintaining ingestion jobs.

Best Practices for Loading S3 Data into Redshift

Use Columnar File Formats When Possible

For analytics workloads, columnar formats like Parquet are often more efficient than raw CSV because they reduce scan costs and improve query performance.

Compress Large Files

Compression can reduce storage and transfer costs. It can also improve load performance depending on file size and format.

Keep File Sizes Reasonable

Avoid loading thousands of tiny files if possible. Very small files can create overhead. Very large files can be harder to retry when failures happen.

Use IAM Roles Instead of Static Credentials

Use IAM roles for Redshift and Glue access to S3. Avoid embedding long-lived AWS access keys in SQL scripts or application code.

Monitor Load Errors

For COPY jobs, inspect load errors and failed rows. For Glue jobs, monitor job logs, retries, and failed transformations.

Plan for Schema Changes

S3 files can change over time. Decide how your pipeline should handle new columns, missing fields, type changes, and nested data.

Build Idempotent Loads

Design pipelines so rerunning a job does not create duplicate records. Use staging tables, merge logic, or deduplication keys where appropriate.

Conclusion

For fast, cost-effective bulk loads, the Redshift COPY command is hard to beat. For serverless ETL with transformations, AWS Glue fits. For an automated pipeline that creates tables, stages through S3, and keeps Redshift current with less manual work, Estuary is the simpler path.

To set up an automated S3-to-Redshift pipeline, the next step is to create an S3 capture and a Redshift materialization, which takes minutes. Start for free, or talk to our team about your setup.

FAQs

    Does the COPY command deduplicate rows if I re-run it?

    No. COPY appends. Re-running it against the same files can double-load rows. Load into a staging table and MERGE on a key, or use a manifest, to avoid duplicates.
    The connector creates the tables from the inferred schema. You do not pre-create them, and the user you configure needs create-table permission on the target schema.
    Yes. Use the SUPER column type to store semi-structured JSON and query it with Redshift's JSON path operators, or flatten it during a Glue or Estuary transformation.
    Use batch (COPY or Glue) when hourly or daily freshness is fine. Use an automated pipeline like Estuary when Redshift needs to reflect new S3 files continuously with minimal manual work.

Start streaming your data for free

Build a Pipeline

About the author

Picture of Jeffrey Richman
Jeffrey RichmanData Engineering & Growth Specialist

Jeffrey is a data engineering professional with over 15 years of experience, helping early-stage data companies scale by combining technical expertise with growth-focused strategies. His writing shares practical insights on data systems and efficient scaling.

Streaming Pipelines.
Simple to Deploy.
Simply Priced.
$0.50/GB of data moved + $.14/connector/hour;
50% less than competing ETL/ELT solutions;
<100ms latency on streaming sinks/sources.