Estuary

AWS ETL Tools Explained: Glue, Lambda, Kinesis, and a Right-Time Alternative

AWS ETL pipelines are built by combining services like Glue, Lambda, and Kinesis. This guide explains how each tool works, their trade-offs, and when to use a right-time data platform for CDC and continuous ingestion.

AWS ETL Tools
Share this article

Building reliable ETL pipelines on AWS requires more than choosing a single service. AWS provides multiple tools such as Glue, Lambda, and Kinesis, each optimized for different parts of the data lifecycle, from batch processing to event-driven transformations and real-time streaming. Understanding where each tool fits and where its limitations begin is essential for designing scalable and maintainable data pipelines.

This article explains how the most commonly used AWS ETL tools differ in terms of latency, scalability, cost predictability, and operational complexity. It also introduces an alternative approach for teams that need continuous data movement, change data capture (CDC), or cross-cloud ingestion alongside AWS analytics services.

Throughout this guide, “real-time” refers to continuous or streaming data processing, not simply running batch jobs more frequently.

What Are AWS ETL Tools?

AWS ETL tools are services used to extract data from source systems, transform it into analysis-ready formats, and load it into AWS storage or analytics destinations such as Amazon S3, Amazon Redshift, or DynamoDB.

Unlike traditional ETL platforms, AWS does not provide a single, unified ETL tool. Instead, ETL pipelines on AWS are typically built by combining multiple services, each responsible for a specific part of the data workflow. These services may handle batch processing, event-driven transformations, real-time streaming, orchestration, or delivery, depending on the use case.

In practice, AWS ETL architectures often involve:

  • Extraction

Ingesting data from operational databases, SaaS applications, event streams, or files using services like AWS Glue crawlers, Amazon Kinesis, AWS Lambda triggers, or custom integrations.

  • Transformation

Cleaning, enriching, and restructuring data using Spark-based jobs (AWS Glue), event-driven functions (AWS Lambda), or streaming processors (Kinesis consumers, Flink).

  • Loading and Delivery

Writing transformed data into AWS destinations such as Amazon S3, Amazon Redshift, OpenSearch, or downstream analytics systems using Glue jobs, Amazon Data Firehose, or custom consumers.

AWS ETL tools are designed to support large-scale, cloud-native workloads, offering elastic scaling and deep integration with the AWS ecosystem. However, because ETL functionality is distributed across multiple services, teams must often manage orchestration, schema handling, retries, and cost controls themselves.

As a result, AWS ETL tools work best when carefully selected and combined based on workload requirements such as batch vs streaming, latency sensitivity, data volume, and operational complexity.

AWS ETL Tools: Comparison Table

To help you evaluate your options quickly, here’s a side-by-side comparison of major AWS ETL tools. We’ve broken down their real-time capabilities, use cases, cost complexity, scalability, and pre-built connector availability.

ToolReal-Time ProcessingBest Use CaseLatency ProfileCost PredictabilityScalabilityPre-Built Connectors
AWS GlueLimited (Streaming ETL via Spark)Batch ETL and large-scale data preparationMedium to High (Spark startup + micro-batch)Low (DPU-based, variable)Auto-scalesAWS-focused
AWS LambdaYes (Event-driven)Lightweight event-based transformationsVery Low (milliseconds)ModerateAuto-scalesAWS-only
AWS KinesisYes (Native streaming)High-throughput real-time streaming analyticsVery Low (streaming)Low (complex shard-based pricing)Scales (auto in on-demand; managed in provisioned)Limited
AWS Data PipelineNoLegacy batch orchestrationHigh (batch-oriented)ModerateLimitedDeprecated

Note: 

  • “Real-time” refers to continuous or streaming data processing, not simply scheduling batch jobs more frequently.
  • AWS Glue supports streaming ETL jobs built on Spark Structured Streaming, but these typically introduce higher latency and operational complexity compared to purpose-built streaming systems.

Top AWS ETL Tools

Here are the top four ETL tools in AWS that help streamline your business processes and enhance data management capabilities.

1. AWS Glue

AWS ETL Tools - AWS Glue Logo

AWS Glue is a fully managed, serverless ETL service designed for large-scale batch data processing within the AWS ecosystem. It is commonly used to extract data from AWS data sources, apply transformations using Apache Spark, and load the results into data lakes or data warehouses such as Amazon S3, Amazon Redshift, and Amazon Athena.

Glue is best suited for organizations that need batch-oriented ETL pipelines tightly integrated with AWS services and are comfortable operating Spark-based workloads.

Key Features

  • Serverless Spark-Based ETL: AWS Glue runs ETL jobs on a serverless Apache Spark environment. AWS automatically provisions, scales, and manages the underlying infrastructure, removing the need to manage clusters or servers.
  • AWS Glue Data Catalog: Glue includes a centralized metadata repository that stores table definitions, schemas, and partitions. The Data Catalog is commonly shared across AWS analytics services such as Athena, Redshift Spectrum, and EMR, making it a core metadata layer in many AWS architectures.
  • Batch and Streaming ETL Support: Glue primarily targets batch ETL workloads but also supports streaming ETL jobs built on Spark Structured Streaming. These jobs can process data from sources like Kinesis or Kafka, though they typically operate with micro-batch latency and higher operational overhead than purpose-built streaming systems.
  • Flexible Transformation Options: Transformations can be implemented using Python (PySpark), Scala, or SQL. Glue provides built-in transforms for common operations such as filtering, joins, deduplication, and aggregation, while also allowing fully custom Spark logic.
  • AWS Glue DataBrew: Glue DataBrew is a visual, no-code data preparation tool that allows analysts and less technical users to profile, clean, and transform data interactively without writing Spark code.

AWS Glue Pricing Model

AWS Glue pricing is based primarily on Data Processing Units (DPUs). A DPU bundles a fixed amount of CPU, memory, and disk resources, and AWS charges per DPU-hour consumed by each job.

Key pricing characteristics to be aware of:

  • Costs scale with job duration and concurrency: Longer-running jobs, higher worker counts, and parallel job execution increase total DPU-hours consumed.
  • Startup and idle time count toward cost: Spark job initialization and any idle execution time are billed, which can significantly impact cost for short or infrequently run jobs.
  • Worker type selection affects pricing: Glue offers different worker types (for example, standard, G.1X, G.2X), each with different memory and CPU profiles that influence both performance and cost.

Because pricing depends on job configuration, runtime behavior, and workload variability, total costs can be difficult to predict, especially for large or highly parallel ETL workloads.

Typical Use Cases

  • Large-Scale Batch ETL: Processing large datasets stored in Amazon S3 and preparing them for analytics or reporting.
  • Data Lake Integration: Transforming raw data into curated datasets for use with Athena, Redshift Spectrum, or downstream analytics tools.
  • Metadata-Driven Pipelines: Environments where the Glue Data Catalog acts as a shared metadata layer across multiple AWS analytics services.

AWS Glue Limitations

  • Cold Start and Latency: Glue jobs incur startup time due to Spark initialization, making them unsuitable for low-latency or real-time use cases.
  • Cost Predictability Challenges: DPU-based pricing makes it harder to estimate costs in advance compared to throughput-based or flat-rate models.
  • AWS-Centric Design: Glue integrates best with AWS-native services and may require additional effort or custom development to work efficiently with non-AWS data sources.

When to Use AWS Glue

AWS Glue is a strong choice for batch-oriented ETL pipelines operating entirely within AWS, particularly when Spark-based transformations and integration with the AWS analytics ecosystem are required. For use cases that demand low-latency streaming, cross-cloud ingestion, or more predictable cost models, teams often complement or replace Glue with other streaming or data movement platforms.

2. AWS Lambda

AWS ETL Tools - AWS Lambda

AWS Lambda is a fully managed, serverless event-driven compute service, not a traditional ETL platform. While it is often used within data pipelines, Lambda is best suited for lightweight, short-lived transformations that run in response to events rather than for full ETL workflows involving large datasets or complex dependencies.

Lambda functions are triggered by events from AWS services such as Amazon S3, DynamoDB, API Gateway, or Amazon Kinesis. This makes Lambda a useful component for real-time or near real-time data processing scenarios where small transformations or routing logic must be applied immediately as data arrives.

Key Features

  • Event-Driven Execution: Lambda runs code automatically in response to events, enabling immediate processing of data changes such as file uploads, database updates, or incoming stream records.
  • Serverless Architecture: AWS manages infrastructure provisioning, scaling, and availability automatically. Lambda scales horizontally based on incoming event volume without requiring manual capacity planning.
  • Tight AWS Integration: Lambda integrates natively with services like Amazon S3, DynamoDB, Kinesis, SNS, and SQS, allowing it to act as glue logic within broader AWS data workflows.
  • Multi-Language Support: Lambda supports multiple runtimes including Python, Java, Node.js, and others, with support for custom runtimes when needed.

Typical Use Cases

  • Event-Driven Transformations: Applying small transformations, validations, or enrichments to data as it arrives from S3 events, DynamoDB Streams, or Kinesis records.
  • Stream Processing Helpers: Acting as a lightweight consumer for Kinesis or DynamoDB Streams to filter, aggregate, or route records to downstream systems.
  • Pipeline Triggers and Glue Logic: Starting ETL jobs in AWS Glue, notifying downstream systems, or coordinating steps across AWS services.

Limitations

  • Execution Time Limits: Lambda functions have a maximum execution time of 15 minutes, which makes them unsuitable for long-running ETL jobs or large batch transformations.
  • Resource Constraints: Memory and CPU limits restrict Lambda’s ability to process large datasets efficiently compared to Spark-based or distributed ETL systems.
  • Not a Full ETL Platform: Lambda does not provide built-in scheduling, state management, schema handling, or bulk data processing capabilities typically expected from ETL tools.

When to Use AWS Lambda

AWS Lambda is best used alongside ETL and streaming platforms rather than as a replacement for them. It excels at event-driven processing and orchestration tasks, but for large-scale batch ETL or continuous streaming pipelines, services like AWS Glue, AWS Kinesis, or external data movement platforms are more appropriate.

3. AWS Kinesis

AWS ETL Tools - AWS Kinesis

Amazon Kinesis is a fully managed service that enables real-time processing of streaming data at any scale. Kinesis is commonly used for applications requiring continuous data ingestion and real-time analysis, such as log and event data monitoring, IoT data processing, and media streaming.

Components of AWS Kinesis

  • Kinesis Data Streams:  Allows you to capture, store, and process streaming data in real time. It can ingest massive volumes of data from sources like web applications, financial transactions, or IoT devices. You can process this data in real-time using consumer applications such as AWS Lambda, Apache Spark, or other streaming analytics platforms.
  • Kinesis Video Streams:  Designed for securely streaming and processing live video data from devices like security cameras, mobile devices, and IoT sensors. It is useful for building applications that require real-time video analysis, such as video analytics for surveillance, video conferencing, or smart home applications.
  • Amazon Data Firehose (formerly Kinesis Data Firehose): A fully managed service for delivering streaming data to destinations such as Amazon S3, Amazon Redshift, and Amazon OpenSearch Service. It can batch, transform, compress, and encrypt data before delivery, and it scales to match incoming throughput

Key Features of AWS Kinesis

  • Scalability: Kinesis can scale to handle large volumes of streaming data. In on-demand mode, Kinesis manages capacity and shard scaling automatically. In provisioned mode, teams must plan and adjust shard capacity to avoid bottlenecks or over-provisioning.
  • Real-Time Analytics: Kinesis Data Analytics (for Apache Flink) supports real-time analysis of streaming data and can be used alongside services like Lambda and Redshift for downstream workflows.
  • Enhanced Fan-Out: Kinesis Data Streams' Enhanced Fan-Out enables multiple consumer applications to process the same stream concurrently, each with its own dedicated read throughput, improving efficiency and reducing latency.

Use Cases

  • High-Throughput, Real-Time Data Streaming: Kinesis is ideal for applications requiring continuous, real-time ingestion and processing of large volumes of streaming data, such as log aggregation, clickstream data from websites, or IoT sensor data, where low-latency processing is crucial.
  • Multiple Concurrent Consumers: Kinesis is better suited for scenarios where multiple applications need to consume and process the same stream simultaneously (e.g., analytics, monitoring, and alerting systems) due to its Enhanced Fan-Out feature, offering dedicated throughput for each consumer.

Drawbacks

  • Pricing Complexity: Kinesis pricing can be difficult to predict due to charges based on shard hours, PUT payload units, data retention, and data transfer, which can lead to unexpected costs, especially for high-throughput applications.
  • Shard Management Overhead: While Kinesis scales automatically, managing shards (the basic unit of capacity) manually can be complex, requiring careful tuning to avoid bottlenecks or over-provisioning for consistent performance.
  • Limited Data Retention: By default, Kinesis Data Streams retain data for 24 hours, but retention can be extended up to 365 days at additional cost.

4. AWS Data Pipeline

AWS ETL Tools - AWS Data Pipeline

Note:  As of July 25, 2024, AWS closed new customer access to AWS Data Pipeline. Existing customers can continue to use it, but AWS does not plan to add new features.

AWS Data Pipeline is a web service that automates the movement and transformation of data across various AWS services and on-premises data sources. It enables you to create, schedule, and manage complex data workflows, ensuring that data is efficiently processed, integrated, and moved between different locations.

Key Features of AWS Data Pipeline

  • Visual Interface:  AWS Data Pipeline provides a visual drag-and-drop interface that simplifies the creation of data workflows. Using the Pipeline Designer, you can easily configure data sources, transformations, and destinations on a visual canvas. This interface allows users to define the flow of data between different services such as Amazon S3, Amazon RDS, DynamoDB, and on-premise systems.
  • Customizable Alerts:  To enhance workflow monitoring, AWS Data Pipeline offers customizable notifications and alerts. These notifications can be triggered based on specific events, such as successful completion, failures, or delays in the pipeline execution. Alerts can be integrated with Amazon SNS (Simple Notification Service) to send real-time notifications via email, SMS, or other channels, enabling you to react immediately when events occur in your data workflow.
  • Security:  AWS Data Pipeline integrates with AWS Identity and Access Management (IAM) to define granular access controls over your data workflows. It supports data encryption both at rest and in transit, ensuring that sensitive data is protected throughout the pipeline. You can also assign IAM roles to pipelines, allowing them to access necessary AWS services securely without exposing long-term credentials.

Use Cases

AWS Data Pipeline is an effective tool for orchestrating data movement between different AWS services and external systems. It supports a wide range of data integration tasks, such as:

  • On-Premises Data Integration: AWS Data Pipeline supports seamless integration with on-premises data sources and services, making it ideal for hybrid environments where data resides outside of AWS.
  • Complex Scheduling and Orchestration: If you need more advanced scheduling options or dependency management between tasks, Data Pipeline offers more flexibility for custom workflows and coordinating different data sources
  • Custom or Non-Spark Workflows: If you need to run non-Spark jobs or custom scripts across different environments, Data Pipeline is better suited, as Glue focuses on serverless Spark ETL jobs.

Drawbacks

  • Manual Resource Management: AWS Data Pipeline often requires users to provision and manage underlying resources, like EC2 instances or EMR clusters, which adds complexity and can lead to higher operational overhead compared to fully serverless solutions.
  • Limited Scalability and Flexibility: Compared to newer tools like AWS Glue, Data Pipeline doesn't scale as easily for large-scale data processing or provide as much flexibility for handling complex data transformations.
  • Outdated Interface and Less Automation: The user interface is less intuitive, and it lacks some automation features found in modern data orchestration tools, which can make creating and managing data workflows more cumbersome.

Operational Challenges of AWS ETL Pipelines

While AWS ETL tools offer powerful capabilities, they also come with some limitations that businesses should consider when choosing an ETL platform.

  1. Limited Source Connectors:  AWS provides a wide array of pre-built connectors to integrate with various services, but these may not cover every potential use case. For more specialized integrations, organizations may need to invest in additional development work, which can complicate workflows and increase costs. This is particularly relevant for integrating data from non-AWS platforms or legacy systems.
  2. Vendor Lock-In: Relying heavily on AWS services for ETL pipelines can create a strong dependency on their ecosystem, making it difficult to migrate or integrate with non-AWS platforms down the road.  Over time, switching to alternate providers can become very costly and complex, reducing your freedom to choose the best tools for your evolving business needs.
  3. Steep Learning Curve:  Services like AWS Glue, AWS Kinesis, and AWS Lambda offer advanced features, but their complexity can present a steep learning curve, particularly for new users or teams unfamiliar with the AWS ecosystem. Mastery of these tools (and cost estimation for them!) often requires significant training and a deep understanding of AWS's broader architecture, making it more challenging for businesses without prior AWS experience to effectively implement these tools.
  4. Cost Estimation Challenges: AWS ETL tools have complex pricing models that make it difficult for organizations to accurately predict expenses. The multitude of variables—such as data processing units, storage tiers, data transfer costs, and varying rates for different services and regions—can lead to unforeseen charges. This complexity hampers effective budgeting and financial planning, as estimating the total cost of ownership becomes a challenging task, especially when scaling operations or dealing with fluctuating workloads.

In practice, AWS-native ETL services work well for many AWS-centric architectures, but teams may encounter added complexity when pipelines require CDC, continuous ingestion, or integrations across multiple clouds and SaaS systems.

While AWS offers powerful tools for building ETL pipelines, many organizations find these solutions difficult to scale, integrate, and maintain, especially when working with real-time or cross-cloud data.

That’s where Estuary comes in. Estuary is the Right-Time Data Platform that helps teams move data into AWS when they choose — sub-second, near real time, or batch — while reducing operational overhead for ingestion, CDC, and delivery.

Estuary: A Right-Time Alternative to Native AWS ETL Tools

AWS ETL Tools - Estuary

Native AWS ETL services such as Glue, Lambda, and Kinesis are powerful, but building end-to-end pipelines with them often requires stitching together multiple services, managing operational complexity, and navigating unpredictable costs. This becomes especially challenging when pipelines need to support change data capture (CDC), cross-cloud sources, or low-latency data delivery.

Estuary is the Right-Time Data Platform, designed to move data when teams choose, whether sub-second, near real time, or batch. Rather than replacing AWS analytics services, Estuary complements them by handling ingestion, CDC, and delivery, while AWS remains the system of record for storage, analytics, and machine learning.

Estuary is commonly used to ingest data into AWS destinations such as Amazon S3, Amazon Redshift, DynamoDB, and Kinesis, without requiring teams to build or operate custom streaming infrastructure.

How Estuary Works with AWS

Estuary captures data from operational databases, SaaS platforms, and event streams, then delivers it into AWS destinations using managed, fault-tolerant pipelines. Where supported, Estuary uses log-based change data capture and streaming ingestion to reduce reliance on scheduled polling and minimize latency. Delivery frequency can be configured based on workload requirements and cost considerations.

Data pipelines are defined declaratively and managed by Estuary, reducing the need to coordinate multiple AWS services or maintain custom orchestration logic.

Key Capabilities

  • Right-Time Data Delivery: Estuary allows teams to control when data moves, from continuous streaming to near-real-time micro-batches or scheduled batch delivery. This flexibility helps balance freshness, cost, and downstream system constraints.
  • Change Data Capture (CDC): Estuary supports CDC for databases such as PostgreSQL, MySQL, Oracle, MongoDB, and DynamoDB. Inserts, updates, and deletes are captured incrementally and propagated downstream, keeping AWS analytics systems synchronized with source systems.
  • Schema Enforcement and Evolution: Schemas are enforced at ingestion time, and compatible schema changes are handled automatically. This reduces pipeline breakage when upstream data evolves and avoids manual schema management in downstream systems.
  • Managed Connectors Across AWS and Beyond: Estuary provides a broad catalog of managed connectors for AWS services, databases, and SaaS platforms. This enables ingestion from sources both inside and outside AWS without custom connector development.
  • Operational Simplicity: Pipelines are managed as long-running services with built-in fault tolerance, retries, and recovery. Teams do not need to manage Spark clusters, shard counts, or custom retry logic.
  • Deployment Flexibility: Estuary supports fully managed deployments as well as Private Deployment and Bring Your Own Cloud (BYOC) options, allowing organizations to meet security, compliance, and data residency requirements.

Cost Model and Predictability

Unlike many AWS ETL services that rely on multi-variable pricing models, Estuary pricing is based on data throughput. This can make costs easier to estimate and reason about as pipelines scale, especially for CDC and streaming workloads where execution time can vary.

When to Use Estuary Instead of Native AWS ETL Tools

Estuary is often chosen when teams need:

  • Continuous or near-real-time data synchronization into AWS
  • CDC pipelines without building custom streaming architectures
  • Cross-cloud or hybrid ingestion alongside AWS analytics services
  • Predictable costs for long-running ingestion workloads
  • Reduced operational overhead compared to Spark- or shard-based systems

Native AWS tools remain a strong fit for tightly scoped, AWS-only batch or event workflows. Estuary is complementary in scenarios where data movement spans multiple systems, requires CDC, or must operate continuously with minimal operational burden.

Example: DynamoDB to Amazon Redshift

A common use case is replicating data from Amazon DynamoDB into Amazon Redshift for analytics. With Estuary, changes in DynamoDB are captured incrementally and delivered into Redshift on a continuous or near-real-time basis. Schema mapping and ongoing synchronization are handled automatically, reducing the need for custom Lambda functions, Kinesis consumers, or Glue jobs.

Conclusion: Choose the Right AWS ETL Tool for the Future

AWS-native ETL tools such as Glue, Lambda, and Kinesis each play an important role in modern data architectures. Glue is well-suited for large-scale batch ETL, Lambda enables event-driven transformations, and Kinesis provides low-latency streaming for real-time analytics. Used together, these services can support a wide range of data workflows within the AWS ecosystem.

However, as data pipelines evolve to include continuous ingestion, CDC, and cross-cloud sources, teams often encounter increasing operational complexity and cost variability when relying solely on native AWS services. In these scenarios, platforms designed specifically for ongoing data movement can complement AWS tools by simplifying ingestion, reducing orchestration overhead, and improving cost predictability.

Estuary is one such platform, enabling right-time data delivery into AWS destinations without requiring teams to build or manage custom streaming infrastructure. Combined thoughtfully with AWS analytics and storage services, it supports scalable, maintainable pipelines that adapt as data volume, latency requirements, and architectural complexity grow.

Also Read:

FAQs

    Which AWS ETL tool is best for beginners?

    AWS Glue and Lambda are powerful but require familiarity with AWS infrastructure and pricing. Managed platforms like Estuary are often easier for beginners because they provide a UI, automatic schema handling, and fewer services to configure.
    AWS ETL pricing depends on usage: Glue is billed per DPU-hour, Kinesis by shard capacity and data volume, and Lambda by execution time. Costs can be hard to predict for continuous workloads compared to throughput-based pricing models.
    AWS-native options typically require combining multiple services such as DMS, networking, and downstream ETL tools. Managed platforms like Estuary simplify this by supporting secure hybrid ingestion and continuous delivery into AWS services.
    No. Many production pipelines combine AWS analytics services with external data movement platforms, especially for CDC, SaaS ingestion, or cross-cloud data pipelines.

Start streaming your data for free

Build a Pipeline
Share this article

Table of Contents

Start Building For Free

About the author

Picture of Dani Pálma
Dani PálmaHead of Data & Marketing

Dani is a data professional with a rich background in data engineering and real-time data platforms. At Estuary, Daniel focuses on promoting cutting-edge streaming solutions, helping to bridge the gap between technical innovation and developer adoption. With deep expertise in cloud-native and streaming technologies, Dani has successfully supported startups and enterprises in building robust data solutions.

Related Articles

Popular Articles

Streaming Pipelines.
Simple to Deploy.
Simply Priced.
$0.50/GB of data moved + $.14/connector/hour;
50% less than competing ETL/ELT solutions;
<100ms latency on streaming sinks/sources.