data pipeline

10 min read

Last updated: March 5, 2026

Building Compliance-Ready Data Pipelines: GDPR, SOX, and Beyond

Learn how to build compliance ready data pipelines for GDPR and SOX using PII masking, audit logs, Airflow orchestration, and right time data governance with Estuary.

Antonello Benedetto Lead Data Engineer

Building Compliance Ready Data Pipelines for GDPR and SOX

Share this article

Summarize this page with AI

Table of Contents

Start Building For Free

When I entered the world of analytics and data engineering around 10 years ago, I wasn’t entirely sure what the word “compliance” meant. At the end of the day, I didn’t even care that much. I just wanted to write and ship good quality code and focus on the fun part of the job.

However, as my career progressed, I became more and more involved in the work of regional controllers and regulatory reporting teams. That’s when it finally clicked: compliance was at the core of their function. Their role was to build a foundation that would allow regulators to trust the business by ensuring them we were capable of protecting both proprietary and customer data.

Since they were my direct “customers”, I wanted to play my part and help them protect the company from fines and lasting damage to reputation. For starters, I had to stop treating compliance as an afterthought and start designing compliance-ready pipelines that scaled well and withstood the test of time.

In this article I’ll share practical examples about how to build analytical pipelines that comply with GDPR and SOX requirements. Though these two regulations are considered pillars of data privacy and financial governance, they are just the starting point. The same principles apply to many other regional codes and standards that dictate how data should be handled. Let’s cover it all.

Key Takeaways

GDPR requires protecting personal data through PII masking, minimization, and controlled access, while SOX focuses on auditability, traceability, and data integrity.
Compliance ready pipelines must include privacy controls, immutable audit logs, and reliable lineage across all data flows.
Airflow can orchestrate compliant workflows, but traditional pipelines often fail under schema drift and fragmented logic.
Estuary is the right time data platform that unifies CDC, batch, and streaming with built in schema driven redaction and automated governance to simplify GDPR and SOX compliance.

Why Compliance Matters in Modern Pipelines

Ever since I took on the role of data engineering lead at a FinTech company, compliance has been shaping pretty much everything our team develops and deploys. The daily interactions with compliance and regulatory reporting teams have made it clear that there are layers of local guidelines and sector-specific rules (beyond GDPR and SOX) that govern exactly how, when, and where data can be used or moved.

I have also learned that compliance shouldn’t be a burden that only few carry. It needs to be embedded proactively into the data platform as part of a shift-left strategy.

One of the biggest challenges my team faced was: how can we build pipelines that can handle all these compliance requirements smoothly while still allowing us to iterate quickly? The solution lay in an approach where privacy, audit trails, and access controls are not treated as secondary but as core features of our data flows.

With less time spent on reactive work and smoother technical process reviews with auditors, we are now able to focus on adding value where it matters most.

Both GDPR and SOX regulations impose concrete guidelines that affect everything from data ingestion to transformation and serving:

GDPR requires you to handle personal data carefully. You can only collect the data you actually need, anonymize everything that’s not needed, and enable users to get their data deleted. Data engineers are responsible for embedding GDPR principles directly into pipelines. We need to mask PII, encrypt sensitive fields, and make sure personal info doesn’t end up somewhere it shouldn’t.
SOX has a different purpose. It’s focused on making every financial data change fully traceable with user IDs, timestamps, and operation details. Data engineers need to automate immutable logging, build replay capabilities, and enforce data fidelity.

These rules set the guardrails for how we design our pipeline architectures and go about our day-to-day work. If you ignore them, it doesn’t take long before you run into compliance issues, and the business ends up dealing with the fallout.

Area	GDPR	SOX
Primary Focus	Personal data protection	Financial data integrity and preparation for audits
Data Collection	Only collect what's necessary	Comprehensive - captures all financial transactions
Retention	Delete when no longer needed; honor deletion requests	Retain records for mandated periods (typically 7 years)
Key Pipeline Requirements	PII masking, encryption, anonymization	Immutable audit logs, timestamps, user tracking
Change Handling	Must support data modification and erasure	Must preserve complete change history (no deletions)
Logging Priority	Consent and access tracking	Full traceability of every data operation
Replay/Recovery	Not a core requirement	Essential - must reconstruct historical states
Scope	Any personal data (EU residents)	Financial records (public US companies)

Airflow DAG for Compliance

Below, I will demonstrate a simple yet realistic example of a data pipeline that meets some of the compliance requirements for both GDPR and SOX. The pipeline is fully operational and demonstrates how masking and audit logging can be embedded directly into the workflow.

Orchestration via Airflow DAG

The pipeline named gdpr_sox_compliance_ppl is orchestrated via Airflow (running as a Docker service) and is made of three tasks (imported directly from the computation/ folder):

extract_data_main → generates mock transactional data via the Faker package and stores data in a table named trx_pii_data within a duckdb database
mask_pii_main → imports the mask_pii auxiliary function and applies that on PII-columns in the original trx_pii_data table to derive the trx_clear_data dataset
capture_audit_log_main → imports the capture_audit_log auxiliary function and uses it to capture comprehensive audit log with metadata

plaintextimport importlib
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python import PythonOperator

# Import the computation scripts
extract_script = importlib.import_module('computation.extract_data_main')
mask_script = importlib.import_module('computation.mask_pii_main')
audit_script = importlib.import_module('computation.capture_audit_log_main')

default_args = {
    'owner': 'data-engineering',
    'depends_on_past': False,
    'start_date': datetime(2025, 1, 1),
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
    'email_on_failure': False,
    'catchup': False,
}

with DAG(
    'gdpr_sox_compliance_ppl',
    default_args=default_args,
    description='Pipeline for data extraction, GDPR PII masking, and SOX audit logging',
    schedule_interval='@daily',
    catchup=False,
    max_active_runs=1,
    tags=['compliance', 'gdpr', 'sox'],
) as dag:

    # Task 1: Extract data and store in DuckDB
    extract_data_main = PythonOperator(
        dag=dag,
        task_id='extract_data_main',
        provide_context=True,
        python_callable=extract_script.extract_data,
    )

    # Task 2: Mask PII data for GDPR compliance
    mask_pii_main = PythonOperator(
        dag=dag,
        task_id='mask_pii_main',
        provide_context=True,
        python_callable=mask_script.mask_pii_data,
    )

    # Task 3: Capture audit log for SOX compliance
    capture_audit_log_main = PythonOperator(
        dag=dag,
        task_id='capture_audit_log_main',
        provide_context=True,
        python_callable=audit_script.capture_audit_log_data,
    )

    # Define task dependencies
    extract_data_main >> mask_pii_main >> capture_audit_log_main

Example of GDPR & SOX Compliance PPL Being Executed In Airflow

I leverage the mask_pii_main task, which obfuscates PII data using a SHA256 hashing algorithm to meet GDPR requirements. The trx_pii_data table is traversed and both the full_name and email fields are hashed to prevent exposure downstream, while retaining useful non-identifiable data for analysis:

plaintextdef mask_pii(df):
    """Mask PII data using SHA256 hashing for GDPR compliance"""
    
    def hash_val(val):
        return hashlib.sha256(val.encode('utf-8')).hexdigest()
    
    df_masked = df.copy()
    df_masked['full_name'] = df_masked['full_name'].apply(hash_val)
    df_masked['email'] = df_masked['email'].apply(hash_val)
    return df_masked

This is what the mock data generated by the extract_data_main task and stored in trx_pii_data table look like:

Now, notice the effect of masking on the full_name and email fields:

Capturing Audit Logs (SOX)

To meet SOX requirements, I then execute capture_audit_log_main , which generates a time-stamped audit log recording what data was processed, when, and by which operation. This lays the foundation for reliable audit trails:

plaintextdef capture_audit_log(df, operation_type='update', user_id=None, include_checksums=True):
    """
    Capture comprehensive audit log with metadata for SOX compliance.
    operation_type can get following values (e.g., 'create', 'update', 'delete', 'read')
    """
    now = datetime.datetime.now(datetime.UTC)
    batch_id = hashlib.sha256(f"{now.isoformat()}{len(df)}".encode()).hexdigest()[:16]
    
    audit_data = {
        'audit_id': [f"AUD-{batch_id}-{i:06d}" for i in range(len(df))],
        'record_id': df['id'].values,
        'operation_time': now.isoformat(),
        'operation_type': operation_type,
        'user_id': user_id or 'system',
        'batch_id': batch_id,
        'record_count': len(df),
        'timestamp_utc': now.timestamp()
    }
    # OPTIONAL: ADD CHECKSUMS BY ROW FOR ADDITIONAL DATA INTEGRITY VERIFICATION
    if include_checksums:
        checksums = df.apply(
            lambda row: hashlib.sha256(
                ''.join(str(v) for v in row.values).encode()
            ).hexdigest()[:16], 
            axis=1
        )
        audit_data['row_checksum'] = checksums.values
    
    # ADDING COLUMN-LEVEL METADATA
    audit_data['columns_accessed'] = ','.join(df.columns)
    
    audit_df = pd.DataFrame(audit_data)
    return audit_df

Finally, this is what the audit log table derived from the function above looks like:

This high-level pipeline captures the key component of the daily work of a data engineer. We safeguard personal data at scale while ensuring strong auditability.

If you want to learn more about the complete code, check out my GitHub repository.

Challenges of Managing Compliance with Traditional Pipelines

While Airflow and other open-source orchestrators offer a lot of flexibility and control, they also introduce several challenges that make compliance difficult to maintain at scale.

For starters, pipelines can break the moment schema drift occurs and cause failures even in basic masking or logging tasks. Such fragility puts compliance at risk and sometimes triggers SEV1-2 incidents.

On top of that, having logging and masking logic scattered across different processes adds a ton of maintenance overhead and makes it difficult to enforce policies consistently. And because there’s no real-time validation or strong governance baked in, compliance gaps often show up only after something has already gone wrong. The mix of specialized tools for CDC, batch, and streaming just adds another layer of complexity. All this creates a tangled ecosystem that’s tough to manage day to day.

These technical hurdles turn into real operational pains, such as more incidents to troubleshoot, longer system downtimes, and a higher chance of running into compliance issues.

Building Compliance-Ready Pipelines with Estuary

While developing a longer term strategy to address these compliance challenges, I came across Estuary. Their SaaS platform redefines how compliance is embedded in data pipelines by unifying ingestion, governance, and delivery into a single consistent framework.

Instead of fragile, fragmented tasks, Estuary redefines pipelines as unified. It combines CDC, batch, and streaming into a single, scalable platform that lets teams choose the “right time” data sync for their workloads, whether that’s sub-second, near-real-time, or scheduled batch. Not only does this consolidation make governance simpler but it also gets rid of fragile, point-to-point jobs, giving teams tighter control over latency, cost, and compliance.

Estuary’s engineering team has recently shipped a new feature called schema-driven redaction. It lets data teams automatically protect sensitive data by applying redaction rules directly within schemas. You can specify whether you want fields to be completely removed or replaced with secure, salted SHA-256 hashes (similar to the Airflow hands-on example but out-of-the-box!). This ensures that personal information is never exposed in storage or error logs. The system also manages the hashing salt for consistency across data processing tasks, which provides a reliable, automated privacy safeguard that goes beyond manual scripting.

Estuary also offers automated schema evolution to reduce breakages, exactly-once delivery for audit integrity, private and BYOC deployments for strict data control, and built-in immutable audit logs with lineage for easy compliance. All this speeds up pipeline builds and reduces failures, so teams get to focus on business value instead.

Strengthening Compliance with Right Time Data Pipelines

Building compliance right into data pipelines is how you manage trust on a large scale. As your data grows and requirements grow, you need pipelines that are governed end-to-end.

Estuary is the right-time data platform that helps with managing all of it. It brings all your pipeline modes together in one place, protects data, and supports deployment options that fit enterprise security policies.

If you’re leading a data engineering team that has to juggle GDPR, SOX, and similar regulatory standards, you don’t need to look into other options.

Ready to simplify GDPR and SOX compliance across your data pipelines? Talk to an Estuary expert and see how right time data movement makes governance effortless.

FAQs

What makes a data pipeline compliance-ready?

A compliance-ready data pipeline treats privacy, auditability, and governance as core design principles, not afterthoughts. Personal data is masked or minimized as it flows through the system, access is controlled by schema and policy, and every data change is logged immutably with clear lineage. This ensures compliance holds even during failures, reprocessing, or schema changes.

Why do traditional Airflow pipelines struggle with compliance at scale?

Traditional Airflow pipelines rely on scattered scripts for masking, logging, and validation, which makes them fragile under schema drift and hard to govern consistently. Compliance gaps often appear only after incidents or audits. Without built-in schema enforcement and lineage, maintaining GDPR and SOX compliance becomes increasingly complex as pipelines grow.

About the author

Antonello BenedettoLead Data Engineer

Experienced Data Engineering Lead with nearly a decade of expertise in designing and delivering robust analytical data solutions for financial markets. Backed by a strong academic foundation in computer science and quantitative finance, I currently lead data engineering initiatives at Wise.

Building Compliance-Ready Data Pipelines: GDPR, SOX, and Beyond

Key Takeaways

Why Compliance Matters in Modern Pipelines

Airflow DAG for Compliance

Orchestration via Airflow DAG

Capturing Audit Logs (SOX)

Challenges of Managing Compliance with Traditional Pipelines

Building Compliance-Ready Pipelines with Estuary

Strengthening Compliance with Right Time Data Pipelines

FAQs

What makes a data pipeline compliance-ready?

Why do traditional Airflow pipelines struggle with compliance at scale?

Start streaming your data for free

About the author

Related Articles

Popular Articles

ChatGPT for Sales Conversations: Building a Smart Dashboard

Why You Should Reconsider Debezium: Challenges and Alternatives

Don't Use Kafka as a Data Lake. Do This Instead.

Streaming Pipelines.

Simple to Deploy.

Simply Priced.

Building Compliance-Ready Data Pipelines: GDPR, SOX, and Beyond

Key Takeaways

Why Compliance Matters in Modern Pipelines

Understanding GDPR and SOX Requirements: A Data Engineering Perspective

Airflow DAG for Compliance

Orchestration via Airflow DAG

Masking PII for GDPR Compliance

Capturing Audit Logs (SOX)

Challenges of Managing Compliance with Traditional Pipelines

Building Compliance-Ready Pipelines with Estuary

Strengthening Compliance with Right Time Data Pipelines

FAQs

What makes a data pipeline compliance-ready?

Why do traditional Airflow pipelines struggle with compliance at scale?

Start streaming your data for free

About the author

Related Articles

What Is Data Quality? Dimensions, Standards, & Examples

Data Governance: Framework, Principles, & 6 Best Practices

Use ETL Pipelines to Remove PII and Protect Privacy

Popular Articles

ChatGPT for Sales Conversations: Building a Smart Dashboard

Why You Should Reconsider Debezium: Challenges and Alternatives

Don't Use Kafka as a Data Lake. Do This Instead.

Streaming Pipelines.

Simple to Deploy.

Simply Priced.