
When I entered the world of analytics and data engineering around 10 years ago, I wasn’t entirely sure what the word “compliance” meant. At the end of the day, I didn’t even care that much. I just wanted to write and ship good quality code and focus on the fun part of the job.
However, as my career progressed, I became more and more involved in the work of regional controllers and regulatory reporting teams. That’s when it finally clicked: compliance was at the core of their function. Their role was to build a foundation that would allow regulators to trust the business by ensuring them we were capable of protecting both proprietary and customer data.
Since they were my direct “customers”, I wanted to play my part and help them protect the company from fines and lasting damage to reputation. For starters, I had to stop treating compliance as an afterthought and start designing compliance-ready pipelines that scaled well and withstood the test of time.
In this article I’ll share practical examples about how to build analytical pipelines that comply with GDPR and SOX requirements. Though these two regulations are considered pillars of data privacy and financial governance, they are just the starting point. The same principles apply to many other regional codes and standards that dictate how data should be handled. Let’s cover it all.
Key Takeaways
- GDPR requires protecting personal data through PII masking, minimization, and controlled access, while SOX focuses on auditability, traceability, and data integrity.
- Compliance ready pipelines must include privacy controls, immutable audit logs, and reliable lineage across all data flows.
- Airflow can orchestrate compliant workflows, but traditional pipelines often fail under schema drift and fragmented logic.
- Estuary is the right time data platform that unifies CDC, batch, and streaming with built in schema driven redaction and automated governance to simplify GDPR and SOX compliance.
Why Compliance Matters in Modern Pipelines
Ever since I took on the role of data engineering lead at a FinTech company, compliance has been shaping pretty much everything our team develops and deploys. The daily interactions with compliance and regulatory reporting teams have made it clear that there are layers of local guidelines and sector-specific rules (beyond GDPR and SOX) that govern exactly how, when, and where data can be used or moved.
I have also learned that compliance shouldn’t be a burden that only few carry. It needs to be embedded proactively into the data platform as part of a shift-left strategy.
One of the biggest challenges my team faced was: how can we build pipelines that can handle all these compliance requirements smoothly while still allowing us to iterate quickly? The solution lay in an approach where privacy, audit trails, and access controls are not treated as secondary but as core features of our data flows.
With less time spent on reactive work and smoother technical process reviews with auditors, we are now able to focus on adding value where it matters most.
Understanding GDPR and SOX Requirements: A Data Engineering Perspective
Both GDPR and SOX regulations impose concrete guidelines that affect everything from data ingestion to transformation and serving:
- GDPR requires you to handle personal data carefully. You can only collect the data you actually need, anonymize everything that’s not needed, and enable users to get their data deleted. Data engineers are responsible for embedding GDPR principles directly into pipelines. We need to mask PII, encrypt sensitive fields, and make sure personal info doesn’t end up somewhere it shouldn’t.
- SOX has a different purpose. It’s focussed on making every financial data change fully traceable with user IDs, timestamps, and operation details. Data engineers need to automate immutable logging, build replay capabilities, and enforce data fidelity.
These rules set the guardrails for how we design our pipeline architectures and go about our day-to-day work. If you ignore them, it doesn’t take long before you run into compliance issues, and the business ends up dealing with the fallout.
| Area | GDPR | SOX |
|---|---|---|
| Primary Focus | Personal data protection | Financial data integrity and preparation for audits |
| Data Collection | Only collect what's necessary | Comprehensive - captures all financial transactions |
| Retention | Delete when no longer needed; honor deletion requests | Retain records for mandated periods (typically 7 years) |
| Key Pipeline Requirements | PII masking, encryption, anonymization | Immutable audit logs, timestamps, user tracking |
| Change Handling | Must support data modification and erasure | Must preserve complete change history (no deletions) |
| Logging Priority | Consent and access tracking | Full traceability of every data operation |
| Replay/Recovery | Not a core requirement | Essential - must reconstruct historical states |
| Scope | Any personal data (EU residents) | Financial records (public US companies) |
Airflow DAG for Compliance
Below, I will demonstrate a simple yet realistic example of a data pipeline that meets some of the compliance requirements for both GDPR and SOX. The pipeline is fully operational and demonstrates how masking and audit logging can be embedded directly into the workflow.
Orchestration via Airflow DAG
The pipeline named gdpr_sox_compliance_ppl is orchestrated via Airflow (running as a Docker service) and is made of three tasks (imported directly from the computation/ folder):
extract_data_main→ generates mock transactional data via the Faker package and stores data in a table namedtrx_pii_datawithin aduckdbdatabasemask_pii_main→ imports themask_piiauxiliary function and applies that on PII-columns in the originaltrx_pii_datatable to derive thetrx_clear_datadatasetcapture_audit_log_main→ imports thecapture_audit_logauxiliary function and uses it to capture comprehensive audit log with metadata
plaintextimport importlib
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python import PythonOperator
# Import the computation scripts
extract_script = importlib.import_module('computation.extract_data_main')
mask_script = importlib.import_module('computation.mask_pii_main')
audit_script = importlib.import_module('computation.capture_audit_log_main')
default_args = {
'owner': 'data-engineering',
'depends_on_past': False,
'start_date': datetime(2025, 1, 1),
'retries': 1,
'retry_delay': timedelta(minutes=5),
'email_on_failure': False,
'catchup': False,
}
with DAG(
'gdpr_sox_compliance_ppl',
default_args=default_args,
description='Pipeline for data extraction, GDPR PII masking, and SOX audit logging',
schedule_interval='@daily',
catchup=False,
max_active_runs=1,
tags=['compliance', 'gdpr', 'sox'],
) as dag:
# Task 1: Extract data and store in DuckDB
extract_data_main = PythonOperator(
dag=dag,
task_id='extract_data_main',
provide_context=True,
python_callable=extract_script.extract_data,
)
# Task 2: Mask PII data for GDPR compliance
mask_pii_main = PythonOperator(
dag=dag,
task_id='mask_pii_main',
provide_context=True,
python_callable=mask_script.mask_pii_data,
)
# Task 3: Capture audit log for SOX compliance
capture_audit_log_main = PythonOperator(
dag=dag,
task_id='capture_audit_log_main',
provide_context=True,
python_callable=audit_script.capture_audit_log_data,
)
# Define task dependencies
extract_data_main >> mask_pii_main >> capture_audit_log_mainMasking PII for GDPR Compliance
I leverage the mask_pii_main task, which obfuscates PII data using a SHA256 hashing algorithm to meet GDPR requirements. The trx_pii_data table is traversed and both the full_name and email fields are hashed to prevent exposure downstream, while retaining useful non-identifiable data for analysis:
plaintextdef mask_pii(df):
"""Mask PII data using SHA256 hashing for GDPR compliance"""
def hash_val(val):
return hashlib.sha256(val.encode('utf-8')).hexdigest()
df_masked = df.copy()
df_masked['full_name'] = df_masked['full_name'].apply(hash_val)
df_masked['email'] = df_masked['email'].apply(hash_val)
return df_maskedThis is what the mock data generated by the extract_data_main task and stored in trx_pii_data table look like:
Now, notice the effect of masking on the full_name and email fields:
Capturing Audit Logs (SOX)
To meet SOX requirements, I then execute capture_audit_log_main , which generates a time-stamped audit log recording what data was processed, when, and by which operation. This lays the foundation for reliable audit trails:
plaintextdef capture_audit_log(df, operation_type='update', user_id=None, include_checksums=True):
"""
Capture comprehensive audit log with metadata for SOX compliance.
operation_type can get following values (e.g., 'create', 'update', 'delete', 'read')
"""
now = datetime.datetime.now(datetime.UTC)
batch_id = hashlib.sha256(f"{now.isoformat()}{len(df)}".encode()).hexdigest()[:16]
audit_data = {
'audit_id': [f"AUD-{batch_id}-{i:06d}" for i in range(len(df))],
'record_id': df['id'].values,
'operation_time': now.isoformat(),
'operation_type': operation_type,
'user_id': user_id or 'system',
'batch_id': batch_id,
'record_count': len(df),
'timestamp_utc': now.timestamp()
}
# OPTIONAL: ADD CHECKSUMS BY ROW FOR ADDITIONAL DATA INTEGRITY VERIFICATION
if include_checksums:
checksums = df.apply(
lambda row: hashlib.sha256(
''.join(str(v) for v in row.values).encode()
).hexdigest()[:16],
axis=1
)
audit_data['row_checksum'] = checksums.values
# ADDING COLUMN-LEVEL METADATA
audit_data['columns_accessed'] = ','.join(df.columns)
audit_df = pd.DataFrame(audit_data)
return audit_dfFinally, this is what the audit log table derived from the function above looks like:
This high-level pipeline captures the key component of the daily work of a data engineer. We safeguard personal data at scale while ensuring strong auditability.
If you want to learn more about the complete code, check out my GitHub repository.
Challenges of Managing Compliance with Traditional Pipelines
While Airflow and other open-source orchestrators offer a lot of flexibility and control, they also introduce several challenges that make compliance difficult to maintain at scale.
For starters, pipelines can break the moment schema drift occurs and cause failures even in basic masking or logging tasks. Such fragility puts compliance at risk and sometimes triggers SEV1-2 incidents.
On top of that, having logging and masking logic scattered across different processes adds a ton of maintenance overhead and makes it difficult to enforce policies consistently. And because there’s no real-time validation or strong governance baked in, compliance gaps often show up only after something has already gone wrong. The mix of specialized tools for CDC, batch, and streaming just adds another layer of complexity. All this creates a tangled ecosystem that’s tough to manage day to day.
These technical hurdles turn into real operational pains, such as more incidents to troubleshoot, longer system downtimes, and a higher chance of running into compliance issues.
Building Compliance-Ready Pipelines with Estuary
While developing a longer term strategy to address these compliance challenges, I came across Estuary. Their SaaS platform redefines how compliance is embedded in data pipelines by unifying ingestion, governance, and delivery into a single consistent framework.
Instead of fragile, fragmented tasks, Estuary redefines pipelines as unified. It combines CDC, batch, and streaming into a single, scalable platform that lets teams choose the “right time” data sync for their workloads, whether that’s sub-second, near-real-time, or scheduled batch. Not only does this consolidation make governance simpler but it also gets rid of fragile, point-to-point jobs, giving teams tighter control over latency, cost, and compliance.
Estuary’s engineering team has recently shipped a new feature called schema-driven redaction. It lets data teams automatically protect sensitive data by applying redaction rules directly within schemas. You can specify whether you want fields to be completely removed or replaced with secure, salted SHA-256 hashes (similar to the Airflow hands-on example but out-of-the-box!). This ensures that personal information is never exposed in storage or error logs. The system also manages the hashing salt for consistency across data processing tasks, which provides a reliable, automated privacy safeguard that goes beyond manual scripting. (See Estuary PR #2383 for technical details)
Estuary also offers automated schema evolution to reduce breakages, exactly-once delivery for audit integrity, private and BYOC deployments for strict data control, and built-in immutable audit logs with lineage for easy compliance. All this speeds up pipeline builds and reduces failures, so teams get to focus on business value instead.
Strengthening Compliance with Right Time Data Pipelines
Building compliance right into data pipelines is how you manage trust on a large scale. As your data grows and requirements grow, you need pipelines that are governed end-to-end.
Estuary is the right-time data platform that helps with managing all of it. It brings all your pipeline modes together in one place, protects data, and supports deployment options that fit enterprise security policies.
If you’re leading a data engineering team that has to juggle GDPR, SOX, and similar regulatory standards, you don’t need to look into other options.
Ready to simplify GDPR and SOX compliance across your data pipelines? Talk to an Estuary expert and see how right time data movement makes governance effortless.
FAQs
Why do traditional Airflow pipelines struggle with compliance at scale?

About the author
Experienced Data Engineering Lead with nearly a decade of expertise in designing and delivering robust analytical data solutions for financial markets. Backed by a strong academic foundation in computer science and quantitative finance, I currently lead data engineering initiatives at Wise.


















