Data breaches are becoming all too common as we digitize more of our lives. It’s more important than ever for organizations to lock down data, remove PII where they can, and use data tools they can trust, especially if they handle any sort of sensitive personal information.
In this shifting landscape, engineers may find that some data systems they want to use haven’t been cleared by their security team to hold all of the organization’s data. A company might have implemented rigorous control over access to their own database, but what if they want to transfer data to a warehouse so big queries don’t impact their main servers? Does this external warehouse have the same security protocols as the in-house database? Do data analysts working with the warehouse really need access to individual customer names? And whose fault is it if the data leaks?
One way to limit the risk is to simply not share all your data with your warehouse. In fact, it’s good practice to limit sharing Personally Identifiable Information (PII) wherever possible.
What is PII?
PII is any Personally Identifiable Information. This of course includes name, address, and birthdate, but can also include otherwise anonymized data. Even if a UUID, or universally unique identifier, is used in place of a name, that UUID can still be used to track certain habits and may be considered PII. Consider financial data, which can track behavioral patterns in spending. Or location data: you don’t need a name to be able to personally identify an individual’s home and work sites from their mobile data.
In some cases, there are specific legal definitions of personal information, notably with GDPR (the General Data Protection Regulation in the EU) and CCPA (the California Consumer Privacy Act).
The GDPR, for example, includes these details in its definition of personal data: “factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of [a] natural person.” This covers a wide swath of data in a variety of formats. Similarly, the CCPA calls out “records of products purchased, internet browsing history, geolocation data, [and] fingerprints” as personal information.
These may be the most well-known data and privacy regulations, but multiple countries and regions have their own variants. Privacy and data protection is an increasing concern, and companies need to stay abreast of applicable local and international laws. Estuary CEO David Yaffe discusses these concerns a bit more in “Keep Privacy and Governance in Mind When Developing or Updating Your Systems.”
This doesn’t mean that companies need to avoid this data wholesale–some apps legitimately benefit from some of these data categories–but you do need to be cognizant of the regulations that apply, and may need to plan ways to remove an individual’s data from all of your connected systems on request.
Industries with Particular Data Obligations
Some industries, by their nature, deal with more sensitive data than your average PII. The healthcare industry must take HIPAA into account while PCI DSS provides security standards for payment data in the finance and retail realm.
If you’re trying to manage data transfer and storage as in-house as possible, it’s a lot to keep track of. Choosing a data pipeline tool like Estuary can help simplify this complicated landscape. Estuary is GDPR, CCPA, and CPRA compliant, as well as HIPAA compliant and SOC 2 Type II certified. We’re always looking for opportunities to provide additional data security options.
How to Handle PII in ETL and ELT Pipelines
If the data landscape now looks like a minefield, let’s work through some specific examples of how to mitigate risk. One area that can be of particular concern is data transfer. Consider ELT pipelines, where data is loaded as-is into storage before any transformations are applied.
This type of ELT, or Extract-Load-Transform pipeline, can be useful when you want to materialize the same data to different locations, perhaps each with their own requirements for how the data should be transformed. However, if you want to anonymize data before storage, or block or remove fields altogether, ETL (Extract, Transform, Load) is the way to go. That way, you can transform your data before any of the downstream data systems receive information they shouldn’t.
If you’re using Estuary for your data pipeline, you have some options for removing PII in a controlled and effective manner.
Option 1: Use the Estuary Web App to Limit Materialized Data
You can create low- or no-code pipelines directly in the Estuary web app. This lets you get started using ETL pipelines to remove PII, even with a very low tech burden. You can simply select the fields you need in the Flow app when materializing data to a downstream system.
Let’s say you’re working with Stripe invoices and want to save a subset of this data to your destination. You really only want to track fields like amount_due
, amount_paid
, and amount_remaining
for reporting purposes. Of course, it matters who is paying or has remaining balances so you can follow up if there’s an outstanding bill, but for this report you’re simply sharing general revenue flow with certain stakeholders. Those stakeholders don’t need to know customer details.
Unfortunately, Stripe provides quite a lot of data on the Invoices API, including the customer name, email, and address. Our reporting will be more focused and secure if we only select the fields we really need.
When you create a materialization in Estuary Flow, you can manage this from the Source Collections section. Assuming you’re already connected to Stripe as a source connector, you can select Stripe’s Invoices collection to materialize.
This will bring up the Resource Configuration on the right-hand side of the screen. The configuration automatically attempts to determine which fields to include or exclude from your materialization. You can also edit this yourself under Field Selection.
If you only want to select a few fields, you can Exclude All and then pick individual fields to include. Or exclude specific fields, such as fields that have been identified as PII.
Either way, we can block all of the invoices’ customer fields before we Save and Publish our materialization. Only the selected fields will make their way downstream to our destination.
Option 2: Create a Derived Collection to Limit Data Across Multiple Materializations
Manually selecting fields to materialize might feel tedious, especially if you’re working with a large schema, and doubly so if you plan to materialize the same data to multiple destinations. Manually duplicating work is also prone to human error, which tends to be a phrase you don’t want to see in conjunction with PII.
In these cases, you may want to set up a derivation to programmatically manage your subset of data. You can use SQL or TypeScript to transform your data before saving it to a new collection. You can then use this collection with multiple materializations.
Using the same Stripe invoice example from earlier, let’s run through a SQL derivation using GitPod. If you’d rather use TypeScript or develop locally with flowctl
, check out How to transform data using TypeScript or Edit a Flow specification locally.
Step 1: Create a Derivation
From the Collections tab in the Estuary dashboard, select New Transformation.
This will open a modal to pick the collection(s) you want to use in your derivation, the language you want to use, and the name of the newly derived collection. In this case, we’re interested in data from the original Invoices collection and will save to invoices-lite
.
Select Proceed to GitPod. This will create a provisioned cloud environment. flowctl
, the CLI tool to manage and test Flow pipelines, will come preinstalled and some helpful stub documents will be created for you.
Step 2: Edit Collection Schema
You will be able to choose your preferred code editor when you set up your workspace. From here, you can explore the generated file structure. For a derivation that simply removes fields from the collection specification, there are only a couple of files we’ll be interested in.
The first is the new collection schema itself. This is a nested flow.yaml
file, in the directory with your organization name.
Here, we’ll define exactly what fields we want to include in our collection, leaving out any PII we don’t want to share with downstream systems. Using our Stripe invoices reporting example, we may only be interested in a handful of properties, such as the ID (which will be our key), when the invoice was created, and the amounts that were due, paid, and remaining.
The resulting schema would look something like this:
plaintext---
collections:
Artificial-Industries/invoices-lite:
schema:
type: object
properties:
id:
type: string
amount_due:
type: integer
amount_paid:
type: integer
amount_remaining:
type: integer
created:
type: integer
required:
- id
key:
- /id
derive:
using:
sqlite: {}
transforms:
- name: Invoices
source: Artificial-Industries/stripe/Invoices
shuffle: any
lambda: invoices-lite.lambda.Invoices.sql
Note that, besides the schema properties, we’re also defining what data we’re deriving and how we’re deriving it. We perform our derivation using sqlite
, our data source is our original Stripe Invoices collection, and the lambda is how we will transform the source collection.
Speaking of which, let’s look at that invoices-lite.lambda.Invoices.sql
file, which is in the same directory as the schema.
Step 3: Edit SQL Lambda
The schema defines broad strokes on how we plan to handle our data, while the lambda will get into the nitty-gritty. For our use case, a simple Select statement will be sufficient, but you can also perform joins, reductions, and otherwise transform your data.
Our Select statement returns all the fields that we defined as properties in our schema:
plaintext-- Returns a very small subset of available data
SELECT
-- Key id
$id,
-- Selected invoice details
$amount_due,
$amount_paid,
$amount_remaining,
$created;
That’s all the editing we need to do for our use case. In case you’re curious about that extra SQL file that GitPod generated, though, check out migrations.
Step 4: Test and Publish Changes
Now we need to transfer our edited files from GitPod back to Estuary.
First, we should test to make sure we don’t have typos, type mismatches, or other little bugs waiting to turn into big issues. We can preview what our derived data will look like and confirm that it matches our expected output.
In the terminal, run:
flowctl preview --source flow.yaml
If no errors occur and the data format looks correct, Ctrl-C out of the process. Then you can push the changes back to Estuary with:
flowctl catalog publish --source flow.yaml
Note: if you receive authentication errors when using flowctl
, your session may have timed out. Run flowctl auth login
and paste in your auth token to refresh your session.
Back in the Estuary dashboard, you should now see a new invoices-lite
collection! You can use this derived collection like any other collection.
Now, when multiple materializations need our reduced dataset, it’s quick to select our source collection without worrying about choosing fields for each one.
Great! But that’s not quite right for my use case
Some PII is so sensitive, you want to minimize access to it as much as possible. Think health records or bank information. While Estuary is HIPAA compliant and works hard to maintain security certifications, we understand that some businesses require more specialized and nuanced data practices.
To help handle such cases, Estuary offers private deployments. This allows you to make use of Estuary Flow’s powerful data pipeline management while remaining in your own private network. This can be especially helpful for enterprises and companies in highly-regulated industries.
We’re also looking into more ways to handle incoming PII, not just outgoing, allowing you to protect data even earlier in the ETL process. Stay tuned: we expect to officially announce this feature in the first half of 2025. 👀
To receive updates on that, as well as other Estuary improvements and resources, sign up for our newsletter (subscribe below) or follow our Data Flow Digest on LinkedIn.
If you have any questions, don’t hesitate to reach out to us via our Community Slack channel or send us an email!
FAQs
1. Why is removing PII before data storage crucial for compliance?
In general, the fewer connected systems that have access to PII, the less likely it is for that PII to leak. Individual certifications will have their own requirements and standards, and companies may mitigate risk by determining that certain downstream vendors will not have access to PII. Additionally, for consumer protection laws, limiting the systems that have access to an individual’s data will simplify the process of removing their information on request.
2. Can I remove PII from large datasets using ETL pipelines?
Yes! Your exact mileage will vary depending on the tools you use for ETL and whether you’re processing data in real-time or batch. Estuary Flow, for example, can apply transformations on streaming data with millisecond latency and can handle individual documents up to 16 MB each. Flow is also smart about handling large datasets or high rates of change depending on the data’s final destination.
3. How does data anonymization fit into PII removal?
Data anonymization is the process of removing PII from data. Be careful with the concept, though—someone might think that using a UUID instead of a name (“cust11235813”) is anonymous enough, but in cases where that UUID tracks an individual’s habits, that data might not be fully anonymized. This article looked at removing entire subsets of data, but in some cases, generalizing data into broader categories or other techniques may be sufficient to anonymize data.
About the author
Emily is a software engineer and technical content creator with an interest in developer education. She has experience across Developer Relations roles from her FinTech background and is always learning something new.