Estuary

Safely Sharing Data Between Companies

You don’t have to be a professional designer to appreciate visual balance and beauty.

Share this article

Data is the lifeblood of modern enterprise.  Our ability to consume, process, and act on it has allowed us to enjoy a digital transformation that’s fueled an unprecedented rate of innovation over the past 20 years.  The internet as we know it didn’t exist until the 1990’s and just 25 years later, it allows 4.5 billion people to connect and access anything they can dream of in seconds.

Sharing data is one of the pillars of that transformation.  Regardless of whether we look at healthcare, manufacturing, transportation, or finance, allowing for fast and easy transfer of information between companies (while of course upholding privacy) is a key to advancement.  Without it, we wouldn’t be able to check Google for when our next train arrives, understand how stocks are performing, or even book restaurant reservations without having to speak to a human.

Why is data sharing so hard?

While data sharing is not complicated in theory, unlimited use cases, unending technology options, and a complete lack of standards in most industries make it difficult in practice.  To simplify it, there are four things that need to be addressed:

  1. Schema
    • Schema refers to the shape of your data and defines the fields you care about.
    • When two companies work together, they rely on each other to keep schema consistent which can be challenging as products change. 
  2. Technology
    • Different companies use different technologies; working with everyone requires support for many distinct options and interfaces.
  3. Privacy laws
    • Privacy laws are complex and different regions are bound by different laws.  Adhering to all of them requires specific opt-in and consent which is challenging to implement and manage. 
  4. Trade secrets
    • Every company has secrets that can’t be compromised, whether they’re your client list, revenue numbers or other proprietary information which may give you an edge.

How do we minimize friction?

We’ll always have to be thoughtful about sharing data between companies to ensure that we’re not jeopardizing privacy or trade secrets, but there are steps that we can take to reduce risk and friction while making integrations more seamless.  A high-level goal should be to make new integrations as easy as specifying schema and technology.

Schema   

Not only does schema govern the shape of your data, but also the fields which are required for it to be valid.  Specifying a strong schema ensures that all data which makes its way from one company to another is actually usable, while providing a place where we can catch potential issues that arise.  Agreeing on that schema up front is beneficial since it provides a standard that can be used repeatedly . Some industries go a step further by developing broad standards such as the multitude for healthcare and OpenRTB for advertising.

Schema is important for both incoming and outgoing data, and an idealized setup would be a centralized repository that contains the necessary information for all integrations.  Doing so is key both for updating those that already exist as well as adding new ones. Schema isn’t “one size fits all” though which makes standardized integrations tricky. Vendors who claim a “Salesforce integration” don’t mean that they support all of the hundreds of fields Salesforce has, but rather that they support some of them, and that may or may not meet your use case.

As an example, imagine that I work for a health insurance company.  My major goal is to share the amount of money that we’re willing to pay for each procedure with hospitals.  Our repository should contain a listing of procedures and payment amounts. There’s no need to add customer data or revenue information.  Hospitals might accept much more information through their API’s, but locking down our schema to my use case protects us and simplifies the integration.

Technology  

These days, data can be stored and sent using a plethora of technologies across several different categories which define how data is pulled from, stored internally, and sent to third parties.   These essentially include:

  • Files that live in cloud storage, FTP or another server 
  • Databases and warehouses
  • API’s 
  • Streaming systems

Realistically, partners will ask for any and all integration types, and doing so results in custom, one-off development.  Unfortunately, as the number of supported integration types increases, so do moving parts and therefore engineering burden.  Inevitably something breaks, such as a partner’s FTP server no longer containing recent data, or the data format of an API changes.  Unless a system is designed thoughtfully up front, all of these types of events will incur wasted engineers’ time spent troubleshooting and coordinating, as well as business expense.  

The options to minimize this type of loss are to either: a) use a partner with pre-built integrations, or b) create a robust testing, monitoring and alerting framework.  Extremely detailed alerts help ensure the fast triage and fixing of issues. Even if your system is built perfectly, partners will unintentionally make breaking changes which you’ll need to know about.

Privacy laws  

For the most part, these kick in when companies share customer data with each other.  One great tactic is to avoid sharing anything that could be considered PII by aggregating data.  For example, instead of sharing data on a particular user, count how many had the attribute you care about and share that information.  Most of the time, you can get to the same value while better protecting users. If that’s not possible because you need to share user level data, it’s probably best to work with companies that specialize in de-identification like LiveRamp or Neustar.

Trade secrets  

In reality, the type of information which can be shared externally and doesn’t qualify as a trade secret is a business decision.  Teams should strive to make it up front, one time, and not repeatedly when a new integration is created. A usual tactic is to involve the internal team which controls governance; asking them early on to enumerate the type of information that is (or isn’t) acceptable to share externally, with the goal of creating a standardized repository around what can be shared.  Any new integrations that fall outside of the scope of what’s permissible should be escalated to the governance team for remediation.

This isn’t the “simple” path that most engineers would take when building an integration.  The most common and easiest way is to locate the data’s source and pass it through to partners directly.  The problem with this method is that data is always in flux. If a team doesn’t take the time to test and validate the specific fields which are being passed, an upstream change in the data’s source could lead to major problems.  Information which violates internal privacy and governance rules could leak to partners and no one would ever be the wiser.  

What does the final ideal system look like?

In an idealized state, your team would have a pre-built “drop down” of technology connections for both incoming and outgoing integrations.  Something which is as simple as “just add credentials” to get it going.  

When data comes from a partner to your system, it should land in whatever shape it arrives in.  This way, you get it into your system as quickly as possible for usage. Users would be able to define its expected schema so that when it doesn’t match reality, the proper team is alerted and can quickly triage with the partner.

Next comes the transformation tool which has the task of getting data into a usable format for both internal and external purposes.  Ideally specifying transformations is as easy as writing a SQL query, but a strong system should additionally support more complex use cases which may require a scripting language like Python.  Transformations live in version control (similar to DBT) so that anyone who is technical within a company can understand how they are made and create new ones.

If we’re talking about ideal state, transformations should be both instantaneous and continuous, creating derived collections of data that are populated in real-time by services or partners. 

The last step is to have a service which loads data into partners’ systems.  This should parallel the simplicity of the aforementioned pre-built “drop down” of technology connections for getting data into the system.  With this level of simplicity, it’s possible that a business person could control new integrations and engineers wouldn’t have to be involved.  

Building this type of system is a burden for many companies and one that should be standardized.  Estuary Flow is designed to help ease that burden. You can read more about it here.

Start streaming your data for free

Build a Pipeline
Share this article

Table of Contents

Build a Pipeline

Start streaming your data for free

Build a Pipeline

About the author

Picture of David Yaffe
David YaffeCo-founder and CEO

David Yaffe is a co-founder and the CEO of Estuary. He previously served as the COO of LiveRamp and the co-founder / CEO of Arbor which was sold to LiveRamp in 2016. He has an extensive background in product management, serving as head of product for Doubleclick Bid Manager and Invite Media.

Popular Articles

Streaming Pipelines.
Simple to Deploy.
Simply Priced.
$0.50/GB of data moved + $.14/connector/hour;
50% less than competing ETL/ELT solutions;
<100ms latency on streaming sinks/sources.