Putting an end to Unreliable AnalyticsJuly 26, 2021
When building a product or service, it’s imperative to know that input data will be as expected. If that information comes from another department, relying on them not to change it inevitably ends in disappointment since they might not even know that their data is being used and certainly won’t know how.
Good engineers know this and test everything to be sure that they’ll catch any changes or mistakes. By nature, humans are prone to errors and a good engineer has a sense of that humility. With quality testing, you can find mistakes before they make it to production servers.
Code is relatively easy to test now. Compilers and test suites help to ensure that code functions properly within a single environment. Unfortunately, it’s still common and incredibly costly for errors to occur in integration points — where one system hands data or tasks off to another.
Integration points are everywhere. They happen whenever different technologies are used or when a service is handed off between teams. One of the most popular ways of ensuring accuracy and consistency during handoffs is by building well documented APIs which are maintained and exposed directly to other teams. This is essentially Amazon’s 2002 strategy of micro-services.
APIs help enable cross technology/team communication since they provide a framework for each service to validate the data that they send or receive. If there’s a problem, an error can quickly be returned, enabling reliability and sanity since problems lead to a failure and therefore a quick fix. Unfortunately, they’re expensive to implement and maintain. Each API requires a constantly available service that’s documented and supported. As such, endpoints cost businesses tens of thousands to implement and maintain on an ongoing basis.
Analytics workloads often integrate data from many different sources without doing any runtime validation, which leads to runtime crashes or worse, incorrect results. The “Data Products” in many organizations rely on integrating information from a variety of sources, oftentimes utilizing complex transformations to make the pieces fit together. But these “Data Products” routinely have very weak test coverage, and state-of-the-art in testing them is extremely nascent or just being built. JSON schema can be a lighter weight method of integrating data from any source through validation without requiring a heavy weight API. It’s essentially a framework for ensuring that data in transit takes on its expected shape and structure. Runtime errors can therefore be caught before data lands in a system by verifying validity at its edges. Much simpler tests can be written to ensure that bad data never leaves or enters.
We’re already seeing this trend with other technologies such as Protobuf and OpenAPI. They’ve both evolved to at least provide some schema checking and validation out of the box, lifting the responsibility for doing request/response validation from developers. While both are imperfect, OpenAPI seems to be headed in the right direction by fully adopting JSON Schema.
The major problem is that these frameworks don’t go far enough. We need a more comprehensive plumbing of OpenAPI style input and output expectations into data-pipelines that are responsible for events after API serving has completed. We should be able to have one source of schematic truth that can both be used to build API’s against OpenAPI and also plumb into data pipelines, services, and stores.
Using these principles combined with tools like JSON schema and others, we’re headed towards a future world with end-to-end testing not only within code that one team maintains, but across boundaries. At some point, the forward thinking enterprise will centralize schema so that changes can simply filter down to all products and services using it. Data integrations and product development will become much faster and more reliable; working on them will look a lot more like building software than standing up tech infrastructure.
Analytics is a space with tons of low level point solutions which need to be wired together without native error checking between them. Its main focus is integration, not processing. We should take a step back from first principles and think about what we’d want in a data analytics framework that can help with this. A usable system should not only verify data in testing but also production, with checks at the boundaries of every system.