Data Infrastructure

Data Quality at the Source: Why Moving Validation Upstream Changes Everything

Yuki Nakashima April 30, 2024

The data observability and quality tooling market has built itself around the wrong premise. The dominant pattern — instrument your data warehouse, run statistical tests against it, alert when distributions shift — catches quality problems after the bad data has already propagated through your pipelines, trained your models, and influenced your decisions. The damage is done by the time the alert fires.

The right frame for data quality is not "detect anomalies in the warehouse." It's "prevent bad data from entering the pipeline in the first place." This is a meaningful architectural shift, and it requires different infrastructure than post-hoc observability tooling.

Why the warehouse-centric model persists

The warehouse is where the data quality tooling grew because that's where the data engineering teams had access. Warehouse data is queryable with SQL. Statistical tests are easy to write against columnar data in a warehouse. The data engineering team doesn't typically own the source systems — the application databases, the event streams, the external API integrations — so they detect problems where they can see them.

This is an organizational constraint masquerading as an architectural choice. The useful question is: if you could choose where to enforce quality, where would you choose? Almost always: at ingestion, at source, before the data enters the pipeline. This is where schema enforcement can be applied. This is where type validation is cheap. This is where the metadata about data provenance is still intact and actionable.

The data contract pattern

The data contract movement — defining explicit schemas and quality constraints at the interface between data producers and data consumers — is the most promising direction for source-side quality enforcement. A data contract specifies what a producer agrees to provide: schema, field semantics, freshness guarantees, volume expectations. When a producer violates a contract, the violation is caught at the interface rather than detected downstream.

This pattern requires infrastructure that can enforce contracts at ingestion time — not just test against them post-hoc in the warehouse. It requires tooling that data producers (typically application engineers, not data engineers) can integrate into their systems with low friction. And it requires the organizational change management to establish contracts as a shared responsibility rather than a data engineering concern alone.

Implications for AI applications

For AI applications, the stakes of data quality are higher than for analytical applications. A bad row in a dashboard produces a slightly wrong chart. Bad data in a training set produces a model with subtly corrupted behavior — and the corruption may not be visible until the model is running in production, where it's expensive to diagnose and expensive to retrain.

The vector databases and feature stores in the Flintrock portfolio are all downstream consumers of data that may have quality problems. The quality of what Chalk computes as features is bounded by the quality of its source data. The quality of what Qdrant retrieves is bounded by the quality of the documents that were indexed. Source-side quality enforcement is a prerequisite for the downstream infrastructure to work well. This is why the data contract and source-side validation category matters to us as infrastructure investors — it's the floor that the rest of the stack rests on.