AI Infrastructure

The AI Observability Gap

Yuki Nakashima September 18, 2025

Software observability is a mature discipline. The three pillars — metrics, logs, traces — have well-understood implementations and a mature ecosystem of tools. If your API endpoint starts returning 500s, your OLAP dashboard surfaces it, your distributed tracing shows which service is at fault, and your logs give you the detailed information to diagnose the root cause. The operational feedback loop for software systems is tight and well tooled.

AI system observability is not mature. The operational feedback loop for AI applications is loose, the tooling is fragmented, and the fundamental instrumentation primitives aren't settled. This gap is not a product opportunity that's been missed — it's a hard problem that reflects genuine differences between AI systems and software systems that require new primitives, not just new dashboards.

What makes AI observability different

The central difference is that software systems have deterministic outputs that can be evaluated against a specification. A function either returns the correct value or it doesn't. An API either returns a 200 or it doesn't. You can write a unit test, a contract test, or an integration test that says "this input should produce this output" and run it mechanically.

LLM outputs are probabilistic and long-form text. There is no "correct" answer for most real queries — there are better and worse answers along multiple dimensions (accuracy, completeness, tone, safety, helpfulness) that require human judgment or approximate automated evaluation to assess. The quality of an LLM response can't be captured in a single metric the way API latency or error rate can. This makes monitoring AI system quality fundamentally different from monitoring software system quality.

What's needed

The first primitive that's needed is trace-level logging of the full context for each LLM call — the prompt, the retrieved context (for RAG applications), the model response, any post-processing, and user feedback if available. This is the equivalent of distributed tracing for AI applications. Most production AI applications today log outputs but not the full context that produced them, which makes it impossible to diagnose quality issues retroactively.

The second primitive is systematic evaluation at scale. The tooling for this — frameworks like LangSmith, Weave, and RAGAS — exists but isn't fully integrated into the operational workflow of most teams. The gap is between "we have an eval framework we run manually" and "eval runs automatically on every deployment and gates releases the way test suites gate software deployments." This requires investment in test case curation, in automated eval metrics, and in the CI/CD integration that makes it operational.

Dagworks, which we backed at Pre-Seed, addresses a specific variant of this problem — pipeline-level observability for data and AI workflows. The insight is that many AI quality problems are actually data pipeline problems that manifest downstream in model outputs. Bad features produce bad model outputs. Stale index data produces irrelevant RAG responses. Observing the pipeline, not just the model output, is necessary for full-stack AI observability. The tooling for this integration — from source data through pipeline computation through model output — doesn't yet exist as a coherent product category.