← All research
ML Infrastructure

Production ML Is Still Hard, and That's an Infrastructure Opportunity

I spent five years building production ML systems before joining Flintrock. The single most consistent thing I observed across different companies, different scales, and different ML application domains is this: the difference between a model that performs well in a notebook and a model that performs well in production is almost entirely an infrastructure problem.

The research and tooling discourse around ML is dominated by model architecture, training techniques, and benchmark results. The operational reality is dominated by feature pipelines, serving infrastructure, monitoring systems, and data quality problems. Most ML practitioners spend the majority of their time on the latter, not the former.

The failure modes that actually matter

Training-serving skew is the most common silent killer of production ML. The features a model was trained on — their computation logic, their data sources, their latency characteristics — are different from the features it receives at serving time. This doesn't cause an obvious error. The model simply performs worse than it did during evaluation, and diagnosing the gap is expensive and time-consuming without dedicated tooling.

Model drift is real but often oversimplified in the tooling conversation. The important version of model drift is not "the input distribution has shifted statistically" — it's "the relationship between inputs and the target label has changed in a way that matters for business outcomes." Detecting this requires understanding what the model is actually predicting and tracking business-relevant metrics, not just covariate shift statistics.

Dependency hell in ML serving environments is underappreciated. A Python model with non-trivial dependencies — GPU libraries, native extensions, framework version constraints — is genuinely difficult to package, version, and deploy reliably. Every platform team that serves ML models has built something to manage this. Most of those solutions are hand-rolled and poorly maintained. The opportunity for a tool that makes model packaging and serving reliable is clear.

Why these problems persist

Each of these problems has been recognized for years. The fact that they persist reflects something important about infrastructure investment cycles: the tools that would solve them require practitioners to build them, but practitioners are busy solving today's ML problems, not tomorrow's infrastructure problems. This creates a gap that fills slowly.

The gap is closing. The companies Flintrock has backed — BentoML on serving, Chalk on feature freshness, Prefect on pipeline observability — are each addressing a specific piece of the production ML problem. What makes them interesting investments is not just that they address real pain, but that they're addressing it with OSS-first distributions that let practitioners adopt incrementally without a procurement cycle.

What changes with LLMs

Large language model applications have different production challenges than traditional ML, but the fundamental pattern is the same: the interesting infrastructure problems emerge when you try to take something that works in a demo and make it reliable at scale. For LLM applications, the production challenges are retrieval quality, prompt versioning, output reliability, latency management, and cost optimization. Each of these is an infrastructure category. Each is currently either unsolved or solved by hand-rolled implementations that would be better served by purpose-built tools.

The investment opportunity in ML infrastructure is larger now than it was when we started Flintrock, not smaller. The proliferation of LLM applications has created a much larger market for the infrastructure that makes them production-ready. The team and thesis we built for traditional ML infrastructure are, if anything, better positioned now than we expected.