The MLOps tooling category spent most of its first decade solving problems specific to classical machine learning: training large models on labeled datasets, managing experiment runs, versioning model artifacts, serving predictions at low latency. These are real problems and the tooling for them matured significantly from 2018 to 2023. Then LLMs changed the development model for AI applications in ways that invalidated significant parts of the MLOps stack.
The central shift is this: in classical ML, the model is what you build. You invest enormous effort in data collection, feature engineering, training pipelines, and hyperparameter optimization to produce a model artifact that gets deployed. In LLM applications, the model is what you start with. OpenAI or Anthropic or Mistral provides the model; you build the layer on top of it — the retrieval system, the prompt architecture, the context management, the output processing. This inversion changes almost everything about what infrastructure you need.
What MLOps tools don't transfer
Training pipelines and experiment tracking are almost entirely irrelevant for most LLM application teams. If you're not fine-tuning — and most production LLM applications today are not — you don't have training pipelines to manage and you don't have experiments in the classical sense of model checkpoint comparisons. Your "experiments" are prompt variations, context window sizes, retrieval strategies, and model routing decisions.
Feature stores are relevant, but their relevance changes shape. The features that matter for LLM applications are less about numerical aggregates computed from event history and more about document chunks, conversation context, and retrieved passages. The computational patterns are different even if the organizational need — consistent context preparation across training and serving — remains.
What LLMOps actually needs
Prompt version control and change management is not well served by existing tooling. Changing a system prompt is a significant application change, but the infrastructure for reviewing, testing, deploying, and rolling back prompt changes is largely hand-rolled. This is a genuine gap.
LLM evaluation infrastructure — systematic frameworks for measuring output quality, consistency, and safety across prompt changes and model upgrades — is immature. The difficulty is fundamental: LLM outputs are long-form text and their quality is often context-dependent and hard to evaluate automatically. The teams doing this well have built bespoke evaluation pipelines that combine automated heuristics with human review. Tooling that makes this systematic is needed.
Retrieval pipeline observability is another gap. A RAG application's quality depends on the quality of its retrieval — whether the right documents are being returned, whether the relevance scoring is calibrated correctly, whether chunking strategies are appropriate for the query distribution. Existing observability tools weren't designed to surface retrieval-specific failure modes.
The infrastructure opportunity
We think the LLMOps tooling market will produce several durable companies, but it will not produce them by rebranding MLOps tools. The products that win will be built around the development workflow of teams building LLM applications from foundation models — which means prompt management, retrieval evaluation, output testing, and model routing as first-class primitives. The teams building these tools need to understand that workflow deeply, which means they should probably have built LLM applications themselves before building infrastructure for others.