Model Infrastructure

Model Serving at the Edge: Latency, Cost, and the Architecture That Follows

James Thornton March 14, 2026

The canonical architecture for AI application deployment in 2023 was simple: route all model calls to a centralized inference API, whether OpenAI's, Anthropic's, or your own GPU cluster. This architecture is excellent for development and early production. It's operationally simple, it requires no model deployment expertise, and it lets you iterate on the application layer without worrying about the infrastructure layer. For small to medium traffic volumes, the economics are acceptable.

At scale, the economics change. The latency and cost of centralized GPU inference become significant constraints for applications that require sub-100ms response times or that serve high request volumes. The architectural pressure at this point pushes toward edge serving — deploying smaller, specialized models closer to the users and workloads they serve.

What "edge" means in practice

Edge serving doesn't always mean IoT devices or mobile phones. In most enterprise contexts, "edge" means geographically distributed cloud regions, CDN nodes, or on-premises infrastructure. The goal is to reduce the network distance between the serving infrastructure and the users or data sources, reducing latency and eliminating the bandwidth cost of routing data to a centralized cluster.

Modal's architecture is relevant here. The serverless GPU compute model — spinning up isolated containers with GPU access on demand, paying per second of compute — makes edge deployment economics tractable for production workloads that would previously have required dedicated GPU instances. The ability to deploy a fine-tuned model to multiple regions with per-request pricing, without managing the infrastructure of persistent GPU cluster allocation, changes the cost structure of edge serving significantly.

The model selection that follows from edge constraints

The practical implication of edge serving is that model selection is constrained by deployment requirements. Large frontier models (70B+ parameters) are not practical for edge serving on current hardware. The edge serving opportunity is in the smaller, task-specific, fine-tuned model tier — models in the 3B to 13B parameter range that have been fine-tuned on domain-specific data to perform well on a narrow task set.

This creates an interesting dynamic in the model market. The frontier models from Anthropic, OpenAI, and Google are optimal for capability-first applications where response quality matters more than latency or cost. The fine-tuned small model tier is optimal for latency-sensitive, cost-sensitive applications where the task is well-defined and the model can be specialized. Feature serving infrastructure — Chalk's real-time feature computation, for example — fits the latter pattern: the model inputs are structured, the outputs are predictable, and low latency matters more than general capability.

Infrastructure implications

The edge serving trend creates infrastructure requirements that are different from centralized inference. Model versioning and deployment pipelines for distributed edge environments are different from single-region deployment. Monitoring and evaluation for models running in dozens of edge locations requires different observability infrastructure than monitoring a single centralized API. The tooling for managing edge AI deployment at scale is in its early stages — this is a real infrastructure gap and a category where we're actively looking for investments.