Data Infrastructure

Embedding the Database: Why In-Process Analytics Changes the Game

James Thornton November 21, 2024

The database industry spent thirty years building better client-server systems. The topology was settled: a database server runs on dedicated hardware, clients connect over a network, queries travel in both directions. This architecture made sense when storage was expensive, when hardware was specialized, and when organizations ran tens of machines rather than thousands. It made less sense in every subsequent decade, but the gravity of the existing ecosystem kept it in place.

DuckDB represents a genuine architectural shift — not because it's faster (though it is) or because the SQL dialect is better (though it is), but because it changes the deployment model fundamentally. Running the database in-process, in the same memory space as the application, eliminates network latency from the query path, eliminates the operational complexity of managing a database server, and makes the database a library rather than a service. This sounds like a technical detail but it changes who can use analytical databases and what they can use them for.

What in-process analytics unlocks

The most immediate unlock is for data scientists and ML engineers working in Python notebooks. Querying a Parquet file on local disk or S3 without spinning up a cluster or provisioning a warehouse is now a single pip install away. The friction reduction from "set up Redshift and wait for cluster provisioning" to "import duckdb and run a SQL query" is not marginal — it's the difference between using SQL for exploratory analysis versus not using it at all.

The second unlock is for application developers building data-intensive features. Embedding analytics directly into the application removes the dependency on a separate analytical database service. A product that surfaces usage statistics, trend data, or aggregate views to users can compute these directly without routing queries through an external warehouse. This makes a category of applications practical that previously required dedicated data engineering support.

LanceDB extends this model to vector data — an embedded multimodal vector database that runs in-process alongside the application, persists to local disk or object storage, and supports the combination of vector similarity search with structured filtering that LLM applications need. This is the same architectural bet as DuckDB applied to the AI application layer.

Where the model breaks down

The embedded model has real limitations. It doesn't scale horizontally — you can't run an embedded database across a fleet of application servers and expect consistent query results. It doesn't support concurrent writes from multiple processes without careful engineering. It's not appropriate for transactional workloads that require ACID guarantees across writes from different sources.

These limitations are not deficiencies to be fixed but constraints that define the use case. Embedded databases are for read-heavy analytical workloads, for development and prototyping, and for application-local computation. They're not replacements for distributed systems. The companies that win in this space will be those that are clear about what the model is good at and build deep features within those constraints, rather than trying to stretch the model into use cases where a client-server architecture was always the right answer.

The question for the investment thesis

The interesting question for us as investors is whether in-process databases build defensible positions. DuckDB is Apache 2.0 and free. LanceDB is open source. The monetization paths for both involve cloud services and enterprise features. The question is whether the community moat — the data ecosystem integrations, the connectors, the format standards like Lance format — creates switching cost that a cloud product can capitalize on. The answer isn't settled yet, but the directional bet that a major architectural shift creates major business opportunity seems right to us.