2006distributed systemspaper #18 / 29

Stasis: Flexible Transactional Storage

by Sears & Brewer

Sears & Eric Brewer An increasing range of applications requires robust support for atomic, durable and concurrent transactions. Databases provide the default solution, but force applications to interact via SQL and to forfeit control over data layout and access mechanisms. We argue there is a gap between DBMSs and ﬁle systems that limits designers of data-oriented applications. Stasis is a storage framework that incorporates ideas from traditional write-ahead logging algorithms and ﬁle systems. It provides applications with ﬂexible control over data structures, data layout, robustness, and performance. Stasis enables the development of unforeseen variants on transactional storage by generalizing write-ahead logging algorithms. Our partial implementation of these ideas already provides specialized (and cleaner) semantics to applications. We evaluate the performance of a traditional transactional storage system based on Stasis, and show that it performs favorably relative to existing systems. We present examples that make use of custom access methods, modiﬁed buffer manager semantics, direct log ﬁle manipulation, and LSN-free pages. These examples facilitate sophisticated performance optimizations such as zero-copy I/O. These extensions are composable, easy to implement and signiﬁcantly improve performance.

Why this paper matters

Stasis (2006) arrived at the tail end of the “one size fits all” DBMS generation and exposed a critical fissure: monolithic SQL engines forced applications into a straitjacket of fixed schemas, opaque storage layouts, and rigid durability semantics, while raw file systems offered none of the ACID guarantees that real systems demanded. Sears and Brewer argued that the gap between DBMSs and file systems was neither inevitable nor desirable; instead, they proposed a programmable storage substrate that externalized control over data structures, logging, buffer management, and even page formats to the application developer. This was radical because it decoupled transactional semantics from a fixed SQL dialect and enabled bespoke storage engines that could still leverage decades of write-ahead logging (WAL) research without being shackled by it.

By 2026, the observation has only deepened: modern distributed databases increasingly expose knobs that let users steer durability (sync vs. async commit), replication factor, compaction cadence, and even serialization format, yet the underlying tension-between generality and performance-remains. Stasis framed that tension explicitly, showing how WAL could be generalized into a library rather than a black box. In an era where vector databases must juggle high-dimensional embeddings, frequent updates, and millisecond ingest latency while still guaranteeing point-in-time recovery, Stasis’ insight-durability as a composable primitive rather than a global policy-feels prescient. The paper also anticipated the rise of programmable storage in cloud-native systems, where infrastructure teams routinely swap storage backends (e.g., from RocksDB to Pebble) without rewriting the application logic that depends on durability guarantees.

Key contributions

Formalized a modular storage framework that separates transactional semantics from access methods, letting applications plug in custom data structures (LSM trees, B+ trees, append-only logs) while reusing a shared WAL and recovery engine.
Introduced the notion of “LSN-free pages,” demonstrating that traditional page-level LSNs are unnecessary for recovery once the log is treated as the authoritative state, which paved the way for techniques like zero-copy I/O and direct log parsing.
Provided a reusable buffer-manager API that supports user-defined eviction policies, prefetching, and even direct log-file manipulation, enabling zero-copy reads when the application can parse the log footer to locate the latest committed version of a record.
Demonstrated composable performance optimizations-zero-copy I/O, direct log access, custom eviction-that could be mixed and matched without re-implementing crash recovery or isolation.
Delivered a partial reference implementation whose micro-benchmarks showed competitive throughput and latency against established DBMSs, validating the architectural claim that flexibility does not imply overhead.

Impact on modern systems

This trajectory continues in the annotated pivot of Amazon’s Dynamo on this site.

Stasis’ architectural DNA is visible in several lineage-defining systems.

FoundationDB (2013) adopted a layered storage model that externalizes compaction and indexing to the application layer while centralizing durability and atomicity in a shared transaction log. The design explicitly cites Sears and Brewer’s separation of transactional semantics from data layout as an influence, and FoundationDB’s layered API-where users supply custom tuple layers-mirrors Stasis’ pluggable storage interface. Production deployments routinely push 1-2 million writes/sec at median latency under 5 ms by composing user-defined layers with a shared WAL, a direct echo of Stasis’ composability thesis.

CockroachDB (2014) likewise decouples storage format (RocksDB) from transactional guarantees via a shared Raft-based log. Each node runs a storage engine that can be swapped (Pebble, RocksDB, or even custom engines), while the Raft log ensures serializable durability across nodes. CockroachDB’s ability to hot-swap storage backends without re-initializing the cluster is a direct descendant of Stasis’ separation of log and data layout. In 2024, CockroachDB introduced “vectorized” ingestion that bypasses the SQL layer and streams directly into RocksDB SSTables via the storage layer API; this zero-copy path is enabled by the same architectural division Stasis advocated. The system also leverages LSM trees internally, demonstrating how custom data structures can be composed with a shared durability substrate.

The influence is not only in distributed SQL. MongoDB’s WiredTiger storage engine (adopted in 2014) exposes a pluggable interface for compression, encryption, and collators while centralizing write-ahead logging and recovery. The collator interface, which lets users define custom serialization for BSON fields, traces back to Stasis’ argument that data layout should be a first-class choice. By 2025, MongoDB’s vector search tier uses the same collator API to store embeddings in a custom columnar format, proving the idea scales from OLTP to vector workloads. This modularity extends to eventually consistent deployments, where storage layers can be configured for weaker consistency while retaining core durability primitives.

Even systems that abandoned distributed transactions inherited Stasis’ modularity. ScyllaDB (2015 C++ rewrite of Cassandra) replaced Cassandra’s monolithic storage with Seastar’s user-space I/O framework, allowing applications to steer memory mapping, page cache behavior, and direct I/O paths-choices that Stasis framed as part of transactional correctness. ScyllaDB’s 2023 “epoch-based recovery” further pushes recovery logic into the storage layer, echoing Stasis’ LSN-free page concept where recovery state is derived from parsing the log footer rather than maintaining per-page metadata. The system’s focus on CAP theorem trade-offs-prioritizing availability and partition tolerance while delegating consistency to the application-highlights how Stasis’ ideas permeate modern NoSQL designs.

Google’s Bigtable (2004) predates Stasis but shares its philosophy of separating storage format from durability. While Bigtable’s SSTable format and Chubby-based locking are fixed, the system’s layered design-where tablets are mapped to files and recovery is log-centric-resonates with Stasis’ modular approach. Similarly, Amazon DynamoDB (2012) abstracts storage behind a key-value interface but relies on a shared durability layer (spanning multiple availability zones) that aligns with Stasis’ composable durability model. These systems prove that the separation of concerns Stasis proposed is not confined to research prototypes but underpins some of the most scalable databases in production.

In practice, the most durable legacy is the idea that transactional correctness and data layout are separable concerns. Modern systems routinely allow users to switch between LSM, B+ tree, and columnar layouts without rewriting the durability engine, a flexibility that would have been impossible under the 1990s monolithic DBMS model. Stasis showed how to build such flexibility without sacrificing serializability or crash safety, a lesson that underpins everything from FoundationDB’s tuple layers to CockroachDB’s hot-swappable storage backends. The paper’s influence even extends to Paxos made simple, where consensus protocols are treated as composable modules rather than monolithic protocols bolted onto a storage engine.

AI era: how LLMs and vector databases relate to this paper

Vector databases face a dual pressure: embeddings must be ingested at sub-millisecond latency while still supporting atomic updates and point-in-time recovery. Stasis’ separation of durability from access methods maps directly onto this tension. Pinecone, Weaviate, and Qdrant all use write-ahead logging to guarantee durability of vector indexes, but each exposes a custom in-memory index (HNSW, IVF, or DiskANN-style structures) that is rebuilt or incrementally updated from the log. The pattern mirrors Stasis’ “plug custom data structures into a shared WAL” design: the vector index is a user-supplied access method, while the WAL provides atomicity and crash safety. In 2025, Weaviate introduced “on-disk incremental index building,” allowing the HNSW graph to be reconstructed from the log footer without a full scan-an explicit realization of Stasis’ LSN-free page idea applied to high-dimensional vectors.

RAG systems intensify the demand for composable storage. An LLM agent’s state-tool calls, memory buffers, retrieval contexts-must be durably logged yet quickly accessible. Vector databases like pgvector (PostgreSQL extension) and Milvus implement custom buffer managers that pin embeddings in memory while streaming updates to the WAL. The ability to flush only the changed pages or to use direct log parsing to reconstruct the latest vector state is a direct application of Stasis’ zero-copy I/O concept. Milvus’ 2024 “LSM-vector” mode, which stores embeddings in an LSM tree backed by a shared WAL, is a textbook example of Stasis’ composability: the embedding index is an access method, the WAL is the durability substrate, and the buffer manager is a user-defined policy for eviction and prefetch.

Semantic indexes-where vector similarity is used to guide query routing-further depend on fast recovery. Qdrant’s 2024 “fast snapshot” feature rebuilds the HNSW index from the WAL footer without reading the entire log, cutting recovery time from minutes to seconds. This mirrors Stasis’ argument that recovery need not be page-centric; if the log contains sufficient metadata to locate the latest committed vector record, a full page scan is unnecessary. The same technique appears in LLM inference caches: systems like vLLM and TensorRT-LLM stream embeddings into a custom buffer pool that is recoverable from the WAL footer, avoiding the need to rebuild the entire cache on restart. These caches often rely on LSM trees to merge updates efficiently, demonstrating how Stasis’ ideas permeate even the lowest layers of ML infrastructure.

LLM-driven query planning also inherits Stasis’ modularity. Modern vector databases expose planner hooks that let users inject custom index traversal policies. These policies run in the storage layer, often bypassing the SQL frontend entirely. In 2025, the “vector planner” in pgvector can route a nearest-neighbor query directly to the storage engine, which then applies the HNSW policy while the WAL guarantees atomicity. This division of labor-planning in the application, durability in the storage layer-is exactly the separation Stasis advocated. The planner itself may use embeddings to optimize routing, creating a feedback loop where storage and retrieval policies are co-designed.

Agent state stores mirror Stasis’ idea of a programmable buffer manager. An LLM agent’s world state (observations, tool outputs, embeddings) must be durably logged while remaining quickly accessible. Systems like LangChain’s “durable memory” and custom-built state stores use a shared WAL with user-defined eviction policies to keep hot embeddings in memory while aging out cold state. The ability to hot-swap eviction policies without rewriting crash recovery is a direct legacy of Stasis’ buffer-manager API. Some systems even use eventually consistent durability models for non-critical state, further extending Stasis’ composability to cover trade-offs between consistency and performance.

Embedding serving pipelines push Stasis’ ideas further by decoupling serialization from storage. Systems like Chroma and LanceDB treat embeddings as immutable vectors that are periodically compacted, with updates appended to a new log segment. This log-structured approach aligns with Stasis’ “LSN-free pages” concept, where recovery is driven by log parsing rather than page metadata. The immutability of vectors also simplifies concurrency control, as updates can be treated as new versions rather than in-place modifications-a pattern that echoes Stasis’ separation of durability from access methods.

Finally, the rise of vector databases built on top of relational engines (e.g., pgvector, SQL Server’s vector extensions) shows how Stasis’ modularity can bridge paradigms. These systems reuse the relational engine’s WAL and buffer manager while exposing vector-specific access methods. The result is a hybrid storage model where traditional rows and embeddings coexist under the same durability substrate, proving that Stasis’ ideas scale beyond niche vector workloads.

Stasis: Flexible Transactional Storage

Why this paper matters

Key contributions

Impact on modern systems

AI era: how LLMs and vector databases relate to this paper

Further reading