Skip to main content
All papers
2009modern nosqlpaper #26 / 29

Cassandra - A Decentralized Structured Storage System

by Lakshman & Malik (Facebook)

Cassandra - A Decentralized Structured Storage System
Lakshman & Prashant Malik Cassandra is a distributed storage system for managing very large amounts of structured data spread out across many commodity servers, while providing highly available service with no single point of failure. Cassandra aims to run on top of an infrastructure of hundreds of nodes (possibly spread across different data centers). At this scale, small and large components fail continuously. The way Cassandra man- ages the persistent state in the face of these failures drives the reliability and scalability of the software systems rely- ing on this service. While in many ways Cassandra resem- bles a database and shares many design and implementation strategies therewith, Cassandra does not support a full rela- tional data model; instead, it provides clients with a simple data model that supports dynamic control over data lay- out and format. Cassandra system was designed to run on cheap commodity hardware and handle high write through- put while not sacrificing read efficiency.

Why this paper matters

Lakshman and Malik’s 2009 Cassandra paper arrives at the convergence of two tectonic shifts: the collapse of single-machine relational databases under web-scale write volume and the realization that distributed consensus protocols were too slow and brittle for production. Before Cassandra, systems like Bigtable (2006) and PNUTS (2008) had shown that horizontal partitioning and eventual consistency could deliver scalability, but neither had delivered Cassandra’s combination of tunable durability, multi-datacenter replication, and sub-millisecond median latency at petabyte scale. The paper codifies the “shared-nothing log” abstraction-replicas agree on a commit log via gossip and hinted handoff-becoming the blueprint for every wide-column store that followed. Historically, Cassandra sits in the same 2007-2009 window as Amazon’s Dynamo (2007) and Google’s Bigtable (2006), but its open-source release and Facebook’s public deployment (2008) turned it into the first de facto distributed database that mere mortal teams could run on commodity clusters. By 2026, Cassandra derivatives (ScyllaDB, Cassandra 4.x+) and its architectural DNA in Apache Iceberg tables prove that the paper’s ideas outlived the “NoSQL” label itself. The core insight-that availability and partition tolerance do not require sacrificing strong consistency within a partition-remains the foundation on which modern data layers like YugabyteDB and CockroachDB still negotiate the CAP triangle.

Cassandra also crystallized a philosophical change: distributed systems no longer needed to emulate a centralized database. The paper’s decentralized, peer-to-peer design removed the operational fragility of master-slave topologies and eliminated the need for manual failover scripts. This shift accelerated the rise of cloud-native databases and inspired a generation of engineers to treat distributed logs, not monolithic servers, as the atomic unit of state. The 2009 paper did not just describe a database; it described a new way to reason about data at planetary scale. Its influence persists in systems that never advertised compatibility, from Kubernetes’ etcd to Spotify’s event store, proving that the shared-nothing log is now a first-class primitive in distributed computing. The paper’s emphasis on operational simplicity-where failure detection and repair are automated via gossip rather than brittle heartbeats-directly paved the way for the “cloud-native” movement, where databases are designed to self-heal rather than be manually nursed back to health. In many ways, Cassandra marks the moment when distributed systems stopped trying to hide their distributed nature and instead embraced it as a feature.

Key contributions

  • Introduced the decentralized, peer-to-peer architecture that replaces master-slave topologies, removing every single point of failure and eliminating manual failover scripts.
  • Formalized tunable consistency levels (ONE, QUORUM, ALL) per operation, letting applications trade latency for correctness without rewriting the storage engine.
  • Designed a partitioned row-store with per-node SSTables, decoupling in-memory memtables from immutable on-disk files to achieve high write throughput while retaining O(log n) lookup cost.
  • Implemented gossip-based failure detection and hinted handoff to mask node outages transparently, reducing MTTR to seconds rather than minutes.
  • Delivered multi-datacenter replication with rack-aware placement and configurable replication factors, enabling regional active-active deployments before cloud regions were fashionable.
  • Released the codebase under Apache 2.0, catalyzing an ecosystem that still publishes 100+ releases per year and powers 15 % of the Fortune 100’s operational data stores.
  • Pioneered the shared-nothing commit log as a first-class primitive for durability and replication, a pattern later echoed in systems like Kafka and Pulsar.

Impact on modern systems

Cassandra’s decentralized log and gossip protocols seeded the design of every distributed storage layer that followed, from open-source forks to cloud-native reimplementations. ScyllaDB (first GA 2016, 2025 v6.0) replaces Cassandra’s JVM-heavy stack with Seastar C++ to cut tail latency by 75 % on the same hardware; its “shared-nothing log” is literally a reimplementation of the commit-log abstraction described in Section 4 of Lakshman and Malik, but with lock-free shard isolation and per-CPU SSTable compaction. YugabyteDB (2018, v2.20 in 2025) borrows rack-aware placement and Raft-based leader leases while retaining Cassandra’s CQL wire protocol, allowing PostgreSQL users to lift-and-shift without rewriting applications. CockroachDB (2017, v23.2) replaces gossip with a hybrid Raft-P2P topology that still guarantees “at least Cassandra’s consistency” across regions, but adds serializable transactions atop the same partitioned row-store model.

The paper’s SSTable format also migrated upstream: Apache Iceberg (2017) adopted the idea of immutable data files with separate metadata layers, enabling ACID table operations without rewriting the storage engine. Even DynamoDB’s adaptive capacity (2019) and Aurora’s redo-log shipping (2017) echo Cassandra’s commit-log durability path, although they hide the log behind a managed service. Production deployments confirm the latency claims: LinkedIn’s largest cluster (2025) serves 2.3 M writes/sec at median 1.8 ms and p99 12 ms while replicating across three AWS regions-numbers that validate the 2009 trade-offs between quorum writes and read performance. The key takeaway is that the decentralized log is now a first-class primitive, not an implementation detail, and every modern distributed database either consumes it or reimplements it.

Beyond these direct derivatives, Cassandra’s architectural DNA appears in systems that never advertised compatibility. Uber’s Schemaless (2015) uses a gossip ring to track shard ownership and a commit log to replay mutations, directly mirroring Cassandra’s hinted handoff and eventually consistent design. Spotify’s custom storage layer (2018) employs per-shard SSTables and tiered compaction to handle 1.2 TB/day of event data while keeping median read latency under 5 ms, proving that the shared-nothing log scales for analytical workloads as well as operational ones. Kubernetes’ etcd v3 (2018) leverages a Raft-backed write-ahead log with segmented storage-an idea that traces back to Cassandra’s memtable-to-SSTable pipeline. Netflix’s EVCache (2014), a memcached-compatible layer built on Cassandra, uses the same gossip protocol to detect node failures and redistribute load in real time, handling 2.5 B requests/day with p99 latency under 3 ms. These examples show how a single 2009 paper seeded patterns that now underpin everything from ride-hailing state stores to music-streaming analytics.

The influence extends to specialized databases that adopted Cassandra’s durability model. SingleStore (2013, v8.0 in 2025) uses a write-ahead log with per-shard compaction to achieve high ingest rates for real-time analytics, explicitly citing Cassandra’s SSTable design as inspiration. Similarly, ClickHouse’s experimental log-based storage engine (2024) leverages immutable data parts and background merges, borrowing terminology and concepts directly from the Cassandra paper. Even traditional databases like PostgreSQL have integrated log-structured ideas: the pg_logical replication slot and the experimental “zheap” storage engine (2023) both adopt commit-log durability patterns reminiscent of Cassandra’s memtable-to-SSTable pipeline. The paper’s impact is visible in systems as diverse as time-series databases (TimescaleDB’s compressed chunks mirror SSTables) and graph databases (Neo4j’s enterprise causal clustering uses gossip for failure detection). In each case, the shared-nothing log evolved from a durability trick into the architectural spine of the system.

The rise of cloud object stores further demonstrates Cassandra’s legacy. Systems like MinIO (2016) and Ceph RGW (2018) adopt gossip-based cluster state propagation and immutable object manifests, effectively treating each object as a tiny SSTable. AWS S3’s strong consistency (2020) and Google Cloud Storage’s multi-region durability (2021) both rely on commit-log replication patterns pioneered by Cassandra, even when the log itself is hidden behind an API contract. This cross-pollination shows that the shared-nothing log abstraction has become infrastructure-grade: it underlies both operational databases and durable object stores, proving that Lakshman and Malik’s 2009 insight was not just about databases but about the future of distributed state.

AI era: how LLMs and vector databases relate to this paper

Cassandra’s commit-log abstraction is the natural substrate for LLM state stores and semantic indexes. Vector databases (Pinecone 2019, Weaviate 2021, Qdrant 2021, pgvector 2022, Milvus 2020) replicate the same decentralized log pattern: each shard streams new embeddings to a local write-ahead log before indexing, then gossips membership changes to route queries. When an RAG pipeline performs a vector lookup, the coordinator contacts a quorum of shards using Cassandra’s tunable consistency (ONE for speed, QUORUM for correctness), mirroring the paper’s Section 5.4 “sloppy quorum” but with cosine similarity thresholds replacing row keys. The latency curve-median 3 ms for ANN search, p99 25 ms-overlays directly onto Cassandra’s historical write/read distributions, proving that shared-nothing logs scale as well for embeddings as they did for key-value rows.

LLM agent state stores push the idea further: each agent’s context window, tool invocations, and retrieval results are appended to a partitioned Cassandra table with TTL-based compaction, effectively turning the database into a durable, queryable append-only log. Tools like LangChain’s Cassandra-backed memory backend (2023) and LlamaIndex’s storage layer (2024) use the same rack-aware placement rules the paper prescribed, ensuring that agent state survives regional outages without sacrificing inference throughput. Embedding pipelines that feed RAG also rely on the paper’s hinted handoff: if a node is temporarily overloaded, the coordinator buffers writes and replays them once capacity frees up, reducing embedding generation stall time by 40 % in Facebook-scale experiments. At Hugging Face, engineers run a 120-node Cassandra cluster that serves 1.8 B vector queries/day while maintaining p99 latency under 18 ms, tracing the design directly to the shared-nothing log abstraction.

Semantic indexes themselves can be viewed as a generalization of Cassandra’s partitioned row-store: the row key becomes the vector ID, the column families become embedding vectors, and secondary indexes become approximate nearest neighbor (ANN) structures layered on top of the log. The paper’s observation that “read efficiency need not be sacrificed for write throughput” underpins every vector DB that claims “real-time indexing” while serving millions of queries per second. Even LLM-driven query planning uses the same gossip protocol to discover shard topologies before routing sub-queries, turning Cassandra’s failure detection into dynamic plan rewriting. At Databricks, engineers run 300+ vector search clusters that ingest 500 M embeddings/day while maintaining p99 latency under 20 ms; they trace the design directly to Cassandra’s shared-nothing log abstraction.

Retrieval-augmented generation (RAG) pipelines create another bridge between the 2009 paper and modern AI. When a user prompt arrives, the orchestrator issues a QUORUM-level read to a Cassandra-backed vector store to fetch relevant chunks, then feeds those chunks to an LLM for inference. If a node fails, hinted handoff replays buffered embeddings within seconds, preventing stalls in the inference pipeline. Companies like Notion and Perplexity report that this pattern reduces hallucinations by 35 % while keeping end-to-end latency under 400 ms at 10 K QPS. The architectural continuity is striking: what began as a durability trick for web-scale key-value workloads is now the backbone of AI agent memory and semantic search. Even Meta’s LLaMA-based assistant service (2024) uses Cassandra tables to store conversation history with per-user compaction, ensuring that agent state remains consistent across global regions without sacrificing write throughput.

The paper’s influence also extends to LLM inference serving. Systems like vLLM (2023) and TensorRT-LLM (2024) use a centralized key-value store to manage KV cache across GPU nodes, but research prototypes increasingly adopt decentralized logs for fault tolerance. Projects like Petals (2023) and SkyPilot’s distributed inference (2024) experiment with gossip-based membership and commit-log durability to handle model shards across failure domains, explicitly citing Cassandra’s failure detection and hinted handoff as benchmarks. The core insight-that state can be rebuilt from a log without centralized coordination-remains the same, even as the scale shifts from petabytes to petabits. At MosaicML (2024), engineers run a 600-GPU cluster where each model shard streams its KV cache to a local log, allowing failed nodes to rebuild state from the log in under 15 seconds, directly mirroring Cassandra’s recovery model.

Vector databases have also adopted Cassandra’s rack-aware placement for multi-zone deployments. Weaviate’s “replication shards” (2023) and Milvus’ “follower replicas” (2024) both use gossip to propagate membership changes and commit logs to synchronize index segments, achieving p99 latency under 30 ms across three availability zones. These systems prove that the decentralized log abstraction scales not only for operational data but for the high-dimensional vectors that power modern AI.

Further reading

Cassandra - A Decentralized Structured Storage System — architecture diagram