Six categories · one archive

Topics, not silos.

The papers and articles are organized into six categories. Each one is a doorway into a different era of distributed data thinking.

Six categories of distributed database knowledge — nosqlsummer taxonomy

Classic papers

7 papers · 2 articles

Foundational works that defined how we think about data and distributed computation — Codd, Lamport, Gray, Brewer.

Distributed systems

11 papers · 2 articles

Coordination, replication, consensus, and the practical engineering of internet-scale services.

Modern NoSQL

5 papers · 0 articles

Cassandra, PNUTS, YCSB, graph databases — the post-2008 NoSQL landscape and its benchmarks.

AI & databases

0 papers · 1 articles

Vector databases, RAG, embeddings — how LLMs are reshaping distributed data systems.

Case studies

4 papers · 0 articles

CAP, BASE, harvest/yield — the trade-off frameworks engineers actually reach for.

Tutorials

2 papers · 3 articles

Walkthroughs of LSM-trees, CRDTs, and the data structures behind modern storage engines.

Why six categories?

In 2009, when the NoSQL movement first began to coalesce, the taxonomy was simple: a single category for "papers we're reading." The focus was on the handful of foundational works that catalyzed a shift away from traditional relational databases. But a decade and a half later, the landscape has fractured into distinct intellectual domains — each with its own language, challenges, and research communities. To reflect this evolution, we've expanded the taxonomy into six categories that mirror how the field has stratified: from theoretical underpinnings to practical deployment, from fault tolerance to AI infrastructure.

The most significant addition is AI & Databases. In 2009, vector databases, embeddings, and distributed inference didn't exist. Today, they are central to the database stack, and their design is deeply influenced by decades-old systems papers — LSM-trees power storage layers, CRDTs enable real-time collaboration, and BASE semantics align with eventually consistent AI workloads. The six categories don't exist in silos; they form a living network of influence. Every modern distributed database draws from Classic Papers for its theoretical roots, borrows Distributed Systems techniques for fault tolerance, adopts Modern NoSQL patterns for scalability, leverages AI & Databases innovations for performance, learns from Case Studies about real-world failure, and relies on Tutorials for reproducibility.

Deep dive into each category

Classic Papers

The papers in this category are not relics; they are the DNA of every database system you use today. E.F. Codd's relational model (1970) didn't just define SQL — it defined the very idea of data independence, normalization, and declarative querying. Jim Gray's transaction concept (1981) formalized ACID properties, turning database systems from dumb storage into transactional engines. The paper on access path selection introduced the idea that databases could optimize queries automatically, a concept now embedded in every query planner from PostgreSQL to Snowflake. NoSQL didn't reject the relational model so much as it rebelled against its implementation — clunky monoliths, rigid schemas, and vertical scaling. The return of relational thinking in distributed SQL systems like CockroachDB and YugabyteDB proves that Codd's vision was never wrong, just premature. Codd's relational model paper is the natural starting point for anyone wanting to understand why NoSQL exists.

Distributed Systems

The papers in this category are the operating system of the internet. Leslie Lamport's time clocks paper (1978) introduced logical clocks, which remain the foundation of distributed tracing, vector clocks, and causality tracking in modern systems. The Byzantine Generals Problem (1982) didn't just describe fault tolerance — it defined the limits of consensus in the presence of malicious actors, a problem that now underpins blockchain consensus. Paxos Made Simple (2001) reduced consensus to its essence, and its lessons are echoed in Raft, etcd, and every distributed lock service. Every distributed system you run today implements some version of these ideas, whether through leader election, quorum-based writes, or conflict resolution. The Byzantine Generals problem foreshadows blockchain consensus mechanisms decades before Bitcoin.

Modern NoSQL

Between 2007 and 2013, a Cambrian explosion of distributed storage systems reshaped the data landscape. Amazon's Dynamo (2007) introduced eventual consistency and gossip protocols. Google's BigTable (2006) proved that column-oriented storage could scale to petabytes. Google's MapReduce (2004) showed how to process massive datasets in parallel. Apache Cassandra (2008) combined Dynamo's availability model with BigTable's data model. These papers aren't just historical artifacts — they solved problems that remain central today: availability vs. consistency trade-offs, columnar storage for analytical workloads, and batch processing at internet scale. The Cassandra paper shows how Facebook combined Dynamo's availability model with BigTable's data model.

AI & Databases

The intersection of AI and databases is where theory meets the real world at scale. The INRIA CRDT paper introduced conflict-free replicated data types, which are now the backbone of collaborative editing, distributed caches, and eventually-consistent AI model serving. The LSM-tree paper (1996) described a storage engine that powers RocksDB, which underlies virtually every vector database and key-value store in production. The BASE vs. ACID debate isn't just academic — it's the foundation of how modern AI systems trade consistency for availability, whether in distributed inference pipelines or real-time feature stores. The INRIA CRDT paper is increasingly relevant as AI systems need eventually-consistent state across distributed inference nodes.

Case Studies

Theory doesn't always survive contact with production. Yahoo PNUTS (2008) and Designing and Deploying Internet-Scale Services (2007) aren't about algorithms — they're about what happens when you try to run a distributed system at scale before AWS existed. PNUTS taught us about geo-distributed key-value stores, per-record consistency, and operational simplicity. The internet-scale services paper revealed the hidden costs of distributed systems: failure modes that textbooks ignore, capacity planning under uncertainty, and graceful degradation during partial outages. These case studies are essential because they show that distributed systems aren't just about correctness — they're about reliability under chaos.

Tutorials

Without reproducibility, even the best paper is just an idea. The YCSB benchmark (2010) didn't just enable fair comparisons between NoSQL systems — it shaped the entire category by forcing vendors to optimize for measurable workloads. The nosqlsummer reading guide itself is a tutorial: it distills complex papers into digestible concepts, introduces key terminology, and provides a roadmap for engineers who want to go deeper. Tutorials matter in an academic archive because they bridge the gap between research and practice.

Tree diagram of distributed database knowledge taxonomy

Cross-category reading paths

Distributed database knowledge isn't linear — it's a graph. The six categories form overlapping domains, and the best way to learn is to follow threads that connect them.

Path 1 — The storage engineer: Start with Classic Papers (Codd's relational model and Gray's transaction concept). Then move to Modern NoSQL where Dynamo's availability model introduces eventual consistency. Next, dive into AI & Databases to understand LSM-trees — the storage engine behind RocksDB and every vector database. Finally, study Case Studies like PNUTS to see how geo-distribution works in practice. For more on consensus and fault tolerance, see the distributed systems category.

Path 2 — The consensus specialist: Begin with Distributed Systems papers like Lamport's clocks and the Byzantine Generals problem to grasp the theoretical limits of fault tolerance. Then revisit Classic Papers to understand how transactions abstract away these complexities. Move to Case Studies like Designing and Deploying Internet-Scale Services to see how these ideas play out in real systems — where partial failures, network partitions, and human error are the norm.

Path 3 — The AI infrastructure engineer: Start in the AI & Databases category with CRDTs, LSM-trees, and BASE vs. ACID to understand the storage and consistency models that power modern AI systems. Then explore Modern NoSQL systems like Cassandra and Dynamo to see how availability and scalability are achieved in practice. Finally, study Distributed Systems papers on Byzantine fault tolerance to understand how to handle unreliable nodes in distributed inference pipelines.