Skip to main content
All articles
June 25, 2026classic papers18 min read

Top 12 NoSQL Papers Every Engineer Should Read in 2026

Top 12 NoSQL Papers Every Engineer Should Read in 2026

1. Amazon Dynamo: Amazon’s Highly Available Key-Value Store — DeCandia et al. (2007)

Amazon’s Dynamo paper introduced the industry to eventual consistency and consistent hashing as core principles for building distributed key-value stores that prioritize availability and partition tolerance. The system was designed to handle Amazon’s e-commerce workload, which demands high availability even during network partitions—critical for global platforms where downtime directly translates to lost revenue.

Dynamo’s architecture relies on several key innovations:

  • Consistent Hashing: Unlike traditional hash-based partitioning, Dynamo uses a ring-based consistent hashing algorithm where both nodes and data items are mapped onto a circular space. This approach minimizes reorganization when nodes are added or removed, reducing the load on the system during scaling events. Each node is responsible for a range of the ring, and virtual nodes (vnodes) are used to distribute load more evenly across physical machines.

  • Vector Clocks for Versioning: To handle concurrent updates in a distributed system, Dynamo employs vector clocks, which capture causality by tracking updates from each node. This allows the system to detect conflicts and resolve them using application-specific reconciliation logic (e.g., last-writer-wins or application-defined merge functions).

  • Hinted Handoff: When a node is temporarily unavailable, Dynamo uses a technique called hinted handoff to ensure writes are not lost. The coordinator node writes the data to a designated “hinted” node, which will later forward the data to the intended recipient once it recovers. This mechanism ensures durability without sacrificing availability.

  • Sloppy Quorums and Hinted Handoff: Dynamo introduces the concept of “sloppy quorums,” where reads and writes are performed on a dynamically selected set of nodes that may not include the topologically closest nodes. This flexibility allows the system to handle network partitions gracefully by routing requests through available nodes, even if they are not the primary replicas.

The paper highlights that Dynamo was deployed across Amazon’s platform, powering services like shopping carts, session management, and product catalogs. Its design principles have directly influenced modern databases such as Apache Cassandra, Riak, and Voldemort. Even today, Dynamo’s emphasis on availability and partition tolerance remains a guiding light for engineers building systems that must operate under adverse network conditions, such as edge computing environments where nodes may be intermittently disconnected.

Real-world impact: Dynamo’s design enabled Amazon to achieve 99.99% availability for critical services, a benchmark that many modern distributed systems strive to meet. The paper’s insights into handling network partitions and balancing consistency with availability have become foundational for engineers working on globally distributed systems.


2. Bigtable: A Distributed Storage System for Structured Data — Chang et al. (2006)

Google’s Bigtable paper describes a distributed storage system designed to handle structured data at petabyte scale, powering services like Google Maps, Search, and YouTube. Bigtable’s architecture is built on the Google File System (GFS) and leverages several innovations to achieve high throughput and low latency:

  • Sparse, Distributed Multi-Dimensional Sorted Map: Bigtable organizes data into a sorted map, where the key is a tuple of (row key, column key, timestamp). This structure allows efficient range scans and supports fast lookups. The row key is used for partitioning, enabling data to be distributed across multiple tablets (partitions), each managed by a single server.

  • Chubby Lock Service: Bigtable relies on Chubby, a highly available distributed lock service, for leader election, metadata storage, and failure detection. Chubby ensures that only one master node is active at a time, simplifying coordination and reducing the complexity of leader election algorithms.

  • Tablet Assignment and Load Balancing: Data is partitioned into tablets, which are dynamically assigned to tablet servers. Bigtable employs a two-level lookup system: the root tablet points to metadata tablets, which in turn point to user tablets. This hierarchy allows the system to scale to millions of tablets without overwhelming any single server. Load balancing is achieved by splitting large tablets and migrating them to underutilized servers.

  • Column Families: Unlike traditional relational databases, Bigtable introduces the concept of column families, where columns are grouped into families that share the same storage and retrieval properties. This design allows for efficient storage and retrieval of sparse data, as columns within a family are stored together, while different families may have different compression and caching policies.

  • Compression and Caching: Bigtable employs several compression techniques, including Google’s proprietary SZIP and Zippy (now known as Snappy), to reduce storage costs and improve I/O performance. Additionally, the system uses a block cache to store frequently accessed data in memory, reducing latency for read operations.

Bigtable’s architecture has had a profound impact on the development of distributed storage systems. Apache HBase, an open-source implementation of Bigtable, has become a cornerstone of the Hadoop ecosystem, enabling large-scale data processing and analytics. Google Cloud Bigtable, a managed version of Bigtable, is widely used for time-series data, machine learning workloads, and real-time analytics.

Performance metrics: In its original deployment, Bigtable achieved write throughput of over 40 MB/s and read throughput of over 200 MB/s per node. It scaled to handle trillions of rows and petabytes of data, demonstrating the feasibility of distributed storage systems for massive-scale applications.

Modern relevance: Bigtable’s influence extends beyond traditional databases. Its column-family model is a foundational concept in modern wide-column stores like Apache Cassandra and ScyllaDB, which combine Dynamo’s decentralized architecture with Bigtable’s data model. Additionally, Bigtable’s use of SSTables (Sorted String Tables) for storage has inspired the design of LSM-Trees, which are now ubiquitous in NoSQL databases.


3. CAP Theorem — “Proving the Impossible” in Distributed Systems — Gilbert & Lynch (2002)

The CAP theorem, introduced by Seth Gilbert and Nancy Lynch in 2002, provides a theoretical framework for understanding the trade-offs inherent in distributed systems. The theorem states that during a network partition, a distributed system can only guarantee two out of the following three properties:

  1. Consistency (C): All nodes see the same data at the same time. This is often interpreted as linearizability, where operations appear to occur instantaneously at a single point in time.
  2. Availability (A): Every request receives a response, even if some nodes are down or unreachable.
  3. Partition Tolerance (P): The system continues to operate despite an arbitrary number of messages being dropped by the network between nodes.

The theorem highlights that network partitions are inevitable in distributed systems, and engineers must make trade-offs when designing their systems. For example:

  • A system that prioritizes CA (Consistency and Availability) cannot tolerate network partitions. This is typical of traditional relational databases deployed in a single data center, where partitions are rare.
  • A system that prioritizes CP (Consistency and Partition Tolerance) will sacrifice availability during a partition. Examples include Google’s Spanner and etcd, which prioritize consistency to ensure strong guarantees.
  • A system that prioritizes AP (Availability and Partition Tolerance) will sacrifice consistency during a partition. Examples include Amazon’s Dynamo and Apache Cassandra, which prioritize availability and partition tolerance over strong consistency.

Criticisms and nuances: While the CAP theorem is often cited as a binary choice, the reality is more nuanced. Systems can offer different levels of consistency and availability depending on the context. For example:

  • Tunable Consistency: Many modern databases, such as Cassandra and DynamoDB, allow engineers to configure consistency levels on a per-query basis. For instance, a read operation can be configured to return data from a single replica (weak consistency) or a quorum of replicas (strong consistency).
  • Eventual Consistency with Bounded Staleness: Systems like Cassandra and Riak offer eventual consistency with bounded staleness, where updates are guaranteed to propagate within a specified time window. This provides a middle ground between strong consistency and full eventual consistency.
  • Conflict Resolution: AP systems often employ conflict resolution mechanisms to handle inconsistencies. For example, Cassandra uses last-writer-wins (LWW) by default, while Riak supports application-defined merge functions to resolve conflicts.

Modern applications: The CAP theorem remains relevant in 2026, particularly in the context of edge computing and multi-cloud architectures. For example:

  • Edge Computing: Systems deployed across geographically distributed edge locations must prioritize partition tolerance and availability, often sacrificing strong consistency to ensure low latency and high availability.
  • Multi-Cloud Deployments: Organizations deploying databases across multiple cloud providers must account for network partitions between clouds, often opting for AP systems to ensure availability.
  • Serverless Architectures: Functions-as-a-Service (FaaS) platforms, such as AWS Lambda and Azure Functions, rely on distributed storage systems that must handle frequent scaling events and network partitions, often prioritizing availability and partition tolerance.

Beyond CAP: While the CAP theorem provides a useful framework, modern distributed systems often aim for a balance of all three properties. For example:

  • CRDTs (Conflict-Free Replicated Data Types): CRDTs are data structures that guarantee convergence without requiring coordination, enabling systems to achieve both availability and consistency in the presence of partitions.
  • Hybrid Consistency Models: Systems like Google’s Spanner and CockroachDB combine strong consistency with high availability by using techniques like TrueTime and Raft consensus, demonstrating that the CAP trade-offs can be mitigated with advanced algorithms.

4. The Log-Structured Merge Tree (LSM-Tree): A Disk-Based Structure for High Performance Databases — O’Neil et al. (1996)

Building on the storage trade-offs first formalised in the Eventually Consistent essay, the Log-Structured Merge Tree (LSM-Tree) is a disk-based data structure designed to optimize write performance in high-throughput databases. Introduced by Patrick O’Neil, Edward Cheng, and others in 1996, the LSM-Tree reduces the cost of random disk I/O by batching writes into sequential log files and compacting them in the background. This approach is particularly effective for write-heavy workloads, such as those found in time-series databases, logging systems, and NoSQL databases.

How LSM-Trees Work

LSM-Trees consist of multiple levels (C-0, C-1, …, C-k), where:

Balance scale weighing classic against modern NoSQL papers

- **`C-0`**: An in-memory memtable that buffers incoming writes. Once the memtable reaches a certain size (e.g., 100 MB), it is flushed to disk as an immutable `SSTable` (Sorted String Table) in `C-1`.

- **`C-1` to `C-k`**: On-disk levels that store SSTables. Each level is larger than the previous one (typically by a factor of 10) and uses a tiered or leveled compaction strategy to merge and compact SSTables.

- **Tiered Compaction**: Used in systems like Apache Cassandra, where SSTables are grouped into tiers and merged when a tier reaches a certain size.

- **Leveled Compaction**: Used in systems like RocksDB and ScyllaDB, where SSTables are organized into levels, and compaction merges smaller files into larger ones to reduce read amplification.

Key Innovations

  1. Write Optimization: By batching writes into sequential log files, LSM-Trees reduce the overhead of random disk I/O, which is a primary bottleneck in traditional B-tree-based storage engines. This design allows LSM-Trees to achieve significantly higher write throughput than B-trees, often by an order of magnitude.

  2. Compaction Strategies: LSM-Trees employ compaction to merge SSTables and remove obsolete data (tombstones). The choice of compaction strategy impacts read performance, write amplification, and disk space usage:

    • Size-Tiered Compaction (STCS): SSTables are grouped into tiers based on size, and compaction merges smaller SSTables into larger ones. This strategy minimizes write amplification but can lead to read amplification as the number of SSTables grows.
    • Leveled Compaction (LCS): SSTables are organized into levels, with each level being 10x larger than the previous one. Compaction merges SSTables from one level into the next, reducing read amplification but increasing write amplification.
    • Universal Compaction: A hybrid approach that balances write and read amplification by compacting SSTables in a way that minimizes the number of levels.
  3. Bloom Filters: To optimize read performance, LSM-Trees often use Bloom filters to quickly determine whether a key exists in an SSTable. This reduces the need to scan multiple SSTables during a read operation, improving latency and throughput.

  4. Tombstone Handling: In distributed systems, tombstones (markers for deleted data) must be propagated to all replicas to ensure consistency. LSM-Trees handle tombstones by:

    • Tombstone Compaction: Removing tombstones during compaction to reclaim disk space.
    • Tombstone Expiration: Automatically expiring tombstones after a configurable period to prevent unbounded growth.

Real-World Systems Using LSM-Trees

  1. RocksDB: A high-performance embedded database developed by Facebook, RocksDB is an LSM-Tree-based storage engine used in systems like MyRocks (MySQL with RocksDB), MongoDB, and Apache Kafka. RocksDB supports advanced features like tiered compaction, TTL (Time-to-Live) for data expiration, and pluggable compression algorithms.

  2. Apache Cassandra: Cassandra uses an LSM-Tree-based storage engine (with a tiered compaction strategy) to achieve high write throughput and low latency. Cassandra’s storage engine is designed to handle large-scale, distributed workloads, making it a popular choice for time-series data, user profiles, and messaging systems.

  3. ScyllaDB: A drop-in replacement for Cassandra, ScyllaDB is designed to maximize performance by leveraging modern hardware (e.g., multi-core CPUs and NVMe SSDs). ScyllaDB uses a leveled compaction strategy and a shared-nothing architecture to achieve linear scalability and low latency.

  4. InfluxDB: A time-series database optimized for high write throughput and fast queries, InfluxDB uses an LSM-Tree-based storage engine to handle millions of data points per second. Its design emphasizes efficient compression and downsampling to reduce storage costs.

  5. Apache Kafka: While primarily a distributed event streaming platform, Kafka’s storage layer uses an LSM-Tree-like structure to efficiently store and retrieve log segments. This design enables Kafka to handle high-throughput, low-latency event processing.

Performance Metrics

LSM-Trees are renowned for their write performance. For example:

  • RocksDB: Achieves write throughput of over 1 million writes per second on modern hardware, with read latencies in the single-digit millisecond range for point queries.
  • Cassandra: Can sustain write throughput of up to 100,000 writes per second per node, with read latencies typically under 10 ms for strong consistency queries.
  • ScyllaDB: Demonstrates linear scalability, with read and write throughput increasing proportionally to the number of nodes. In benchmarks, ScyllaDB has achieved sub-millisecond latencies for 99th percentile reads.

Trade-offs and Considerations

While LSM-Trees excel at write performance, they introduce several trade-offs:

  1. Read Amplification: Read operations may need to scan multiple SSTables, increasing latency and CPU usage. This is mitigated by Bloom filters, caching, and efficient compaction strategies.
  2. Write Amplification: Compaction processes consume additional write bandwidth, which can impact overall system performance. Leveled compaction reduces write amplification but increases read amplification.
  3. Memory Usage: The memtable and Bloom filters consume memory, which must be carefully managed in memory-constrained environments.
  4. Background Compaction Overhead: Compaction processes run in the background and can compete with foreground read/write operations for disk and CPU resources. Systems like RocksDB and ScyllaDB employ techniques like rate limiting and prioritization to mitigate this impact.

Modern Innovations

In 2026, LSM-Trees continue to evolve with advancements in hardware and software:

  • Persistent Memory (PMem): Systems like Intel Optane enable LSM-Trees to reduce the cost of random reads by caching frequently accessed SSTables in persistent memory.
  • Zoned Namespaces (ZNS) SSDs: ZNS SSDs expose internal parallelism to the host, allowing LSM-Trees to align their compaction strategies with the device’s physical layout, reducing write amplification.
  • Adaptive Compaction: Machine learning techniques are being explored to dynamically adjust compaction strategies based on workload patterns, further optimizing performance.
  • Hybrid Storage Engines: Some modern databases combine LSM-Trees with B-trees to balance write and read performance. For example, PostgreSQL’s Zheap storage engine experiments with hybrid approaches to optimize for mixed workloads.

5. Paxos Made Simple — Lamport (2001)

For engineers building their own consensus prototypes in 2026, the architecte logiciel — choix techno interview expert cloud describes how production teams arbitrate between Paxos and Raft. The open-source LLM installation guide is also useful for understanding the dependencies modern vector pipelines place on a stable consensus layer.

Leslie Lamport’s Paxos Made Simple demystifies the Paxos consensus algorithm, a foundational protocol for achieving fault tolerance in distributed systems. Paxos ensures that a distributed system can agree on a single value (e.g., a configuration change, a transaction, or a leader) even in the presence of failures, as long as a majority of nodes are operational. The paper breaks down Paxos into three key roles: proposers, acceptors, and learners, and simplifies the protocol into two phases: prepare/promise and accept/accepted.

The Paxos Protocol in Detail

Paxos operates in rounds, where each round consists of two phases:

  1. Phase 1a (Prepare) / 1b (Promise):

    • A proposer (e.g., a node proposing a new value) sends a prepare request with a unique proposal number (typically a timestamp or UUID) to a quorum of acceptors.
    • An acceptor promises not to accept any proposals with a lower number and responds with:
      • The highest proposal number it has already accepted (if any).
      • The value associated with that proposal (if any).
    • If the proposer receives responses from a majority of acceptors, it proceeds to Phase 2.
  2. Phase 2a (Accept) / 2b (Accepted):

    • The proposer selects the highest proposal number it received in Phase 1 and sends an accept request with:
      • The proposal number.
      • The value to be accepted (either the client’s proposed value or the value from the highest accepted proposal in Phase 1, if any).
    • Acceptors accept the proposal if they haven’t promised a higher proposal number and respond with an accepted message.
    • If the proposer receives accepted responses from a majority of acceptors, the value is chosen.

Why Paxos Matters

Paxos is the backbone of modern consensus systems, enabling critical functionality in distributed databases and configuration management:

  1. etcd: A distributed key-value store used for configuration management and service discovery in Kubernetes. etcd uses Raft (a Paxos-inspired consensus algorithm) to ensure strong consistency across a cluster of nodes. In 2026, etcd powers millions of Kubernetes clusters, managing configurations for cloud-native applications at scale.

  2. Chubby: Google’s distributed lock service, which uses Paxos to manage leader election and metadata storage. Chubby underpins services like Bigtable and Spanner, ensuring fault tolerance and consistency.

  3. Consul: A service mesh and service discovery tool by HashiCorp, Consul uses Raft to manage cluster state and provide strong consistency for service registries and health checks.

  4. Apache ZooKeeper: While ZooKeeper itself uses a custom consensus protocol (ZAB), its design is heavily influenced by Paxos. ZooKeeper provides distributed coordination services for applications like Hadoop, Kafka, and HBase.

Real-World Example: etcd in Kubernetes

In a Kubernetes cluster, etcd is the source of truth for all cluster state, including:

  • Pod and service definitions.
  • Node status and health.
  • Network policies and RBAC configurations.

etcd uses Raft to replicate state across multiple nodes, ensuring that:

  • Reads are strongly consistent by default (linearizable reads).
  • Writes are durable and replicated to a majority of nodes before being acknowledged.
  • The cluster can tolerate the failure of up to (n/2 - 1) nodes while remaining operational.

Performance metrics: etcd can achieve:

  • Write latencies of < 10 ms for 99.9th percentile operations in a 3-node cluster.
  • Throughput of > 10,000 operations per second on commodity hardware.
  • Recovery times of < 1 second after a leader failure.

Challenges and Optimizations

While Paxos is theoretically simple, implementing it in practice is non-trivial. Key challenges include:

  1. Leader Election: Paxos does not specify how to elect a leader, which is critical for efficiency. Most practical implementations (e.g., Raft) add explicit leader election phases.
  2. Performance Overhead: The two-phase protocol introduces latency and network overhead. Optimizations include:
    • Batch Processing: Combining multiple proposals into a single round to reduce the number of messages.
    • Pipeline: Allowing proposers to send new proposals without waiting for previous rounds to complete.
    • Fast Path: Some systems (e.g., etcd) use a fast path for read operations by bypassing the consensus protocol when the leader is known.
  3. Network Partitions: Paxos requires a majority quorum to make progress. During a network partition, the system may split into two partitions, each unable to make progress. This is a fundamental trade-off of strong consistency.

Alternatives to Paxos

While Paxos is the gold standard for consensus, several alternatives have emerged to address its complexity or improve performance:

  1. Raft: Introduced by Ongaro and Ousterhout in 2014, Raft simplifies consensus by explicitly separating leader election, log replication, and membership changes. Raft’s design is easier to understand and implement, making it the preferred choice for many distributed systems. etcd, Consul, and HashiCorp’s Vault all use Raft.

    • Leader-Based: Raft requires a single leader to coordinate replication, simplifying the protocol.
    • Membership Changes: Raft supports dynamic cluster membership changes without requiring a full reconfiguration.
    • Strong Guarantees: Raft provides the same safety guarantees as Paxos but with better understandability.
  2. Byzantine Fault Tolerance (BFT): Systems like PBFT (Practical Byzantine Fault Tolerance) and its successors (e.g., HotStuff, LibraBFT) handle malicious nodes in addition to crash failures. These systems are critical for blockchain and decentralized applications where nodes may behave maliciously.

    • PBFT: Requires 3f + 1 nodes to tolerate f malicious nodes. Used in systems like Hyperledger Fabric.
    • HotStuff: A leader-based BFT protocol used in blockchain systems like Facebook’s Libra (now Diem) and Avalanche.
  3. Gossip Protocols: While not a consensus algorithm per se, gossip protocols (e.g., used in Cassandra and Riak) enable eventual consistency by propagating updates peer-to-peer. These protocols are simple and scalable but do not provide strong consistency guarantees.

Paxos in 2026: The State of the Art

Must-read paper shelf with twelve spines

In 2026, Paxos and its derivatives remain the cornerstone of distributed consensus, but several trends are shaping their evolution:

  1. Hybrid Consistency Models: Systems like Google’s Spanner and CockroachDB combine Paxos/Raft with TrueTime (a hardware-based clock synchronization mechanism) to provide externally consistent, globally distributed transactions. These systems demonstrate that strong consistency can be achieved at scale with careful engineering.
  2. Hardware Acceleration: Modern CPUs and accelerators (e.g., FPGAs, SmartNICs) are being used to offload consensus protocols, reducing latency and improving throughput. For example, FPGA-accelerated Raft implementations can achieve sub-millisecond consensus latencies.
  3. Multi-Cloud Consensus: As organizations deploy applications across multiple cloud providers, consensus protocols are being adapted to handle heterogeneous environments. Systems like Antithesis and CockroachDB support multi-cloud deployments with strong consistency guarantees.
  4. Formal Verification: Tools like TLA+ and Ivy are used to formally verify consensus protocols, ensuring correctness and reducing the risk of bugs. For example, the Raft protocol was formally verified before its initial release.

Case Study: CockroachDB

CockroachDB is a distributed SQL database that uses a Raft-inspired consensus protocol to provide globally consistent transactions. Key features include:

  • Geo-Partitioning: Data can be partitioned across regions, with Raft groups ensuring strong consistency within each partition.
  • Serializable ACID Transactions: CockroachDB provides full ACID guarantees across distributed transactions, a rarity in NoSQL systems.
  • Survivability: The system can tolerate the loss of up to n-1 nodes in a n-node cluster while remaining operational.

In benchmarks, CockroachDB achieves:

  • < 10 ms read and write latencies for 99.9th percentile operations.
  • > 1 million transactions per second in a globally distributed deployment.
  • Automatic Recovery: Nodes can be added or removed dynamically without downtime.

6. MapReduce: Simplified Data Processing on Large Clusters — Dean & Ghemawat (2004)

Google’s MapReduce paper introduced a programming model that abstracts away the complexities of parallelism, fault tolerance, and distributed execution, enabling developers to process vast amounts of data with minimal effort. The model consists of two primary phases:

  1. Map Phase: Input data is split into key-value pairs, and a user-defined map function processes each pair to generate intermediate key-value pairs. For example, in a word-counting job, the map function might emit each word as a key with an initial count of 1.
  2. Reduce Phase: Intermediate key-value pairs are grouped by key, and a user-defined reduce function aggregates the values for each key. In the word-count example, the reduce function sums the counts for each word.

Key Innovations

  1. Automatic Parallelization: MapReduce automatically parallelizes the map and reduce functions across a cluster of machines, distributing data and computation to maximize throughput.
  2. Fault Tolerance: MapReduce handles failures transparently by:
    • Re-execution: If a worker node fails, MapReduce reschedules the task on another node.
    • Speculative Execution: If a task is taking longer than expected (e.g., due to a slow machine), MapReduce launches duplicate tasks on other nodes and uses the first result to complete the job.
  3. Data Locality: MapReduce schedules tasks on nodes where the input data is already located, reducing network I/O and improving performance.
  4. Combiner Functions: An optional combiner function can be used to pre-aggregate data on the map worker, reducing the amount of data transferred to reducers.

Google’s Implementation

Google’s MapReduce implementation consists of:

  • Master Node: Coordinates the overall job, assigns tasks to workers, and tracks progress.
  • Worker Nodes: Execute map and reduce tasks, read input data from Google File System (GFS), and write output back to GFS.
  • Shuffle and Sort: Intermediate key-value pairs are partitioned by key

See also: distributed-database lexicon

See also: the 2026 paper-club reading list

The tradition of sharing technical knowledge through curated reading lists has roots in computing subcultures going back decades. One of the most persistent is the demoscene — a community of coders, musicians and graphic artists who push hardware to its limits in competition. The top 20 demoscene productions of all time is a compelling companion read that illustrates how the same spirit of peer-reviewed creative excellence applies to real-time graphics.