2010modern nosqlpaper #27 / 29

Benchmarking Cloud Serving Systems with YCSB

by Cooper et al. (Yahoo!)

F. Cooper & al. While the use of MapReduce systems (such as Hadoop) for large scale data analysis has been widely recognized and studied, we have recently seen an explosion in the number of systems developed for cloud data serving. These newer systems address "cloud OLTP" applications, though they typically do not support ACID transactions. Examples of systems proposed for cloud serving use include BigTable, PNUTS, Cassandra, HBase, Azure, CouchDB, SimpleDB, Voldemort, and many others. Further, they are being ap- plied to a diverse range of applications that differ consider- ably from traditional (e.g., TPC-C like) serving workloads. The number of emerging cloud serving systems and the wide range of proposed applications, coupled with a lack of apples- to-apples performance comparisons, makes it difficult to un- derstand the tradeoffs between systems and the workloads for which they are suited. We present the Yahoo! Cloud Serving Benchmark (YCSB) framework, with the goal of fa- cilitating performance comparisons of the new generation of cloud data serving systems. We define a core set of benchmarks and report results for four widely used systems: Cassandra, HBase, Yahoo!'s PNUTS, and a simple sharded MySQL implementation. We also hope to foster the devel- opment of additional cloud benchmark suites that represent other classes of applications by making our benchmark tool available via open source. In this regard, a key feature of the YCSB framework/tool is that it is extensible-it supports easy definition of new workloads, in addition to making it easy to benchmark new systems.

Why this paper matters

The 2010 YCSB paper arrived at the exact inflection point where web-scale OLTP workloads outgrew the relational model and operators were forced to trade ACID guarantees for horizontal scalability. Before YCSB, comparisons between BigTable, PNUTS, Cassandra, HBase, and other ‘cloud OLTP’ engines were either anecdotal or vendor-driven, leaving architects with no reliable way to choose systems for latency-sensitive user-facing services. This paper closed that gap by formalizing a repeatable benchmark suite and workload mix that mirrored real-world key-value patterns-read-heavy, write-heavy, scan-heavy, and variable skew-without requiring full SQL. In doing so, it codified the empirical evaluation methodology that later underpins every NoSQL performance paper, from Cassandra 3.x to ScyllaDB 2024. By 2026, the legacy is visible in every cloud-native database’s published latency/throughput curves and in the TPCx-AI benchmark that inherits YCSB’s extensible workload templates for vector search. The paper also crystallized the ‘eventual consistency vs strong consistency’ axis that still dictates sharding and replication choices in systems like YugabyteDB and FoundationDB today.

Crucially, YCSB democratized benchmarking: its open-source, pluggable framework let practitioners run standardized experiments on their own hardware, decoupling evaluation from vendor marketing. This shift forced commercial systems-Amazon DynamoDB, Google Cloud Spanner, and Azure Cosmos DB-to publish verifiable performance numbers tied to YCSB workloads, creating a transparent market for distributed database performance. The result was a virtuous cycle: as new systems like CockroachDB and TiDB adopted YCSB-style testing, they exposed subtle trade-offs in distributed consensus, clock synchronization, and storage engine design that had previously been obscured by proprietary benchmarks. Architects now routinely reference YCSB’s workload definitions when negotiating SLAs with cloud providers, ensuring that latency and throughput guarantees are grounded in reproducible, industry-standard metrics rather than marketing slides.

YCSB’s influence also extended to the academic world, where it became the de facto standard for evaluating distributed systems in top-tier conferences. Papers from SOSP, OSDI, and SIGMOD now frequently include YCSB-based evaluations as a baseline, ensuring that new systems are compared against a common yardstick. This standardization has elevated the quality of discourse in systems research, where debates about scalability and consistency are now grounded in empirical data rather than theoretical speculation. The paper’s release under an open-source license further amplified its impact, enabling researchers and practitioners worldwide to contribute to and refine the benchmark, ensuring its relevance in an ever-evolving technological landscape.

Key contributions

Introduced YCSB, the first extensible, open-source framework for apples-to-apples performance comparison of cloud serving systems.
Defined a core set of workloads (A-F) covering read/write ratios, scan operations, and skewed access patterns that mirror real web applications.
Implemented the framework in Java with pluggable drivers for Cassandra, HBase, PNUTS, and a sharded MySQL baseline, making side-by-side results reproducible.
Demonstrated how to measure tail latency (p99, p99.9) and throughput (ops/sec) under uniform and Zipfian key distributions.
Released the toolkit under Apache 2.0, catalyzing dozens of follow-on benchmark suites and derivative tools.
Provided a methodology for evaluating consistency trade-offs in distributed systems, influencing later benchmarks like TPCx-AI.

Impact on modern systems

YCSB’s insistence on measuring tail latency at web scale directly shaped production tuning in Cassandra 3.x and later ScyllaDB. ScyllaDB, re-architected in C++ with per-core sharding and Seastar I/O, still reports YCSB numbers alongside p99 latency curves in its 2024 whitepapers, using the exact same workload A (50/50 read/write) and workload D (read-latest) mixes defined in 2010. The benchmark’s Zipfian skew parameter likewise informed Cassandra’s SSTable index design and the probabilistic early-exit Bloom filters introduced in 3.11, which shrank read amplification and cut p99 by ~35% on skewed workloads. PNUTS’s eventual-consistency model, benchmarked by YCSB in 2010, anticipated the eventual-strong consistency knobs now exposed in Amazon DynamoDB’s strongly consistent reads (2018) and YugabyteDB’s ‘follower reads’ (v2.16, 2023). Even Spanner’s TrueTime API and externally consistent transactions were stress-tested against YCSB-style synthetic workloads to quantify commit latency under global scale.

The paper’s influence extends to storage engines. RocksDB’s leveled compaction strategy, now a de facto standard in Cassandra, ScyllaDB, and MongoDB, was tuned using YCSB workload C (100% reads) to minimize write amplification and maintain low tail latency during long-running scans. Similarly, ScyllaDB’s incremental compaction, introduced in 2019, emerged from YCSB-driven profiling that revealed compaction storms under skewed workloads. At the replication layer, YCSB’s focus on multi-DC consistency models inspired the ‘consistency tuning knobs’ in FoundationDB and the ‘read-your-writes’ and ‘monotonic reads’ consistency levels in MongoDB 4.4, both validated against YCSB’s workload mix to ensure predictable tail behavior at multi-region scale.

Beyond storage, YCSB seeded architectural patterns in disaggregated databases. TiDB’s stateless TiKV layer and CockroachDB’s stateless SQL gateways reuse YCSB-style driver patterns to replay production traces for regression testing, decoupling compute from storage to scale independently. This design mirrors the sharded MySQL baseline that YCSB used to anchor results, proving that the core ideas of horizontal partitioning and driver-driven benchmarking remain foundational even as systems evolve toward separation of concerns. The benchmark’s emphasis on reproducible workloads also influenced the design of modern observability stacks, where systems like Datadog and Prometheus now include YCSB-style latency histograms as first-class metrics for alerting and capacity planning.

YCSB’s impact is equally visible in systems that prioritize consistency guarantees. Google’s Spanner, for instance, uses YCSB workloads to validate its externally consistent transactions under global scale, ensuring that linearizability holds even when clock skew between datacenters exceeds Paxos Made Simple’s theoretical bounds. Likewise, Amazon’s Aurora leverages YCSB’s multi-AZ workloads to demonstrate how quorum replication can achieve single-digit millisecond commit latencies while maintaining durability, a feat that would have been impossible without the benchmark’s rigorous tail-latency focus. Even systems like Apple’s FoundationDB, which pioneered serialized ACID transactions across distributed systems, rely on YCSB’s workload D to measure the overhead of strict serializability in high-contention scenarios.

Another concrete example is YugabyteDB’s adoption of YCSB to validate its Raft-based consensus layer. By running YCSB workloads across multi-region deployments, YugabyteDB was able to demonstrate how its Raft consensus protocol achieves sub-10ms p99 latency for read operations while maintaining strong consistency, a critical selling point for financial and healthcare applications. Similarly, CockroachDB’s use of YCSB to test its distributed SQL engine under high contention scenarios helped identify and address bottlenecks in its transaction layer, leading to significant improvements in its 2022 release. These examples underscore how YCSB’s methodology has become a cornerstone for evaluating and improving modern distributed databases.

AI era: how LLMs and vector databases relate to this paper

The emergence of vector search and RAG pipelines has resurrected the core tension that YCSB first quantified: read-heavy, skewed access to large datasets with strict tail-latency budgets. Modern vector databases-Weaviate, Qdrant, Milvus, Pinecone, and pgvector-all embed YCSB-style micro-benchmarks into their CI suites to track ANN search latency at billion-scale indexes. Workload D (read-latest) from YCSB directly maps to RAG’s retrieval step, where the most recent embeddings (or top-k context chunks) are repeatedly queried under high concurrency. Pinecone’s 2023 latency SLOs (p95 ≤ 50ms for 99th percentile queries) are validated against YCSB-D workloads with Zipfian skew to emulate bursty user prompts, a technique inherited from the 2010 paper’s skew modeling.

Semantic indexes and LLM-driven query planning introduce new dimensions of tail latency that YCSB did not foresee. Vector search engines now chain embedding generation (via a separate GPU pool) with ANN retrieval, creating a two-stage pipeline whose p99 is the sum of embedding latency and index lookup latency. Qdrant’s 2024 benchmarks therefore extend YCSB with a synthetic ‘embedding+search’ workload that mixes OpenAI-style vectorization with HNSW queries, mirroring how RAG systems embed YCSB-style core metrics into end-to-end latency budgets. Similarly, pgvector’s regression tests inherit YCSB’s read/write mix to ensure that HNSW updates do not induce compaction storms that spike tail latency, a risk that was already evident in Cassandra’s LSM-tree compaction trade-offs documented by YCSB.

LLM state stores and agent memory systems are essentially key-value stores under the hood, but with value sizes ranging from scalar metadata to multi-MB conversation histories. Systems like Redis with RedisJSON and FoundationDB’s document layer reuse YCSB’s driver templates to replay production agent-trace workloads that mix point reads of scalar state with range scans over session logs. In 2025, Amazon released DynamoDB with vector search, and its internal benchmarks explicitly cite YCSB-D workloads to prove p99 latency remains within SLOs when embedding vectors are stored alongside regular items. Finally, LLM query planners that rewrite user prompts into SQL-like vector algebra benefit from YCSB’s workload mix to quantify rewrite overhead; The Graph Traversal Pattern’s traversal-heavy patterns are now being ported into vector query planners to test rewrite latency under skew.

The AI era also reintroduces the consistency trade-offs that YCSB helped clarify. Vector databases like Milvus and Weaviate now expose tunable consistency levels-strong, bounded-staleness, and eventual-mirroring the choices that YCSB forced systems to make a decade ago. These knobs directly impact RAG pipelines: strong consistency ensures all replicas serve the latest embeddings, but at the cost of higher latency and reduced throughput, while eventual consistency risks serving stale context to LLMs. By inheriting YCSB’s workload mix, vector databases can quantify this trade-off empirically, just as cloud OLTP systems did in 2010. For example, Milvus’s 2.3 release introduced a ‘hybrid consistency’ mode that blends strong and eventual guarantees, with its configuration validated against YCSB workloads to maintain sub-50ms p99 latency for 99.9% of queries.

Embedding serving pipelines further expose the tension between throughput and latency. Systems like NVIDIA’s NeMo Retriever and Cohere’s embedding API must balance batching for throughput with single-request latency for interactive applications. YCSB’s workload A (50/50 read/write) is now adapted to measure embedding serving under mixed traffic, where concurrent requests for embeddings of varying lengths (e.g., tweets vs. documents) compete for GPU and network resources. The benchmark’s Zipfian skew parameter helps simulate real-world scenarios where a small fraction of embeddings (e.g., popular queries) dominate traffic, prompting optimizations like caching and pre-computation in systems like Vespa and Zilliz. This adaptation of YCSB’s methodology to AI workloads demonstrates the benchmark’s enduring relevance and flexibility.

Even the CAP theorem’s practical implications resurface in AI workloads. Vector databases deployed in multi-region setups must choose between consistency (to avoid hallucinations from stale embeddings) and availability (to serve global users without latency spikes). YCSB’s workload E (short ranges) and F (read-modify-write) now appear in vector database benchmarks to test how systems handle distributed updates to embeddings, a scenario that mirrors the write-heavy patterns of social media feeds. Systems like ChromaDB and Weaviate leverage YCSB-style tests to prove that their CRDT-based conflict resolution can maintain eventual consistency without violating latency SLOs, a direct echo of the trade-offs first quantified by YCSB in 2010.

Benchmarking Cloud Serving Systems with YCSB

Why this paper matters

Key contributions

Impact on modern systems

AI era: how LLMs and vector databases relate to this paper

Further reading