Interview: 50 Years from Relational to Vector Databases

We’re thrilled today to host Dr. Margaret Hollis, a distinguished independent database historian, to discuss the profound evolution of database systems over the last five decades. From the foundational theories of the relational model to the cutting-edge demands of vector embeddings, Dr. Hollis offers a unique perspective on the forces that have shaped, and continue to shape, how we store and retrieve information.
About Dr. Margaret Hollis
Dr. Margaret Hollis is a renowned independent database historian with over 30 years of experience tracking the intricate arc of database development. Formerly a research scientist at IBM Almaden, where she witnessed firsthand many pivotal moments in database innovation, Dr. Hollis now dedicates her time to chronicling the industry’s history and future. She is a prolific writer for the esteemed “Database Quarterly” and a sought-after speaker, known for her ability to distill complex technical shifts into compelling narratives.
The interview
Readers new to the field can pair this conversation with the Dynamo paper summary and the Bigtable architecture overview, both referenced repeatedly below.
Q: Dr. Hollis, thank you for joining us. You’ve witnessed the database world transform dramatically. Let’s start at the beginning: Edgar Codd’s seminal 1970 paper. What was its immediate impact, and why was it so revolutionary?
A: It’s a pleasure to be here. Codd’s 1970 paper, “A Relational Model of Data for Large Shared Data Banks,” was nothing short of a paradigm shift. Before Codd, data management was largely tied to physical storage structures, often hierarchical or network models. Developers had to navigate complex pointer-based systems, making data access and application development incredibly rigid and error-prone. Codd’s genius was to propose a mathematical, set-theoretic foundation for data. He introduced the concept of relations, tuples, and attributes, providing a logical view of data that was entirely independent of its physical representation. This abstraction was truly revolutionary.
The immediate impact wasn’t a sudden industry overhaul, but rather a profound intellectual awakening within research circles. It provided a theoretical bedrock that was provable, elegant, and offered a clear path to data independence. This meant applications could be written without intimate knowledge of how data was physically stored, dramatically improving maintainability and flexibility. While it took years for the practical implementations to catch up, Codd’s paper laid the theoretical groundwork that would define database systems for decades. It’s impossible to overstate its importance; it gave us a common language and a rigorous framework. You can read his original vision at /papers/codd-relational-model/.
Q: Codd’s theory was elegant, but how did it translate into practical systems? Can you describe the journey from Codd’s paper to the widespread adoption of SQL and relational databases in the 1980s?
A: The journey from theory to widespread practice was fascinating and arduous, largely driven by projects like IBM’s System R. Initiated in 1973, System R was a research prototype designed to demonstrate the feasibility of Codd’s relational model. It wasn’t just about proving the theory; it was about tackling the immense engineering challenges: query optimization, transaction management, concurrency control, and recovery. The team at IBM Almaden, including Don Chamberlin and Raymond Boyce who developed SQL (initially SEQUEL), had to invent many of the techniques that are now commonplace in every relational database.
System R, which first became operational in 1976, showed that a relational system could achieve performance comparable to, and often better than, the existing hierarchical and network models, especially for complex queries. The development of SQL as a declarative query language was critical; it allowed users to specify what data they wanted, rather than how to navigate to it. This simplicity, combined with the underlying theoretical rigor, made relational databases incredibly attractive. By the 1980s, commercial products like Oracle, Ingres, and eventually IBM’s DB2, building on System R’s innovations, brought relational technology to the mainstream, fundamentally changing enterprise data management.
Q: The relational model brought strong consistency and ACID guarantees. How critical were these concepts, and did their strictness ever become a limitation as data needs grew?
A: ACID properties—Atomicity, Consistency, Isolation, Durability—were absolutely critical. They addressed a fundamental challenge in data management: ensuring data integrity and reliability, especially in multi-user, concurrent environments. Before ACID, developers had to manually manage complex locking mechanisms and recovery procedures, which was incredibly difficult to get right. The /papers/transaction-concept/ paper, though not by Codd, formalized many of these ideas, providing the bedrock for reliable database operations. ACID guarantees meant that a transaction either fully completed or completely failed, leaving the database in a consistent state. This was a non-negotiable requirement for financial systems, inventory management, and any application where data accuracy was paramount.
However, as you rightly point out, their strictness eventually became a limitation, particularly with the advent of the internet and the need for massive horizontal scaling. Achieving strong consistency across geographically distributed systems incurs significant performance penalties due to the overhead of distributed commit protocols and synchronization. For applications demanding extreme availability and partition tolerance, such as large-scale web services, the latency introduced by strict ACID guarantees could be prohibitive. This tension between consistency and availability would become a central theme in the next wave of database innovation, leading to the exploration of alternative consistency models.
Q: The 1990s saw the rise of object-oriented programming. There was a significant push for Object-Oriented Database Management Systems (OODBMS). Why didn’t they manage to unseat relational databases, and what lessons did that era teach us?
A: The object-oriented era was a fascinating, albeit ultimately unsuccessful, challenge to relational dominance. OODBMS emerged from the frustration developers felt with the "impedance mismatch" between object-oriented programming languages and relational databases. Representing complex objects—with their methods, inheritance, and encapsulation—in flat relational tables required a lot of boilerplate code and mapping layers, which was cumbersome. OODBMS promised to eliminate this by directly storing objects, offering a more natural persistence model for object-oriented applications. Companies like ObjectStore, Versant, and GemStone gained traction in niche markets, particularly CAD/CAM, telecommunications, and scientific applications where complex data structures were common.
However, OODBMS faced several hurdles. They lacked a universally accepted theoretical foundation comparable to Codd’s relational model, leading to fragmentation and proprietary query languages. The existing investment in relational technology and SQL was enormous, and the performance benefits of OODBMS for general-purpose business applications often didn’t outweigh the cost of retraining and migration. Furthermore, relational databases themselves evolved, introducing features like object-relational mapping (ORM) tools and user-defined types (UDTs) that mitigated some of the impedance mismatch. The key lesson was that while data modeling is important, a robust query language, a strong theoretical basis, and a mature ecosystem are equally, if not more, crucial for widespread adoption.
Q: As the web exploded in the early 2000s, traditional relational databases began to show strain under unprecedented scale requirements. This led to the “NoSQL” movement. What were the core drivers behind NoSQL, and which systems exemplified this shift?
A: The explosion of the web, coupled with the rise of massive online services like Amazon and Google, exposed the Achilles’ heel of traditional relational databases: their inherent difficulty in scaling horizontally across commodity hardware. Relational systems were primarily designed for vertical scaling—making a single machine more powerful—which became economically and physically unfeasible for petabyte-scale data and millions of concurrent users. The core drivers for NoSQL were thus horizontal scalability, high availability, and the ability to handle schema-less or semi-structured data more flexibly than rigid relational schemas allowed.
The foundational systems that truly exemplified this shift were Google’s Bigtable (2006) and Amazon’s Dynamo (2007). Bigtable, a sparse, distributed, persistent multi-dimensional sorted map, powered many of Google’s internal services, demonstrating how to build a highly available, scalable system that could handle vast amounts of data. Similarly, Amazon’s Dynamo paper described a highly available key-value store designed to power Amazon’s shopping cart and other critical services. Dynamo prioritized availability over strong consistency, embracing eventual consistency, which was a radical departure from the ACID world. Other systems like Cassandra (inspired by Bigtable and Dynamo), MongoDB (document-oriented), and Redis (in-memory key-value) quickly followed, each offering different trade-offs in terms of data model, consistency, and performance, addressing specific use cases where traditional RDBMS fell short. The broader theoretical underpinning of these eventually-consistent designs is captured in the CAP theorem paper.
Q: The NoSQL movement introduced a spectrum of consistency models beyond strict ACID. Can you elaborate on the implications of eventual consistency and how it changed application development paradigms?
A: Eventual consistency was arguably the most significant conceptual shift introduced by the NoSQL movement, moving away from the “all or nothing” guarantee of ACID. In an eventually consistent system, if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value. In other words, replicas of the data will eventually converge. The implication is that for a period after an update, different users or applications might see different, potentially stale, versions of the data.
This paradigm shift profoundly impacted application development. Developers could no longer assume immediate data consistency; they had to design applications to be tolerant of transient inconsistencies. This meant implementing strategies like idempotent operations, conflict resolution mechanisms (e.g., “last writer wins” or application-specific logic), and explicit handling of potential data staleness. For use cases like social media feeds, shopping cart updates, or sensor data collection, where immediate consistency isn’t always critical and high availability is paramount, eventual consistency proved to be a powerful enabler. It allowed systems to achieve massive scale and fault tolerance by relaxing the strict consistency requirements that were bottlenecks in distributed relational systems. It forced a new way of thinking about data integrity and system design, trading off immediate data accuracy for higher availability and partition tolerance.
Q: Despite the success of NoSQL, SQL didn’t disappear. The “NewSQL” movement emerged, attempting to combine the scalability of NoSQL with the transactional guarantees and familiarity of SQL. What were the key innovations here, and how successful have they been?
A: The NewSQL movement, which gained prominence in the early 2010s, was a direct response to the pendulum swinging perhaps too far towards eventual consistency and schema flexibility. Many enterprises, while needing scale, were unwilling to sacrifice the strong transactional guarantees, rich query capabilities, and mature tooling that SQL and ACID provided. NewSQL systems aimed to offer the best of both worlds: the horizontal scalability and fault tolerance of NoSQL, combined with the strong consistency, relational data model, and SQL interface of traditional RDBMS.
Key innovations included distributed transaction protocols that could span multiple nodes while still enforcing ACID properties, often leveraging techniques like two-phase commit or Paxos/Raft for consensus. Google’s Spanner, detailed in its 2012 paper, was a pioneering example, introducing globally consistent, externally synchronized time to enable strong consistency across a globally distributed database. Other notable NewSQL systems like CockroachDB and TiDB built on these principles, providing distributed, fault-tolerant relational databases. They’ve been quite successful in hybrid environments and for specific applications that demand both scale and strict transactional integrity, such as financial trading platforms or complex inventory systems. While they add complexity compared to simpler NoSQL stores, they offer a compelling solution for organizations that need the analytical power of SQL with distributed resilience. The SQL ecosystem, with decades of tooling and expertise, proved incredibly resilient, as highlighted by discussions at events like the /papers/1995-sql-reunion/, demonstrating its enduring value.
Q: Let’s fast forward to the present. The rise of AI and machine learning has introduced a new type of data: vector embeddings. What are vector embeddings, and why do they necessitate a new category of databases, or at least new features in existing ones?
A: Vector embeddings are a truly transformative data type, emerging directly from the advancements in artificial intelligence, particularly deep learning. Essentially, a vector embedding is a numerical representation of an object—it could be a word, an image, a document, a user profile, or even a complex concept—in a high-dimensional space. These vectors are generated by machine learning models trained to capture semantic meaning, context, and relationships. Objects that are semantically similar will have vector embeddings that are “close” to each other in this high-dimensional space.
Traditional databases, whether relational or NoSQL, are optimized for structured queries based on exact matches, range queries, or string comparisons. They are not designed to efficiently perform “similarity searches” across thousands or millions of high-dimensional vectors, which involves calculating distances (e.g., cosine similarity, Euclidean distance) between vectors. Doing this brute-force is computationally prohibitive. This need for efficient nearest-neighbor search, and approximate nearest-neighbor (ANN) search, is what necessitates vector databases or specialized vector indexing capabilities. Systems like pgvector for PostgreSQL (released around 2021), or dedicated vector databases like Pinecone and Weaviate, have emerged to address this by implementing specialized indexing algorithms (like HNSW, IVF_FLAT) that can quickly find vectors similar to a query vector, even in massive datasets.
A: Vector databases are overwhelmingly an augmentation, not a replacement, for existing data stores. They represent a specialized layer in the modern data stack, designed to handle a very specific, albeit increasingly critical, workload: semantic search and similarity matching. In most real-world applications, vector embeddings are generated from or associated with structured or semi-structured data that resides in traditional relational databases, document stores, or object storage.
Consider a retail application: product descriptions, images, and customer reviews might be stored as text and image files, with metadata in a relational database. Machine learning models would then generate vector embeddings for these product descriptions and images. These embeddings, along with a reference back to the original product ID, would be stored in a vector database. When a customer searches for “comfortable running shoes,” the search query is converted into a vector, which is then used to find similar product vectors in the vector database. The vector database returns the IDs of the most similar products, which are then used to retrieve the full product details from the relational database. This hybrid approach allows applications to leverage the semantic understanding provided by vectors while retaining the robust transactional capabilities and structured querying of traditional databases for other aspects of the data. This polyglot persistence strategy is becoming the norm.
Q: One of the most talked-about applications of vector databases is Retrieval Augmented Generation (RAG) in the context of Large Language Models (LLMs). Can you explain how vector databases enable RAG?
A: RAG is a truly groundbreaking application that perfectly illustrates the power of vector databases in the age of LLMs. Large Language Models are incredibly powerful, but they have limitations: they can “hallucinate” (generate factually incorrect information), their knowledge is capped at their training data, and they struggle with very specific, up-to-date, or proprietary information. RAG addresses these issues by augmenting the LLM’s generation process with relevant, authoritative information retrieved from an external knowledge base.
Here’s where vector databases come in: the external knowledge base (which could be internal documents, web pages, company wikis, etc.) is first “chunked” into smaller, semantically meaningful pieces. Each chunk is then converted into a vector embedding and stored in a vector database. When a user asks a question, the question itself is converted into a vector. This query vector is then used to perform a similarity search in the vector database, retrieving the top N most semantically relevant chunks of information. These retrieved chunks are then passed to the LLM as context alongside the original user query. The LLM then uses this specific, relevant information to formulate its answer, significantly reducing hallucinations and grounding its responses in factual data. It’s a powerful pattern that allows LLMs to interact with and effectively utilize vast, dynamic, and domain-specific knowledge bases, making them far more reliable and useful for enterprise applications.
Q: Looking back at the last 50 years, what common patterns or cycles do you observe in database innovation? Are we simply rediscovering old ideas with new technologies, or is there genuine, continuous evolution?
A: That’s an insightful question, and the answer is a bit of both. We definitely see cyclical patterns. The pendulum swings between consistency and availability, between structured and schema-less, between centralized and distributed. For instance, the move from hierarchical/network to relational was about abstraction and data independence. The move to NoSQL was about scaling beyond relational limits, often sacrificing some of that abstraction or consistency for performance. NewSQL then tried to bring back the best of relational to the distributed world. Vector databases, while new in their data type, are still about efficient retrieval and indexing, just for a different kind of query. Even Codd’s early work implicitly dealt with efficient data access, as seen in the subsequent work on /papers/access-path-selection/.
However, it’s not merely rediscovery. Each cycle is built upon the technological advancements of the previous one. We’re not just re-implementing 1970s ideas; we’re applying them to vastly different scales, hardware, and application demands. The underlying compute power, network bandwidth, and storage capabilities today are orders of magnitude beyond what was available in the relational era. Furthermore, entirely new data types and use cases, like vector embeddings for AI, represent genuine evolution. We’re constantly raising the bar on what “data” means and what we expect to do with it. The core problems of data persistence, integrity, and retrieval remain, but the solutions become increasingly sophisticated, specialized, and capable of handling unprecedented complexity and volume. It’s a continuous, dynamic evolution, driven by both enduring principles and emerging technological needs.
Key takeaways
For practitioners deploying vector indexes on local infrastructure, the parallel discussion in Ollama LLM local PHP integration and the open-source LLM installation guide for Ollama, Mistral and Llama explains the runtime side of the same pipeline.
- Relational Foundations Endure: Codd’s 1970 relational model provided a robust theoretical and practical framework that shaped database systems for decades, emphasizing data independence and logical organization, despite later challenges.
- Cycles of Consistency vs. Availability: Database innovation often cycles between prioritizing strong consistency (ACID, relational, NewSQL) and high availability/scalability (NoSQL, eventual consistency), driven by changing application demands and hardware constraints.
- The Power of Abstraction and Query Languages: SQL’s declarative nature and the abstraction offered by relational systems proved incredibly powerful and resilient, influencing subsequent database designs even in non-relational contexts.
- Specialization for New Workloads: The rise of specific challenges, like massive web scale or AI-driven semantic search, has led to the emergence of specialized database types (NoSQL, Vector DBs) that optimize for particular data models, consistency tradeoffs, or query patterns.
- Polyglot Persistence is the New Norm: Modern applications increasingly adopt a “polyglot persistence” approach, combining different database technologies (relational, NoSQL, vector, graph) to leverage their respective strengths for various parts of an application’s data needs.
Further reading
- See also: the Dynamo-to-FoundationDB lineage
- See also: vector databases and RAG
- Codd’s Relational Model of Data
- The Transaction Concept: Virtues and Limitations
- Access Path Selection in a Relational Database Management System
- The 1995 SQL Reunion: People, Projects, and Politics
- Dynamo: Amazon’s Highly Available Key-value Store