Martin Kleppmann is the author of Designing Data-Intensive Applications (DDIA), a widely influential book on the principles behind reliable, scalable, and maintainable systems. Nine years after the first edition, a second edition has arrived, co-authored with Chris Riccomini, reflecting how cloud-native infrastructure, AI-supporting data systems, and industry practice have evolved. The episode covers Martin’s career path, what changed between editions, and his current research in academia.
From startups to LinkedIn to writing DDIA
Martin’s early career included two UK-based startups: a cross-browser testing service (GoTest, built on Selenium) that struggled with adoption, and Rapportive, a Gmail browser extension that showed social profiles next to emails.
Rapportive was acquired by LinkedIn in 2012 when the team was about five people. The team stayed together and continued building products inside LinkedIn.
At LinkedIn, Martin worked on stream processing on top of Kafka (which had just been open sourced) and on Samza, a stream processing framework.
Kafka was created at LinkedIn to solve a data integration problem: many upstream systems generated event streams, and many downstream systems (data warehouses, Hadoop/ML pipelines) needed to consume them. Kafka served as a general-purpose append-only log abstraction for moving data between systems.
Working at LinkedIn scale was Martin’s first exposure to truly large systems and gave him the foundational understanding that fed directly into the book.
Writing the first edition
The book was motivated by Martin’s own experience lacking foundational knowledge when debugging performance issues at Rapportive, and later by seeing how much senior engineers understood that had never been written down.
He learned by talking to senior engineers at LinkedIn, reading research papers, and reading blog posts, then distilled the essential ideas into a practitioner-focused overview rather than a theoretical textbook.
The three-part structure (foundational data systems, distributed data, derived data) emerged somewhat after the fact; the chapter topics were clear from the initial proposal, but the internal structure of each chapter was decided during writing.
Writing took about four years (roughly 2.5 years of full-time equivalent), and Martin missed the publisher deadline by about 2.5 years. O’Reilly was relaxed about this for the first edition but more deadline-driven for the second.
Reliability, scalability, and maintainability
Reliability means fault tolerance: the system continues working despite network interruptions, node crashes, or other failures. Much of the book covers techniques like replication that support this.
Scalability is about mechanisms for handling changes in load, especially horizontal scaling (adding more machines) rather than just buying a bigger machine. Martin also emphasizes scaling down: making very low-load services extremely cheap to run, which serverless systems have made more practical.
Maintainability is the third pillar, though the episode focuses more on reliability and scalability.
The second edition: what changed
The biggest structural update is the shift to cloud-native systems architecture: building on top of cloud services (especially object stores like S3) as the foundational abstraction rather than machines with local disks.
In the first edition, replication happened at the database level between machines with local disks. Now many systems build on object stores, and replication happens at the storage layer, which changes the nature of how databases are designed.
This idea is woven throughout the entire book rather than confined to one section.
MapReduce coverage was reduced: MapReduce is essentially dead as a technology people use directly; successors like Spark and Flink are what matter now. MapReduce is retained mainly as a learning tool for understanding partitioned batch processing.
New coverage was added for systems supporting AI, particularly:
Vector indexes, added to the storage engine chapter as another indexing strategy alongside B-trees and LSM trees.
Data frames, added as an important data model for training data alongside relational, graph, and document models.
The second edition was co-authored with Chris Riccomini, an old LinkedIn colleague and author of The Missing README. Chris brought up-to-date knowledge of industry trends; Martin brought teaching experience and writing style.
Tradeoffs of using cloud services and managed abstractions
Using higher-level cloud abstractions means engineers no longer need to think about lower-level details, analogous to how garbage collection freed developers from manual memory management.
For higher-level business logic, this is fine. But someone still has to build and operate the lower-level abstractions, so the skills shift rather than disappear.
The book’s philosophy remains that even when using cloud services, understanding a bit about internals (e.g., how storage engines work) gives engineers a superpower for diagnosing performance issues and making good tradeoff decisions.
Multi-region and multi-cloud setups push toward higher availability but introduce consistency tradeoffs and higher cost. Martin notes that geopolitical risk (e.g., Europe being locked out of US cloud services) is making multi-cloud more seriously considered for critical workloads, even though it sits at the expensive end of the spectrum.
Sharding and scale in the cloud era
Achieving very high scale is still challenging because sharding still requires application-level engineering; it can’t be made entirely transparent.
However, cloud has made it easier to scale down: serverless systems can spin up and down instances very cheaply, enabling extremely lightweight services (Martin’s personal website costs about 13 cents/month).
Sharding across multiple machines may be becoming slightly less pressing because individual machines are more powerful, and more workloads can run on a single machine. But it’s not going away, and replication remains important for fault tolerance even at smaller scales.
The trouble with distributed systems
This chapter defends the theoretical models used in distributed systems by showing that the weird edge cases the theory assumes are real and common.
Network delays have no reliable upper bound; messages can take much longer than typical.
Crashes are ambiguous: a node might be disconnected, software-crashed, hardware-failed, or have its power cable unplugged.
Clocks are not precise enough to rely on for correctness.
The chapter draws on postmortems and real incidents (sharks biting undersea cables, cows stepping on cables) to illustrate that failures are not rare. The goal is to help people make educated tradeoffs between risk and cost, not to prescribe a specific level of reliability.
Ethics and responsibility
The final chapter, “Doing the Right Thing,” argues that engineers building systems with societal impact have a responsibility to consider consequences and make intentional decisions about what kind of world they are creating.
Martin felt this had been ignored in his industry experience, especially in startups focused on growth and data harvesting for advertising.
Engineers are in a strong position to articulate societal risks (not just technical risks) to business leaders and should not sweep ethical concerns under the carpet.
Formal verification and AI
Formal verification ranges from model checking (using specification languages like TLA+ or Alloy with randomized test case generation) to full formal proofs (mathematical proofs that an algorithm always satisfies a specification, using proof assistants like Isabelle, Coq, or Lean).
Unlike testing, which checks specific examples, formal proofs can reason about infinite state spaces and guarantee the absence of bugs.
Writing formal proofs is very laborious and time-consuming, which is why Martin never used them in industry but found them valuable in academia for subtle, high-stakes algorithms.
Martin believes formal verification will become more important because:
LLMs are getting better at writing proofs, making them more accessible.
AI-generated code (vibe coding) creates a need for automated correctness checking, since humans can’t manually review all generated code.
In security contexts, a single bug can destroy the security of an entire system, making the exhaustive guarantees of formal verification especially valuable.
He recommends engineers start with model checking (TLA+, Alloy) before attempting full proof assistants.
Academia vs. industry
Academia allows much longer-term thinking. Martin gives the example of “local-first software,” which aims to reduce user dependence on centralized cloud providers. This goes against the commercial incentives of SaaS businesses (which depend on subscription lock-in), so it’s unlikely to be pursued by startups but is viable as academic research.
Industry is focused on shipping products on shorter timescales, with clearer requirements for infrastructure work.
Martin sees his role as bridging both: bringing research insights into industrial practice and informing research with real-world problems.
Local-first software
The vision is collaborative software (like Google Docs or Figma) that doesn’t depend on a single centralized cloud provider. Users could sync across multiple providers or peer-to-peer, and if one provider disappears, the system continues working.
This introduces hard engineering challenges, particularly around access control in decentralized settings:
If a user’s edit permissions are revoked concurrently with an edit they make, different devices may see the events in different order, leading to permanent inconsistency.
Solving this without consensus (to preserve high availability and offline work) is much harder than in a centralized model where one server makes the decision.
Clocks can’t be relied on because users can forge timestamps.
Martin’s team is close to solving this for Automerge, a CRDT library he works on.
Computer science education
Martin teaches concurrent and distributed systems (undergraduate), cryptographic protocol engineering (master’s), and other courses at Cambridge.
The distributed systems course is available on YouTube and goes deeper into algorithms than the book, including a full walkthrough of the Raft consensus algorithm.
AI has disrupted assessment: banning AI is unenforceable and counterproductive, but the challenge is ensuring students use it in ways that support learning rather than undermine it.
A boot camp at the start of the first year now exposes students to version control, unit testing, and generative AI basics.
The key distinction: in industry, the desired outcome is a working product, so using AI to get equivalent results faster is fine. In academia, the desired outcome is the thought process and learning, so AI use must preserve that.
Martin’s current research
Local-first software: ongoing work on CRDTs, access control in decentralized settings, and formal verification of algorithms, pursued through open-source work and academic research.
Using cryptography to prove things about the physical world, particularly sustainability:
Verifying carbon emissions numbers in supply chains without revealing commercially sensitive information (e.g., supplier identities).
Supporting EU regulations that require importers of coffee, cocoa, palm oil, etc. to prove their products didn’t come from recently deforested land, using satellite imagery and cryptographic proofs.
Advice for students and young professionals
Industry and academia are not mutually exclusive. Martin has seen the best PhD students come from industry, bringing real-world perspective, while academia teaches nuanced critical thinking and first-principles reasoning that industry often lacks.
He encourages people to weave between both rather than treating them as separate career paths.