How AWS S3 is built

The Pragmatic Engineer 1h18 5 min #61
How AWS S3 is built
Watch on YouTube

Summary

  • AWS S3 is the world’s largest cloud storage service, holding over 500 trillion objects across hundreds of exabytes, served from tens of millions of hard drives on millions of servers in 120 availability zones across 38 regions. Milan Filo, VP of Data and Analytics at AWS, has led S3 for 13 years and walks through how the system is engineered for extreme durability, availability, and consistency — and how it continues to evolve with new data primitives like tables and vectors.

S3’s origins and early design

  • S3 launched in 2006 as the first AWS service, built to give Amazon engineers a cheap, unstructured store for PDFs, images, and backups.
    • The original design used eventual consistency: a write was acknowledged only once data was durably stored, but a subsequent list or read might not yet reflect it. This was a deliberate trade-off favoring availability and durability over immediate read-after-write consistency.
    • At the time this worked well for e-commerce workloads — if an image didn’t appear immediately, a human would just refresh.
  • Pricing at launch was $0.15/GB/month, roughly a third to a fifth of competing storage, and AWS has continued cutting prices over the years (now around $0.023/GB/month).
    • The mission is to make storage so economical that customers never have to delete data for cost reasons, enabling data lakes (or “data oceans” as Sony’s CEO called them) to grow without constraint.
    • Features like Intelligent Tiering (automatic discounts of up to 40% on data untouched for 30+ days) and Glacier ($0.01/GB/month for archival) extend this philosophy across access patterns.

The shift to strong consistency

  • Around 2013–2015, frontier data customers like Netflix and Pinterest began building data lakes on S3 using Hadoop, extending unstructured storage to tabular data. By 2020, enterprises followed, storing parquet files in exabyte scale.
    • These new workloads needed strong consistency — a read must always reflect the most recent write — which the original eventual-consistency model couldn’t guarantee.
  • S3’s engineers invented a new internal data structure called a replicated journal to deliver strong consistency without sacrificing availability.
    • In the replicated journal, writes flow through storage nodes sequentially; each node forwards to the next and learns the sequence number of the value. On reads, the sequence number is retrieved and compared, guaranteeing the latest value is returned.
    • This was paired with a new cache coherency protocol that introduced the concept of a failure allowance: the system is designed so that multiple servers can receive requests and some are allowed to fail, while the cache coherency protocol ensures correctness.
  • Strong consistency was launched as a free, default property of every S3 request — no latency increase, no extra cost, no opt-in required.
    • This was a deliberate decision: the team debated whether to pass along the additional hardware costs and chose not to, treating consistency as a foundational building block of the service.

Correctness through formal methods

  • At S3’s scale, the team cannot rely on testing alone to verify correctness. They use automated reasoning (formal methods) to mathematically prove that their consistency model, cross-region replication, and API behaviors are correct.
    • These proofs are run on every code check-in to the index subsystem, ensuring no regression in the consistency model.
    • Formal methods are applied across multiple areas: consistency proofs cover all combinatorics of edge cases; cross-region replication proofs verify data arrival; API correctness proofs validate behavior.
    • As Milan puts it: “At a certain scale, math has to save you.”

Durability and how it’s verified

  • S3 promises 11 nines of durability (99.999999999%), a level far beyond typical availability targets.
    • Durability is managed primarily in the storage layer through a combination of software and physical data layout across fault domains (servers, racks, availability zones, regions).
    • Behind the S3 endpoint sit over 200 microservices, a significant portion dedicated to durability: health checks, repair systems, and auditor systems that inspect every single byte across the fleet and trigger automatic repair when needed.
    • The team can answer at any time what the actual durability has been over the past week, month, or year — the math is continuously validated against reality.
    • Servers and drives fail constantly; the system is designed with the assumption that failure is continuous, not exceptional.

Correlated failure, crash consistency, and failure allowances

  • Correlated failure — when multiple components fail together due to shared fault domains (same rack, same availability zone) — is the primary threat to availability at scale.
    • S3 mitigates this by replicating data across separate availability zones, ensuring that no single correlated failure domain can take out all copies of any object’s data.
  • Crash consistency means the system always returns to a consistent state after a fail-stop failure. Engineers design every microservice assuming failure is always present, reasoning about the set of states a system can reach under failure.
  • Failure allowance is the designed-in capacity for components to fail without impacting customers. The cache and underlying hardware are sized so that the allowance is never exhausted in practice, tracked by a dedicated fleet of metric-collecting microservices.

S3’s evolution: from objects to tables to vectors

  • S3 has continuously evolved beyond simple blob storage in response to how customers actually use it.
    • Parquet and Iceberg (circa 2019–2020): Customers began storing tabular data as parquet files. Apache Iceberg provided table semantics (compaction, schema evolution) on top of those files. In December 2024, AWS launched S3 Tables, natively managing Iceberg tables in S3, with 15+ new features added since launch.
      • S3 Tables lets users query data with SQL — the lingua franca of data — directly against S3, without needing a separate database. This makes data accessible to AI agents and humans alike.
    • S3 Vectors (launched July 2025, GA last week): A new native data type for storing embeddings (long lists of numbers produced by AI models).
      • Unlike S3 Tables (which builds on existing object storage), vectors required a brand-new data structure optimized for high-dimensional nearest-neighbor search.
      • S3 precomputes vector neighborhoods (clusters of similar vectors) offline and asynchronously. At query time, only the relevant neighborhoods are loaded into fast memory for nearest-neighbor search, achieving sub-100ms warm query latency.
      • Scale: up to 2 billion vectors per index, up to 20 trillion vectors per vector bucket — at S3’s standard storage pricing, making vector storage dramatically cheaper than specialized vector databases.
    • Milan describes S3’s evolution as a “product shape” — a living, coherent form that maintains the core traits (durability, availability, consistency) while extending into new paradigms (conditionals, SQL, vectors) that remove constraints on how customers use data.

Engineering culture and principles

  • Two guiding Amazon engineering tenets are in productive tension on the S3 team:
    • Respect what came before — S3 has worked for nearly two decades; changes must preserve its core properties.
    • Be technically fearless — the team must invent new capabilities (conditionals, tables, vectors) that extend storage in ways customers need now and in the future.
  • Simplicity is a constant discipline: each microservice does one or two things well, keeping the overall distributed system maintainable. The user-facing API remains simple (put/get) even as capabilities grow.
  • “Scale is to your advantage” is a core design principle: every new feature must perform better as S3 grows, not worse. The massive scale of S3 decorrelates customer workloads, meaning any application on top of S3 inherits the statistical benefits of that scale.
  • The team includes engineers straight out of school alongside veterans who have worked on S3 for 15 years. Common traits are deep ownership, relentless curiosity, and a personal commitment to the durability and usability of every byte.
  • The 50 TB object size limit (up from 5 TB at launch) reflects continuous optimization for new workloads like high-resolution video, and the team regularly revisits limits based on observed customer usage patterns.

Closing recommendations

  • For technical reading, Milan recommends research on multimodal embedding models — AI models that create semantic understanding across text, images, audio, and other formats — as the next frontier for making vast data oceans truly searchable and usable.
  • For a non-technical read, he suggests books on supporting native bees and insects — drawing a parallel between ecosystems in nature and the interconnected systems in distributed computing.
Back to The Pragmatic Engineer