Observability: the present and future, with Charity Majors — The Pragmatic Engineer

Charity Majors is a veteran infrastructure engineer (ex-Parse, ex-Facebook) and co-founder of Honeycomb, an observability startup. She’s also the author of Observability Engineering. In this episode, she explains what observability really means, why it’s so hard and expensive, how the industry is shifting from “Observability 1.0” to “2.0,” and what engineering teams get wrong when adopting it.

What observability actually is

Observability is about understanding your software at the intersection of code, systems, and users—not just catching errors or outages.
It’s not just an operational tool; it underpins development feedback loops, product decisions, and business impact.
Charity argues that observability should help engineers explain their work in the language of the business (money, outcomes), giving engineering a “first-class seat” at the executive table.

The three-pillar model (Observability 1.0) and its problems

The traditional model defines observability as metrics, logs, and traces—a framing coined by Peter Bourgon around 2017.
Vendors latched onto this because they had separate products to sell for each pillar.
In practice, teams store every request in many different tools: metrics stores, structured/unstructured logs, tracing tools, profiling tools, analytics tools, etc.
The problems:
- Correlation is manual: engineers sit in the middle, copy-pasting IDs or guessing that “this shape looks like that shape.”
- Cost multiplier: storing 15 copies of each request for 15 use cases is expensive and unsustainable.
- Predefined relationships: you must decide in advance what’s important and how data connects.

Observability 2.0: unified storage and structured data

The core shift is from many sources of truth to unified storage.
Instead of separate silos, you store wide, structured events (rich logs) in a single backend, typically a columnar store.
Benefits:
- No dead ends: click a log, turn it into a trace, visualize over time, derive metrics or SLOs, and jump directly into the events violating an SLO.
- High cardinality support: you can slice and dice by unique IDs (user ID, request ID, app ID) without breaking the bank.
Charity emphasizes that many vendors claim “unified observability” but often only offer a unified bill or unified visualization, not true unified storage.
The new wave of observability startups (many built on ClickHouse, columnar stores, OpenTelemetry-native, wide events) looks more like “cheaper Honeycombs” than “cheaper Datadogs.”

How Charity’s experience at Parse and Facebook shaped Honeycomb

At Parse (a mobile backend-as-a-service with ~1M apps), outages were frequent and hard to diagnose:
- The Ruby on Rails stack had fixed worker pools; one slow app could bring down the whole system.
- Logs were full of requests, but it was unclear which app actually caused the problem.
After Parse was acquired by Facebook, Charity used Scuba, Facebook’s in-memory columnar analytics tool:
- Scuba allowed real-time slicing and dicing on high-cardinality dimensions (app ID, user ID, query, latency).
- Time to pinpoint root cause dropped from hours (or never) to seconds.
Her co-founder Christine had built Parse’s analytics product on Cassandra and was frustrated by having to predefine questions in advance; she’d manually look up answers in Scuba.
Both realized this kind of interactive, high-cardinality exploration was life-changing—and not available in any external tool.
They left Facebook to build Honeycomb, aiming to productize that experience, without initially knowing terms like “product-market fit” or “observability category.”

Scuba at Facebook

Scuba is an in-memory columnar store, built about a decade before Charity used it, originally to debug MySQL/PHP issues.
It was “quick and dirty”: a C++ binary that shelled out to rsync for replication.
Its evolution is closer to business analytics than traditional monitoring or the three-pillar model.

Key observability concepts engineers should understand

Metrics (small m vs. Big M):
- Small m metrics: generic term for telemetry.
- Big M metrics: a number with tags appended—efficient but limited; they don’t store contextual relationships.
Structured data vs. metrics:
- The shift is from storing data in multiple metric/logging/tracing tools to storing wide, structured events in a unified backend.
Sampling:
- Often feared because logging vendors have long said “every log is sacred.”
- In practice, smart sampling is essential for managing cost while preserving debuggability.
Cardinality:
- Refers to the number of unique values in a set (e.g., request IDs, user IDs).
- High cardinality data is the most valuable for debugging but also the most expensive in traditional metric-based systems.
- In time-series databases, every unique combination of metric + tags is stored separately, so adding a high-cardinality field (like IP address) can 100x your bill overnight.
- Teams using traditional tools spend most of their time governing cardinality instead of debugging.

Why observability is so expensive

System complexity and high standards: companies like banks or delivery services must understand every request.
Multiplier effect: storing many copies of each request across many tools.
Cardinality blowup: traditional metric-based tools can’t handle high cardinality efficiently; costs spiral as cardinality grows.
Charity notes that when money is no longer “free,” the old model becomes unsustainable.

The solution: structured data and columnar stores

Move away from tools backed by Big M metrics toward tools using structured data in columnar stores.
Emit fewer but wider logs—attach rich context to each event.
This enables:
- Real-time slicing and grouping by high-cardinality fields.
- Interactive exploration (e.g., “Bubble Up” in Honeycomb: select a spike, compare dimensions inside vs. outside to see what’s different).
Trade-offs exist but are mitigated by falling storage/compute costs and modern columnar databases that don’t require predefined schemas or indexes.

Observability across the development lifecycle

Observability should underpin the entire development cycle, not just production incidents.
Use cases:
- CI/CD: visualize builds as traces, see where tests break or where time is spent.
- Progressive deployments: ship behind feature flags or canaries, then use observability to validate behavior with precision.
Charity’s metaphor: observability + feature flags + canaries is like putting on glasses before driving—you should feel in control, not constantly course-correcting.

Who owns observability?

Some companies assign ownership to SRE or DevOps teams, but the center of gravity is shifting to platform teams.
Platform teams manage the boundary between app code and infrastructure, and their customers are internal (product engineers).
The model is: “You own your code; we help you with the platform.”
Charity believes DevOps as a movement is in its fulfillment stage:
- The old split between “Dev builds, Ops operates” is fading.
- Increasingly, engineers write code and own it in production.
- The philosophy of collaboration and empathy remains, but the organizational split is disappearing.

Why observability is hard

Software is hard; observability adds a meta-layer of thinking about what future-you will need to know at 2 a.m.
Historically, tools required engineers to think in terms of physical resources (CPU, RAM) rather than user outcomes.
Charity herself “hated” observability and monitoring for years, always trying to delegate it.

Vendor lock-in and OpenTelemetry

Historically, vendor lock-in has been a huge problem in observability.
OpenTelemetry (OTel) is changing this:
- It’s a set of APIs, SDKs, and tools to instrument code once and export telemetry to any backend.
- Goal: point your telemetry firehose at any vendor without re-instrumenting.
- It’s now the #1 CNCF project by commits and committers, overtaking Kubernetes.
OTel provides:
- Consistent naming and structure via semantic conventions.
- Vendor-neutral pipelines, enabling vendors to compete on value, not lock-in.
Charity recommends that midsize companies adopt OpenTelemetry where possible:
- Improves consistency.
- Makes it easier to switch vendors if needed.
- Strengthens negotiating position with vendors.

Common mistakes engineering teams make

Waiting until production to add observability; instead, instrument early, just like writing tests.
Over-relying on static dashboards:
- Dashboards are often “public-facing” and not useful for deep debugging.
- Dynamic, interactive exploration is more valuable than fixed graphs.
Using SLOs disconnected from debugging data:
- SLOs should be the API for engineering teams—an agreement on service level and a budget for reliability work.
- When SLOs and observability data are connected, you can click on an SLO violation and immediately see which events are causing it.
- SLOs also hedge against micromanagement: if you’re meeting your obligations, how you spend your time is your own business.

Honeycomb’s engineering challenge: building their own database

Honeycomb built its own database, Retriever, a columnar store.
Charity had always advised against writing your own database, but in 2016, suitable alternatives (ClickHouse, Snowflake) weren’t available or mature.
Retriever has evolved:
- Initially on local SSDs on EC2.
- Around 2020, they serverless-ified it: data ages out to S3, and queries fan out to Lambda jobs for on-demand processing.
This custom storage engine has been a force multiplier, allowing Honeycomb to iterate quickly on features like traces and high-cardinality queries.
Charity frames this as spending 2 of their 3 “innovation tokens” on the storage engine—a risky bet that paid off.

Observability and AI

Charity sees three intersections:
1. Building/training models: observability for model performance, data drift, etc.
2. Developing with LLMs: understanding prompts, outputs, and behavior in applications.
3. Observing AI-generated code: as more code is written by AI, teams must understand software of “unknown origin” in production.
Key insights:
- Don’t use AI when you can compute: if you have enough context to calculate an answer, that’s faster, cheaper, and more reliable than guessing with AI.
- AI observability cannot be isolated from software observability:
  - Inputs come from many services; outputs affect users and downstream systems.
  - It’s a trace-shaped problem: you must trace from user input through the model to human feedback.
- Many AI observability startups focus only on the model in isolation, missing the broader software context.
Charity’s take: AI reliability starts with good software observability; without it, you’re flying blind.

Build vs. buy vs. open source

Building your own observability stack rarely makes sense unless you’re at Facebook/Google scale.
OpenTelemetry is open source and has a bright future as the standard instrumentation layer.
For backends, the trend is consolidation:
- Teams are tired of paying 5 vendors, each targeting 15–20% of cloud spend.
- A reasonable benchmark is 15–20% of cloud spend on observability, depending on the business.
Prometheus and Datadog are mature metric-based tools; Charity sees them as the last major “Big M metrics” products that will be built.
Metrics still have a place:
- Cheap long-term trend plotting.
- Counters and high-scale use cases where structured data is too expensive.
The goal is to invert the ratio: 80% structured data, 20% metrics (currently it’s the reverse for many teams).
Most companies prefer vendors over self-hosted open source because when things break at 2 a.m., you don’t want to also debug your observability stack.

Frontend and mobile observability

RUM (Real User Monitoring) is the frontend equivalent of backend request monitoring, organized around browser/user sessions.
Honeycomb launched its own RUM product to provide a unified view from mobile/browser to database.
Mobile observability remains a challenging, somewhat isolated space:
- Apple and Google’s store gating and restrictions make CI/CD and feature flagging harder.
- The build pipeline is “alien” compared to backend, so mobile teams often can’t adopt the same best practices.
- Crashlytics was acquired by Twitter, then neglected, spooking VCs and leaving mobile without a first-class, vendor-backed observability solution.
Charity believes mobile is not a small market, but it’s been underserved due to platform constraints.

When to invest in observability (for startups)

As early as you start writing tests.
Instrument while you’re writing code, not after.
Done correctly, observability accelerates development by providing fast feedback and rich mental models of your software.
The analogy: just like tests, once you’ve experienced good observability, you can’t unsee it.

Rapid fire

Engineer vs. manager: Charity loves being an engineer—“getting paid to solve puzzles all day.” She misses some aspects of engineering management but plans to return to a staff engineer role next.
Controversial belief: She disagrees with the “founder mode” idea that the CEO must approve everything. She sees that as egotistical and harmful to good decision-making, citing Steve Jobs as successful despite being a control freak, not because of it.
If not Honeycomb: She’d be a staff engineer somewhere, building things and turning off her brain at 5 p.m. (mostly).
Whiskey: Currently favors bourbon and rye, especially Whistle Pig; her all-time favorite is the rare George T. Stagg (try Stagg Jr. as a substitute).
Book recommendation: Fluke by Joseph Chance—about chance, chaos, and why everything we do matters. She’s read it three times in the past year.

Summary

What observability actually is

The three-pillar model (Observability 1.0) and its problems

Observability 2.0: unified storage and structured data

How Charity’s experience at Parse and Facebook shaped Honeycomb

Scuba at Facebook

Key observability concepts engineers should understand

Why observability is so expensive

The solution: structured data and columnar stores

Observability across the development lifecycle

Who owns observability?

Why observability is hard

Vendor lock-in and OpenTelemetry

Common mistakes engineering teams make

Honeycomb’s engineering challenge: building their own database

Observability and AI

Build vs. buy vs. open source

Frontend and mobile observability

When to invest in observability (for startups)

Rapid fire