Simon Willison: Engineering practices that make coding agents work - The Pragmatic Summit — The Pragmatic Engineer

Simon Willison — creator of Django, co-founder of Lanyard, and maintainer of Datasette — now writes more code on his phone than on his laptop, using AI coding agents to ship features in minutes. His workflow illustrates a broader shift in software development: from writing code yourself, to reviewing agent-written code, to not reading the code at all and instead relying on tests and automated verification to trust that it works. The key practices that make this possible are red-green test-driven development, manual testing via tools like curl, and sandboxing agents in isolated environments to limit the blast radius of mistakes or attacks.

The core workflow: TDD as the foundation for trusting agents

Simon starts every coding session by telling the agent to run the existing test suite (typically uv run pytest) and to follow red-green TDD — write a failing test first, then implement the minimal code to pass it.
- This is only five tokens of instruction (“use red green TDD”), and all major coding agents understand it.
- The benefit is that agents won’t over-engineer: they write the minimum needed to pass the test, which keeps code focused and correct.
- Simon personally disliked TDD for most of his career because it felt tedious, but with agents the cost is negligible — the agent spends a few extra minutes, not the human.
- Tests are now “effectively free” to produce, so there’s no reason not to write them. They are no longer optional.

Beyond automated tests: manual verification and Showboat

Passing tests don’t guarantee the system actually works in practice — a web server might fail to boot even when all tests pass.
Simon tells agents to start the server in the background and exercise it with curl, which catches bugs the test suite missed.
He recently released a tool called Showboat that builds a markdown log of manual tests the agent ran — curl commands, outputs, and commentary — giving the human a readable record of what was verified without reading the code itself.
- Showboat was only about 48 hours old at the time of the conversation but already proving useful.

Conformance-driven development: reverse-engineering standards from existing implementations

When a language-agnostic test suite exists (e.g., the WebAssembly specification with hundreds of tests), agents can be told to write code until the conformance suite passes.
- Simon used this to build a Python WebAssembly library that works despite being “janky as all get out.”
He also used a technique where he asked Claude to build a test suite that passes across six different web frameworks (Go, Node.js, Django, Starlette, etc.) for multipart file uploads, then used that test suite to implement the same feature in Datasette.
- This is effectively reverse-engineering a standard from multiple implementations, then building a new implementation against that standard.

Code quality: it’s a choice, not a given

For throwaway tools (single-page HTML/JS apps), code quality doesn’t matter — 800 lines of spaghetti is fine if it works.
For long-term maintained projects, code quality matters significantly, and letting an agent produce bad code is a choice the developer makes.
- If the agent produces 2,000 lines of poor code, the human can direct it to refactor — applying design patterns, restructuring — resulting in code that may be better than what the human would have written by hand, because the human would have run out of time or energy for that final refactor.
- Simon notes he’s ended up with higher-quality code via agents because he’s willing to prompt a refactoring pass and then walk the dog while the agent does the work.

Context and consistency: templates and patterns

Agents are highly consistent at following existing patterns in a codebase. If the code follows a convention, the agent will too.
Simon uses Cookiecutter templates to scaffold new projects — setting up file structure, testing frameworks, CI, and a README — so that even one or two example tests in the preferred style cause the agent to follow that style throughout.
- This mirrors how human teams work: the first person to introduce a pattern at a company sets the template everyone else copies.

Prompt injection and the lethal trifecta

Simon coined the term prompt injection (comparing it to SQL injection) to describe attacks where malicious instructions are embedded in data the agent reads — for example, documentation that tells the agent to execute a harmful command.
- The term was imperfect because, unlike SQL injection, there’s no reliable way to separate “instructions” from “data” in an LLM context.
He later introduced the term lethal trifecta to describe the three conditions that make an LLM-based system vulnerable:
1. Access to private data (API keys, emails, environment variables)
2. Exposure to malicious instructions (untrusted input)
3. An exfiltration vector (a way to send data back to the attacker)
The only guaranteed mitigation is to cut off one of the three legs — most commonly by preventing external communication, so the worst an attacker can do is make the bot lie in its responses.

Sandboxing: containing the damage

The most important defensive practice is running agents in sandboxed environments where a prompt injection or mistake can’t cause widespread damage.
- Claude Code for the Web runs in an Anthropic-managed container: the worst a prompt injection could do is steal the source code in that container, which is acceptable for open-source projects.
- Simon runs Claude with dangerously-skip-permissions on his Mac for convenience, despite being the world’s foremost expert on why that’s risky — but he avoids pointing it at untrusted repos.
- Docker containers and Apple containers are other options, though friction remains a barrier to consistent use.
For sensitive user data, Simon recommends investing in good mocking — buttons that generate synthetic test data (e.g., a user with 1,000 ticket types to test edge cases) — rather than copying production data to agent environments.

The recent inflection points

2022–2023: GitHub Copilot for autocomplete, then ChatGPT for conversational help. GPT-4 was the first model that was actually useful without making up everything.
Early 2025: Claude Code turned one year old. The combination of Claude Code with Sonnet 3.5 was the first time a model felt good enough at terminal-driven coding to be genuinely useful.
November 2025: Claude Opus 4.5 and GPT 5.1 produced code that was reliably good rather than janky — a major trust threshold.
March 2026 (the week before this conversation): Opus 4.6 and Codex 5.3 represented another inflection point — Simon is now “one-shotting” most tasks with two-sentence prompts and not even questioning whether they’ll work.
- He compares this to spell checking, which models couldn’t do reliably 18 months ago and now handle effortlessly — suggesting we’re still discovering what current models can do.

Using current models better rather than predicting the future

Simon tries not to predict more than a week ahead. His focus is on discovering what current models can already do — a process he estimates takes about six months after each new model release.
- When a model fails at something, note it and try again in six months. Sometimes it’ll succeed, and you may be the first person to discover that capability.
He wishes model vendors would clearly state what new models can do that previous versions couldn’t, but vendors themselves often don’t know the boundaries.

The human cost: this work is exhausting

Simon typically works on three projects simultaneously, switching between them when one agent is processing, but after about two hours he’s mentally exhausted.
- This contradicts fears of skill atrophy or laziness — operating multiple agents requires intense focus and decision-making.
- He thinks this exhaustion may be what prevents a single engineer from truly scaling to a thousand projects at once.
At the same time, engineers can be far more ambitious than before:
- Learning a new language no longer requires deep study — just start writing code in it and let the agent handle the details. Simon released three Go projects in two weeks despite not being a fluent Go programmer.
- Weird little experiments become viable — he had Claude build a custom cooking timer app for managing two Christmas dinner recipes simultaneously, which was unnecessary but fun.

What would Django look like if built today?

Django was created in 2003 to help journalists build web apps on newsroom deadlines — things that needed to ship in hours, not weeks.
Today, Simon can build a news-story app in two hours by prompting Claude, regardless of code quality.
The impact on open source is complex:
- Demand for generic open source libraries may decline — why use a date picker library when Claude can write exactly the one you need? Tailwind’s paid component library business has already felt this pressure.
- But agents love open source — they’re great at recommending and stitching together libraries, and the entire agent ecosystem is built on the back of open source.
- Open source maintainers are flooded with junk contributions from agents, to the point where some are asking GitHub to disable pull requests — something GitHub has never done, as open collaboration is its core value.
- Simon’s overall take: the situation is “really complicated.”

Summary

The core workflow: TDD as the foundation for trusting agents

Beyond automated tests: manual verification and Showboat

Conformance-driven development: reverse-engineering standards from existing implementations

Code quality: it’s a choice, not a given

Context and consistency: templates and patterns

Prompt injection and the lethal trifecta

Sandboxing: containing the damage

The recent inflection points

Using current models better rather than predicting the future

The human cost: this work is exhausting

What would Django look like if built today?