Simon Willison, co-creator of the Django framework and a prolific open-source contributor, has been experimenting with large language models (LLMs) for software engineering productivity for nearly three years — starting with GPT-3 before ChatGPT’s public release. He shares an honest, experience-grounded perspective on what these tools can and cannot do, how his workflow has evolved, and why he believes experienced engineers who invest time in learning LLMs will gain a significant professional advantage.
How Simon got started with LLMs
Simon had a side interest in machine learning for five or six years, taking Jeremy Howard’s fast.ai course around 2018, and experimented with GPT-2 in 2019 to generate New York Times headlines by decade — the results were underwhelming.
GPT-3 in 2020–2021 was the turning point: it was the first model large enough to do useful things, and Simon used it through OpenAI’s playground interface to solve problems with JQ (a JSON query language he finds difficult) by using completion-style prompts.
He didn’t get serious about using LLMs for coding until after ChatGPT launched in November 2022, when he used it to learn Rust during Advent of Code that December — an exercise that revealed both the power and limits of current models.
The “scary” moment: existential dread as a tool builder
Simon’s main open source project is Datasette, a tool for exploring and publishing data from databases — his long-standing mission has been to let anyone ask questions of data without needing specialized skills.
When OpenAI launched ChatGPT’s code interpreter mode (now called Advanced Data Analysis), Simon uploaded a SQLite database and asked it a question; it flawlessly wrote the correct SQL query, executed it via Python’s sqlite3 library, and returned the answer.
His reaction was a mix of awe and existential dread: the tool he had spent years building was being matched by a general-purpose AI that didn’t even advertise SQLite as a feature.
This shifted his mental model for Datasette — he now explores adding LLM-based plugins to stay ahead, but acknowledges the problem space has fundamentally changed.
Languages where LLMs excel — and where they struggle
Python, JavaScript, and SQL are the three languages LLMs are best at, because they have the most training data; Simon’s daily stack aligns perfectly with this.
Rust is still difficult for LLMs due to its memory management model (borrowing and lifetimes); Simon uses “can this model explain Rust borrowing to me?” as a stress test.
He deliberately chooses “boring technology” like Django, which LLMs already know well, maximizing the tools’ usefulness.
Fine-tuning vs. RAG: what actually works for adding knowledge
Fine-tuning means taking an existing model and running additional training cycles on custom data (e.g., uploading a CSV of examples via an API). It sounds tempting but is expensive, difficult, and often counterproductive.
Fine-tuning does not reliably add new factual knowledge to a model — the existing weights overwhelm what’s added, and hallucinations can actually increase.
It does work for task-specific improvements, like training a model on thousands of question-to-SQL-query pairs to make it better at SQL generation.
RAG (Retrieval-Augmented Generation) is the practical alternative: when a user asks a question, you search relevant documents, paste them into the model’s context window along with the question, and let the model answer using that material.
The basic implementation is simple — Simon has versions in ~30 lines of Python or two dozen lines of bash.
Getting good RAG is hard: the core challenge is information retrieval — picking the most relevant content to fill the context window, a problem search engineers have worked on for 30 years.
Real users ask questions in unpredictable ways, so production RAG systems take months to harden against edge cases.
The cost and difficulty of evaluating LLM-powered software
Traditional unit testing doesn’t work with LLMs because they are non-deterministic — they rarely return the exact answer twice.
Automated evaluations (“evals”) are essential but expensive: one AI company Simon spoke to spends $50 per eval run, making it impractical to run on every commit.
A common evaluation technique is “LLM as judge” — using one model (e.g., GPT-4) to compare outputs from two models and pick the better one — which Simon finds uncomfortable but acknowledges is one of the few practical options available.
Simon’s current AI stack
Claude 3.5 Sonnet (Anthropic) is his default model for most work — he considers it the first time a non-OpenAI model has been clearly the best available.
Anthropic was founded by a splinter group from OpenAI who had tried to remove Sam Altman; they include people who built GPT-4.
GPT-4o (OpenAI) is used for two specific features:
Code interpreter mode: can write and execute Python code, iterating until it works — useful for fiddly problems.
Voice mode: Simon uses Airpods to have hour-long conversational coding sessions while walking his dog, combining code generation with web lookups.
GitHub Copilot is always on in his IDE; he primarily uses autocomplete and the “sparkly icon” feature that lets him select lines of code and give a prompt to transform them.
He doesn’t use Copilot’s chat window at all.
Copilot actually runs a sophisticated RAG-like mechanism, pulling context from other files in the project based on semantic similarity — most users don’t realize this.
LLM (his own open source command-line tool): plugin-based, supports 100+ models, lets him pipe files and output from other commands into prompts. He mainly uses it with Claude but has also run local models like Microsoft’s Phi-3, Llama, Mistral, and Google’s Gemma.
Claude Artifacts: a feature where Claude can write HTML/CSS/JavaScript and render it in a secure iframe — Simon uses it to prototype UI changes, like redesigning blog pages from screenshots.
Running local models: why it’s worth it even if they’re worse
Local models are not as good as frontier cloud models, but running them teaches you how LLMs actually work.
Local models hallucinate wildly, which is educational — Simon recommends “ego searches” (asking a local model “who is Simon Willison?”) to see how confidently it fabricates answers.
Some are small enough to run on a phone (e.g., via the MLC Chat app), and are useful enough for looking up API documentation on a plane.
Tools like Hugging Face make downloading and running local models much less complicated than most people expect.
Productivity: how much faster is Simon now?
For the specific activity of typing code at a keyboard, Simon estimates he is 2–3x faster at turning thoughts into working code.
However, typing code is only about 10% of a senior engineer’s job — the rest is research, requirements gathering, design, and communication.
LLMs also speed up research: asking for options for solving a problem and getting working examples back is effectively a faster, more productive form of Google search.
The bigger productivity gain is scope: Simon can now take on projects in languages he doesn’t know well (like Go) because the LLM fills in the trivia — he shipped production Go code with tests and CI/CD despite not being a Go programmer.
He believes most engineers should limit themselves to 3–4 languages they know deeply, and use LLMs to bridge into others as needed.
Historical context: what were the big productivity bumps before LLMs?
Firebug (Firefox extension, precursor to Chrome DevTools) was a revelation for JavaScript debugging — before it, developers used alert() statements and had no element inspector. It brought Python’s interactive REPL-style workflow to front-end development.
Open source as a concept was the biggest productivity boost of Simon’s career: 25 years ago, companies had blanket bans on open source; today, no front-end code could be written without npm.
GitHub massively accelerated open source by replacing SourceForge, mailing lists, and CVS/Subversion with a one-click experience.
Package managers (cpan for Perl in the late ’90s, then pip and npm) made the real difference — the ability to pip install or npm install a solution to a problem reduced the cost of building software to a fraction of what it was 20 years ago.
The industry spent decades chasing software reusability through OOP; the actual fix was a vibrant open source community with documented, packaged software.
Misconceptions and resistance
“It’ll make you productive on day one” — false. Simon emphasizes that these are power-user tools requiring significant investment to learn. There is no manual for Copilot; even OpenAI has almost no documentation. Claude’s prompting guide is the best resource available.
Superstition and junk advice are rampant: tips like “tell the model you’re the world’s greatest expert in X” spread because people try them once, get a good result, and form superstitious habits — like a dog checking a bush where it once found a hamburger.
Ethical concerns are legitimate: these models were trained on vast quantities of unlicensed copyrighted data. Image models like Stable Diffusion were trained on artists’ work and are now used in ways that replace commissions. Simon respects people who refuse to use these tools on ethical grounds, comparing it to veganism.
“The technology is plateauing”: Simon would welcome a plateau — the pace has been exhausting. He notes that many of the best techniques (like Chain of Thought prompting) were discovered independently, months after models shipped, so even without model improvements, better usage patterns would continue to emerge.
AGI hype is disconnected from reality: Simon does not believe you get to AGI from auto-completing sentences, and is skeptical of claims that AI will replace all knowledge work. The mainstream narrative that software engineers are “replacing themselves” is, in his view, primarily a fundraising strategy.
The future: coding vs. professional coding
Simon draws a parallel to professional video creation: iPhones and YouTube didn’t kill professional video — they enabled millions of new creators while professionals continued to thrive.
He hopes LLMs increase the number of people who can do basic programming by an order of magnitude, since today you almost need a computer science degree to automate a dull task.
Two possible demand curves:
Demand for professional engineers goes down because basic work is automated.
Demand goes up because companies that couldn’t justify building custom software before (hiring 20 people for 6 months) can now do it with 5 people in 2 months — making more projects feasible.
Code equals liability: more code means more maintenance burden. Simon has observed that teams of less-experienced developers tend to produce spaghetti code over time, and the engineers who add the most value are those who can simplify, delete code, and explain why — systems thinking, not typing speed.
The skills that matter most now: system design, prioritization, QA/testing, and the ability to evaluate and question LLM outputs. Simon’s personal rule: never commit a line of code you don’t understand.
Advice for engineers
Experienced engineers: maintain side projects as low-stakes exploration spaces for AI tools. If your employer allows internal hack days, advocate for them. Simon uses his personal blog to test features like GitHub Copilot Workspace in live demos.
Less experienced engineers: take advantage of having more free time in your 20s to build side projects. Set yourself the challenge of having AI tools write every line of code.
Universal advice: get free accounts with the best available models (GPT-4o and Claude 3.5 Sonnet are both free with login) and just throw questions at them — including ones you think they’ll fail on, because that’s useful information. Play with Claude Artifacts for fun.
The key mindset: this stuff is supposed to be fun. Simon uses voice mode for prank phone calls to his dog, asks models to rap about technical answers, and builds custom CSS tools on demand instead of Googling — the joy of trying weird things and having them work is a genuine part of the experience.
Rapid-fire recommendations
Book: Martin Kleppmann’s Designing Data-ensive Applications — the Blue Sky team requires all engineers to have it on their shelf.
Fiction genre: British wizards tangled up in old-school British bureaucracy — Charles Stross’s Laundry Files series and Ben Aaronovitch’s Rivers of London series.
Favorite programming language/framework: JavaScript with no framework (vanilla JS with querySelectorAll and map — jQuery is now built into browsers).
Exciting company: Fly.io — a hosting platform that makes it easy to spin up secure containers with an API and pricing model that makes sense, which Simon used to build a SaaS platform on top of Datasette where each customer gets an isolated container with encrypted volumes.