Chip Huyen, author of the O’Reilly book AI Engineering, joins to discuss what AI engineering is, how it differs from traditional ML engineering, and practical approaches for software engineers building AI applications today. She emphasizes that while the field moves fast, many fundamentals are stable, and the real challenge is disciplined problem-solving rather than chasing shiny tools.
What is AI Engineering and how it differs from ML Engineering
AI engineering describes building applications on top of foundation models accessed via APIs, which dramatically lowers the entry barrier compared to traditional ML engineering.
Previously, ML engineers had to gather data, label it, train, and babysit models from scratch.
Now, engineers can start with a product idea, call a powerful model via API, and only later worry about data, evaluation, and potentially fine-tuning or hosting their own models.
The development path has reversed:
ML engineering: data → model → product.
AI engineering: product → data → model.
This shift places much more emphasis on product thinking and data, since the underlying AI capabilities are increasingly commoditized.
Chip surveyed practitioners building on foundation models and found most identified with the term “AI engineer,” which is why she chose it for the book.
How the book stays relevant in a fast-moving field
Chip focused on fundamentals that have been around for years, even if the buzzwords are new.
Language modeling dates back to Claude Shannon in the 1950s.
RAG (retrieval-augmented generation) is built on retrieval, a mature technology powering search and recommendation systems.
Vector databases and vector search have a long history in information retrieval.
She distinguishes between temporary capability gaps and fundamental limitations, betting that some “hot topics” (like prompt engineering tips) will fade as models become more robust.
She also anticipated trends like multimodality early, arguing it was inevitable even when others thought it was too soon.
Typical steps to build an AI application
Start by defining what a good response looks like.
This is not always intuitive; LinkedIn found that candidates didn’t want a “you’re a terrible fit” answer but rather guidance on gaps and better-fitting roles.
Build clear guidelines and iterate on prompts with examples.
Evaluate rigorously.
Create a set of queries and expected responses.
Use both automated metrics (AI-as-a-judge) and human evaluation to measure progress.
Add context via RAG when needed.
Retrieve relevant documents, resumes, job listings, etc., to help the model answer better.
Start simple: keyword retrieval and basic chunking before jumping to vector databases.
Vector search can be expensive, add latency, and obscure exact matches like error codes.
BM25, a 20+ year old term-based retrieval method, is a strong baseline that many systems should be benchmarked against.
Hybrid search combining keyword and semantic retrieval is common in practice.
Data preparation—metadata extraction, summarization, contextual retrieval—often yields bigger performance gains than switching databases.
Fine-tune only after exhausting other options.
Fine-tuning introduces new problems: hosting, memory constraints, maintenance, and the risk that a newly released open-source model outperforms your fine-tuned version within days.
It should be a last resort, not a first-line defense.
Practical ways for software engineers to get started
Use a structured, incremental deployment approach (Microsoft’s CRAWL framework):
Start with a human in the loop: AI suggests responses, humans pick or lightly edit.
Roll out to smaller user groups or internal use cases as confidence grows.
Expand automation and scope gradually.
Don’t jump to Gen AI by default.
Many problems can be solved with simpler methods like classifiers or routing models.
Focus on the business problem first: understand bottlenecks, user needs, and existing workflows.
Use the simplest solution that works, not the fanciest one.
Avoid common mistakes:
Using Gen AI when it’s not needed (e.g., scheduling tasks to off-peak hours can be done greedily without AI).
Giving up on Gen AI too soon due to poor product understanding, bad evaluation, or not localizing where the system fails.
Adopting complex frameworks prematurely; many agent frameworks today contain bugs, typos in prompts, and untested best practices.
Evaluating AI systems
Evaluation is one of the hardest and most important parts of AI engineering.
As AI becomes smarter, it’s harder for humans to judge outputs (e.g., verifying a convincing book summary might require reading the whole book).
Functional correctness: evaluate based on task performance (e.g., does the code compile and produce the expected output?).
Coding is a popular use case partly because we have well-established ways to test code.
AI-as-a-judge: using one AI to evaluate another is growing in popularity and can be cost-effective, but judge quality depends on the model and prompt, and results can be non-deterministic.
Comparative evaluation: humans may struggle to give absolute scores but can reliably judge which of two outputs is better, even when both are superhuman.
Don’t skip human evaluation and manual data inspection.
Talk to users, observe their interactions, and look at real usage data to understand what actually matters.
Examples:
A meeting summary tool found users cared only about their personal action items, not overall correctness.
A tax chatbot saw low adoption not because of hallucinations but because users didn’t know what questions to ask; guiding them with suggested questions helped.
Manual data inspection has a very high value-to-effort ratio but is often undervalued.
Learning AI engineering as a software engineer
Combine project-based and structured learning.
Project-based learning: pick a project, work through real problems, and finish it.
Structured learning: take courses, read books, or study papers to fill gaps and ask the right questions.
Tutorials are useful but can lead to mindless copying; always stop to ask why things are done a certain way.
Observe your own workflow.
For a week, note what you do and estimate what percentage could be automated by AI, then try automating those tasks.
This builds intuition for viable use cases.
Will AI replace software engineering?
AI will automate parts of coding but not the core of software engineering, which is problem-solving.
Writing used to mean the physical act of putting words on paper; now it means arranging ideas into a readable format. Similarly, coding is the physical act; software engineering is devising executable solutions to problems.
Natural language is inherently ambiguous, while programming languages are precise. Bridging that gap requires someone who understands edge cases, system constraints, and the environment.
AI will enable engineers to build and maintain much more complex systems, just as word processors enabled writers to produce far longer works.
Exciting use cases beyond coding
Education: AI can help people learn faster by making it easier to find answers and references; the harder skill is learning to ask the right questions.
Entertainment: AI can create content that is both fun and intellectually stimulating, like strategy games that teach negotiation.
Content adaptation: AI can help adapt content across mediums—books to movies, papers to podcasts—more efficiently.
Enterprise efficiency: AI can automate information aggregation and transmission, potentially reshaping middle management and organizational structures.
Rapid fire
Most-used programming languages: Python and JavaScript, the latter for quickly building demos and products.
Favorite LLM: No single favorite; uses different models for different tasks (e.g., Claude for less cliché writing, DeepSeek R1 out of curiosity, Llama Vision for screenshot-to-code experiments).
Neat AI tool: A custom research assistant she built that scans paper links, extracts abstracts, checks authors, and retrieves citations—something that would have taken weeks to build before but now takes very little time.
Book recommendations:
Complex Adaptive Systems: about system thinking and designing social dynamics to achieve goals.
The Selfish Gene: explores how ideas (memes) replicate like genes, offering a new perspective on free will and legacy.
Antifragile: by Nassim Nicholas Taleb, whose ideas Chip finds compelling and thought-provoking.