Field notes from the production trenches
Engineering, research, and process essays from the Overflow Labs team. Published when we've got something worth saying — usually monthly.
Evals are the product
Most LLM systems fail in production not because the model is wrong, but because no one defined what 'right' looks like. Here's how we approach evaluation as a first-class deliverable.
RAG isn't magic — it's information retrieval with extra steps
Why teams keep shipping disappointing RAG systems, and what 30 years of IR research can teach us about building ones that actually work.
Buy vs. build vs. fine-tune in 2026
A decision framework for technical leaders evaluating AI infrastructure, with real cost models for the three most common architectures.
Agents without frameworks
Why the most reliable agentic systems we've shipped contain less than 200 lines of orchestration code — and what that says about the current framework landscape.
The quiet rise of small models
Frontier models get the headlines, but a 7B model fine-tuned on the right 50k examples is still the right answer for half the problems we see.
Why we ship on day 90
Every Overflow Labs engagement targets a production milestone in the first quarter. Here's the operating system that makes it possible.