The Model That Passed Eval and Failed in Production
Offline metrics look great. Production behavior is a disaster. This gap isn't bad luck — it's a design failure you can prevent.
The eval suite is green. Accuracy is up 3 points over the baseline. The team ships it.
Two weeks later, support tickets are climbing. Users are complaining about exactly the kind of outputs the eval was supposed to catch. The model passed every test and still made it worse.
This is the offline-online gap, and it bites teams that mistake metric improvement for production readiness.
Why evals lie to you.
Your eval dataset is a snapshot. It was collected at a point in time, from a distribution of inputs that existed then. Users in production don’t read the memo. They type differently, ask things you didn’t anticipate, and find edge cases your curators didn’t think to include.
When you optimize against a fixed benchmark, you’re fitting to that snapshot. If the benchmark is representative of production, great. It usually isn’t — because production is live and the benchmark isn’t.
The other problem: evals measure what you measure. If your eval scores fluency and factual accuracy, but users actually care about tone, brevity, and whether the answer is actionable — you can improve your eval scores while degrading user experience. The numbers go up. The product gets worse.
The causes nobody wants to admit.
Eval data leakage. If any of your eval examples were ever used in prompt engineering, fine-tuning, or system design decisions, the model has seen them — indirectly. You’ve overfit to your own benchmark.
Coverage gaps. The long tail of production inputs is always larger than your eval set. That 2% of inputs that look weird? In a high-volume system, 2% is tens of thousands of requests a day. Your eval might have zero examples of them.
Label drift. What “correct” meant when the eval was built might not be what correct means now. Products change. Policies change. User expectations change. Eval datasets don’t always keep up.
Single-metric thinking. Optimizing for one score almost always trades off against something else. Lower latency vs. better reasoning. Higher accuracy vs. more refusals. You can’t see the tradeoff if you’re only watching one number.
What production-grade eval looks like.
Shadow mode before cutover. Run the new model alongside the existing one in production — same real inputs, no user impact. Compare outputs. Look for systematic differences, not just aggregate metrics. Does the new model behave differently on your top user segments? On your most common query types?
Slice your evals. Don’t just report overall accuracy. Break it down by query category, user cohort, input length, language, time of day if it matters. Regressions hide in aggregates. They show up in slices.
Track production metrics separately. Downstream signals — session length, retry rate, thumbs-down, escalation to human, task completion — are lagging indicators, but they’re honest ones. Wire them up. An eval suite with no connection to business outcomes is a comfort metric, not a signal.
Treat your eval set like a codebase. It needs to be versioned, maintained, and updated as production evolves. Old examples that no longer reflect real usage should be retired. New failure modes discovered in production should be added. Eval maintenance is not optional work.
The model is not the product.
The model is a component. Production behavior is a function of the model, the prompt, the retrieval layer, the input preprocessing, the output parsing, the downstream integration, and the distribution of real user inputs hitting all of it at once.
Your eval covers a slice of that stack. A passing eval means the slice you measured is probably fine.
Everything else, you find out the hard way — unless you’ve built the infrastructure to find out the smart way first.
Green evals are a starting point, not a finish line.