The Retry Loop That Ate Your API Quota
Naive retry logic is one of the most common — and most expensive — bugs in LLM production systems. Here's what it looks like and how to fix it.
Every LLM integration has a retry loop. Most of them are wrong.
Not wrong in an obvious way — wrong in the quiet, expensive way. They work fine in development. They mostly work in staging. Then production hits a rate limit at 2 AM, and suddenly you’re burning 10x your expected API spend watching requests slam against a wall in tight succession.
You’ve seen the code. It’s usually three lines of time.sleep(1) and a for i in range(3) wrapper someone added after the first timeout. It looks like a solution. It’s actually a liability.
The problem is compounding failures.
When your system hits a rate limit or a transient error, a naive retry loop does one thing: it sends the same request again, immediately or after a fixed delay. If the error is a transient spike, that might work. If the error is systemic — high load, quota exhaustion, a provider degradation — it makes things worse.
Fixed delays don’t back off fast enough. Under load, multiple concurrent callers all retry at the same interval and pile up together. The provider sees a wave, not a trickle. Your system sees a cascade, not a recovery.
Worse: if the failure is on your side — a malformed request, a token limit exceeded, a bad input — retrying is pointless. You’re paying per token to fail the same way three times in a row.
What production retry logic actually needs:
Exponential backoff with jitter. Not sleep(1), sleep(2), sleep(3). Exponential: sleep(2^attempt). Jitter: randomize within a range so concurrent callers don’t synchronize. This is standard advice. Most LLM code doesn’t follow it.
Error classification before retry. 429 (rate limit)? Retry with backoff. 400 (bad request)? Don’t retry — fix the request. 500 (server error)? Retry with caution. Treating all errors as retryable is how you waste quota and mask bugs.
A circuit breaker. If your retry loop has failed five times in the last minute, stop retrying entirely for a cooldown window. A circuit breaker prevents your system from hammering a degraded provider indefinitely. It’s the difference between a spike and an outage.
Budget awareness. Know how much retry spend is acceptable per request, per workflow, per hour. Set hard limits. Log when you hit them. Retries are not free — they’re real tokens, real spend, real latency.
Fallback behavior. After retries are exhausted, what happens? “Raise an exception” is not a product decision, it’s a dodge. Graceful degradation, cached responses, queued retry — pick one. Have it ready before you need it.
The real cost isn’t the API bill.
It’s the silent failures. The request that retried three times, spent 15 seconds, and returned a stale cached answer without telling the user. The workflow that silently dropped a step because the error handler gave up. The on-call alert at 3 AM because a retry storm woke the monitoring.
Retry logic is infrastructure. It deserves the same design discipline as anything else in your stack — not a three-liner someone copy-pasted from a Stack Overflow thread.
Build the retry logic like you expect it to fail. Because it will.
And when it does, you want it failing on your terms — not the provider’s.