Retries are designed to improve reliability. Under stress, they do the opposite.
When a system slows down, retry logic increases load. Increased load slows the system further. This feedback loop collapses capacity.
A request fails. The client retries.
If the failure was transient, the retry succeeds.
But if the failure was caused by overload, the retry increases overload.
Reliability logic becomes a load amplifier.
Reference: AWS Builders Library – Timeouts, Retries, and Backoff with Jitter
Assume:
System is stable.
Utilization = 80%.
In queueing theory terms (M/M/1 approximation):
ρ = λ / μ
ρ = 8000 / 10000 = 0.8
As utilization approaches 1, latency increases non-linearly.
Now assume 5% of requests timeout due to latency spike.
8,000 × 5% = 400 retries per second
Effective traffic becomes:
8,000 + 400 = 8,400 RPS
Utilization:
ρ = 8400 / 10000 = 0.84
Latency increases. More requests exceed timeout.
Suppose timeout rate increases to 10%.
8,000 × 10% = 800 retries
Effective load = 8,800 RPS
ρ = 0.88
Positive feedback loop.
Retries target slow requests. Slow requests are usually P95 or P99.
That means retries amplify tail latency pressure, not average latency.
Google SRE notes that tail latency dominates user experience.
Reference: Google SRE – Handling Overload
Now add:
Worst-case multiplier:
Original: 8,000 RPS
1st retry: 8,000
2nd retry: 8,000
3rd retry: 8,000
Total potential = 32,000 RPS
A system built for 10,000 RPS now receives 32,000.
Collapse is deterministic.
In an M/M/1 queue, average response time:
W = 1 / (μ - λ)
If:
μ = 10000
λ = 9500
W = 1 / (500) = small
If λ increases to 9900:
W = 1 / (100)
5× worse latency from a 4% load increase.
Retries accelerate this non-linear region.
1. Exponential Backoff
delay = base * 2^attempt
2. Add Jitter
delay = random(0, base * 2^attempt)
3. Circuit Breaker
Stop sending traffic when failure threshold is exceeded.
4. Retry Budgets
Limit total retries as percentage of baseline traffic.
Reference: Google SRE – Cascading Failures
Retries are not free.
Under normal conditions, they improve reliability. Under stress, they multiply traffic.
Systems do not collapse because of a single failure. They collapse because feedback loops amplify load.
Retry logic is a load amplifier. Design it like one.