Table of Contents

When Retry Made Our System Worse

We thought retry would make our system more reliable.
It made it collapse faster.


1. Context

We had a system using Memcache to improve read performance.

To support cache invalidation, we stored all active cache keys in a shared List<string>. Whenever data was updated, we would iterate through that list and clear related cache entries.

It worked perfectly in testing.

Single requests. Controlled environment. No visible issues.


2. The Assumptions

All three assumptions were wrong.


3. What Happened in Production

Under real traffic, concurrent requests started updating data at the same time.

The shared List<string> was not thread-safe.

Sometimes the invalidation step threw exceptions.

The endpoint returned HTTP 500.

And then the retry policy kicked in.


4. Where Things Got Worse

Here’s the critical part:

But because the endpoint returned 500, the client retried.

Each retry:

Retry didn’t fix the failure.
It amplified it.


5. Root Causes

a. Shared Mutable State

A non-thread-safe collection was shared across concurrent requests.

b. Non-Idempotent Side Effects

Retry re-executed logic that had already partially succeeded.

c. Blind Retry Policy

The system retried on 500 errors without understanding whether the operation was safe to retry.


6. How I Would Redesign It Today


7. Big Lesson

Caching is hard.

Concurrency makes it harder.

Retry makes it dangerous.

Retry doesn’t improve reliability.
It magnifies design flaws.


Frameworks don’t scale systems.
Engineering decisions do.