Table of Contents

When Retry Made Our System Worse

We thought retry would make our system more reliable.
It made it collapse faster.

1. Context

We had a system using Memcache to improve read performance.

To support cache invalidation, we stored all active cache keys in a shared List<string>. Whenever data was updated, we would iterate through that list and clear related cache entries.

It worked perfectly in testing.

Single requests. Controlled environment. No visible issues.

2. The Assumptions

A List<string> was enough to track cache keys.
Cache invalidation was simple.
Retry would improve reliability.

All three assumptions were wrong.

3. What Happened in Production

Under real traffic, concurrent requests started updating data at the same time.

The shared List<string> was not thread-safe.

Multiple threads modified the list simultaneously.
Race conditions occurred.
Cache invalidation logic broke.

Sometimes the invalidation step threw exceptions.

The endpoint returned HTTP 500.

And then the retry policy kicked in.

4. Where Things Got Worse

Here’s the critical part:

The database transaction had already succeeded.
Data was committed.
Only the cache invalidation failed.

But because the endpoint returned 500, the client retried.

Each retry:

Re-executed business logic
Triggered cache invalidation again
Increased contention
Made the race condition more frequent

Retry didn’t fix the failure.
It amplified it.

5. Root Causes

a. Shared Mutable State

A non-thread-safe collection was shared across concurrent requests.

b. Non-Idempotent Side Effects

Retry re-executed logic that had already partially succeeded.

c. Blind Retry Policy

The system retried on 500 errors without understanding whether the operation was safe to retry.

6. How I Would Redesign It Today

Use thread-safe collections like ConcurrentDictionary (or eliminate shared state entirely).
Separate cache invalidation from the main transaction flow.
Make operations idempotent before enabling retry.
Retry only transient failures — not business logic failures.

7. Big Lesson

Caching is hard.

Concurrency makes it harder.

Retry makes it dangerous.

Retry doesn’t improve reliability.
It magnifies design flaws.

Frameworks don’t scale systems.
Engineering decisions do.