We thought retry would make our system more reliable.
It made it collapse faster.
We had a system using Memcache to improve read performance.
To support cache invalidation, we stored all active cache keys in a shared List<string>.
Whenever data was updated, we would iterate through that list and clear related cache entries.
It worked perfectly in testing.
Single requests. Controlled environment. No visible issues.
List<string> was enough to track cache keys.All three assumptions were wrong.
Under real traffic, concurrent requests started updating data at the same time.
The shared List<string> was not thread-safe.
Sometimes the invalidation step threw exceptions.
The endpoint returned HTTP 500.
And then the retry policy kicked in.
Here’s the critical part:
But because the endpoint returned 500, the client retried.
Each retry:
Retry didn’t fix the failure.
It amplified it.
A non-thread-safe collection was shared across concurrent requests.
Retry re-executed logic that had already partially succeeded.
The system retried on 500 errors without understanding whether the operation was safe to retry.
ConcurrentDictionary (or eliminate shared state entirely).
Caching is hard.
Concurrency makes it harder.
Retry makes it dangerous.
Retry doesn’t improve reliability.
It magnifies design flaws.
Frameworks don’t scale systems.
Engineering decisions do.