Production Lessons – What Broke and Why

Episode #1 – Redis Is Fast — Until You Design It Wrong


Context

We had a feature that allowed administrators to target a specific group of users for a time-based campaign. The selected user IDs were configured from an internal admin portal.

The requirement was simple:

In development, everything worked perfectly.


Original Design

All targeted user IDs were stored inside a single Redis key as a serialized JSON array.

        Key: campaign:2026-02-12
        Value: [1, 5, 20, 500, 999, ...]
        

On every login request:

  1. GET the key from Redis
  2. Deserialize the JSON array
  3. Perform a linear search to check if the userId exists

In staging, the list usually contained 20–100 users. Latency was negligible. No one questioned the design.


The Incident

One day, an administrator added more than 30,000 user IDs into the campaign group.

Unfortunately, this happened during peak traffic hours.

Within minutes:

At first glance, it looked like a memory leak. It wasn’t.


Root Cause Analysis

1. O(n) Lookup Per Login

        GET campaign:2026-02-12
        Deserialize 30,000 IDs
        Linear search
        

This is an O(n) operation executed on every login.

Under high concurrency, we were repeatedly:

The cost wasn’t obvious at small scale. At peak traffic, it became catastrophic.


2. Single Hot Key

All users were hitting the exact same Redis key simultaneously.

This created:

We unintentionally turned Redis into a bottleneck.


3. Misaligned Data Modeling

Redis is a data structure server. We ignored that.

We treated Redis like:

Give me everything, I will filter locally.

Instead of:

Ask Redis the exact question you need answered.

Why It Looked Like a Memory Leak

It was not a true leak.

It was the combination of:

Under peak load, this pattern amplified resource consumption dramatically.


Refactored Design

Option 1 – Use Redis Set

        Key: campaign:2026-02-12
        Type: SET
        Members: userId
        

On login:

        SISMEMBER campaign:2026-02-12 userId
        

Benefits:


Option 2 – Per User Key with TTL

        Key: user:{userId}:campaign
        Value: 1
        TTL: 24h
        

On login:

        EXISTS key
        

This removes the hot key issue entirely.


Performance Comparison

Design Time Complexity Network Payload Concurrency Risk
JSON Array O(n) High High
Redis Set O(1) Minimal Low
Per-user Key O(1) Minimal Very Low

Senior Engineering Lesson

Redis was never the problem.

The problem was asking the wrong question to the system.

The difference between:

"Give me the whole list."

and

"Does this member exist?"

is the difference between surviving peak traffic and crashing in production.


Closing Thought

Fast systems are not built by faster code. They are built by correct data modeling.