Production Lessons – What Broke and Why

Episode #1 – Redis Is Fast — Until You Design It Wrong

Context

We had a feature that allowed administrators to target a specific group of users for a time-based campaign. The selected user IDs were configured from an internal admin portal.

The requirement was simple:

Admin selects users
System stores the selected user IDs in Redis
When a user logs in, the system checks whether they belong to that campaign group

In development, everything worked perfectly.

Original Design

All targeted user IDs were stored inside a single Redis key as a serialized JSON array.

        Key: campaign:2026-02-12
        Value: [1, 5, 20, 500, 999, ...]

On every login request:

GET the key from Redis
Deserialize the JSON array
Perform a linear search to check if the userId exists

In staging, the list usually contained 20–100 users. Latency was negligible. No one questioned the design.

The Incident

One day, an administrator added more than 30,000 user IDs into the campaign group.

Unfortunately, this happened during peak traffic hours.

Within minutes:

Redis memory usage spiked
CPU usage increased significantly
Application response time degraded
Login requests started timing out

At first glance, it looked like a memory leak. It wasn’t.

Root Cause Analysis

1. O(n) Lookup Per Login

        GET campaign:2026-02-12
        Deserialize 30,000 IDs
        Linear search

This is an O(n) operation executed on every login.

Under high concurrency, we were repeatedly:

Transferring large payloads over the network
Allocating large objects in memory
Triggering heavy garbage collection cycles

The cost wasn’t obvious at small scale. At peak traffic, it became catastrophic.

2. Single Hot Key

All users were hitting the exact same Redis key simultaneously.

This created:

Hot key contention
High network I/O
Repeated full-value deserialization

We unintentionally turned Redis into a bottleneck.

3. Misaligned Data Modeling

Redis is a data structure server. We ignored that.

We treated Redis like:

Give me everything, I will filter locally.

Instead of:

Ask Redis the exact question you need answered.

Why It Looked Like a Memory Leak

It was not a true leak.

It was the combination of:

Large repeated allocations
High-frequency deserialization
Concurrent access to a large payload
Increased memory fragmentation

Under peak load, this pattern amplified resource consumption dramatically.

Refactored Design

Option 1 – Use Redis Set

        Key: campaign:2026-02-12
        Type: SET
        Members: userId

On login:

        SISMEMBER campaign:2026-02-12 userId

Benefits:

O(1) lookup
No full payload transfer
No JSON deserialization
No large object allocation

Option 2 – Per User Key with TTL

        Key: user:{userId}:campaign
        Value: 1
        TTL: 24h

On login:

        EXISTS key

This removes the hot key issue entirely.

Performance Comparison

Design	Time Complexity	Network Payload	Concurrency Risk
JSON Array	O(n)	High	High
Redis Set	O(1)	Minimal	Low
Per-user Key	O(1)	Minimal	Very Low

Senior Engineering Lesson

Always model data based on access pattern
Hot keys can silently destroy performance
Complexity analysis matters in distributed systems
Peak traffic reveals architectural flaws

Redis was never the problem.

The problem was asking the wrong question to the system.

The difference between:

"Give me the whole list."

and

"Does this member exist?"

is the difference between surviving peak traffic and crashing in production.

Closing Thought

Fast systems are not built by faster code. They are built by correct data modeling.