We discovered 209,033 regime keys with no TTL and fixed them in a single SCAN pass, then cut the regime endpoint latency 13x by eliminating per-request key scans.
209,033
Keys without TTL (found)
13x
Regime endpoint speedup (1,048ms → 79ms)
3.2x
Macro endpoint speedup (1,476ms → 465ms)
<4 min
SCAN sweep time
CHAPTER 01
Argus computes market regime classifications across 8 timeframes for every active symbol in its universe. The regime service runs a classification cycle, reads feature vectors from Redis, applies a 5-class classifier, and writes results back as Redis hashes. Apex, the execution layer, reads those keys before sizing any position.
The problem was subtle and dangerous: nothing enforced a TTL on regime keys. The write path called redis.hset_multiple() to write the classification hash, then stopped. No EXPIRE call followed. If argus-regime stopped or fell behind, its last classification persisted indefinitely. Apex would read a regime label that could be hours or days old and apply position sizing and risk gates as if the regime were current.
An audit on April 25, 2026 confirmed the scope: 209,033 regime keys in Redis, all with TTL of -1 (permanent). The system had been running for weeks accumulating stale potential.
CHAPTER 02
The fix needed to operate at two levels. The immediate remediation covered the existing 209,033 stale keys. The permanent fix embedded TTL enforcement directly into the writer so no future key could ever be written without an expiry.
For the remediation sweep, a naive KEYS argus:regime:* was ruled out immediately. Redis KEYS is O(N) and blocks the server for the duration of the scan. With 209K matching keys in a production database also serving live feature reads, a blocking scan would have caused latency spikes in Apex's regime reads. Instead, the sweep used iterative SCAN with a cursor and a count of 200 per batch. Each batch set TTL with EXPIRE key 86400. The full sweep completed without measurable impact on Redis latency.
For the permanent fix, the choice was 86400 seconds (24 hours). A 1-hour TTL would surface stale-regime bugs faster, but the regime pipeline runs on a configurable interval and a transient delay of several minutes could cause unnecessary key expiration during normal operation.
ARCHITECTURE OVERVIEW
INGRESS
Redis 7.2
Rust 1.84 CLUSTER
pod-1
pod-2
pod-3
pod-4
pod-5
pod-6
STORAGE
Tokio 1.40
OBSERVABILITY
ioredis 5.3
CHAPTER 03
The EXPIRE call follows hset_multiple in the same async context. If hset_multiple succeeds but EXPIRE fails, the key exists without a TTL. That was accepted as tolerable given Redis 7.2 reliability on localhost: in the failure scenario, the health writer would mark the cycle as degraded before the stale key aged past 24 hours. Complete atomicity via a pipeline or a Lua script was considered but rejected as over-engineering for a single-host configuration.
The regime collection cache at the /api/intelligence/regime route was the bigger win. The original route iterated over all 209,033 Redis regime keys using a SCAN loop to build a regime summary. SCAN on 209,033 keys took 1,048ms even with Redis's non-blocking approach, because the collection traversal was synchronous from the API route's perspective. The fix was a single pre-aggregated cache key updated by the regime pipeline after every write cycle. Instead of scanning 209,033 keys per request, each request performed a single GET. Latency dropped from 1,048ms to 79 to 90ms.
TECH STACK
CHAPTER 04
Before the fix: 209,033 keys, 0 with TTL, 100% permanent. After the SCAN sweep: 209,033 keys, all with TTL approximately 86,400 seconds. The sweep ran in under 4 minutes on a server with Redis 7.2 running at roughly 4,000 operations per second for active ingest.
The /api/macro route demonstrated the performance model. Cold-cache latency measured at 1,476ms. Warm-cache latency measured at 465ms. A 3.2x improvement for the most common repeated query pattern. Regime hash reads via hgetall on a warm key return in under 5ms versus 40 to 80ms for a ClickHouse fallback.
209,033
Keys without TTL (found)
13x
Regime endpoint speedup (1,048ms → 79ms)
3.2x
Macro endpoint speedup (1,476ms → 465ms)
<4 min
SCAN sweep time
CHAPTER 05
DECISION · 01
The core lesson is that write-path invariants are not enforced by tests unless tests check TTL. Unit tests for argus-regime covered classification accuracy but nothing tested that a written key had an expiry. Adding an integration test that writes a classification cycle and then checks TTL would have caught this before production.
DECISION · 02
The decision to use Redis hashes per symbol-timeframe rather than a single large sorted set was correct for this access pattern: Apex reads regimes for one symbol at a time before a trade decision, so hgetall on one key is more natural than range scans.
DECISION · 03
A recurring maintenance task that checks for regime keys with TTL of -1 and alerts would catch this class of bug in any future regression, particularly if new key namespaces are added without TTL discipline.
START A PROJECT
We build fast. Most projects ship in under two weeks. Start with a free 30-minute discovery call.
Start a ProjectWe built a 63-line Node.js proxy that gives Vercel serverless functions read-only access to a private ClickHouse instance with zero database exposure.
12ms Proxy overhead (end-to-end)
Read case study →
InfrastructureWe added a lock-free AtomicUsize round-robin proxy pool to argus-common, giving all 23 downloader binaries IP rotation without duplication or mutex contention.
180/min Download throughput (proxy)
Read case study →
InfrastructureWe audited 168 running services consuming 33GB of RAM, culled the dead weight, and reduced the Argus footprint to 25 production services using 12GB.
168 Services before audit (33GB RAM)
Read case study →