We built a shared Rust validation library that blocked 1,319 corrupt rows from entering ClickHouse and caught 4.35M corrupt records through nightly out-of-band audits.
723M+
Total rows validated
1,319
Hard quarantine events (rows blocked)
4.35M
Audit-detected corruption (rows flagged)
0.8ms
Validation overhead (per 10K rows)
CHAPTER 01
Raw market data is not clean. Every exchange feed contains some fraction of corrupt records: ticks with negative volume from sign-extension bugs in the exchange's serializer, timestamps that overflow 64-bit integers and wrap to year 2299, prices that are three standard deviations outside any historically plausible range for that symbol. These anomalies cluster at ingest events: startup of a new exchange connector, recovery after a network partition, transitions between API versions.
The damage from silent corruption is asymmetric. A corrupt row in bars_1m with a year-2299 timestamp creates a partition. QuestDB generated 2,810 such partitions, each requiring its own directory entry and metadata overhead, which caused 10-core CPU saturation on every restart. A corrupt row in bars_1d with a price of 0.0 for AAPL pulls the computed average down and biases every regime model trained on that signal.
A Deribit options ingest introduced a unit conversion bug that generated 4,349,487 rows in weather_data and 1,315 rows in bars_1d with year-2299 timestamps before anyone noticed. The discovery came when a regime query returned no results for November 2024. ClickHouse's optimizer chose the 2299 partition as the most recent and skipped the real 2024 data.
CHAPTER 02
The validation engine is a Rust library crate (argus-validate) shared across all ingest binaries as a cargo dependency. Every ingest binary that writes to ClickHouse links against argus-validate and calls the validation layer before any row reaches the insert buffer. The library approach means validation rules are centrally maintained and new ingest binaries inherit the full ruleset automatically.
Each ClickHouse table has a corresponding ValidationSpec struct that specifies timestamp bounds, price sanity rules, volume rules, a deduplication key, and a quarantine policy. Two quarantine policies cover different severity levels. Hard-fail is for rows where corruption is unambiguous: year-2299 timestamp, negative volume. These are written to a quarantine_hard Redis Stream with the offending row serialized as JSON and a structured error code. Quarantine-and-continue is for rows that are anomalous but plausibly legitimate: a price that is 4 standard deviations outside the trailing mean may be a genuine flash crash. Neither policy drops rows silently; the quarantine stream is the audit log.
A nightly audit job runs as a separate systemd timer binary that queries ClickHouse directly for rows that bypass the ingest path and flags any that fail to a ClickHouse audit_findings table.
ARCHITECTURE OVERVIEW
SOURCES
Rust 1.84 (argus-validate)
ClickHouse 26.3 (ReplacingMergeTree)
TRANSFORM
Redis 7.2 (quarantine stream)
validate + dedup
STORE
systemd (nightly audit timer)
partitioned
QUERY
Query API
+ cache
CHAPTER 03
The price deviation check required a pre-computation step. Computing a 30-day trailing mean and standard deviation at ingest time for every tick would require a ClickHouse query per row, impractical at 80,000 ticks per second. Instead, the validation library reads a precomputed symbol statistics table refreshed every 4 hours from a materialized view over bars_1d.
The Z-score threshold of 10.0 was chosen after analyzing the distribution of legitimate price moves in bars_1d going back 5 years. Flash crashes like Knight Capital's August 2012 event produced moves in the 6-sigma to 8-sigma range. A 10-sigma threshold catches only implausible values, not extreme but plausible market events. We log 3-sigma to 10-sigma moves to the soft quarantine for investigation but do not reject them.
The timestamp validation bounds were constants until we needed to handle crypto symbol listings that pre-date our expected bounds. We had set the minimum valid timestamp to 2010-01-01T00:00:00Z. When we added on-chain BTC data going back to 2009, the validator began quarantining legitimate pre-2010 rows. The fix was per-table timestamp bounds rather than a global minimum.
TECH STACK
CHAPTER 04
Total rows validated at the ingest boundary: 723M+. Hard quarantine events blocking rows from ClickHouse: 1,319. Soft quarantine events for anomalous-but-plausible rows: 4,217. Audit-detected corruption in bars_1d: 1,315 rows (year-2299 timestamps, deleted). Audit-detected corruption in weather_data: 4,349,487 rows (year-2299 timestamps, deleted). Ingest-time validation overhead: 0.8ms per 10K rows. Nightly audit run time: 4.2 minutes p50 across all 43 ClickHouse tables. False positive rate on soft quarantine: under 0.1% on manual review of 100 sampled rows.
723M+
Total rows validated
1,319
Hard quarantine events (rows blocked)
4.35M
Audit-detected corruption (rows flagged)
0.8ms
Validation overhead (per 10K rows)
CHAPTER 05
DECISION · 01
Chose a shared library over per-ingestor validation. The tradeoff: the library has its own release cycle and a change to validation rules requires recompiling and redeploying all ingest binaries. What it gave us: a single definition of valid across all data sources. When we discovered the on-chain BTC timestamp issue, fixing the library fixed all binaries.
DECISION · 02
Chose quarantine over silent drop. This is the most important architectural decision in the validation engine. Silent drops are operationally invisible. A quarantine stream provides an audit log, a replay path if the validation rules are updated, and a signal that something in the upstream feed is producing unexpected data.
DECISION · 03
The epoch-1970 rows in 7 tables need timestamp backfill from source systems, not just classification as anomalous. The audit continues to flag them on every nightly run, generating noise in the audit_findings table. Until those tables have their timestamps corrected at the source, we plan to add a suppression rule for known-ambiguous tables.
START A PROJECT
We build fast. Most projects ship in under two weeks. Start with a free 30-minute discovery call.
Start a ProjectWe built a 723M-row market data pipeline ingesting 10 exchanges simultaneously at under 50ms tick-to-storage latency.
723M+ Total rows stored
Read case study →
DataWe migrated 425M rows to ClickHouse and achieved 8x storage compression and 15x faster analytical scans versus our prior QuestDB setup.
723M+ Rows stored
Read case study →
DataWe replaced a Python fan-in that dropped ticks under load with a Rust multi-task aggregator handling 80,000 ticks per second across 10 exchanges at 3.1% CPU.
80K tick/s Peak throughput
Read case study →