Observing Live Validators: Building a Validator Metrics Layer from Zero

June 12, 2025

Observing Live Validators: Building a Validator Metrics Layer from Zero

When you run a validator on a public chain, the worst feeling in the world is a Discord message at 2 a.m. that says "is your node behind?". There is no good answer to give without metrics. "Let me check" takes minutes when it should take seconds; "I think it's fine" is a lie even when it's true. The job of a validator metrics layer is to make that question answerable in one place, in real time, with enough fidelity that you find the problem before the operator pings you.

I joined a team running validator infrastructure for a Cairo-based L2 whose production binary had almost no instrumentation. There were a handful of info! logs and a JSON-RPC health endpoint that returned {"status":"ok"} if the process was still running, which is an operationally useless statement. The work I'm describing here is the metrics layer we built into the L2 sequencer framework — what we measured, how we shaped the time series, and how we instrumented a high-throughput Rust binary without making it slower in a way the chain would notice.

This post walks through the design from operator question down to hot-path Rust.

What "metrics" actually means for a validator

The temptation when starting from zero is to expose the metrics the underlying libraries already happen to track — request counts, queue depths, GC stats — and call it done. That gives you a graph wall, not visibility. You end up with a hundred panels and no answer to "is my node behind?"

Useful validator metrics fall into five families. Each maps to a question an operator actually asks:

FamilyThe questionExamples
Sync / lagAm I caught up?current_block, head_lag_blocks, finalized_lag_blocks, sync mode
PeersAm I connected to enough of the network?peer_count, peer_churn_per_min, peers_by_region
Block pipelineAre blocks moving through me cleanly?block_validate_seconds, block_apply_seconds, block_propagate_seconds
MempoolIs my view of pending txs healthy?mempool_size, mempool_age_seconds, mempool_evictions
ReorgDid the chain just rewrite history?reorg_depth_blocks, reorg_count, last_reorg_seconds_ago

Sync lag is the headline. Almost every operational question reduces to "why is head_lag_blocks not zero?" Once you have that one, the others tell you whether the answer is "no peers", "slow validator", "stuck mempool", or "we just got reorg'd".

flowchart TB subgraph NODE["Validator process"] direction TB SYNC["Sync layer"] BLK["Block pipeline"] MEM["Mempool"] REORG["Reorg detector"] end subgraph METRICS["metrics façade"] direction LR COUNT["Counters"] HIST["Histograms"] GAUGE["Gauges"] end NODE --> METRICS METRICS -- "/metrics" --> SCRAPE["Prometheus<br/>(15s scrape)"] SCRAPE --> ALERT["Alertmanager<br/>+ Grafana"]

The split between what the process knows and what we expose is the design — keep the surface narrow, label things sparingly, and don't let convenience leak label cardinality into Prometheus.

Prometheus shape decisions that pay off later

Prometheus rewards a few choices and punishes the rest of them. The choices that paid off:

Counters, histograms, gauges — and never the other way

The cardinal sin we found in older code was using gauges for things that were really cumulative counts (set instead of increment), and counters for things that were really histograms (block_seconds_total rather than a histogram of per-block latency). Once we had that wrong, dashboards lied: a rate() over a counter that was secretly being set produced negative spikes whenever the value went down.

The simple rule we ended up enforcing in code review:

  • A counter only ever goes up. Its only API is inc and inc_by.
  • A gauge is a snapshot of "how many right now" — peer count, mempool size, head lag.
  • A histogram is for distributions of values you care about quantiles of — block validation time, RPC latency, peer ping.

We banned Gauge::set for things that were monotonic. We banned histograms for cardinality dimensions (per_peer_latency) because they explode the storage. We allowed each metric exactly one type and lost no expressiveness for it.

Label cardinality is the silent killer

The worst metric we shipped early was block_validate_seconds_bucket{validator_id="0x..."}. It looked sensible. It produced a million time series within a week, because the validator set rotated and every new validator got its own permanent series in Prometheus storage. The retention bill went from "fine" to "noticeable" overnight.

The discipline that survived: labels are for cardinalities you can put on a dashboard. A few rules of thumb:

  • Anything with more than ~50 distinct values does not become a label. It becomes a log field.
  • Anything you would never sum by (...) does not need to be a label.
  • Anything that grows unbounded over time (validator IDs, peer IDs, request IDs) is never a label.

Two labels we kept on almost everything: chain (so a single Prometheus could ingest multiple networks) and node_role (validator, full node, archive). Two labels we removed: peer_id and request_id. The drop was painful for ad-hoc debugging; the savings were enormous.

Histogram buckets are an opinion, not a default

The prometheus crate ships with default buckets that are fine for HTTP latency and bad for almost everything else. Block validation time is bimodal — most blocks are 30–80 ms, but the occasional state-trie-cold block is 500–800 ms. Default buckets put both populations in the same bin, which is the same as not measuring the slow tail at all.

We picked buckets per metric, on purpose, with a comment:

// Block validation: bimodal. Tight low-end for the 95% case, // long tail captured to flag pathological state-trie misses. const BLOCK_VALIDATE_BUCKETS: &[f64] = &[ 0.005, 0.010, 0.025, 0.050, 0.100, 0.200, 0.500, 1.000, 2.000, 5.000, ];

The buckets are checked in. They are reviewed in code review. Changing them is a metric-versioning event.

Instrumenting hot paths in Rust without slowing them down

The block pipeline is on the hot path. A single instrumentation mistake — a String::from per block, a Mutex::lock inside a histogram observe, a label allocated on every call — moves block validation latency by tens of milliseconds. Tens of milliseconds is the difference between "caught up" and "slipping". So we cared.

A few patterns paid for themselves.

Use the metrics façade, register once

We standardized on the metrics façade with a Prometheus exporter behind it. The façade lets us register a metric once at startup and re-fetch a cheap handle anywhere in code, without touching a global hashmap on every call.

use metrics::{counter, gauge, histogram}; pub struct BlockMetrics { validate_seconds: metrics::Histogram, apply_seconds: metrics::Histogram, head_block: metrics::Gauge, head_lag_blocks: metrics::Gauge, reorgs_total: metrics::Counter, } impl BlockMetrics { pub fn install() -> Self { // Buckets are registered with the recorder at process start; // here we just take cheap handles to the already-registered // metrics so the hot path doesn't pay a hashmap lookup. Self { validate_seconds: histogram!("node_block_validate_seconds"), apply_seconds: histogram!("node_block_apply_seconds"), head_block: gauge!("node_head_block"), head_lag_blocks: gauge!("node_head_lag_blocks"), reorgs_total: counter!("node_reorgs_total"), } } }

The handles are Copy-cheap and the underlying recorder is lock-free for atomics under the hood. The "register once, hold a handle" pattern saved roughly 200ns per metric site in our benchmarks. Multiplied by hundreds of sites per block, it stops being a rounding error.

Measure with RAII, not paired calls

Manual start = Instant::now(); … record(start.elapsed()) is a footgun. Someone returns early, the metric never records, and the histogram silently undercounts. We wrap all timing in a small RAII guard:

pub struct TimerGuard { start: std::time::Instant, histogram: metrics::Histogram, } impl TimerGuard { #[inline] pub fn new(histogram: metrics::Histogram) -> Self { Self { start: std::time::Instant::now(), histogram } } } impl Drop for TimerGuard { #[inline] fn drop(&mut self) { self.histogram.record(self.start.elapsed().as_secs_f64()); } } // usage: let _t = TimerGuard::new(self.metrics.validate_seconds.clone()); self.validate(block)?; // histogram observes on drop, even on `?` early-return

This pattern eliminated an entire class of "we never noticed the slow path because the metric never fired on the slow path" bugs. The guard is the metric — if you forget to construct it, the metric is missing; if you remember to construct it, it cannot lie about latency.

Don't allocate strings on the hot path

A surprising amount of Rust observability advice tells you to format dynamic labels at call sites. "Just include the chain ID, the block height, and the producer." Every one of those is an allocation per call.

The cheapest pattern: every label that's actually a label is a &'static str. Anything else is a structured log, not a metric.

// Good: zero allocation on the hot path. counter!("node_blocks_validated_total", "result" => "ok").increment(1); // Bad: format! allocates every call. counter!("node_blocks_validated_total", "block_height" => format!("{}", block.height) ).increment(1);

The "bad" example also turns block height into a label, which violates the cardinality rule above. If you need block height in a metric, it goes in the log line, not the time series.

Lock-free updates for polled gauges

Some gauges are fundamentally state-of-the-world — head lag, peer count, mempool size. The naive pattern is to take a Mutex on the relevant struct from a background publisher and read its size to update the gauge. Under load, that Mutex becomes a contention point precisely when you need the metric most.

We instead exposed an AtomicU64 snapshot per polled gauge. The hot path stores into the atomic. A small publisher task ticks once per scrape interval, loads each atomic, and pushes the value to the gauge:

use std::sync::atomic::{AtomicU64, Ordering}; pub struct HeadLagTracker { value: AtomicU64, } impl HeadLagTracker { #[inline] pub fn update(&self, lag: u64) { self.value.store(lag, Ordering::Relaxed); } pub fn snapshot(&self) -> u64 { self.value.load(Ordering::Relaxed) } } // Background publisher (one task for all polled gauges): async fn publish_polled_gauges(state: Arc<NodeState>) { let mut tick = tokio::time::interval(Duration::from_secs(1)); loop { tick.tick().await; gauge!("node_head_lag_blocks").set(state.head_lag.snapshot() as f64); gauge!("node_peer_count").set(state.peers.snapshot() as f64); gauge!("node_mempool_size").set(state.mempool_size.snapshot() as f64); } }

Writes on the hot path are a single atomic store with Relaxed ordering — about as cheap as instrumentation gets. The publisher never touches the mempool's Mutex or the peer set's RwLock; it reads atomics that the producer side has already updated. That removed the largest source of latency variance we used to see during scrapes.

What the dashboards look like, in practice

The dashboard we kept is six panels, in this order:

flowchart TB subgraph TOP["Top of dashboard — 'is the node OK?'"] direction LR P1["1 · Head lag (blocks)<br/>+ alert at 5"] P2["2 · Finalized lag (blocks)<br/>+ alert at 32"] end subgraph MID["Middle — 'why?'"] direction LR P3["3 · Block validate p50 / p95<br/>seconds, 5m window"] P4["4 · Peer count<br/>by region, 1m window"] end subgraph BOT["Bottom — 'what just happened?'"] direction LR P5["5 · Mempool size<br/>+ age p95"] P6["6 · Reorg events<br/>(annotations on every panel)"] end TOP --> MID --> BOT

The order matters. Operators read top-to-bottom. Panel 1 says "are we behind?". Panel 2 says "are we behind enough to matter?". Panels 3 and 4 are the two most common explanations. Panels 5 and 6 cover the long tail. Reorg events are also annotated on every panel as vertical lines, so a spike in lag visibly aligns with the reorg that caused it.

Everything else lives on a deeper "node internals" dashboard that exists for incident postmortems. The top-level dashboard is six panels because operators don't read more than that.

How the layer earns its keep

The combined effect, after the metrics layer landed in production, was that on-call duty stopped being a guessing game. A few concrete changes:

  • Sync lag alerts fire before users notice. head_lag_blocks > 5 for 2 minutes paged the operator roughly thirty seconds before any external monitor that polled the public RPC would have caught it.
  • Peer churn became visible. A subtle networking regression that disconnected a region for ninety seconds at a time used to look like "my node feels slow sometimes." It now looks like a 90-second dip on the peer-count panel, with a clear correlation to the validate-latency p95.
  • Reorg events are annotated, not buried. Every operational anomaly gets cross-referenced against the reorg counter. Most don't correlate. The ones that do save a postmortem hour each.
  • Allocation budget on the hot path is preserved. Block validate p50 did not move when the metrics layer was added. The AtomicU64-plus-publisher pattern, the &'static str label discipline, and the RAII timer guard combine to keep the instrumentation invisible to throughput.

The shape of the layer is small on purpose. Five metric families, three Prometheus types, a six-panel dashboard, and a hot-path discipline that keeps allocation off the critical sections. The next post goes one floor down — what happens when the chain rewrites itself underneath all of these metrics, and how the rollback algorithm preserves the operational invariants the metrics layer was watching for.

GitHub
LinkedIn
X