Surviving a Chain Reorg: Rolling State Back Without Losing Your Mind

The first time my node logged a reorg in production, I was holding my coffee. The relevant log line was "detected reorg, depth=3, common_ancestor=0x82af…", and the node moved on. The chain was fine. The validator was fine. Three blocks of work disappeared and reappeared in a slightly different shape, mempool entries got reinjected, the JSON-RPC subscribers got their notifications, and nobody noticed.

That's the goal. A reorg is a normal event in a probabilistic chain, and the right response is to handle it with the same discipline you'd handle a database transaction rolling back. The wrong response is what most early-stage L2 implementations do — partially rewind, leave indices and caches stale, and hope no one queries them before the next block lands.

This post is about the reorg handling we shipped in the L2 sequencer framework — what reorgs actually break, what we had to roll back, the invariants we refused to violate, and how we tested it without sacrificing a live network.

What a reorg actually is, for our purposes

In a Cairo-based L2, the chain has two notions of head:

The pending head — the block the sequencer most recently produced.
The L1-finalized head — the block depth proven and settled on Ethereum.

A reorg is when the canonical pending chain rewinds to a common ancestor and resumes along a different path. This happens for ordinary reasons: a sequencer race, a network partition healing, a settlement contract preferring a different fork. We assumed it would happen, treated it as an event the node had to react to correctly, and built around it.

The two depths are different in their consequences:

Below finalized: a reorg here is impossible by definition. If our node ever sees one, the right move is to halt and shout, not roll back.
Above finalized, below pending: rollback territory. This is what this post is about.

flowchart TB subgraph BEFORE["Before reorg"] direction TB F0["F · finalized (immutable)"] F0 --> A["A"] A --> B["B"] B --> C["C ← old pending head"] end subgraph AFTER["After reorg"] direction TB F0b["F · finalized (unchanged)"] F0b --> Ab["A"] Ab --> Bp["B′"] Bp --> Cp["C′"] Cp --> Dp["D′ ← new pending head"] end BEFORE -- "depth = 2 (B, C reverted)" --> AFTER

The common ancestor is A. Everything from B onwards is reverted; B′ onwards is applied. The job of the node is to make the world look indistinguishable from the case where A, B′, C′, D′ was always the canonical chain — to every consumer of node state, internal and external.

What reorgs actually break

The naive picture is: "just rewind state to the common ancestor and replay." That's about a third of the work. What actually breaks falls into four categories.

The state trie

A Merkle Patricia trie is value-immutable per block, which is a beautiful property until you realize that "rewinding" a trie isn't the same as deleting a few keys. Every state diff between A and C was committed as a series of trie updates. Reverting them means knowing the exact pre-image of every key the reverted blocks touched.

We solved this with versioned trie storage: every state change recorded the block height that produced it, and the storage layer kept enough history to reconstruct the trie at any height above the finalized cutoff. Rewinding is then a single operation: discard everything written above the common-ancestor height. Replaying B′, C′, D′ writes new versioned entries on top.

The cost is bookkeeping. Versioned storage is more expensive than overwriting, and pruning has to be wired into the finalization path so it doesn't grow without bound. We wrote it, measured it, and concluded the cost was less than rebuilding the trie from genesis on every reorg, which was the alternative.

The mempool

Mempool reinjection is one of those problems that's easy to describe and full of subtle invariants.

When B, C are reverted, the transactions they contained are not necessarily invalid — they may simply belong in B′, C′ instead. So we reinject them into the mempool. But:

A transaction that was reverted might have been invalidated by reordering in the new chain (its sender's nonce may now be different).
A transaction that's already in the new chain (B′, C′, D′) must not be reinjected — it's no longer pending.
A transaction reinjected into the mempool must respect the current state, not the state at the time it was originally accepted. A reinjected tx whose preconditions no longer hold is a tx that should be evicted, not retried.

The reinjection algorithm in pseudocode:

1. Collect reverted txs:        T_reverted = txs(B) ∪ txs(C)
2. Collect newly-applied txs:   T_applied  = txs(B') ∪ txs(C') ∪ txs(D')
3. Candidates for reinjection:  T_candidates = T_reverted \ T_applied
4. For each tx in T_candidates:
       re-validate against the new head's state
       if valid → push to mempool
       if invalid → drop and emit "tx_dropped_after_reorg"

The dropped event is important. It's how a wallet or a dApp learns that the tx it was waiting on was orphaned by the reorg. If we silently dropped, the operator's UI would spin forever waiting for a tx that was never going to land.

Finalized indices and derived data

Almost every component above the raw chain builds an index — a tx-by-hash map, a block-by-number cache, an event log feed for subscribers, an account-history table for explorers. Each of those is a derived view.

The reorg has to walk every derived index and tell it: "these block heights are gone; you must invalidate, recompute, or notify."

We modeled it as a publish/subscribe contract. The chain emitted a ChainEvent enum:

pub enum ChainEvent {
    BlockApplied   { height: u64, hash: BlockHash, parent: BlockHash },
    BlockReverted  { height: u64, hash: BlockHash },
    Finalized      { height: u64, hash: BlockHash },
}

…and every derived index subscribed and reacted accordingly:

Subscriber	On `BlockApplied`	On `BlockReverted`
Tx-by-hash index	Insert tx hashes	Delete tx hashes (they may reappear)
Block-by-number cache	Insert at height	Delete at height
Event log subscriptions	Push events	Emit `chain_reorg` notification
Explorer history	Append rows	Mark rows as `reorged_at = now()`
RPC `getTransactionReceipt`	Cache	Invalidate; mark "pending" again

The single most useful decision was making BlockReverted a first-class event, not an inferred consequence of "you saw the new chain, figure out the rest". Subscribers who didn't handle reverts at all were obvious in code review — they didn't have a match arm for the variant. Rust's exhaustive enum matching made the contract self-enforcing.

Optimistic state in the JSON-RPC layer

The fourth category is subtle. JSON-RPC clients hold open subscriptions for new heads, new transactions, new events. When a reorg happens, those subscribers have already been told about blocks that no longer exist.

We followed the Ethereum JSON-RPC reorg semantics: every reverted block emits a removed: true event on the same subscription that originally announced it. Clients that handle this correctly update their UI; clients that don't are at least no worse off than they would be on Ethereum, which is the lowest bar of consistency we were willing to ship.

The invariants we refused to violate

A few things were not negotiable. They drove the entire design.

A nonce that was used cannot be reused

Once a transaction with (sender, nonce) has been validated and applied, no future block on the canonical chain may apply a different transaction with the same (sender, nonce). This is what prevents a reorg from being weaponized into a double-spend.

The invariant is enforced at re-validation: a reinjected tx is checked against the current state's nonce and against the in-flight mempool. If either says "this nonce is taken", the tx is dropped, not reordered.

The finalized prefix is immutable, full stop

If the rollback algorithm is ever asked to revert a block at or below the finalized height, it does not. It panics with a structured fault that paginates the operator and halts the validator. There is no "best-effort" rollback below finality. The whole reason finality exists is to give every consumer of the chain — exchanges, bridges, dApps — a height beyond which they don't have to think about reorgs. Violating that quietly would be worse than halting.

Externally observable state never lies

The hardest invariant to preserve. It says: at any moment a JSON-RPC client queries us, every answer we give must be consistent with the current canonical chain.

That sounds obvious. It implies, concretely:

The rollback is atomic from the consumer's perspective. We do not serve responses mid-rollback. The RPC server briefly returns a chain_reorganizing error during the swap, then resumes.
Subscriptions are drained and replayed in the right order: revert events for the old branch first, then apply events for the new branch, then the new head.
Caches are invalidated before new state is exposed. We do not serve a tx receipt for a tx that is no longer in the canonical chain, even for a microsecond.

The atomicity is enforced by a single RwLock<ChainState> in the RPC layer that the rollback acquires for write. RPC handlers acquire it for read. The lock is held for tens of milliseconds during a typical depth-2 reorg, which is short enough to be invisible and short enough that backpressure is the right response if it isn't.

flowchart TB DETECT["Reorg detected (new branch from L1 settlement or sequencer signal)"] DETECT --> LOCK["RPC lock acquired (write)"] LOCK --> COMPUTE["Compute revert path + apply path"] COMPUTE --> REVERT["Revert blocks oldest-first (BlockReverted events emitted)"] REVERT --> APPLY["Apply new blocks (BlockApplied events emitted)"] APPLY --> MEM["Mempool reinjection + re-validation"] MEM --> RELEASE["RPC lock released"] RELEASE --> NOTIFY["Subscriber notifications flushed in order"]

Testing reorgs without a live network

The testing problem for chain rollback is harder than the implementation. A bug in rollback is silent for hours and then catastrophic when somebody finally queries. You cannot afford to discover it on mainnet.

Three layers of testing earned their keep.

Deterministic unit tests over a synthetic chain

The state machine — apply a block, revert a block, replay — is a pure function of (state, event). We extracted it from anything async, anything I/O-bound, anything network-shaped, and tested it as a plain Rust state machine.

#[test]
fn reorg_depth_two_replays_mempool() {
    let mut node = TestNode::new();
    let a = node.apply_block(block_with_txs(&[tx1, tx2]));
    let b = node.apply_block(block_with_txs(&[tx3]));
    let c = node.apply_block(block_with_txs(&[tx4]));

    node.reorg_to(&[
        block_after(a, &[tx2, tx3]),  // tx1 dropped
        block_after_prev(&[tx5]),
    ]);

    assert_eq!(node.head_height(), 3);
    assert!(node.mempool().contains(&tx1));     // reinjected
    assert!(!node.mempool().contains(&tx4));    // also dropped, but it was orphaned
    assert!(node.dropped_after_reorg().contains(&tx4));
}

Tests like this caught dozens of subtle ordering bugs early. They run in milliseconds. They form a lower-bound regression suite that runs on every commit.

Property-based testing with reorg shape generators

Hand-written tests cover reorg depths 1, 2, and 3. They do not cover "depth 7 where two of the seven blocks share a tx that was reverted twice". We used proptest to generate random valid chain shapes and assert invariants:

Every tx that was once in the canonical chain and is no longer must have a dropped_after_reorg event.
Every state value at the new head matches a fresh replay from genesis along the new path.
No subscriber receives two BlockApplied events for the same hash.
No BlockReverted event fires for a block that was never applied.

Each invariant is one assertion in a test that runs against thousands of generated chain shapes per CI run. The first few weeks, this generator found a new bug every other day. It tapered off as we fixed them. The test now serves as a tripwire.

Local devnet with fault injection

The third layer is a small docker-compose that runs four nodes plus a fake L1. The fake L1 supports an admin endpoint that lets the test issue "declare a reorg from height N along this branch". We use this to script realistic scenarios:

Sequencer publishes blocks A, B, C; one node loses the network briefly; sequencer reorganizes to A, B′, C′, D′; the dropped node reconnects.
Two-thirds of validators agree on chain X; the remaining third on Y; settlement contract picks X.
A long-running tx subscriber is mid-stream when a reorg fires.

Each scenario has an oracle: a known-good final state we expect every node to converge to, byte-for-byte. The CI runs them on every PR. They catch the integration bugs that the unit tests miss — usually around timing and lock-ordering — and they cost about three minutes per run.

How the layer behaves in production

A reorg, observed from the outside of the node, looks like a few sub-second pauses on the JSON-RPC layer followed by a clean resumption of state, with removed: true notifications fanned out to subscribers in the right order. From the inside, it is a sequence of well-typed events flowing through the same pipeline that handled normal block application — just in reverse, then forward again, against a versioned trie that already knew how to roll back.

The properties that fall out of the design are the ones that matter operationally:

Every reorg is observable. The reorg counter increments, the depth gauge spikes, and the panel annotations on the metrics dashboard show exactly when state was rewritten. The validator metrics layer is the consumer of these signals.
No derived index drifts. Because BlockReverted is a first-class event in the ChainEvent enum, every index — tx-by-hash, block-by-number, explorer history, RPC subscriptions — is forced by the type system to handle reverts. A new index added next year will get the same compiler nudge.
The finalized prefix is structurally immutable. Rollback below finality is not "discouraged" or "rare". It is impossible by construction; the rollback function refuses heights at or below the L1 cutoff and halts the validator with a structured fault. Bridges, exchanges, and dApps reading our finalized stream do not have to think about reorgs.
Externally observable state stays consistent. The single RwLock<ChainState> in the RPC layer is held for the duration of the swap. Clients see either the old chain or the new chain, never a partial mixture, and the cost is tens of milliseconds of brief backpressure on RPC calls during the reorg window.
Rollback is unoptimized on purpose. Brute-force "discard above the common ancestor, replay from there" is slower than a clever trie-skip would be, and is correct by construction. The performance budget for reorgs is spent on making them rare and shallow, not on shaving milliseconds off the rare deep ones.

The shape of the layer is small on purpose. Five families of derived state, one event enum that every subscriber matches exhaustively, a single lock that gates the swap, and three layers of tests that catch different classes of bug. Reorgs are no longer the "is the chain OK?" moment they used to be; they're a structured, observable, cleanly-handled event that the validator handles before anyone has to ask.