Symptom · search entry

Solana validator stuck catching up and snapshot download slow: why it happens

A validator stuck catching up is losing a race: it must replay slots faster than the cluster produces them, starting from a snapshot that is ageing while it downloads. The cluster advances at roughly two and a half slots per second, so a node that replays at or below that rate will never close the gap, and solana catchup reports the distance growing instead of shrinking. A slow snapshot download makes the race longer; slow replay makes it unwinnable.

Common causes

The snapshot is being fetched from a slow or distant RPC peer, and the node does not abort and retry a faster one, so the snapshot is already thousands of slots old when replay starts.
Replay throughput is capped by disk: the accounts database and ledger are on shared or throttled NVMe, and untarring the snapshot competes with replay for the same IOPS.
CPU is undersized or shared, so banking and replay stages cannot exceed cluster pace even with healthy disks.
The node restarts into catchup during a period of high cluster load, when blocks are fuller and each slot costs more to replay.

System-level mechanism

Catchup is bounded by the slowest stage of a pipeline that runs entirely on one host: download bandwidth, snapshot decompression, accounts index rebuild, then replay. Each stage has a different bottleneck, which is why the same symptom appears on hosts that fail for different reasons. The race framing matters because it makes the arithmetic explicit: if replay manages three slots per second against the cluster's two and a half, a one-hour-old snapshot still costs around five hours of catchup. Operators who only provision for steady-state validation discover the gap during recovery, which is exactly when stake is offline and the cost is visible.

What this indicates

A catchup that converges slowly indicates marginal provisioning; one that diverges indicates a hard bottleneck, usually disk, and no amount of waiting will fix it. Measure snapshot peer throughput and replay slots per second separately before changing anything, because the remedies are different: peer selection for the first, hardware isolation for the second.

Related issues

Slot distance increasing on a running node; IOPS saturation and packet loss under load; vote credits dropping once the node finally rejoins.

Deep references

We're securing validators at the wrong layer covers why the infrastructure layer underneath consensus, including recovery paths like this one, gets the least attention and absorbs the most failure.
Expensive work before authentication covers the load on the RPC nodes that serve snapshots, which is the other half of a slow download.
slashr.dev shows delinquency windows across networks, which is where extended catchup time becomes publicly visible.

Related symptoms

Evidence

slashr.dev · live validator incident feed