Solana validator stuck catching up and snapshot download slow: why it happens
A validator stuck catching up is losing a race: it must replay slots faster than the cluster produces them, starting from a snapshot that is ageing while it downloads. The cluster advances at roughly two and a half slots per second, so a node that replays at or below that rate will never close the gap, and solana catchup reports the distance growing instead of shrinking. A slow snapshot download makes the race longer; slow replay makes it unwinnable.
Common causes
- The snapshot is being fetched from a slow or distant RPC peer, and the node does not abort and retry a faster one, so the snapshot is already thousands of slots old when replay starts.
- Replay throughput is capped by disk: the accounts database and ledger are on shared or throttled NVMe, and untarring the snapshot competes with replay for the same IOPS.
- CPU is undersized or shared, so banking and replay stages cannot exceed cluster pace even with healthy disks.
- The node restarts into catchup during a period of high cluster load, when blocks are fuller and each slot costs more to replay.
System-level mechanism
Catchup is bounded by the slowest stage of a pipeline that runs entirely on one host: download bandwidth, snapshot decompression, accounts index rebuild, then replay. Each stage has a different bottleneck, which is why the same symptom appears on hosts that fail for different reasons. The race framing matters because it makes the arithmetic explicit: if replay manages three slots per second against the cluster's two and a half, a one-hour-old snapshot still costs around five hours of catchup. Operators who only provision for steady-state validation discover the gap during recovery, which is exactly when stake is offline and the cost is visible.
What this indicates
A catchup that converges slowly indicates marginal provisioning; one that diverges indicates a hard bottleneck, usually disk, and no amount of waiting will fix it. Measure snapshot peer throughput and replay slots per second separately before changing anything, because the remedies are different: peer selection for the first, hardware isolation for the second.
Related issues
Slot distance increasing on a running node; IOPS saturation and packet loss under load; vote credits dropping once the node finally rejoins.
Deep references
- We're securing validators at the wrong layer covers why the infrastructure layer underneath consensus, including recovery paths like this one, gets the least attention and absorbs the most failure.
- Expensive work before authentication covers the load on the RPC nodes that serve snapshots, which is the other half of a slow download.
- slashr.dev shows delinquency windows across networks, which is where extended catchup time becomes publicly visible.
