GitHub suffered three major outages between February 2 and March 5, 2025. The February 9 incident traced back to a compounding failure: two popular client apps drove a tenfold increase in API read traffic, a cache TTL was cut from 12 to 2 hours during a model rollout on February 7, and peak Monday load hit all at once. The auth and user management database cluster collapsed under the combined write and read pressure. User settings that once measured bytes per user had quietly grown to kilobytes, hidden behind that TTL.

The Actions outages tell a different story about failover theater. On February 2, a telemetry gap triggered security policies that locked internal storage accounts across all regions, killing VM creation and hosted runner lifecycle operations globally. On March 5, a Redis cluster used for Actions job orchestration failed over correctly, then sat with no writable primary due to a latent configuration bug. Engineers fixed it manually. Both incidents exposed single points of failure that dry-run testing should have caught.

GitHub's remediation plan includes redesigning the user cache into a segmented database cluster, auditing critical data and compute infrastructure capacity, and isolating GitHub Actions and Git from shared infrastructure failures. The original incident writeup is worth reading in full for one specific reason: the technical postmortem on how a cache TTL change combined with gradual client app adoption created an invisible load bomb that only detonated under peak production conditions.

[READ ORIGINAL →]