GitHub had two unacceptable outages. On April 23, a merge queue regression corrupted squash merges across 658 repositories and 2,092 pull requests when merge groups contained more than one pull request. No data was lost, but default branches were left in incorrect states that could not all be repaired automatically. On April 27, a botnet attack overloaded the Elasticsearch cluster, breaking search-backed UI across pull requests, issues, and projects. Git operations and APIs stayed up, but the user-facing disruption was significant. Root cause analyses are forthcoming.
The underlying pressure is structural. Since late December 2025, agentic development workflows have driven GitHub to 90 million pull requests merged, 1.4 billion commits, and 20 million new repositories in a single month. GitHub started planning for 10x capacity in October 2025. By February 2026, the target had already been revised to 30x. A single pull request now touches Git storage, mergeability checks, branch protection, Actions, search, notifications, permissions, webhooks, APIs, background jobs, caches, and databases simultaneously. At this scale, cache misses become database load, retries amplify traffic, and one slow dependency cascades across multiple product surfaces.
GitHub's stated priority order is now: availability, then capacity, then features. Specific work in progress includes moving webhooks out of MySQL, redesigning session caching, migrating performance-critical paths from a Ruby monolith into Go, isolating Git and Actions from general workloads, and pursuing a multi-cloud architecture beyond its current Azure migration. A separate post on the new API design for large monorepo and merge queue efficiency is coming soon. The full article is worth reading for the dependency-analysis methodology GitHub is using to rank and sequence blast-radius reduction, which is the clearest public explanation the company has given of how it thinks about reliability tradeoffs at this scale.
[READ ORIGINAL →]