AI agents are shipping more code and breaking more things. Anthropic, which generates over 80% of its production code with Claude Code, let a input-resetting bug live on Claude.ai long enough to hit every paying customer, every single day. Nobody caught it internally. Amazon's retail org is now requiring senior sign-off on junior engineers' AI-assisted changes after a measurable spike in SEVs caused by its own AI agents. Uber's internal data shows 'power users' of AI generate 52% more PRs, but product quality is not tracked at all.
The velocity numbers look good until they don't. Dax Reed, creator of OpenCode, says AI agents lower the bar for what ships, actively discourage refactoring, and do not actually speed teams up. Sentry's CTO echoes this: LLMs remove the barrier to starting, but produce bloated, hard-to-maintain code that degrades long-term velocity. Research backs this up with a pattern of short-lived productivity gains followed by significant tech debt accumulation. Meta and Uber are now tracking AI token usage in performance reviews, pressuring engineers to use the tools regardless of quality impact.
The piece is worth reading in full not for its conclusions but for the specifics: the exact failure sequence on Claude.ai, how Amazon is restructuring review processes in response to agent-caused outages, and what proposed solutions, including formal validation methods and revived QA practices, actually look like in practice. The core question it forces is whether the industry is measuring the right thing entirely.
[READ ORIGINAL →]