Mozilla distinguished engineer Brian Grinstead ran an agentic bug-finding pipeline against Firefox, a codebase with tens of millions of lines of code, and produced a record month of security fixes. A viral chart attributed the result entirely to Anthropic's Mythos model. Grinstead disputes that framing: he puts the split closer to 50-50 between the model and the custom harness his team built around it. One confirmed find was a bug that had sat undetected in the codebase for 15 years.

The pipeline has three core components. An LLM judge scores and ranks files before any expensive compute runs, solving the problem of pointing an agent blindly at a massive repo. A verifier subagent then catches false positives by checking whether the main agent is cheating its way to a passing result. The whole thing runs on a goal-loop pattern: tight scope, a clear pass-or-fail signal, and automatic retries well beyond what a human engineer would attempt. The entry point is deliberately low. You can run a starter version with Claude Code or Codex, a single prompt, and the -p flag. No SDK required.

The full episode is worth reading for two things the summary cannot convey: the live demo showing exactly how file prioritization works in practice, and Grinstead's specific argument for why teams with existing fuzzing and CI infrastructure have a compounding advantage that pure AI tooling cannot replicate. He also addresses where humans remain mandatory, every AI-generated patch still requires human review before it ships, and how the same score-verify-fix loop applies outside security to design quality, conversion rate, and tech debt.

[READ ORIGINAL →]