GitHub Copilot now uses two concrete mechanisms to stop wasting tokens in long agentic sessions in VS Code: prompt caching, which reuses model state for repeated prompt prefixes instead of recomputing them each turn, and tool search, which loads tool definitions on demand rather than dumping every full schema into context on every request. In a session with MCP tools, terminal commands, file operations, and workspace search all available, that front-loaded schema cost was real and fixed. Now it is not.
The second piece is Auto, a model router built on a system called HyDRA. After your first prompt, Auto reads task intent and live model health signals including availability, error rates, and speed, then picks the model. On SWEBench, HyDRA's conservative operating point matched OpenRouter Auto's 70.8% resolution rate at 3.3x the cost savings. Its aggressive point outperformed both Azure Foundry modes. The result is not a quality-cost tradeoff. It is task-matched routing across a five-model production pool.
The original post is worth reading in full for two reasons. First, the VS Code technical deep dive it links covers cache-control breakpoints and provider-specific tool search implementation details that matter if you run long agentic sessions today. Second, the HyDRA paper on arXiv explains the routing model's mechanics, including how it handles cache-aware switching, multilingual tasks, and mid-conversation context shifts. The summary tells you what changed. The source tells you how it actually works.
[READ ORIGINAL →]