DataCurve's DeepSWE benchmark reveals significant performance gaps between AI systems on realistic, long-horizon coding tasks. This is not a synthetic test. It measures how models perform on extended, real-world software engineering work, and the gaps are large enough to matter for anyone deploying agents in production.
The 'AI summer slowdown' narrative is back, as it is every year, but this cycle carries more weight alongside active debate on job displacement and deployment friction. The tension between what models can do in demos versus what they do in production is the actual story here.
Token shortages and major inference funding rounds for Base10 and OpenRouter are pushing the market toward pay-per-use pricing structures. This directly constrains agent experimentation and raises access questions. The full episode gets into what that funding signals about where inference capacity is heading and who gets left out.
[WATCH ON YOUTUBE →]