OpenAI built a new network protocol called Multipath Reliable Connection (MRC) because training frontier models at scale requires every GPU in a cluster to stay in lockstep, and a single network failure stalls the entire job. Mark Handley and Greg Steinbrecher explain how MRC routes around failures in real time, keeping training runs intact across record-scale GPU deployments. The protocol was developed jointly with AMD, Broadcom, Intel, Microsoft, and Nvidia, and OpenAI has made it an open industry standard.
The technical core of the conversation, starting around 10:05, is worth your attention: the discussion of why AI training stresses networks differently than traditional HPC or cloud workloads, what bottlenecks actually cost in terms of wasted compute, and the specific failure modes MRC was designed to eliminate. This is not a generic networking talk. The protocol details start at 15:19.
OpenAI's decision to open-source MRC rather than keep it proprietary is addressed directly at 25:05, and the reasoning is more strategic than altruistic. The episode closes with a speculative segment on whether AI compute could move to space, which is either a footnote or the most interesting thing here depending on your appetite for infrastructure futures.
[WATCH ON YOUTUBE →]