Anthropic has built Natural Language Autoencoders, NLAs, a system that converts Claude's internal numerical activations into readable text. Activations are the raw computational states that govern model behavior, and until now they existed in a form humans cannot directly interpret.

The significance is not just academic. Anthropic reports NLAs are already being used operationally: improving safety testing methods and diagnosing why Claude produces specific outputs. That means this is deployed interpretability tooling, not a research prototype sitting on a shelf.

The blog post behind this video is worth reading for the architectural details of how NLAs encode and decode activation space into natural language, and what failure modes or limitations Anthropic found in the process. The gap between what a model says and what it computes is one of AI safety's hardest problems. This is a concrete attempt to close it.

[WATCH ON YOUTUBE →]