Standard Intelligence is training general computer agents directly from raw video, not text or screenshots. Their first model, FDM-1, learns computer use by predicting mouse movements, clicks, and keystrokes from pixel streams, the same way Tesla FSD learns to drive from camera feeds. The team has assembled an 11-million-hour computer action dataset, the largest in the industry, and built a video encoder that is 50 times more token-efficient than competing approaches, fitting nearly two hours of 30 FPS footage into a 1-million-token context window. They racked a 30-petabyte storage cluster in San Francisco for under $500K, roughly 20 times cheaper than hyperscaler pricing.
FDM-1 can already extrude a CAD gear in Blender, drive a car around a San Francisco block after one hour of fine-tuning, and find software bugs by exploring application state space. Founders Galen Mead and Devansh Pandey, ages 21 and 20, met at the Atlas Fellowship in 2022, a selective program for high schoolers focused on AI alignment. Both left their undergraduate programs to pursue this. The six-person team has no legacy assumptions from the video research world, which is either a liability or the reason they solved problems others abandoned.
The core bet is the bitter lesson applied to knowledge work: skip the hand-engineered scaffolding, skip the language model wrappers, and pre-train on raw computer-use video at scale until generality emerges from the data. Sequoia is leading the Series A alongside Spark Capital. The FDM-1 technical report is where this piece earns its depth. Read it not for the conclusion but for how they solved the token-efficiency and storage problems that killed prior attempts to scale video toward general agents.
[READ ORIGINAL →]