Why AI Labs Pay People for Dirty Robot Training Work

▼ Summary
– OpenAI relaunched its robotics program, signaling a race among major AI labs to develop physical-world AI, but a lack of robot-specific training data is a major bottleneck.
– Startup XDOF raised $70 million to build data pipelines, collection tools, and annotation systems for robots, aiming to solve the data feedback loop problem that frontier labs face.
– XDOF is releasing ABC, the largest collection of high-quality robot training data, including 130,000 manipulation trajectories, to advance academic robotics research.
– The company plans to collect data across three tiers: teleoperation on deployed robots, general teleoperation data, and egocentric human data gathered via custom wearable sensors.
– XDOF argues that major labs outsource data production because it requires massive operational scale—warehouses, robot maintenance, and trained operators—that is costly and distracting.
Two weeks ago, OpenAI announced it would revive the robotics program it shut down in 2021, the latest sign that major AI labs are scrambling to teach machines how to navigate and act in the physical world. But engineering capable robots hinges on a resource the industry still lacks: the kind of training data that powers today’s language models.
That shortfall is giving rise to a new breed of infrastructure company. Unlike large language models, which were trained on a vast trove of publicly available text, robots require data that captures physical interaction , and that data is scarce. YouTube clips and footage from gig workers lack fidelity and are difficult to translate into real-world robotic movements.
XDOF (pronounced “ecks-doff”), emerging from stealth mode today, is betting that the next critical bottleneck in AI isn’t models or chips, but the data feedback loop needed to teach robots how to interact with their surroundings. The startup aims to build the data pipelines, collection tools, and annotation systems that frontier labs and robotics firms can’t easily develop on their own. It has raised $70 million from Thrive Capital, Spark Capital, a16z, Lux, and WndrCo. Co-founder and CEO Philipp Wu says XDOF, which employs about 60 people, is already working with 20 customers, including several top AI labs, though he cannot name them.
“All of the top labs are trying to pursue robotics,” Wu said. “We’ve already seen some of the downfalls of falling a little bit behind in the language model race … you don’t want to be in this type of situation where you pursue this technology too late, and everyone is in this boat where physical AI is the next frontier.”
Wu encountered this problem firsthand as a PhD student at UC Berkeley. His research focused on enabling robots to learn skills from large-scale datasets, but there was one major obstacle.
“We didn’t have large-scale data to work with,” he told TechCrunch. “There was this chicken-and-egg problem , we first needed to actually collect data before we could even ask how to train a foundation model for robotics.”
Wu and his future XDOF co-founder and CTO, Fred Shentu, worked on a project called GELLO, a low-cost teleoperation system that lets a human operator control a robotic arm to generate training data. “It ended up becoming a very influential paper in robotics, because a lot of people had similar needs and bottlenecks, and many started leveraging this type of device for data collection,” Wu said.
Recognizing the opportunity, Wu, Shentu, and third co-founder and COO Nemo Jin launched XDOF in October 2024 to create a full data ecosystem for companies developing robotics models. Aware that providing raw data alone can be a dead-end business, the company also focuses on data cleaning, tooling, and annotation , building a self-reinforcing feedback loop for robot trainers.
As a starting point, XDOF is partnering with UC Berkeley’s AI Research lab to release what it believes is the largest collection of high-quality robot training data ever assembled, called ABC. The dataset includes 130,000 trajectories of robot manipulation data, 300 hours of simulation, and 100 hours of evaluations. That kind of scaled-up pre-training data has never been available to academia before.
“We’ve seen in language, image generation, and other fields, that when models and data are released, the community achieves things that you wouldn’t necessarily have expected,” David McAllister, a Berkeley PhD student who helped organize the release, told TechCrunch.
The team has already used the data to train robots on benchmark tasks like folding T-shirts, flattening boxes, and loading AirPods into their cases.
The company plans to work across three tiers of a data pyramid. The most valuable tier is teleoperation data collected on the actual robot being deployed; the next involves teleoperated robots gathering more general data, as with GELLO; and the third is egocentric data gathered by humans performing everyday tasks, for which XDOF plans to build its own wearable sensors.
“Your camera choice is going to affect the quality of your data , which is going to affect how your hand-tracking algorithm performs,” Wu said. “If you don’t design the hardware well from the start, the data you collect might have very specific problems that you didn’t anticipate.”
The company intends to hire and train armies of teleoperators and egocentric data operators around the world. That labor-intensive model raises an obvious question: Why aren’t the major labs doing this data production work themselves?
“You need a warehouse of hundreds of thousands of square feet with hundreds of robots,” Wu said. “You need to maintain these robots, calibrate their physical parameters, and properly train operators.”
It’s a build-out that requires focus, capital, and operational scale that most AI labs would rather outsource , which is precisely the market XDOF is betting on.
The name XDOF is a play on the robotics term “degrees of freedom,” which describes the number of independent motions a robot can perform. Your arm, from shoulder to wrist, has seven degrees of freedom. Humanoid robotics company Figure AI’s latest robot has 30. The X in the company’s name captures its ambition: “Arbitrary degrees of freedom, unlimited degrees of freedom,” Wu says.
(Source: TechCrunch)