AI & TechArtificial IntelligenceBusinessNewswireTechnology

Why AI Agents Fail as Freelancers

▼ Summary

– Top AI agents performed poorly in online freelance work simulations, completing less than 3% of tasks and earning only $1,810 out of a possible $143,991.
– The Remote Labor Index benchmark was created by Scale AI and CAIS researchers to measure AI’s ability to automate economically valuable work.
– Leading AI models like Manus, Grok, Claude, ChatGPT, and Gemini were tested across diverse freelance tasks including graphic design, video editing, and data scraping.
– Despite improvements in coding and reasoning, AI still struggles with multi-step tasks, tool usage, and lacks long-term memory and on-the-job learning capabilities.
– These findings contrast with OpenAI’s GDPval benchmark, which suggests AI models are approaching human-level performance on office tasks.

The notion that artificial intelligence will soon replace human freelancers on a massive scale faces a significant challenge, according to recent experimental findings. A newly developed benchmark, the Remote Labor Index, reveals that even the most advanced AI agents struggle profoundly with online freelance assignments. Created through collaboration between researchers at data annotation firm Scale AI and the nonprofit Center for AI Safety, this index evaluates how well cutting-edge AI models handle economically valuable work.

During testing, multiple leading AI systems attempted various simulated freelance jobs, yet the most capable among them completed fewer than 3 percent of the total tasks. Out of a potential earnings pool of $143,991, the top-performing AI agent managed to secure just $1,810. Performance rankings placed Manus, developed by a Chinese startup, at the forefront, followed by Grok from xAI, Claude by Anthropic, OpenAI’s ChatGPT, and Google’s Gemini.

Dan Hendrycks, who leads the Center for AI Safety, remarked that these findings provide a much clearer picture of actual AI capabilities. He observed that while some agents have shown notable improvement over the past year, there is no guarantee this progress will continue at the same pace. This research arrives amid widespread speculation about artificial intelligence rapidly overtaking human intellect and displacing huge portions of the workforce. Earlier this year, Anthropic CEO Dario Amodei predicted that 90 percent of coding tasks would become automated within months.

History shows that previous AI advancements also sparked premature forecasts about job replacement, such as claims that radiologists would soon be superseded by AI algorithms. For this study, researchers generated a diverse set of freelance assignments originally completed by verified Upwork workers. These tasks covered graphic design, video editing, game development, and administrative functions like data scraping. Each job came with a detailed description, necessary file directories, and a sample of human-produced work for reference.

Hendrycks explained that although AI models have demonstrated improved capabilities in coding, mathematics, and logical reasoning in recent years, they continue to face fundamental limitations. They struggle to coordinate multiple tools and execute complex, multi-step processes effectively. Unlike humans, these systems lack long-term memory storage and cannot engage in continuous learning from experience. They remain unable to acquire new skills through on-the-job practice.

This analysis presents a contrasting perspective to OpenAI’s GDPval benchmark, introduced last September, which claims to measure economically valuable work. According to GDPval metrics, frontier AI models like GPT-5 are nearing human-level performance across 220 office-related tasks. OpenAI has not commented on these differing assessments.

(Source: Wired)

Topics

ai limitations 95% freelance automation 90% ai benchmarks 88% Job Displacement 85% ai models 82% economic automation 80% research methodology 78% ai progress 75% human comparison 72% skill limitations 70%