New AI Agent Benchmark Questions Workplace Readiness

▼ Summary
– Despite predictions of AI replacing knowledge work, the transformation of white-collar jobs has been slower than expected, even with advanced AI models.
– New research from Mercor introduces the APEX-Agents benchmark, which tests AI on real professional tasks from fields like consulting, banking, and law, and current models are failing, often scoring below 25% accuracy.
– The primary challenge for AI models is performing multi-domain reasoning, such as gathering information across tools like Slack and Google Drive, which is essential for human knowledge work.
– The benchmark uses complex, real-world scenarios from professionals, requiring deep analysis of policies and regulations, making it a more accurate measure of job automation potential than broader knowledge tests.
– While current AI performance is limited, with Gemini 3 Flash leading at 24% accuracy, rapid yearly improvements suggest these systems could significantly impact professional work in the near future.
Nearly two years after Microsoft’s CEO suggested artificial intelligence would soon transform knowledge-based professions, the reality in offices remains largely unchanged. While AI models have made impressive strides in research and planning, their impact on the daily work of lawyers, consultants, and bankers has been minimal. New research from data firm Mercor provides a compelling explanation for this delay, introducing a benchmark that reveals a significant gap between AI capabilities and the complex demands of professional work.
The study, which led to the creation of the APEX-Agents benchmark, tested leading AI models on authentic tasks from consulting, investment banking, and law. The results were sobering. Every model evaluated received a failing grade, with even the top performers struggling to answer more than a quarter of the questions correctly. Most attempts ended with an incorrect response or no answer at all.
According to Mercor’s CEO, the core challenge for AI lies in synthesizing information from multiple, disparate sources. Real-world professional work doesn’t happen in a single, neatly packaged document. It involves navigating across platforms like Slack, Google Drive, and various databases to gather and connect relevant data. This multi-domain reasoning, essential for human experts, remains a significant hurdle for current agentic AI systems.
The benchmark’s scenarios were developed with input from actual professionals on Mercor’s platform, who also defined what constituted a successful answer. Reviewing the publicly available questions reveals their daunting complexity. One legal example asks whether a company’s data export during an outage complies with EU privacy regulations, a query that requires deep analysis of both internal policies and international law. Answering such questions reliably would signal an AI’s readiness to perform substantive legal work.
This new benchmark differs from previous efforts like OpenAI’s GPQA. While GPQA assesses broad professional knowledge, APEX-Agents evaluates a system’s ability to execute sustained, intricate tasks within specific, high-value fields. This focus makes it a tougher test but also a more accurate gauge of real-world job automation potential.
In the initial evaluations, no model came close to professional proficiency. Gemini 3 Flash achieved the highest one-shot accuracy at 24%, with GPT-5.2 close behind at 23%. Other models like Opus 4.5 and GPT-5 scored around 18%. Despite these low scores, the AI industry has a track record of rapidly overcoming difficult benchmarks. Now that APEX-Agents is public, it presents a clear target for labs aiming to prove their systems’ workplace readiness.
Progress, while incremental, is accelerating. As one researcher noted, current AI performance might be comparable to an intern who is correct a quarter of the time, a notable improvement from just a year ago. That pace of annual advancement suggests the long-predicted shift in knowledge work may be closer than it appears, even if the technology isn’t quite there yet.
(Source: TechCrunch)
