Artificial IntelligenceBusinessNewswireTechnology

GPT-5 Matches Human Performance in Diverse Jobs, Says OpenAI

▼ Summary

– OpenAI released a new benchmark called GDPval to measure how its AI models perform against human professionals across economically significant industries and jobs.
– The benchmark tests AI performance in 44 occupations across nine major industries, including healthcare and finance, by having experts compare AI-generated reports to human ones.
– OpenAI’s GPT-5 model was rated as good as or better than industry experts 40.6% of the time, while Anthropic’s Claude Opus scored 49%, though OpenAI attributes Claude’s high score partly to its pleasing graphics.
– OpenAI acknowledges the current test is limited, as it only evaluates report generation and not the full range of tasks in a real job, but plans to develop more robust future versions.
– The company views the rapid progress as significant, noting that GPT-5’s performance has nearly tripled compared to the GPT-4o model from 15 months prior.

A new benchmark from OpenAI suggests its latest AI models are reaching performance levels comparable to human professionals across several key industries. The evaluation, known as GDPval, represents an initial effort to gauge how closely artificial intelligence systems can match or exceed the quality of work performed by people in economically significant roles. This forms a core part of the organization’s broader objective to advance toward artificial general intelligence.

According to the company’s findings, its GPT-5 model and Anthropic’s Claude Opus 4.1 are already approaching the quality of work produced by industry experts. It is crucial to understand that this does not signal an immediate, widespread replacement of human workers. The current version of the test covers a limited set of tasks compared to the full scope of most jobs. However, it provides a novel metric for tracking AI’s advancement toward this significant milestone.

The GDPval benchmark is structured around the nine industries that contribute most significantly to the United States’ gross domestic product. These sectors include healthcare, finance, manufacturing, and government. The evaluation assesses AI performance across 44 distinct occupations within these fields, from software engineering and nursing to journalism.

For this initial iteration, GDPval-v0, the methodology involved experienced professionals comparing AI-generated reports against those created by their human counterparts. They were then asked to select the superior submission. In one example, investment bankers were prompted to develop a competitor landscape analysis for the last-mile delivery industry; their reports were then evaluated alongside those generated by AI. The model’s overall “win rate” is an average of its performance across all 44 professions.

The results indicated that GPT-5-high, a more computationally powerful variant of GPT-5, was rated as better than or equal to industry experts 40.6% of the time. Interestingly, Anthropic’s Claude Opus 4.1 achieved an even higher score, with a win rate of 49%. OpenAI theorizes that Claude’s strong performance may be partly attributable to its ability to create visually appealing graphics, which could influence human evaluators.

It is important to recognize that most professional roles involve a far wider range of activities than simply producing research reports, which is the sole focus of GDPval-v0. OpenAI acknowledges this limitation and has stated its intention to develop more robust future versions of the test that incorporate a greater variety of industries and interactive workflows.

Despite these limitations, the company views the progress as noteworthy. In discussions, OpenAI’s chief economist, Dr. Aaron Chatterji, suggested that the results imply professionals in these fields can increasingly leverage AI to handle routine tasks. This would free up their time to concentrate on more complex, higher-value work that requires human judgment and creativity.

The rapid improvement is another key takeaway. OpenAI’s evaluations lead, Tejal Patwardhan, pointed out that the GPT-4o model, released approximately 15 months ago, achieved a win rate of just 13.7%. The fact that GPT-5’s performance has nearly tripled this figure indicates a swift pace of development that is expected to continue.

The AI industry relies on a variety of benchmarks to measure progress and determine state-of-the-art status. Popular examples include tests like AIME 2025 for complex math problems and GPQA Diamond for PhD-level science questions. However, many leading models are beginning to saturate these existing benchmarks, leading researchers to call for more practical evaluations that reflect real-world applications. Benchmarks like GDPval could therefore play an increasingly vital role in demonstrating AI’s practical utility across diverse sectors, though a more comprehensive assessment will ultimately be needed to make definitive claims about outperforming humans.

(Source: TechCrunch)

Topics

ai benchmarking 95% gdpval test 93% gpt-5 performance 90% human comparison 88% claude opus 85% industry applications 82% occupational testing 80% job automation 78% agi development 75% benchmark limitations 72%