AI & Tech Artificial Intelligence BigTech Companies Newswire Technology What's Buzzing

GPT-5.4 Shatters Professional Benchmark Records

March 7, 2026Last Updated: March 7, 2026

2 minutes read

GPT-5.4 Thinking text on blurred orange and pink background.

Originally published on: March 6, 2026

▼ Summary

– OpenAI has released GPT-5.4, a major new model featuring native computer use, a 1-million-token context window, and a new tool-calling system for efficiency.
– The model shows significant benchmark improvements, such as matching or exceeding professionals in 83% of knowledge work tasks and achieving a 75% success rate on desktop navigation.
– A key caveat is that while it leads on certain benchmarks, no model is close to professional-grade reliability on complex, multi-step tasks, as highlighted by the APEX-Agents results.
– The release includes a new open-source safety evaluation, CoT Controllability, designed to test if models can hide their reasoning, with GPT-5.4 showing low ability to do so.
– The launch occurs during intense competition, with rivals like Anthropic and Google holding advantages in other areas, and OpenAI’s rapid release pace aims to maintain visibility in the news cycle.

The latest iteration of OpenAI’s flagship model, GPT-5.4, represents a significant leap in capability, particularly for professional applications. Announced just days after the release of GPT-5.3 Instant, this new model arrives during a period of intense competition and corporate scrutiny for the company. It is being promoted as the most advanced and efficient frontier model tailored for serious work, available in three distinct configurations: a standard version, a specialized GPT-5.4 Thinking model for complex reasoning, and a high-performance GPT-5.4 Pro tier.

Initial performance data is compelling. On OpenAI’s internal GDPval benchmark, which assesses knowledge work across 44 professions, GPT-5.4 matched or surpassed industry experts in 83% of evaluations, a notable increase from the previous model’s 70.9%. Perhaps more striking is its performance on the OSWorld-Verified test for desktop computer navigation, where it achieved a 75% success rate, exceeding the reported human benchmark. The model also claims top marks on the Mercor APEX-Agents benchmark for sustained professional tasks.

Beyond raw scores, the update introduces transformative new features. A cornerstone is native computer use, allowing the model to directly interact with software, file systems, and applications through Codex and the API. This built-in capability simplifies the development of automated workflows. Furthermore, the API now supports an immense 1-million-token context window, enabling the processing of vast documents, codebases, or financial records in a single session. Developers will also appreciate the efficiency gains from a reworked tool-calling system, which in testing reduced token usage by nearly half for complex tasks.

However, interpreting these benchmarks requires nuance. While leading the APEX-Agents ranking is an achievement, the benchmark’s own creators have noted that even top models currently perform like an intern who “gets it right a quarter of the time.” This highlights that while GPT-5.4 shows impressive progress on specific deliverables, achieving reliable, end-to-end professional workflow automation remains a work in progress. The company has also introduced a new open-source safety evaluation, CoT Controllability, to assess whether reasoning models can hide their internal thought processes, reporting that GPT-5.4 Thinking shows a low propensity for such evasion.

The launch occurs in a fiercely competitive landscape. Anthropic’s Claude Opus 4.6 maintains an edge in certain coding areas, while Google’s Gemini 3.1 Pro offers a larger context window at a competitive price. GPT-5.4 appears to carve out its lead in desktop interaction and professional knowledge tasks. The blistering release cadence, two major models in one week, signals a strategic push to dominate the news cycle and continuously refresh its offering. Whether this rapid-fire approach will cement long-term enterprise loyalty or simply contribute to an exhausting cycle of fleeting benchmark advantages is the pivotal question moving forward.

(Source: The Next Web)