OpenAI Launches GPT-5.5, First Full Retrain Since GPT-4.5

▼ Summary
– OpenAI released GPT-5.5, its first fully retrained base model since GPT-4.5, designed to complete complex multi-step tasks with minimal human direction across applications like email and spreadsheets.
– GPT-5.5 sets new benchmarks in agentic coding, computer use, and knowledge work, scoring 82.7% on Terminal-Bench 2.0, 58.6% on SWE-Bench Pro, and 84.9% on GDPval.
– The model matches GPT-5.4’s per-token latency while using fewer tokens to complete tasks, improving cost efficiency for enterprise customers despite a higher per-token price.
– API access for GPT-5.5 is delayed pending additional safety work, with the model currently available only in ChatGPT and Codex for paid subscribers.
– The launch is a direct response to Anthropic’s Claude gaining enterprise market share, as OpenAI aims to win back the B2B segment with a model focused on autonomous task completion.
For months, the unspoken reality across the AI industry has been clear: Anthropic’s Claude was steadily capturing the enterprise market. Internal sources at OpenAI described a “Code Red” situation dating back to at least December 2025, as Anthropic’s annual recurring revenue surged from $9 billion to $30 billion, eroding OpenAI’s once-dominant B2B positioning. On Thursday, the company fired back with a decisive response: GPT-5.5.
This new model, internally codenamed “Spud,” represents OpenAI’s first full retraining of a base model since GPT-4.5. It is rolling out immediately to Plus, Pro, Business, and Enterprise users within ChatGPT and Codex. The central ambition behind GPT-5.5 is legibility , the ability to handle complex, multi-step tasks with minimal human oversight. Instead of requiring meticulously structured prompts and constant supervision, OpenAI claims GPT-5.5 can take a “messy, multi-part task,” independently plan, use tools, verify its own work, navigate uncertainty, and persist until completion.
The performance improvements are concentrated in four critical domains: agentic coding, computer use, knowledge work, and early scientific research. OpenAI describes these as areas “where progress depends on reasoning across context and taking action over time,” signaling a shift from simple question-answering to autonomous execution.
The benchmark data is compelling. On Terminal-Bench 2.0, which evaluates complex command-line workflows involving planning and tool coordination, GPT-5.5 scores 82.7%. On SWE-Bench Pro, a test of real-world GitHub issue resolution across four programming languages, it achieves 58.6%, solving more tasks in a single pass than any predecessor. The model reaches 84.9% on GDPval, which assesses agents across 44 knowledge-work occupations, and 78.7% on OSWorld-Verified, a measure of autonomous computer environment operation. On Tau2-bench Telecom, it hits 98.0% without any prompt tuning. Across all these metrics, OpenAI says GPT-5.5 surpasses GPT-5.4’s scores while using fewer tokens.
That efficiency claim carries significant commercial weight. Larger, more capable models typically demand slower serving times, forcing enterprises into a cost-quality trade-off. OpenAI asserts that GPT-5.5 matches GPT-5.4’s per-token latency in real-world deployment, meaning users get a substantial intelligence upgrade without slower response times. Moreover, the model uses significantly fewer tokens to complete equivalent tasks in Codex, directly lowering the cost per task. While GPT-5.5 is priced higher per token than GPT-5.4, OpenAI argues the net effect is better results for lower total cost in most workflows.
The safety narrative surrounding this launch is notably more cautious. OpenAI says it evaluated GPT-5.5 across its “full suite of safety and preparedness frameworks,” collaborating with internal and external red-teamers, adding targeted testing for advanced cybersecurity and biology capabilities, and gathering feedback from nearly 200 trusted early-access partners before release. The caution is most visible in cybersecurity: OpenAI describes deploying “stricter classifiers for potential cyber risk which some users may find annoying initially,” acknowledging that GPT-5.5 represents a meaningful leap in cyber capability and framing the enhanced safeguards as a necessary investment in responsible deployment.
Conspicuously absent from the launch is API access. GPT-5.5 is available now in ChatGPT and Codex for paid subscribers, but the company states that API deployments “require different safeguards and we are working closely with partners and customers on the safety and security requirements for serving it at scale.” OpenAI promises API availability “very soon” but offers no specific date. For enterprise customers who build on the API rather than the ChatGPT interface, this delay is significant. A variant called GPT-5.5 Pro, which includes extended reasoning, is limited to Pro, Business, and Enterprise subscribers.
Every design decision in GPT-5.5 reflects the competitive landscape. OpenAI is building its unified desktop “super-app” around this model, merging ChatGPT, Codex, and the Atlas browser agent into a single session. GPT-5.5 is engineered to power intent-aware reasoning within that unified workspace, a product category that barely existed six months ago. As a legacy option, GPT-5.2 Thinking will remain available for three months before being retired on 5 June 2026.
The breakneck pace of model releases , GPT-5, 5.1, 5.2, 5.3-Codex, 5.4, and now 5.5 in under a year , underscores both the speed of AI development and the intensity of competition from Anthropic, Google, and the open-source ecosystem. OpenAI makes no effort to hide its target. Bloomberg’s framing, a model intended to “keep pace with rivals like Anthropic,” captures the situation precisely.
GPT-5.5 is the clearest signal yet that OpenAI has fully absorbed the threat posed by Claude’s enterprise market share. It represents a concerted effort to reclaim the B2B segment with a model that can genuinely work, not merely answer questions. Whether it succeeds will depend on whether those benchmark gains hold up in real-world production workflows, whether the API arrives before enterprise customers finalize their next procurement decisions, and whether “Spud” can deliver on its promises when the prompts are messy and the tasks are real.
(Source: The Next Web)




