Artificial IntelligenceCybersecurityNewswireTechnologyWhat's Buzzing

Mythos Beats GPT5.5 on Chrome Security Exploits

Originally published on: June 5, 2026
▼ Summary

– Anthropic’s Claude Mythos outperformed OpenAI’s GPT-5.5 on ExploitBench, a new benchmark measuring real-world Chrome vulnerability exploitation, scoring 9.90 versus 5.51 on average.
– ExploitBench, launched by Bugcrowd with Carnegie Mellon University, scores staged exploitation outcomes up to arbitrary code execution, not just crash detection.
– Mythos exploited a one-day Chrome vulnerability about 50% of the time, a “lead-tier” result potentially worth up to $10,000, and found solutions missed by top human hackers.
– Experts caution that results are limited to specific Chrome vulnerabilities and should not be extrapolated; AI models are improving but not yet reliably capable of large-scale exploitation.
– Bugcrowd released ExploitBench alongside reinforcement learning environments to measure and improve model capability, urging defenders to develop AI-driven remediation at scale.

At Infosecurity Europe 2026, Bugcrowd unveiled the first results from ExploitBench, a new benchmark designed to test how frontier AI models handle real-world Chrome security exploits. The findings reveal that Anthropic’s Claude Mythos significantly outperformed OpenAI’s GPT-5.5 in exploiting vulnerabilities found in Google Chrome.

Developed in collaboration with Carnegie Mellon University and leading Chrome vulnerability researchers, ExploitBench launched in May 2026 as an independent, graded evaluation. David Brumley, chief AI and science officer at Bugcrowd, called it “the first independent benchmark that measures what AI models can actually do with a vulnerability, not just identify it but exploit it step by step.” Anthropic was among the first to participate.

In head-to-head testing, Mythos achieved a markedly higher exploitation score than GPT-5.5. The benchmark moves beyond simple crash-or-no-crash assessments, scoring progress through five stages of exploitation up to arbitrary code execution against a vulnerable V8 build, the JavaScript and WebAssembly engine powering Google Chrome, Microsoft Edge, Node.js, and Cloudflare Workers.

Mythos posted an average score of 9.90 out of 16 and reached the highest tier on 21 of 41 vulnerabilities, aided by occasional human hints or “nudges.” GPT-5.5 averaged 5.51 and reached the top tier on just two cases. “For example, Mythos is able to exploit a one-day vulnerability in Chrome about 50% of the time. This is lead-tier activity,” Brumley said, noting that Google could reward up to $10,000 for such a vulnerability with no previously known exploit. “Anthropic’s model is churning these out and actually found solutions for exploiting the flaws that even top-tier hackers missed.”

Brumley added that while GPT-5.5’s performance is currently lower, its broader availability opens opportunities for more people to use it for exploit development.

AI models are closing the gap with elite human researchers, but experts urge caution. ExploitBench measures staged exploitation rather than superficial signals, a distinction critical for assessing real-world capability. Models that can reliably exploit zero-day flaws lower the barrier for threat actors to weaponize vulnerabilities. Dave Gerry, Bugcrowd CEO, warned that automation and AI are already being integrated into attacker workflows, increasing the pace at which discovered flaws become active exploits.

However, Brumley cautioned against overgeneralization. “We measured a very sophisticated target application. Chrome is made of hundreds of thousands of lines of code, it’s been audited for years. It doesn’t necessarily mean we would get the same results trying to exploit a vulnerability in a web application.”

Michael Price, VP of product engineering at VulnCheck, noted that while AI models are improving, they are not yet fully capable of reliable exploitation at scale. Citing a recent UK AI Security Institute report on Mythos, Price explained that the most significant advance is in planning ability , producing step-by-step plans, replanning as needed, and executing multi-stage actions. “They’re getting better, but they still are not actually like that great,” he said. “I would expect them every month or every quarter to get 1% better and probably over the course of two or four years they get really good.”

Bugcrowd released ExploitBench alongside reinforcement learning environments to both measure and improve model capability. Gerry emphasized that the benchmark and training environments are complementary: one drives measurement, the other drives improvement through targeted RL training with industry partners.

The company leaders urged defenders to match offensive speed with automated remediation and prioritization. Gerry said the shrinking “zero-day clock” and AI-assisted discovery surge mean organizations must develop AI-driven remediation at scale. “Finding more bugs faster only amplifies the noise unless you can automatically prioritize and act on the ones that actually enable exploits,” he said.

Brumley echoed that urgency, calling for contextual intelligence to prioritize and remediate the most critical vulnerabilities before adversaries exploit them. This requires models trained not just to find flaws but to recommend and, where safe, initiate fixes at scale, allowing human developers to focus on the highest-risk work. “Over the coming months, we will have announcements on that, with tools focusing on helping give people intelligence about how certain vulnerabilities are affecting them,” he said.

(Source: Infosecurity Magazine)

Topics

ai vulnerability exploitation 98% exploitbench benchmark 95% anthropic mythos vs gpt5.5 92% chrome vulnerability exploits 89% ai in cybersecurity 88% staged exploitation scoring 85% Human-AI Collaboration 82% zero-day discovery risk 80% ai model planning ability 78% reinforcement learning for security 76%