Claude writes 80% of its own code, calls for AI pause

▼ Summary
– As of May 2026, Claude writes over 80% of Anthropic’s production code, and the typical engineer ships eight times more code per day than in 2024.
– On complex engineering problems, Claude’s success rate rose to 76% in May 2026, and the quality of its code is now at parity with human-written code.
– Claude demonstrated the ability to run an open-ended AI safety research project end to end, recovering 97% of the performance gap with minimal human input.
– The length of tasks AI can reliably complete alone is doubling roughly every four months, with current models handling 12- to 16-hour tasks.
– Anthropic’s paper calls for a verifiable global pause mechanism for frontier AI development, acknowledging the difficulty of enforcement and the incentive to defect.
In May 2026, one engineer at Anthropic hadn’t typed a single line of code in five months. The reason isn’t a lack of work. It’s that Claude now writes over 80% of the production code merged into Anthropic’s own codebase. That marks a staggering leap from low single digits when Claude Code launched in early 2025. The company published these findings Wednesday in a new paper from the Anthropic Institute, titled “When AI builds itself.” But the headline isn’t the productivity surge. It’s what comes next: AI that can design and train its own successor. Anthropic insists it isn’t there yet, but warns it may be closer than most institutions are ready for.
The raw numbers paint a clear picture. In the second quarter of 2026, the typical Anthropic engineer merged eight times more code per day than in 2024. An internal poll of 130 research staff found the median respondent estimated roughly four times the output when using Anthropic’s latest model, Mythos Preview, compared to working without AI. On the most complex, open-ended engineering challenges, Claude’s success rate hit 76% in May 2026, a 50-percentage-point jump in just six months. One example: when a routine upgrade began crashing tens of thousands of training jobs, an engineer gave Claude live incident context and cluster access. Claude isolated an obscure debugging flag, reproduced the crash, and confirmed a fix in about two hours. That task normally takes two to three days.
Code quality is converging, too. Anthropic staff say Claude-written code was “somewhat worse” than human code in late 2025, is at rough parity today, and is expected to be strictly better within the year. An automated Claude reviewer now checks every proposed change to Anthropic’s codebase before it can merge. A retrospective analysis found it would have caught roughly a third of the bugs behind past claude.ai incidents before they reached production.
Writing code is the easy part. The harder question is whether Claude can do research: the open-ended scientific reasoning that drives AI forward. Anthropic’s evidence here is more preliminary but still striking. In April 2026, the company demonstrated Claude running an end-to-end AI safety research project. Nine parallel agents were given a problem, left to propose hypotheses, run experiments, share findings through a common forum, and iterate. Over 800 cumulative hours and roughly $18,000 in compute, the agents recovered 97% of the performance gap on the task. Two human researchers, working for a week, recovered 23%. Another experiment measured whether Claude could pick a better “next step” than a human researcher at difficult junctures. In November 2025, Claude matched the human’s judgment 51% of the time. By April 2026, that rose to 64%. The day-to-day work of research is largely a chain of these next-step decisions. If that trend continues, the gap between AI-as-assistant and AI-as-researcher narrows fast.
Anthropic’s internal data aligns with a broader pattern tracked by METR, a non-profit that benchmarks AI capabilities. The length of tasks AI can reliably complete on its own has been doubling roughly every four months, accelerating from an earlier pace of every seven months. In March 2024, Claude Opus 3 could handle tasks that take a human about four minutes. By early 2025, Claude Sonnet 3.7 managed hour-and-a-half tasks. Today, Claude Opus 4.6 handles 12-hour tasks, and METR found that Mythos Preview could sustain work for at least 16 hours, at the upper end of what the current benchmark suite can measure. If the trend holds, tasks requiring days of skilled human work come into range this year. Weeks-long tasks could follow in 2027.
The downstream effects are already visible. GitHub, the platform most of the world’s software is built on, saw roughly one billion code commits in all of 2025. By mid-2026, the platform was processing 275 million commits per week, on pace for 14 billion over the year. Claude Code alone accounts for 4.5% of all public commits on GitHub, generating 2.6 million weekly. GitHub’s COO has said the company is “pushing incredibly hard” on capacity just to keep up. Inside Anthropic, the bottleneck has already shifted. As Claude generates more code, human code review has become the constraint. The company says it has encountered a textbook example of Amdahl’s law, where speeding up one part of a process simply reveals the next slowest link.
The paper’s most significant section is not about productivity. It is a call for a verifiable global mechanism to slow or temporarily pause frontier AI development. Anthropic is careful with the framing. A unilateral pause by one lab would simply change who leads, not create the deliberative process the company says is missing. What Anthropic proposes instead is a system where multiple frontier labs, in multiple countries, could agree to stop under the same conditions and verify that the others had actually done so. It draws a parallel to nuclear arms control but acknowledges the differences: training runs are far easier to conceal than missile silos, the inputs are general-purpose, and the incentive to defect quietly is enormous. “If it were possible to effectively slow the development of this technology to give ourselves more time to deal with its immense implications, we think that would likely be a good thing,” the paper states. The AI coding market is now worth tens of billions. Asking the industry to pause is asking it to leave money on the table while trusting that competitors, including those in China, will do the same.
The paper lays out three possible futures. In the first, the trend stalls, but even today’s capabilities reshape the economy. In the second, AI development becomes substantially automated while humans still set research direction, meaning 100-person companies could do the work of 100,000-person organisations. In the third, AI systems achieve full recursive self-improvement and begin designing their own successors. Anthropic says it does not have “good intuitions” for what that third scenario looks like. But it offers one observation: even recursive intelligence cannot speed up everything. It cannot learn what a drug does over decades of use, hold elections sooner than a constitution dictates, or turn a stranger into an old friend in a weekend. The felt pace of this future, for most people, would still be set by the bottlenecks.
The company’s growing enterprise push makes the timing of this paper notable. Anthropic is simultaneously selling Claude as a productivity revolution and warning that the trajectory it enables could require a global emergency brake. Whether that tension is principled transparency or strategic positioning depends on what happens next.
(Source: The Next Web)




