Artificial Intelligence Cybersecurity Fintech Newswire

AI Agents vs. Smart Contract Exploits: New Open-Source Benchmark

February 20, 2026Last Updated: February 20, 2026

2 minutes read

▼ Summary

– Smart contract exploits are a persistent threat due to the permanent, autonomous nature of EVM code, driving demand for better AI security tools.
– EVMbench is an open-source benchmark from OpenAI and Paradigm that tests AI agents on three practical security tasks: detecting, patching, and exploiting vulnerabilities.
– The benchmark uses a dataset of 120 real-world vulnerabilities from audits and contests, evaluated in containerized environments for reproducible, automated scoring.
– Results show AI models struggle with exploit and patching tasks, though exploit performance has improved significantly in recent model generations.
– EVMbench is freely available on GitHub to provide a consistent evaluation framework for researchers and security teams as AI capabilities advance.

The security of smart contracts remains a critical challenge for the blockchain ecosystem, with vulnerabilities in immutable code posing a significant financial risk. EVMbench emerges as a new, open-source framework designed to rigorously evaluate how effectively artificial intelligence systems can handle practical security tasks. Developed through a collaboration between OpenAI and Paradigm, this benchmark moves beyond theoretical tests by using real-world vulnerability data sourced from professional audits and security contests. Its creation addresses the pressing need for standardized, repeatable methods to assess AI agents that claim to enhance smart contract auditing and automated analysis.

This benchmark focuses on three core security functions: detection, patching, and exploitation. In the detect phase, an AI model reviews contract code to identify known vulnerabilities that have been previously documented by auditors. Performance is scored based on recall, its ability to correctly flag these confirmed security flaws. The patch task requires the model to modify the vulnerable code, removing the exploit while ensuring the contract’s original functionality and tests still pass successfully. For the exploit evaluation, the AI is placed in a sandboxed blockchain environment and must execute a working attack against a vulnerable contract, with success measured by verifying specific on-chain state changes, such as drained balances.

The dataset powering EVMbench is built from 120 curated vulnerabilities across 40 real audits, primarily drawn from open competitions like Code4rena and Paradigm’s own Tempo audit process. Each case includes the full vulnerable contract and the necessary infrastructure to recreate the scenario, ensuring tasks reflect the complexity of real development conditions. This approach forces AI agents to reason about contract interactions and state changes, a step up from simpler, synthetic vulnerability tests.

Grading for exploit tasks is handled through deterministic replay in a controlled local EVM instance. Agents can deploy contracts and call functions, with an automated harness verifying success based on concrete outcomes like contract balances. This objective, automated scoring allows for consistent and repeatable evaluation across different models and test runs, providing a reliable basis for comparison.

Initial results from applying the benchmark reveal significant performance gaps. Exploit tasks prove particularly difficult for many AI systems, even when they can identify vulnerabilities at a surface level. Notably, Paradigm reported rapid improvement in this area across recent model generations. A partner at the firm, Alpin Yukseloglu, stated that while top models initially exploited less than 20% of critical bugs, newer iterations like GPT-5.3-Codex now succeed in over 70% of cases. However, the patching of vulnerabilities remains a major weakness, as it requires a deep understanding of the code’s design assumptions to fix flaws without breaking intended behavior across all edge cases.

EVMbench is publicly available on GitHub, complete with tasks, tooling, and documentation. Its release aims to provide researchers and security teams with a consistent testing ground to measure progress as AI capabilities for blockchain security continue to evolve.

(Source: HelpNet Security)