OpenAI, in collaboration with crypto investment firm Paradigm, has launched EVMbench, a comprehensive benchmark designed to evaluate the capabilities of AI agents in identifying, fixing, and safely exploiting vulnerabilities in Ethereum smart contracts. The initiative aims to measure AI performance in "economically meaningful environments" that mirror real-world security challenges faced by auditors and developers before contract deployment.
The benchmark organizes tasks across three distinct modes: detect (finding bugs), patch (proposing safe fixes), and exploit (demonstrating the ability to leverage vulnerabilities in a controlled setting). According to reports, recent testing shows a dramatic improvement in AI exploit capabilities. A model variant called GPT-5.3-Codex achieved a 72.2% success rate in exploit mode, a significant leap from the 31.9% rate recorded for its predecessor, GPT-5. However, performance in the detect and patch categories continues to lag, indicating persistent gaps in defensive applications.
EVMbench is constructed using real vulnerabilities sourced from approximately 40 actual security audits, supplemented by custom, unreleased contract tasks. To ensure safety and reproducibility, AI agents run within containerized sandboxes, with each task featuring an answer key for objective scoring. This design aims to provide apples-to-apples comparisons across different AI models and versions over time while minimizing real-world risk.
The launch coincides with OpenAI's broader $10 million commitment to cybersecurity research. Alpin Yukseloglu, a Partner at Paradigm, emphasized the significance of the benchmark, stating, "With $100B+ in assets sitting in open source crypto contracts, there's a real risk from AI agents capable of finding exploits. EVMbench is designed to measure what agents can do." He further noted that the results preview a structural shift, with "a growing portion of audits in the future will be done by agents."
Independent research, including Anthropic's SCONE-bench, supports the dual-use nature of this technology, showing agents can autonomously generate exploit code simulating millions in losses. This underscores a narrowing window for defenders between vulnerability disclosure and potential exploitation. Industry experts, such as those from OpenZeppelin, caution that while AI can handle many known challenges, it still struggles with novel or adversarial cases, meaning human-led review and governance will remain central to security workflows.