The ARC Prize Foundation has released its latest artificial general intelligence benchmark, ARC-AGI-3, delivering a stark reality check to the AI industry. The results reveal a massive chasm between current AI capabilities and true human-like reasoning, with the best-performing frontier models scoring below 1% while ordinary humans achieve perfect 100% performance.
The benchmark, designed by François Chollet and Mike Knoop's foundation, presents a fundamentally different challenge. Instead of testing memorized patterns or trained knowledge, it drops AI agents into 135 original, interactive game-like environments with zero instructions, stated goals, or rule descriptions. The agent must explore, deduce objectives, formulate a plan, and execute it—a task any young child can manage intuitively.
Leading AI models performed dismally. Google's Gemini 3.1 Pro led the pack with a score of just 0.37%. OpenAI's GPT-5.4 scored 0.26%, Anthropic's Claude Opus 4.6 managed 0.25%, and xAI's Grok-4.20 scored exactly zero. These results stand in brutal contrast to the recent hype from industry leaders. Just days before the benchmark's release, Nvidia CEO Jensen Huang stated on Lex Fridman's podcast, "I think we've achieved AGI." OpenAI's Sam Altman has claimed they've "basically built AGI," and Microsoft is marketing a lab focused on building Artificial Superintelligence (ASI).
ARC-AGI-3 was specifically engineered to prevent benchmark saturation through brute-force training. Of the 135 environments, 110 are kept private (55 semi-private for API testing, 55 fully locked for competition), eliminating any dataset for models to memorize. Scoring uses a rigorous metric called Relative Human Action Efficiency (RHAE), which heavily penalizes inefficiency, backtracking, and guessing. An AI that takes ten times as many actions as a human scores just 1% for that level.
The foundation acknowledges a methodological debate: the benchmark feeds agents JSON code rather than visual inputs. A custom harness from Duke University reportedly pushed Claude Opus 4.6's performance on a single environment variant (TR87) from 0.25% to 97.1%, suggesting perception format may influence results. However, the ARC paper contends that "frame content perception and API format are not limiting factors," arguing the core gap lies in reasoning and generalization, not perception.
The ARC Prize 2026 competition is now open, offering $2 million across three tracks hosted on Kaggle. A key requirement is that all winning solutions must be open-sourced. The results serve as a powerful counter-narrative to commercial AGI claims, suggesting current systems are sophisticated pattern-matching tools but lack the fundamental, flexible reasoning that defines true general intelligence.