Read full article →

Surprise upset: GPT-5.5 beats Claude Fable 5 on brutal new Agents’ Last Exam benchmark

Summary

TITLE: GPT-5.5 Beats Claude Fable 5 in AI Benchmark Challenge HOMEPAGE: Researchers from the University of California, Berkeley, have created a new AI benchmark that tests models on real-world professional workflows.
OpenAI's GPT-5.5 has beaten Anthropic's Claude Fable 5 on this challenging test.
SUMMARY: The Agents' Last Exam (ALE) is a new AI benchmark designed to measure a model's ability to perform economically valuable, long-term tasks.
OpenAI's GPT-5.5 has topped the leaderboard with a 24% pass rate, beating Anthropic's Claude Fable 5.
The ALE benchmark is unique because it forces models to interact with real-world tools and software, rather than just answering static questions.
This is a significant departure from traditional AI benchmarks, which have been criticized for being too easy or allowing models to "cheat" by reading hidden answer keys.
The ALE benchmark consists of 1,490 task instances across 55 non-physical industry sub-domains, sourced from the professional histories of industry practitioners.
Agents must use their "Eyes" and "Hands" to navigate Linux or Windows virtual machines, interleaving shell scripting with point-and-click operations.
WHY IT MATTERS: This test highlights the limitations of current AI models, which are struggling to perform on real-world tasks.
The ALE benchmark demonstrates that AI models are not yet ready to replace humans in professional workflows.
This has significant implications for industries that rely on automation, such as healthcare, finance, and education.
The ALE benchmark also raises questions about the fairness and accuracy of AI evaluation methods, and the need for more robust and transparent testing frameworks.
EXPLANATION: Let's break down some key technical terms from this story: Generalist Computer-Use Agent (GCUA): A GCUA is a type of AI model that can perform a wide range of tasks, similar to how a human would.
It's not just a specialized tool for a specific task, but rather a general-purpose agent that can interact with the world in a flexible way.
Functional layers: The ALE benchmark maps an agent's capabilities across five functional layers: Brain (reasoning), Eyes (visual perception), Body (orchestration), Hands (tool invocation), and Feet (runtime substrate).
This means that an agent must be able to perform a range of tasks, from reasoning and problem-solving to interacting with the physical world.
Deterministic evaluation: The ALE benchmark uses deterministic evaluation for many tasks, which means that the output is predictable and reproducible.
This is in contrast to "LLM-as-a-judge" grading paradigm, which relies on the model's own judgment to evaluate its performance.

Surprise upset: GPT-5.5 beats Claude Fable 5 on brutal new Agents’ Last Exam benchmark

Summary

Firms Spend Big on AI, $7,500 a Month Per Employee