New AI University · Jobs Simplified

Surprise upset: GPT-5.5 beats Claude Fable 5 on brutal new Agents’ Last Exam benchmark

Summary

  • TITLE: GPT-5.5 Beats Claude Fable 5 in AI Benchmark Challenge HOMEPAGE: Researchers from the University of California, Berkeley, have created a new AI benchmark that tests models on real-world professional workflows.
  • OpenAI's GPT-5.5 has beaten Anthropic's Claude Fable 5 on this challenging test.
  • SUMMARY: The Agents' Last Exam (ALE) is a new AI benchmark designed to measure a model's ability to perform economically valuable, long-term tasks.
  • OpenAI's GPT-5.5 has topped the leaderboard with a 24% pass rate, beating Anthropic's Claude Fable 5.
  • The ALE benchmark is unique because it forces models to interact with real-world tools and software, rather than just answering static questions.
  • This is a significant departure from traditional AI benchmarks, which have been criticized for being too easy or allowing models to "cheat" by reading hidden answer keys.
  • The ALE benchmark consists of 1,490 task instances across 55 non-physical industry sub-domains, sourced from the professional histories of industry practitioners.
  • Agents must use their "Eyes" and "Hands" to navigate Linux or Windows virtual machines, interleaving shell scripting with point-and-click operations.
  • WHY IT MATTERS: This test highlights the limitations of current AI models, which are struggling to perform on real-world tasks.
  • The ALE benchmark demonstrates that AI models are not yet ready to replace humans in professional workflows.
  • This has significant implications for industries that rely on automation, such as healthcare, finance, and education.
  • The ALE benchmark also raises questions about the fairness and accuracy of AI evaluation methods, and the need for more robust and transparent testing frameworks.
  • EXPLANATION: Let's break down some key technical terms from this story: Generalist Computer-Use Agent (GCUA): A GCUA is a type of AI model that can perform a wide range of tasks, similar to how a human would.
  • It's not just a specialized tool for a specific task, but rather a general-purpose agent that can interact with the world in a flexible way.
  • Functional layers: The ALE benchmark maps an agent's capabilities across five functional layers: Brain (reasoning), Eyes (visual perception), Body (orchestration), Hands (tool invocation), and Feet (runtime substrate).
  • This means that an agent must be able to perform a range of tasks, from reasoning and problem-solving to interacting with the physical world.
  • Deterministic evaluation: The ALE benchmark uses deterministic evaluation for many tasks, which means that the output is predictable and reproducible.
  • This is in contrast to "LLM-as-a-judge" grading paradigm, which relies on the model's own judgment to evaluate its performance.

SHARE THIS

WhatsApp LinkedIn

Save articles to read later — View Saved

READ NEXT

#3

Firms Spend Big on AI, $7,500 a Month Per Employee

Continue reading

MORE FROM THIS EDITION