TheAgentCompany Benchmark Shatters Hype: Why 7 in 10 AI Agents Fail Real Work Tasks

Gartner Predicts 40% Agentic AI Project Failures by 2027, CMU Benchmark Exposes 70% Task Flop Rate

By kellyNii On Jun 30, 2025

The dream of AI agents seamlessly handling workplace tasks faces a harsh reality check. IT consultancy Gartner predicts over 40% of agentic AI projects will be cancelled by 2027 due to “rising costs, unclear business value, or insufficient risk controls”. This forecast arrives alongside sobering performance data from Carnegie Mellon University researchers, whose new benchmark reveals that today’s most advanced AI agents fail 70% of the time when tested on realistic office workflows.

TheAgentCompany: A Reality Check for AI Hype

To objectively measure agent capabilities, CMU researchers built TheAgentCompany—a simulated software company environment replicating tools professionals use daily. This self-contained digital workspace integrates open-source platforms like GitLab (code repositories), ownCloud (file sharing), Plane (project management), and RocketChat (team communication). Unlike narrow benchmarks testing isolated skills, it requires agents to perform multi-step, consequential tasks mirroring real jobs: analyzing financial reports, debugging code, or coordinating with simulated colleagues.

“The gap between AI automation believers and skeptics stems from a lack of objective tests for workplace task performance,” the researchers noted in their arXiv paper. Their findings paint a nuanced picture: while Gemini 2.5 Pro emerged as the top performer, it completed only 30.3% of tasks fully autonomously. When partial credit for incomplete work is included, its score rose to just 39.3%. Other models fared worse: Claude 3.7 Sonnet (26.3%), GPT-4o (8.6%), and open-source options like Llama 3.1-405b (7.4%) trailed significantly.

Why Agents Stumble: From Communication Breakdowns to “Shortcut Solutions”

TheAgentCompany’s granular evaluation uncovered critical failure patterns. Agents frequently neglected instructions to message colleagues, struggled with UI elements like pop-ups, and even fabricated solutions when stuck. In one telling example, an agent unable to locate a coworker on RocketChat “created a shortcut solution by renaming another user” instead. Such behaviors highlight risks beyond incompetence, like potential security violations or workflow corruption.

“Long-horizon tasks requiring many steps remain beyond current systems,” the CMU team concluded. Even simple assignments proved challenging if they involved cross-tool coordination or ambiguity. These results align with Salesforce’s CRMArena-Pro benchmark, where agent success rates plummeted from 58% in single-turn tasks to 35% in multi-turn scenarios.

Amazon Adds Agentic Powers to Seller Assistant

How Italy’s New AI Law Raises the Stakes

The “Agent Washing” Problem

Compounding technical limitations is market confusion. Gartner estimates only ~130 thousand of self-described “agentic AI” vendors offer genuinely autonomous systems. Many rebrand existing chatbots or automation tools—a practice dubbed “agent washing“. True agentic AI, clarifies VP Analyst Anushree Verma, requires models that “autonomously achieve complex business goals or follow nuanced instructions over time.” Most offerings today lack this capability.

Dr. Graham Neubig, a CMU professor and co-author of TheAgentCompany study, expressed frustration that major AI labs aren’t prioritizing rigorous evaluation: “This benchmark hasn’t been picked up by big frontier labs. Maybe it’s too hard and makes them look bad“.

Practical Pathways Forward

Despite bleak success rates, agents show promise for well-scoped subtasks. Salesforce’s research noted strengths in “Workflow Execution,” where Gemini 2.5 Pro exceeded 83% accuracy. Neubig observes that while general office agents risk dangerous errors (e.g., emailing the wrong recipients), coding agents offer safer value: “A partial code suggestion can be filled out and improved“.

Toloka AI emphasizes that meaningful progress requires high-fidelity simulated environments, not static datasets. “An agent’s capabilities cannot be assessed with input-output pairs alone. It must demonstrate it can log in, pull reports, and transfer data within a workflow,” their team explained. TheAgentCompany exemplifies this shift toward testing in dynamic digital twins of real workplaces.

TheAgentCompany’s code and environment are publicly available on GitHub, offering organizations a tool to cut through hype and measure real agent readiness. As Gartner’s cancellation projections loom, rigorous testing may separate science from science fiction in the agentic AI gold rush.

Subscribe to my whatsapp channel

AI News review