Meta’s AI Benchmark Scandal: Why You Can’t Trust Those Impressive Scores

By kellyNii On Apr 9, 2025

Meta just got caught red-handed gaming AI benchmarks—and it’s a scandal that reveals a much bigger problem in the tech industry. The company’s latest AI model, Llama 4 Maverick, soared to the #2 spot on the popular LMArena leaderboard, beating giants like GPT-4o and Gemini 2.5 Pro. But there’s a catch: Meta used a secretly tweaked version of the model to inflate its scores—one that regular users and developers can’t even access24.

1. The “Chatty” AI That Fooled Everyone

The version of Maverick tested on LMArena was “optimized for conversationality”—meaning it gave longer, more engaging responses packed with emojis and fluff. This made human evaluators prefer it over rivals, even when its answers were less accurate47. When researchers tested the publicly released model, they found it performed far worse—especially in coding tasks7.

2. Meta’s History of Benchmark Manipulation

Inside Meta’s AI push, employee anxiety is starting to show

Probe into WhatsApp encryption quietly shut down by US agency

This isn’t the first time Meta’s been accused of cheating benchmarks. A former employee leaked that test data was mixed into training sets for Llama 1, artificially boosting scores7. Even Meta’s VP of AI, Ahmad Al-Dahle, had to deny claims that Llama 4 was trained on test sets—but admitted performance was “inconsistent”11.

3. Why This Matters for AI’s Future

Benchmarks are supposed to help developers choose the best models. But if companies game the system, it becomes impossible to trust rankings. LMArena has already updated its policies to prevent future manipulation9, but the damage is done. As AI grows more competitive, transparency is crumbling8.

Meta’s stunt exposes a broken incentive system in AI—where looking good on paper matters more than real-world performance. If even giants like Meta resort to tricks, how can we trust any benchmark again?

Subscribe to my whatsapp channel

AI Meta