Neural Notes: Inside Arena, the Unofficial Scoreboard for the AI Model Wars
Welcome back to Neural Notes, a weekly column exploring how AI is shaping Australia. In this edition, we delve into Arena, a platform that has become a public referee for AI models like ChatGPT, Claude, and Gemini. But how reliable are its rankings for founders and businesses?
The Arena Phenomenon
Arena, originally known as LMArena, is a live leaderboard where AI models from various companies compete in anonymous side-by-side comparisons. Users input prompts, receive responses from two models, and vote for the better one. The models' rankings are determined by a system similar to the Elo rating used in chess, where performance against specific competitors influences their position. This dynamic ranking system is built from millions of comparisons.
A Platform for Founders
For startup founders and small businesses, Arena offers a valuable tool. Instead of deciphering vendor benchmark charts, teams can directly test prompts and compare model behaviors. This shift from static test suites to continuous evaluation is particularly beneficial, as models are updated regularly.
The Rise of Arena
Arena's popularity grew from its academic origins at UC Berkeley, where it explored human preference evaluation as an alternative to static benchmarks. In May 2025, it became a venture-backed company, raising $100 million in seed funding at a $600 million valuation. This was followed by a $150 million Series A in early 2026, valuing the company at $1.7 billion.
Measuring Relative Preference
However, Arena's rankings are not without limitations. They measure relative preference among its users, not universal quality. The platform's influence is concentrated in a small evaluation layer, and its preferences may not align with broader social needs. This was highlighted in 2025 when researchers analyzed Arena's battles and found it unintentionally favored large proprietary labs.
The Influence of Arena
Despite these concerns, Arena remains a valuable resource for founders and businesses. It provides a reality check for vendor claims and helps understand model behavior. However, it should be used alongside bespoke testing, as it reflects global averages rather than specific customer needs.
The Future of AI Measurement
Arena's impact extends beyond individual models. It influences adoption decisions, investor narratives, and media coverage. While it increases transparency, it also concentrates soft power in a small evaluation layer. As AI continues to evolve, the role of platforms like Arena in shaping perception and measurement will become increasingly significant.