Evaluating Gemini in an arena for learning
LearnLM Team Google: Abhinit Modi, Aditya Srikanth Veerubhotla, Aliya Rysbek, Andrea Huber, Ankit Anand, Avishkar Bhoopchand, Brett Wiltshire, Daniel Gillick, Daniel Kasenberg, Eleni Sgouritsa, Gal Elidan, Hengrui Liu, Holger Winnemoeller, Irina Jurenka, James Cohan

TL;DR
This study introduces an 'arena for learning' benchmark where educators compare AI models in realistic educational scenarios, finding Gemini 2.5 Pro to outperform competitors in supporting learning goals.
Contribution
The paper presents a novel evaluation framework for AI in education, using expert comparisons to assess model effectiveness in learning contexts.
Findings
Gemini 2.5 Pro ranked first in the arena.
Experts preferred Gemini 2.5 Pro in 73.2% of comparisons.
Gemini 2.5 Pro showed superior pedagogical performance.
Abstract
Artificial intelligence (AI) is poised to transform education, but the research community lacks a robust, general benchmark to evaluate AI models for learning. To assess state-of-the-art support for educational use cases, we ran an "arena for learning" where educators and pedagogy experts conduct blind, head-to-head, multi-turn comparisons of leading AI models. In particular, educators drew from their experience to role-play realistic learning use cases, interacting with two models sequentially, after which experts judged which model better supported the user's learning goals. The arena evaluated a slate of state-of-the-art models: Gemini 2.5 Pro, Claude 3.7 Sonnet, GPT-4o, and OpenAI o3. Excluding ties, experts preferred Gemini 2.5 Pro in 73.2% of these match-ups -- ranking it first overall in the arena. Gemini 2.5 Pro also demonstrated markedly higher performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsConstruction Project Management and Performance · Complex Systems and Decision Making
