Copilot Arena: A Platform for Code LLM Evaluation in the Wild
Wayne Chi, Valerie Chen, Anastasios Nikolas Angelopoulos, Wei-Lin, Chiang, Aditya Mittal, Naman Jain, Tianjun Zhang, Ion Stoica, Chris Donahue,, Ameet Talwalkar

TL;DR
Copilot Arena is a new platform that evaluates large language models for code generation in real-world developer settings, providing insights into human preferences and model performance differences from traditional benchmarks.
Contribution
It introduces a novel integrated evaluation platform with a unique interface, sampling strategy, and prompting scheme, enabling large-scale, human-centric assessment of code LLMs in practical environments.
Findings
Model rankings differ from traditional benchmarks due to realistic data distribution.
User preferences are consistent across languages but vary by task category.
Over 4.5 million suggestions and 11k pairwise judgments collected.
Abstract
Evaluating in-the-wild coding capabilities of large language models (LLMs) is a challenging endeavor with no clear solution. We introduce Copilot Arena, a platform to collect user preferences for code generation through native integration into a developer's working environment. Copilot Arena comprises a novel interface for comparing pairs of model outputs, a sampling strategy optimized to reduce latency, and a prompting scheme to enable code completion functionality. Copilot Arena has served over 4.5 million suggestions from 10 models and collected over 11k pairwise judgements. Our results highlight the importance of model evaluations in integrated settings. We find that model rankings from Copilot Arena differ from those of existing evaluations, which we attribute to the more realistic distribution of data and tasks contained in Copilot Arena. We also identify novel insights into human…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Digital and Cyber Forensics
