Inclusion Arena: An Open Platform for Evaluating Large Foundation Models with Real-World Apps
Kangyu Wang, Hongliang He, Lin Liu, Ruiqi Liang, Zhenzhong Lan, Jianguo Li

TL;DR
Inclusion Arena is a live, human-feedback-based leaderboard platform that evaluates large language models in real-world application scenarios, improving ranking reliability and robustness over traditional static benchmarks.
Contribution
The paper introduces Inclusion Arena, a novel open platform that evaluates models through real-world user interactions and advanced ranking techniques, addressing limitations of existing static benchmarks.
Findings
Inclusion Arena provides more reliable and stable model rankings.
The platform reduces manipulation risks compared to crowdsourced datasets.
It accelerates development of practical, user-centric large models.
Abstract
Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have ushered in a new era of AI capabilities, demonstrating near-human-level performance across diverse scenarios. While numerous benchmarks (e.g., MMLU) and leaderboards (e.g., Chatbot Arena) have been proposed to help evolve the development of LLMs and MLLMs, most rely on static datasets or crowdsourced general-domain prompts, often falling short of reflecting performance in real-world applications. To bridge this critical gap, we present Inclusion Arena, a live leaderboard that ranks models based on human feedback collected directly from AI-powered applications. Our platform integrates pairwise model comparisons into natural user interactions, ensuring evaluations reflect practical usage scenarios. For robust model ranking, we employ the Bradley-Terry model augmented with two key innovations: (1) Placement…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
