A Statistical Framework for Ranking LLM-Based Chatbots
Siavash Ameli, Siyuan Zhuang, Ion Stoica, Michael W. Mahoney

TL;DR
This paper introduces a new statistical framework for ranking LLM-based chatbots that improves analysis of human-judged pairwise comparisons, enabling better performance insights and model stability.
Contribution
The paper presents a novel factored tie model, covariance modeling, and optimization constraints for enhanced ranking analysis of chatbots based on human judgments.
Findings
Significantly improved data fit with the factored tie model.
Enabled deeper insights through covariance modeling.
Achieved stable and interpretable parameter estimation.
Abstract
Large language models (LLMs) have transformed natural language processing, with frameworks like Chatbot Arena providing pioneering platforms for evaluating these models. By facilitating millions of pairwise comparisons based on human judgments, Chatbot Arena has become a cornerstone in LLM evaluation, offering rich datasets for ranking models in open-ended conversational tasks. Building upon this foundation, we propose a statistical framework that incorporates key advancements to address specific challenges in pairwise comparison analysis. First, we introduce a factored tie model that enhances the ability to handle ties -- an integral aspect of human-judged comparisons -- significantly improving the model's fit to observed data. Second, we extend the framework to model covariance between competitors, enabling deeper insights into performance relationships and facilitating intuitive…
Peer Reviews
Decision·ICLR 2025 Poster
1. The proposed method can account for ties in an axiomatic framework. This improves not only tie prediction but also enhances win-loss inference. 2. The mathematical analysis is comprehensive and thorough. 3. This work has an open source python package which is easy to use.
1. For model ranking, it has no "ground truth". Therefore, it is hard to convince others under this method, the ranking is more accurate. 2. The evaluation is highly dependent on the Chatbot Arena dataset which makes this work a slight improvement on chatbot arena. Therefore, the impact of this work is limited. 3. As for chatbot arena, a simple enough ranking rule is more important if the users are common users. This work will make the rule too complicated for them to understand and then reduce
1. The authors identify symmetry issues in the likelihood function of traditional models, which could lead to instability in parameter estimation. To address this, they propose symmetry constraints that ensure stable parameter estimation, thereby enhancing the model’s optimization performance and interpretability; 2. They effectively address the issue of ties, which is a limitation of the existing Elo system, and make optimizations to handle this; 3. They provide a Python package that allows for
1. The authors use multiple models to rank models, showing that high-ranking models exhibit greater consistency than lower-ranking ones. However, they do not provide further analysis, such as examining the specific characteristics of models that initially show ranking inconsistencies; 2. The authors’ work focuses on optimizing the ranking model but lacks subsequent analysis. Additional insights, such as a more in-depth examination of correlations between LLMs or an analysis of the differences be
1. **Pioneering Approach**: This paper tackles a novel and important problem in LLM evaluation, presenting a unique perspective on using advanced statistical models to refine the ranking process in chatbot comparisons. Its approach to systematically modeling ties and latent structures sets a new precedent for evaluating LLM-based chatbots and offers fresh insights that go beyond traditional ranking methods. 2. **Innovative Use of Thurstonian Representations**: By integrating Thurstonian models a
1. While the paper introduces an alternative framework, it does not clearly discuss why Arena’s Elo-based approach is insufficient beyond the issue of ties. Additional insight into Arena’s limitations in capturing competitive dynamics or certain statistical shortcomings would strengthen the argument for this new model. 2. The authors mention consistency in high-ranking models and variability in lower-ranking models, but do not explore further distinctions within these groups. For example, identi
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI in Service Interactions
