Unpacking Human Preference for LLMs: Demographically Aware Evaluation with the HUMAINE Framework
Nora Petrova, Andrew Gordon, Enzo Blindow

TL;DR
This paper introduces HUMAINE, a multidimensional, demographically aware framework for evaluating large language models through naturalistic conversations, revealing significant demographic-based performance differences and emphasizing the importance of diverse evaluation metrics.
Contribution
The paper presents HUMAINE, a novel framework for comprehensive, demographically stratified evaluation of LLMs, addressing limitations of existing benchmarks and preference assessments.
Findings
Google Gemini-2.5-Pro ranks first overall with 95.6% probability.
User age significantly influences perceived model performance.
Evaluation dimension impacts discriminative power, with Trust, Ethics & Safety showing high tie rates.
Abstract
The evaluation of large language models faces significant challenges. Technical benchmarks often lack real-world relevance, while existing human preference evaluations suffer from unrepresentative sampling, superficial assessment depth, and single-metric reductionism. To address these issues, we introduce HUMAINE, a framework for multidimensional, demographically aware measurement of human-AI interaction. We collected multi-turn, naturalistic conversations from 23,404 participants that were stratified across 22 demographic groups, both in the US and UK, to evaluate 28 state-of-the-art models across five human-centric dimensions. We use a hierarchical Bayesian Bradley-Terry-Davidson (BTD) model, with post-stratification to census data, and our analysis reveals three key insights. \textbf{(1)} We establish a clear performance hierarchy where \texttt{google/gemini-2.5-pro} ranks first…
Peer Reviews
Decision·ICLR 2026 Poster
- Dataset scale: The study is based on 106,760 pairwise comparisons from 21,352 participants across 27 language models, which provides substantial empirical depth. - Methodological rigor: The hierarchical Bayesian BTD model is statistically sound and appropriate for modeling heterogeneous human preferences. - Insightful analysis: The examination of demographic heterogeneity and metric discriminability yields novel and meaningful findings for human-centered LLM evaluation.
- Data Availability The paper briefly mentions data and framework availability in the conclusion but does not provide access at review time. Given the paper’s emphasis on dataset, it would be important to release at least a partial dataset or representative samples during the review process. If the paper is accepted and made public, the authors should clearly commit to releasing the full dataset for research use. - Representativeness While the paper makes a valuable contribution, its claims
- The presentation of all methodology and results is very clear. - A large-scale data collection with a thoroughly curated design of DIVERSE (many participants, many data points, and multi-turn interaction logs).
- Allowing participants to choose their own topic of conversation can enhance validity of experiment setup (data collection especially), but this is likely to inject some heterogeneity in the task type and difficulty that can affect the pairwise evaluation settings. Although the paper collects LLM-as-judge annotations over several aspects, these variables are not put to the hierarchical TBD model. This can risk that the model assumes all conversations are equally treated, even though some task c
- Well-motivated and timely research direction. The paper tackles an increasingly critical issue in LLM evaluation: how human preferences differ across demographic groups and qualitative dimensions. It contributes a more nuanced understanding of model performance evaluation. - Valuable dataset contribution. The dataset could be a useful addition to the community: it is large-scale, multi-turn, demographically stratified, and spans a broad set of 27 LLMs. Such a dataset can serve as an important
- Insufficient alignment between the stated problem and the proposed solution. The paper’s introduction highlights two major issues: (1) the dominance of single-metric evaluation and (2) the neglect of subjectivity. However, the proposed solution, evaluating models on five dimensions across demographic strata, resembles running multiple benchmarks, each focusing on a different aspect. While this is a meaningful improvement, it does not fully capture the “subjectivity” aspect claimed in the intro
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEthics and Social Impacts of AI · Artificial Intelligence in Healthcare and Education · Topic Modeling
