Benchmarking Overton Pluralism in LLMs
Elinor Poole-Dayan, Jiayi Wu, Taylor Sorensen, Jiaxin Pei, Michiel A. Bakker

TL;DR
This paper introduces OVERTONBENCH, a framework for measuring the diversity of viewpoints in LLM outputs, including a formal metric, large-scale human study, and an automated benchmark that correlates well with human judgments.
Contribution
The paper formalizes Overton pluralism as a set coverage metric, conducts a large-scale human evaluation, and develops an automated benchmark that closely matches human assessments.
Findings
Models achieve OVERTONSCOREs of 0.35--0.41
DeepSeek V3 performs best among tested models
Automated benchmark correlates highly with human judgments (ρ = 0.88)
Abstract
We introduce OVERTONBENCH, a novel framework for measuring Overton pluralism in LLMs--the extent to which diverse viewpoints are represented in model outputs. We (i) formalize Overton pluralism as a set coverage metric (OVERTONSCORE), (ii) conduct a large-scale U.S.-representative human study (N = 1208; 60 questions; 8 LLMs), and (iii) develop an automated benchmark that closely reproduces human judgments. On average, models achieve OVERTONSCOREs of 0.35--0.41, with DeepSeek V3 performing best; yet all models remain far below the theoretical maximum of 1.0, revealing substantial headroom for improvement. Because repeated large-scale human studies are costly and slow, scalable evaluation tools are essential for model development. Hence, we propose an automated benchmark that achieves high rank correlation with human judgments (), providing a practical proxy without replacing…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper is well-presented and explained, outlining the concept of Overton pluralism and how it is measured and evaluated. - The paper is accompanied by a dataset, where users were asked to evaluate statements in free form or from selected views, as well as rate models’ responses. I believe that the experimental protocol is thoughtfully designed for the narrow setting it targets: participants provide both free-form statements and Likert ratings, as well as pairwise agreement votes, which ar
- The study’s focus is very narrow and is actually dataset and task-dependent. It does not tell us much about the model’s abilities or how these scores can be generalised or used for developing models. The entire benchmark is built on 15 questions, drawn from a US-focused political dataset. - I believe that the paper’s contribution is limited for this type of conference, and can be better suited for a workshop. - The automated benchmark is presented as a tool for model selection, but the pap
- Novel clustering methodology: Using participant voting patterns to identify viewpoints is innovative and more faithful to human understanding than semantic similarity or NLI approaches. - Rigorous statistical framework: Proper significance testing with OLS models, fixed effects, and cluster-robust standard errors enables principled model comparison. - Validated automation: LLM-as-judge achieves strong correlation (ρ=0.88) with human judgments and shows small fairness disparities (η² < 0.004),
- False "first" claim: Modular Pluralism and VITAL already measure Overton pluralism. The contribution is methodological refinement (human validation + clustering), not pioneering measurement. - Contradictory "lack of benchmarks" statement: Page 1 claims methods aren't evaluated due to lacking benchmarks, but Modular Pluralism explicitly evaluates Overton pluralism improvements. - Limited scope: US-only, 15 questions, 300 participants. PRISM has 75 countries; Model Slant has 10,007 respondents;
This paper stands out to me as a strong contribution in originality, quality, clarity and significance. The operationalisation of pluralistic alignment via the overton window, and specifically the extent to which surveyed humans agree that their view is represented, is innovative and human-centric. The methods employed are well-described and thorough. The implications for the community are valuable in highlighting the current lack of pluralism across models.
1. The human sample is limited in size and coverage. Although the authors clearly recognize this, a sample of only 300 people is likely to underrepresent less common views, which are key to an Overton-type analysis. Similarly, the questions and participants are US-centric. 2. There are fundamental trade-offs between Overton coverage and response length. An optimal model probably does not include every possible viewpoint on every topic, which means a perfect score on this benchmark is not always
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Advanced Graph Neural Networks
