On the Credibility of Evaluating LLMs using Survey Questions
Jind\v{r}ich Libovick\'y

TL;DR
This paper critically examines the methodology of evaluating large language models' value orientations using survey questions, highlighting how prompting and decoding strategies influence results and proposing new metrics for better assessment.
Contribution
It identifies limitations in current survey-based evaluation methods for LLMs, introduces a novel self-correlation distance metric, and offers recommendations for more robust evaluation practices.
Findings
Prompting methods significantly affect evaluation outcomes.
High agreement with human responses does not ensure structural alignment.
Weak correlation between mean-squared distance and KL divergence metrics.
Abstract
Recent studies evaluate the value orientation of large language models (LLMs) using adapted social surveys, typically by prompting models with survey questions and comparing their responses to average human responses. This paper identifies limitations in this methodology that, depending on the exact setup, can lead to both underestimating and overestimating the similarity of value orientation. Using the World Value Survey in three languages across five countries, we demonstrate that prompting methods (direct vs. chain-of-thought) and decoding strategies (greedy vs. sampling) significantly affect results. To assess the interaction between answers, we introduce a novel metric, self-correlation distance. This metric measures whether LLMs maintain consistent relationships between answers across different questions, as humans do. This indicates that even a high average agreement with human…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsComputational and Text Analysis Methods · Language and cultural evolution · Survey Methodology and Nonresponse
