On the Credibility of Evaluating LLMs using Survey Questions

Jind\v{r}ich Libovick\'y

arXiv:2602.04033·cs.CL·February 5, 2026

On the Credibility of Evaluating LLMs using Survey Questions

Jind\v{r}ich Libovick\'y

PDF

Open Access 1 Video

TL;DR

This paper critically examines the methodology of evaluating large language models' value orientations using survey questions, highlighting how prompting and decoding strategies influence results and proposing new metrics for better assessment.

Contribution

It identifies limitations in current survey-based evaluation methods for LLMs, introduces a novel self-correlation distance metric, and offers recommendations for more robust evaluation practices.

Findings

01

Prompting methods significantly affect evaluation outcomes.

02

High agreement with human responses does not ensure structural alignment.

03

Weak correlation between mean-squared distance and KL divergence metrics.

Abstract

Recent studies evaluate the value orientation of large language models (LLMs) using adapted social surveys, typically by prompting models with survey questions and comparing their responses to average human responses. This paper identifies limitations in this methodology that, depending on the exact setup, can lead to both underestimating and overestimating the similarity of value orientation. Using the World Value Survey in three languages across five countries, we demonstrate that prompting methods (direct vs. chain-of-thought) and decoding strategies (greedy vs. sampling) significantly affect results. To assess the interaction between answers, we introduce a novel metric, self-correlation distance. This metric measures whether LLMs maintain consistent relationships between answers across different questions, as humans do. This indicates that even a high average agreement with human…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

On the Credibility of Evaluating LLMs using Survey Questions· underline

Taxonomy

TopicsComputational and Text Analysis Methods · Language and cultural evolution · Survey Methodology and Nonresponse