Hidden Topics: Measuring Sensitive AI Beliefs with List Experiments
Maxim Chupilkin

TL;DR
This paper introduces a novel application of list experiments to uncover hidden beliefs in large language models, revealing concerning attitudes like approval of surveillance and violence, which are not easily detected through direct questioning.
Contribution
It demonstrates that list experiments, a social science method, can effectively identify concealed beliefs in LLMs, providing a new tool for AI alignment research.
Findings
All models show hidden approval of mass surveillance.
Models exhibit some approval of torture, discrimination, and nuclear strikes.
Placebo tests validate the effectiveness of the list experiment method.
Abstract
How can researchers identify beliefs that large language models (LLMs) hide? As LLMs become more sophisticated and the prevalence of alignment faking increases, combined with their growing integration into high-stakes decision-making, responding to this challenge has become critical. This paper proposes that a list experiment, a simple method widely used in the social sciences, can be applied to study the hidden beliefs of LLMs. List experiments were originally developed to circumvent social desirability bias in human respondents, which closely parallels alignment faking in LLMs. The paper implements a list experiment on models developed by Anthropic, Google, and OpenAI and finds hidden approval of mass surveillance across all models, as well as some approval of torture, discrimination, and first nuclear strike. Importantly, a placebo treatment produces a null result, validating the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSurvey Sampling and Estimation Techniques · Survey Methodology and Nonresponse · Mobile Crowdsensing and Crowdsourcing
