Eliciting Trustworthiness Priors of Large Language Models via Economic Games
Siyu Yan, Lusha Zhu, Jian-Qiao Zhu

TL;DR
This paper introduces a novel method to measure trustworthiness priors of large language models using economic games, revealing GPT-4.1's trust levels align with humans and how models differentiate trust based on agent stereotypes.
Contribution
The paper presents a new elicitation technique for trust priors in LLMs using the Trust Game, enabling better understanding of AI trust calibration.
Findings
GPT-4.1's trust priors closely match human data
Models differentiate trust based on agent stereotypes
Variation in trust can be predicted by warmth and competence perceptions
Abstract
One critical aspect of building human-centered, trustworthy artificial intelligence (AI) systems is maintaining calibrated trust: appropriate reliance on AI systems outperforms both overtrust (e.g., automation bias) and undertrust (e.g., disuse). A fundamental challenge, however, is how to characterize the level of trust exhibited by an AI system itself. Here, we propose a novel elicitation method based on iterated in-context learning (Zhu and Griffiths, 2024a) and apply it to elicit trustworthiness priors using the Trust Game from behavioral game theory. The Trust Game is particularly well suited for this purpose because it operationalizes trust as voluntary exposure to risk based on beliefs about another agent, rather than self-reported attitudes. Using our method, we elicit trustworthiness priors from several leading large language models (LLMs) and find that GPT-4.1's…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · AI in Service Interactions · Explainable Artificial Intelligence (XAI)
