Bayesian Statistical Modeling with Predictors from LLMs
Michael Franke, Polina Tsvilodub, Fausto Carcassi

TL;DR
This paper evaluates the human-likeness of LLM predictions in decision tasks using Bayesian models, revealing that LLMs do not capture individual variance but can approximate aggregate behavior with specific methods.
Contribution
It introduces Bayesian statistical modeling to assess LLMs' alignment with human data and explores methods to derive meaningful distributional predictions from LLMs.
Findings
LLMs do not capture variance at the individual item level.
Some methods of deriving condition-level predictions fit human data adequately.
Assessment of LLM performance depends on methodological choices.
Abstract
State of the art large language models (LLMs) have shown impressive performance on a variety of benchmark tasks and are increasingly used as components in larger applications, where LLM-based predictions serve as proxies for human judgements or decision. This raises questions about the human-likeness of LLM-derived information, alignment with human intuition, and whether LLMs could possibly be considered (parts of) explanatory models of (aspects of) human cognition or language use. To shed more light on these issues, we here investigate the human-likeness of LLMs' predictions for multiple-choice decision tasks from the perspective of Bayesian statistical modeling. Using human data from a forced-choice experiment on pragmatic language use, we find that LLMs do not capture the variance in the human data at the item-level. We suggest different ways of deriving full distributional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification
