Turning large language models into cognitive models
Marcel Binz, Eric Schulz

TL;DR
This paper demonstrates that large language models, when fine-tuned on psychological data, can serve as accurate and versatile cognitive models of human behavior across various decision-making tasks.
Contribution
It shows that large language models can be transformed into cognitive models that outperform traditional models and generalize across tasks after fine-tuning.
Findings
Fine-tuned models accurately replicate human decision-making behavior.
Models outperform traditional cognitive models in two decision domains.
Fine-tuning enables models to predict behavior in unseen tasks.
Abstract
Large language models are powerful systems that excel at many tasks, ranging from translation to mathematical reasoning. Yet, at the same time, these models often show unhuman-like characteristics. In the present paper, we address this gap and ask whether large language models can be turned into cognitive models. We find that -- after finetuning them on data from psychological experiments -- these models offer accurate representations of human behavior, even outperforming traditional cognitive models in two decision-making domains. In addition, we show that their representations contain the information necessary to model behavior on the level of individual subjects. Finally, we demonstrate that finetuning on multiple tasks enables large language models to predict human behavior in a previously unseen task. Taken together, these results suggest that large, pre-trained models can be…
Peer Reviews
Decision·ICLR 2024 poster
I found this paper really interesting and worthwhile. While I suspect that there is much more work to be done in assessing the results, the main result is clear and fascinating: LLMs can quickly adapt to predict specific human behaviors. Perhaps most interesting is that this only requires linear regression. While this isn't really fine-tuning in the typical sense and could be considered a limitation (the authors could have trained a single layer or so, but with LLMs this is not easy by any means
The main weakness in my view is the difficulty in comparing to past work in the relevant domains and which are cited in the work. Past work appears to use different metrics, different splits of the data, and different baseline models. For example, BEAST does not appear to be the best or only relevant baseline, which is also usually evaluated using mean squared error. There is also a history of work behind the choices13k dataset with machine learning methods that the authors don't review. The aut
- This paper is well written, with a clear motivation of using large language models to simulate human behaviors (or at least binary choice results on two types of decision-making tasks). The detailed implementation: extract embeddings and then do a linear probing is good enough for a scalable method. - I like the idea of using large language models for a proxy model for analyzing human behavior. The crux is how to prob or design proper experimental methods (analogous to methods developed in ex
- Although using open-sourced models (e.g., llama) is a good choice, the most powerful models to date (including instruction fine-tuned ones) are not tested as a baseline method (e.g., few-shot evaluations on GPT-3.5/GPT-4/PaLM-2/Claude/instruction-finetuned llama 2 variants), it is suggested that some of those models can also demonstrate human-like behaviors in some human decision making tasks, through prompting or few-shot evaluations [A]. The few-shot method might be done by prompting some o
- This work presents an interesting and novel approach to cognitive modeling that outperforms domain-specific cognitive models. - The model is shown to be capable of generating qualitative cognitive insights, in addition to superior quantitative performance. - The model accounts for individual differences. - The model generalizes to a novel task. - The paper includes some interesting discussion of the broader implications of this approach for cognitive science.
- I am not sure if this is a weakness per se, but the work is primarily oriented toward cognitive science. It may be better suited to a more cog-sci oriented venue. However, I think the work generally makes a strong contribution and would be happy for it be published at ICLR. - My primary substantive concern is that the model is only evaluated on publicly available datasets. Do the authors know whether this data is included in LLaMa's pretraining data? I'm not entirely certain, but given the ope
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems
