Evaluating Language Model Character Traits
Francis Rhys Ward, Zejia Yang, Alex Jackson, Randy Brown, Chandler, Smith, Grace Colverd, Louis Thomson, Raymond Douglas, Patrik Bartak, Andrew, Rowan

TL;DR
This paper formalizes a behaviourist framework for describing language model character traits, analyzing their consistency and development over interactions, and provides a precise, non-anthropomorphic language for characterizing LM behaviour.
Contribution
It introduces a formalism for describing LM character traits based on empirical observations, highlighting how traits vary with model size, fine-tuning, and prompting.
Findings
Traits like truthfulness can be consistent over interactions.
Traits such as harmfulness can be reflective, mirroring previous behaviour.
Consistency of traits depends on model size, fine-tuning, and prompting.
Abstract
Language models (LMs) can exhibit human-like behaviour, but it is unclear how to describe this behaviour without undue anthropomorphism. We formalise a behaviourist view of LM character traits: qualities such as truthfulness, sycophancy, or coherent beliefs and intentions, which may manifest as consistent patterns of behaviour. Our theory is grounded in empirical demonstrations of LMs exhibiting different character traits, such as accurate and logically coherent beliefs, and helpful and harmless intentions. We find that the consistency with which LMs exhibit certain character traits varies with model size, fine-tuning, and prompting. In addition to characterising LM character traits, we evaluate how these traits develop over the course of an interaction. We find that traits such as truthfulness and harmfulness can be stationary, i.e., consistent over an interaction, in certain contexts,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
