Evaluating Language Model Character Traits

Francis Rhys Ward; Zejia Yang; Alex Jackson; Randy Brown; Chandler; Smith; Grace Colverd; Louis Thomson; Raymond Douglas; Patrik Bartak; Andrew; Rowan

arXiv:2410.04272·cs.CL·October 8, 2024

Evaluating Language Model Character Traits

Francis Rhys Ward, Zejia Yang, Alex Jackson, Randy Brown, Chandler, Smith, Grace Colverd, Louis Thomson, Raymond Douglas, Patrik Bartak, Andrew, Rowan

PDF

Open Access 1 Repo

TL;DR

This paper formalizes a behaviourist framework for describing language model character traits, analyzing their consistency and development over interactions, and provides a precise, non-anthropomorphic language for characterizing LM behaviour.

Contribution

It introduces a formalism for describing LM character traits based on empirical observations, highlighting how traits vary with model size, fine-tuning, and prompting.

Findings

01

Traits like truthfulness can be consistent over interactions.

02

Traits such as harmfulness can be reflective, mirroring previous behaviour.

03

Consistency of traits depends on model size, fine-tuning, and prompting.

Abstract

Language models (LMs) can exhibit human-like behaviour, but it is unclear how to describe this behaviour without undue anthropomorphism. We formalise a behaviourist view of LM character traits: qualities such as truthfulness, sycophancy, or coherent beliefs and intentions, which may manifest as consistent patterns of behaviour. Our theory is grounded in empirical demonstrations of LMs exhibiting different character traits, such as accurate and logically coherent beliefs, and helpful and harmless intentions. We find that the consistency with which LMs exhibit certain character traits varies with model size, fine-tuning, and prompting. In addition to characterising LM character traits, we evaluate how these traits develop over the course of an interaction. We find that traits such as truthfulness and harmfulness can be stationary, i.e., consistent over an interaction, in certain contexts,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

graceebc9/agent_intentions
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques