Alignment Makes Language Models Normative, Not Descriptive
Eilam Shapira, Moshe Tennenholtz, Roi Reichart

TL;DR
Aligning language models enhances their ability to predict normative human decisions in simple, one-shot scenarios but reduces accuracy in complex, multi-round strategic interactions where descriptive behavior dominates.
Contribution
This study demonstrates that alignment induces a normative bias in language models, improving predictions in normative settings while impairing performance in descriptive, strategic contexts.
Findings
Aligned models outperform base models in normative, one-shot games.
Base models better predict human choices in multi-round strategic interactions.
Alignment causes a trade-off, favoring normative predictions over descriptive accuracy.
Abstract
Post-training alignment optimizes language models to match human preference signals, but this objective is not equivalent to modeling observed human behavior. We compare 120 base-aligned model pairs on more than 10,000 real human decisions in multi-round strategic games - bargaining, persuasion, negotiation, and repeated matrix games. In these settings, base models outperform their aligned counterparts in predicting human choices by nearly 10:1, robustly across model families, prompt formulations, and game configurations. This pattern reverses, however, in settings where human behavior is more likely to follow normative predictions: aligned models dominate on one-shot textbook games across all 12 types tested and on non-strategic lottery choices - and even within the multi-round games themselves, at round one, before interaction history develops. This boundary-condition pattern suggests…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExperimental Behavioral Economics Studies · Game Theory and Applications · Artificial Intelligence in Games
