Modeling Pathology-Like Behavioral Patterns in Language Models Through Behavioral Fine-Tuning
Nicola Milano, Davide Marocco

TL;DR
This paper demonstrates that fine-tuning large language models on synthetic behavioral datasets can induce stable, systematic behavioral patterns that influence their language generation and interpretative tendencies.
Contribution
The study introduces a behavioral induction framework for LLMs, showing how structured fine-tuning can produce specific, dissociable behavioral biases in language models.
Findings
Fine-tuned models show increased negative and threat-related language.
Behavioral biases generalize across contexts and are detectable in multiple evaluation methods.
Different behavioral patterns lead to distinct response tendencies.
Abstract
Large language models are increasingly used as computational tools for modeling human-like behavior. We introduce a behavioral induction framework that modifies model policies through fine-tuning on structured decision-making tasks: using synthetic datasets inspired by maladaptive behavioral patterns, including depression and paranoia, we train transformer-based language models to consistently select specific classes of actions across diverse contexts. We then test whether this behavioral optimization produces systematic changes in generative distributions. Across two architectures, fine-tuned models show stable, context-general shifts in next-token probability distributions, including increased probability assigned to negative and threat-related interpretations in open-ended language tasks. These effects generalize beyond training contexts and are detectable in qualitative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
