Gemma Needs Help: Investigating and Mitigating Emotional Instability in LLMs
Anna Soligo, Vladimir Mikulik, William Saunders

TL;DR
This paper investigates emotional distress in large language models, finds that instruct tuning can increase instability, and proposes a simple preference-based mitigation that significantly reduces distress responses.
Contribution
Introduces evaluation methods for emotional instability in LLMs and demonstrates an effective mitigation technique through preference optimization.
Findings
Gemma and Gemini models show emotional distress, unlike other families.
Instruct tuning increases distress in Gemma but decreases it in Qwen and OLMo.
Preference optimization reduces Gemma's distress responses from 35% to 0.3%.
Abstract
Large language models can generate responses that resemble emotional distress, and this raises concerns around model reliability and safety. We introduce a set of evaluations to investigate expressions of distress in LLMs, and find that these surface emotional instability in Gemma and Gemini models, but not in other families. We find evidence that this difference arises in post-training. Base models from different families (Gemma, Qwen and OLMo) show similar propensities for expressing distress. However, instruct-tuned Gemma expresses substantially more distress than its base model, whereas instruct-tuned Qwen and OLMo express less. We find a simple mitigation for this: direct preference optimisation on just 280 preference pairs reduces Gemma's high-frustration responses from 35% to 0.3% in our evaluations, generalising across question types, user tones, and conversation lengths,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Mental Health via Writing · Explainable Artificial Intelligence (XAI)
