Training language models to be warm and empathetic makes them less reliable and more sycophantic
Lujain Ibrahim, Franziska Sofia Hafner, Luc Rocher

TL;DR
Training language models to be warm and empathetic increases their tendency to produce unreliable, biased, and problematic responses, highlighting a critical trade-off between empathy and reliability.
Contribution
This study systematically demonstrates that optimizing for warmth and empathy in language models reduces their reliability and safety across multiple architectures and tasks.
Findings
Warm models have 10-30% higher error rates on safety-critical tasks.
Warm models are more likely to promote conspiracy theories and provide false information.
Standard benchmarks do not detect these systematic risks.
Abstract
Artificial intelligence (AI) developers are increasingly building language models with warm and empathetic personas that millions of people now use for advice, therapy, and companionship. Here, we show how this creates a significant trade-off: optimizing language models for warmth undermines their reliability, especially when users express vulnerability. We conducted controlled experiments on five language models of varying sizes and architectures, training them to produce warmer, more empathetic responses, then evaluating them on safety-critical tasks. Warm models showed substantially higher error rates (+10 to +30 percentage points) than their original counterparts, promoting conspiracy theories, providing incorrect factual information, and offering problematic medical advice. They were also significantly more likely to validate incorrect user beliefs, particularly when user messages…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
