TL;DR
This paper introduces AHaBench, a benchmark for diagnosing affective hallucination in LLMs, and demonstrates that DPO fine-tuning reduces such hallucinations while maintaining reasoning abilities.
Contribution
It presents AHaBench and AHaPairs datasets for evaluating and aligning LLMs to prevent affective hallucination, a new safety concern in emotionally sensitive AI interactions.
Findings
DPO fine-tuning reduces affective hallucination significantly.
AHaBench effectively diagnoses affective hallucination.
Strong correlation (r=0.85) between human and model judgments.
Abstract
Large Language Models (LLMs) are increasingly engaged in emotionally vulnerable conversations that extend beyond information seeking to moments of personal distress. As they adopt affective tones and simulate empathy, they risk creating the illusion of genuine relational connection. We term this phenomenon Affective Hallucination, referring to emotionally immersive responses that evoke false social presence despite the model's lack of affective capacity. To address this, we introduce AHaBench, a benchmark of 500 mental-health-related prompts with expert-informed reference responses, evaluated along three dimensions: Emotional Enmeshment, Illusion of Presence, and Fostering Overdependence. We further release AHaPairs, a 5K-instance preference dataset enabling Direct Preference Optimization (DPO) for alignment with emotionally responsible behavior. DPO fine-tuning substantially reduces…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
