Detecting and Steering LLMs' Empathy in Action
Juan P. Cadile

TL;DR
This paper explores how large language models encode and can be guided to exhibit empathy-in-action, revealing architecture-specific differences and the impact of safety training on empathy steering capabilities.
Contribution
It introduces a novel framework for detecting and steering empathy-in-action in LLMs, demonstrating that empathy encoding can emerge independently of safety training and varies across architectures.
Findings
High detection accuracy of empathy across models (AUROC 0.996-1.00).
Empathy encoding emerges independently of safety training.
Model-specific differences in steering robustness and coherence.
Abstract
We investigate empathy-in-action -- the willingness to sacrifice task efficiency to address human needs -- as a linear direction in LLM activation space. Using contrastive prompts grounded in the Empathy-in-Action (EIA) benchmark, we test detection and steering across Phi-3-mini-4k (3.8B), Qwen2.5-7B (safety-trained), and Dolphin-Llama-3.1-8B (uncensored). Detection: All models show AUROC 0.996-1.00 at optimal layers. Uncensored Dolphin matches safety-trained models, demonstrating empathy encoding emerges independent of safety training. Phi-3 probes correlate strongly with EIA behavioral scores (r=0.71, p<0.01). Cross-model probe agreement is limited (Qwen: r=-0.06, Dolphin: r=0.18), revealing architecture-specific implementations despite convergent detection. Steering: Qwen achieves 65.3% success with bidirectional control and coherence at extreme interventions. Phi-3 shows 61.7%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Social Robot Interaction and HRI · Reinforcement Learning in Robotics
