Detecting and Steering LLMs' Empathy in Action

Juan P. Cadile

arXiv:2511.16699·cs.CL·November 24, 2025

Detecting and Steering LLMs' Empathy in Action

Juan P. Cadile

PDF

Open Access

TL;DR

This paper explores how large language models encode and can be guided to exhibit empathy-in-action, revealing architecture-specific differences and the impact of safety training on empathy steering capabilities.

Contribution

It introduces a novel framework for detecting and steering empathy-in-action in LLMs, demonstrating that empathy encoding can emerge independently of safety training and varies across architectures.

Findings

01

High detection accuracy of empathy across models (AUROC 0.996-1.00).

02

Empathy encoding emerges independently of safety training.

03

Model-specific differences in steering robustness and coherence.

Abstract

We investigate empathy-in-action -- the willingness to sacrifice task efficiency to address human needs -- as a linear direction in LLM activation space. Using contrastive prompts grounded in the Empathy-in-Action (EIA) benchmark, we test detection and steering across Phi-3-mini-4k (3.8B), Qwen2.5-7B (safety-trained), and Dolphin-Llama-3.1-8B (uncensored). Detection: All models show AUROC 0.996-1.00 at optimal layers. Uncensored Dolphin matches safety-trained models, demonstrating empathy encoding emerges independent of safety training. Phi-3 probes correlate strongly with EIA behavioral scores (r=0.71, p<0.01). Cross-model probe agreement is limited (Qwen: r=-0.06, Dolphin: r=0.18), revealing architecture-specific implementations despite convergent detection. Steering: Qwen achieves 65.3% success with bidirectional control and coherence at extreme interventions. Phi-3 shows 61.7%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Social Robot Interaction and HRI · Reinforcement Learning in Robotics