Can You Break RLVER? Probing Adversarial Robustness of RL-Trained Empathetic Agents
Deeraj S K, Sadhana Devarajan, Krishna Mehra, Sudhakar Mishra

TL;DR
This paper evaluates the robustness of RL-trained empathetic language models against adversarial emotional interactions, introducing a new benchmark and score to measure their resilience.
Contribution
It constructs the Adversarial Empathy Benchmark (AEB) and the Emotional Consistency Score (ECS) to assess empathetic robustness under adversarial conditions.
Findings
RLVER-PPO-Think outperforms baseline models in empathetic response quality.
Training improves emotional responsiveness but not internal state tracking.
ECS remains nearly flat, indicating a dissociation between behavior and internal understanding.
Abstract
Reinforcement learning from verifiable emotion rewards RLVER has produced language models with strong empathetic performance, evaluated on benchmarks that assume cooperative, honest users. Yet real emotional interactions systematically violate this assumption: users gaslight, escalate, and pressure AI systems for unconditional validation, dynamics that cooperative benchmarks cannot surface. We construct the Adversarial Empathy Benchmark AEB and introduce the Emotional Consistency Score ECS to evaluate empathetic robustness under adversarial conditions. AEB comprises six psychologically grounded adversarial trajectory types with discriminative reward structures that penalize formulaic responses; ECS formally disentangles a model's capacity to track user emotional states from its capacity to improve them. In a controlled experiment across eight scenario-matched conditions (think and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
