AttuneBench: A Conversation-Based Benchmark for LLM Emotional Intelligence
Kate M. Lubrano, Faisal Sayed, Ankita Rathod, Akshansh, Craver Corbyn Thomas-Smith, Mark E. Whiting, Karina Nguyen

TL;DR
AttuneBench is a new benchmark for evaluating emotional intelligence in large language models through real multi-turn conversations with detailed annotations.
Contribution
It introduces a framework for assessing multiple aspects of emotional intelligence in LLMs using genuine multi-turn interactions and turn-by-turn annotations.
Findings
Model rankings vary across emotion recognition and response quality tasks.
Preference prediction and response quality are more discriminative than emotion-label accuracy.
Emotionally intelligent behavior involves predicting user-specific responses in context.
Abstract
Emotional intelligence (EI), the ability to perceive, understand, and respond appropriately to others' emotional states, is central to human communication, and increasingly important to assess as LLMs assume conversational roles in everyday life. Existing EI benchmarks rely on synthetic prompts, single-turn cases, or third-party annotation. These approaches do not directly measure how models infer and respond to a participant's emotional state over the course of a real conversation. We introduce AttuneBench, a benchmark grounded in 200 genuine multi-turn human-model conversations in which participants conversed with anonymized LLMs and provided turn-by-turn annotations of their emotional state, the model's behavior, and their preferred responses. Across 11 evaluated models, we find that model rankings on emotion recognition, behavioral classification, preference prediction, and judged…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
