SI-Bench: Benchmarking Social Intelligence of Large Language Models in Human-to-Human Conversations
Shuai Huang, Wenxuan Zhao, Jun Gao

TL;DR
SI-Bench is a new benchmark that evaluates social intelligence of large language models in authentic human conversations, revealing their strengths and limitations in complex social reasoning and dialogue quality.
Contribution
Introduces SI-Bench, a social intelligence benchmark based on real human dialogues, addressing limitations of previous simulated datasets and grounded in social science theories.
Findings
SOTA models outperform humans in social process reasoning.
Models still lag behind humans in reply quality.
Chain-of-Thought reasoning may impair social dialogue performance.
Abstract
As large language models (LLMs) develop anthropomorphic abilities, they are increasingly being deployed as autonomous agents to interact with humans. However, evaluating their performance in realistic and complex social interactions remains a significant challenge. Most previous research built datasets through simulated agent-to-agent interactions, which fails to capture the authentic linguistic styles and relational dynamics found in real human conversations. To address this gap, we introduce SI-Bench, a novel benchmark designed to evaluate aspects of social intelligence in LLMs. Grounded in broad social science theories, SI-Bench contains 2,221 authentic multi-turn dialogues collected from a social networking application. We further selected a subset of 312 dialogues for manual annotation across 8 major models. The experiments show that SOTA models have surpassed the human expert in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
