SocialVeil: Probing Social Intelligence of Language Agents under Communication Barriers
Keyang Xuan, Pengda Wang, Chongrui Ye, Haofei Yu, Tal August, Jiaxuan You

TL;DR
SocialVeil introduces a realistic social interaction environment for language models, simulating communication barriers like semantic vagueness and cultural mismatch, to better evaluate their social intelligence in imperfect settings.
Contribution
This paper presents SocialVeil, a novel environment with barrier simulations and evaluation metrics, to assess LLM social intelligence under communication disruptions, addressing limitations of prior idealized benchmarks.
Findings
Barriers significantly reduce mutual understanding by over 45%.
Confusion levels increase by nearly 50% under communication barriers.
Human evaluations confirm the fidelity of simulated barriers.
Abstract
Large language models (LLMs) are increasingly evaluated in interactive environments to test their social intelligence. However, existing benchmarks often assume idealized communication between agents, limiting our ability to diagnose whether LLMs can maintain and repair interactions in more realistic, imperfect settings. To close this gap, we present \textsc{SocialVeil}, a social learning environment that can simulate social interaction under cognitive-difference-induced communication barriers. Grounded in a systematic literature review of communication challenges in human interaction, \textsc{SocialVeil} introduces three representative types of such disruption, \emph{semantic vagueness}, \emph{sociocultural mismatch}, and \emph{emotional interference}. We also introduce two barrier-aware evaluation metrics, \emph{unresolved confusion} and \emph{mutual understanding}, to evaluate…
Peer Reviews
Decision·Submitted to ICLR 2026
Novel Contribution: The core idea is highly relevant and timely. Moving beyond idealized "seamless" interaction to study how agents handle communication breakdowns is a critical step toward more robust and socially-aware AI. The focus on structured, cognitive barriers, as opposed to simple noise, is a significant conceptual advance. Rigorous and Well-Structured Framework: The methodology is well-designed. The barrier taxonomy is theoretically grounded in literature from pragmatics, sociolinguis
Statistical Reporting Could Be Enhanced: While the results are presented clearly in tables and figures, the paper would be strengthened by more formal statistical testing. Table 2: The reported performance drops are descriptive (averages). Statistical significance tests (e.g., paired t-tests between baseline and each barrier condition for each metric/model) would solidify the claim that barriers "consistently impair" performance. Table 3: The comparison between Base, Repair, and (BC+SR) conditio
1. This paper introduces a barrier-aware social interaction environment (SOCIALVEIL) that systematically embeds realistic communication disruptions to evaluate LLM social intelligence. 2. The paper proposes a comprehensive, automated evaluation protocol and verifies its fidelity through extensive human studies, showing strong metric alignment and reproducibility. 3. The experiment results and analysis demonstrate that communication barriers substantially impair LLMs’ mutual understanding and rel
1. Evaluation: Barriers are injected with one model and GPT-4o is used as the automatic evaluator. This raises concerns about evaluator bias/overfitting to its own stylistic expectations. An ablation with multiple evaluators would strengthen claims. 2. Dataset: Generalization beyond SOTOPIA. All scenarios are adapted from SOTOPIA; it remains unclear how well the findings transfer to other interactive corpora or human-in-the-loop settings. 3. Dataset: Limited Data Points: 180 episodes for each ba
* The motivation for studying social intelligence in noisy or ambiguous communication settings is interesting. * The paper is clearly written and organized. * The implementation of different communication barriers is creative and could be useful for exploratory studies.
1. The proposed benchmark is mainly constructed by manually designing prompting templates that inject vagueness, cultural mismatch, or emotional bias into conversations. There is no new model, algorithm, or principled framework. The whole approach remains at the level of prompt engineering rather than a genuine methodological advance in measuring social intelligence. 2. The study does not introduce a novel metric, learning method, or theoretical insight. Most of the results simply confirm what
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Artificial Intelligence in Healthcare and Education · Topic Modeling
