SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models
Tianyu Xie, Jinfa Huang, Yuexiao Ma, Rongfang Luo, Yan Yang, Wang Chen, Yuhui Zeng, Ruize Fang, Yixuan Zou, Xiawu Zheng, Jiebo Luo, Rongrong Ji

TL;DR
SocialOmni introduces a new benchmark for evaluating audio-visual social interactivity in omni-modal large language models, focusing on dynamic conversational cues and robustness, revealing gaps between perception and interaction capabilities.
Contribution
This work presents SocialOmni, a comprehensive benchmark for assessing social interactivity in omni-modal models, including perception and generation tasks, with a detailed diagnostic set and analysis of current models.
Findings
Significant variance in social-interaction capabilities across models.
Perceptual accuracy does not correlate strongly with interaction quality.
Diagnostics provide actionable insights for improving social interactivity in models.
Abstract
Omni-modal large language models (OLMs) redefine human-machine interaction by natively integrating audio, vision, and text. However, existing OLM benchmarks remain anchored to static, accuracy-centric tasks, leaving a critical gap in assessing social interactivity, the fundamental capacity to navigate dynamic cues in natural dialogues. To this end, we propose SocialOmni, a comprehensive benchmark that operationalizes the evaluation of this conversational interactivity across three core dimensions: (i) speaker separation and identification (who is speaking), (ii) interruption timing control (when to interject), and (iii) natural interruption generation (how to phrase the interruption). SocialOmni features 2,000 perception samples and a quality-controlled diagnostic set of 209 interaction-generation instances with strict temporal and contextual constraints, complemented by controlled…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Speech and dialogue systems
