SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models

Tianyu Xie; Jinfa Huang; Yuexiao Ma; Rongfang Luo; Yan Yang; Wang Chen; Yuhui Zeng; Ruize Fang; Yixuan Zou; Xiawu Zheng; Jiebo Luo; Rongrong Ji

arXiv:2603.16859·cs.AI·March 18, 2026

SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models

Tianyu Xie, Jinfa Huang, Yuexiao Ma, Rongfang Luo, Yan Yang, Wang Chen, Yuhui Zeng, Ruize Fang, Yixuan Zou, Xiawu Zheng, Jiebo Luo, Rongrong Ji

PDF

Open Access 1 Datasets

TL;DR

SocialOmni introduces a new benchmark for evaluating audio-visual social interactivity in omni-modal large language models, focusing on dynamic conversational cues and robustness, revealing gaps between perception and interaction capabilities.

Contribution

This work presents SocialOmni, a comprehensive benchmark for assessing social interactivity in omni-modal models, including perception and generation tasks, with a detailed diagnostic set and analysis of current models.

Findings

01

Significant variance in social-interaction capabilities across models.

02

Perceptual accuracy does not correlate strongly with interaction quality.

03

Diagnostics provide actionable insights for improving social interactivity in models.

Abstract

Omni-modal large language models (OLMs) redefine human-machine interaction by natively integrating audio, vision, and text. However, existing OLM benchmarks remain anchored to static, accuracy-centric tasks, leaving a critical gap in assessing social interactivity, the fundamental capacity to navigate dynamic cues in natural dialogues. To this end, we propose SocialOmni, a comprehensive benchmark that operationalizes the evaluation of this conversational interactivity across three core dimensions: (i) speaker separation and identification (who is speaking), (ii) interruption timing control (when to interject), and (iii) natural interruption generation (how to phrase the interruption). SocialOmni features 2,000 perception samples and a quality-controlled diagnostic set of 209 interaction-generation instances with strict temporal and contextual constraints, complemented by controlled…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

alexisty/SocialOmni
dataset· 1.5k dl
1.5k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Speech and dialogue systems