Omni-MMSI: Toward Identity-attributed Social Interaction Understanding
Xinpeng Li, Bolin Lai, Hardy Chen, Shijian Deng, Cihang Xie, Yuyin Zhou, James Matthew Rehg, Yapeng Tian

TL;DR
Omni-MMSI introduces a new task for AI to understand social interactions from raw multi-modal data, emphasizing identity attribution and reasoning, and proposes a reference-guided pipeline that outperforms existing models.
Contribution
The paper presents Omni-MMSI-R, a novel reference-guided pipeline for identity attribution and social reasoning from raw multi-modal data, addressing limitations of prior methods.
Findings
Omni-MMSI-R outperforms existing LLMs and methods on the Omni-MMSI dataset.
Constructed participant-level reference pairs and curated reasoning annotations.
Demonstrated improved social interaction understanding from raw data.
Abstract
We introduce Omni-MMSI, a new task that requires comprehensive social interaction understanding from raw audio, vision, and speech input. The task involves perceiving identity-attributed social cues (e.g., who is speaking what) and reasoning about the social interaction (e.g., whom the speaker refers to). This task is essential for developing AI assistants that can perceive and respond to human interactions. Unlike prior studies that operate on oracle-preprocessed social cues, Omni-MMSI reflects realistic scenarios where AI assistants must perceive and reason from raw data. However, existing pipelines and multi-modal LLMs perform poorly on Omni-MMSI because they lack reliable identity attribution capabilities, which leads to inaccurate social interaction understanding. To address this challenge, we propose Omni-MMSI-R, a reference-guided pipeline that produces identity-attributed social…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
