Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation

Kaining Ying; Henghui Ding; Guangquan Jie; Yu-Gang Jiang

arXiv:2507.22886·cs.CV·August 1, 2025

Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation

Kaining Ying, Henghui Ding, Guangquan Jie, Yu-Gang Jiang

PDF

Open Access

TL;DR

This paper introduces OmniAVS, a comprehensive dataset for referring audio-visual segmentation with multimodal expressions, and proposes OISA, a reasoning-enabled segmentation model that advances understanding of complex audiovisual content.

Contribution

The paper presents OmniAVS, a novel dataset with diverse multimodal expressions and complex reasoning, along with OISA, a new model for multimodal reasoning-based segmentation.

Findings

01

OISA outperforms existing methods on OmniAVS.

02

OmniAVS includes 8 types of multimodal expressions.

03

OISA achieves competitive results on related tasks.

Abstract

Referring audio-visual segmentation (RAVS) has recently seen significant advancements, yet challenges remain in integrating multimodal information and deeply understanding and reasoning about audiovisual content. To extend the boundaries of RAVS and facilitate future research in this field, we propose Omnimodal Referring Audio-Visual Segmentation (OmniAVS), a new dataset containing 2,104 videos and 61,095 multimodal referring expressions. OmniAVS stands out with three key innovations: (1) 8 types of multimodal expressions that flexibly combine text, speech, sound, and visual cues; (2) an emphasis on understanding audio content beyond just detecting their presence; and (3) the inclusion of complex reasoning and world knowledge in expressions. Furthermore, we introduce Omnimodal Instructed Segmentation Assistant (OISA), to address the challenges of multimodal reasoning and fine-grained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSubtitles and Audiovisual Media · Music and Audio Processing · Speech and dialogue systems