Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation
Kaining Ying, Henghui Ding, Guangquan Jie, Yu-Gang Jiang

TL;DR
This paper introduces OmniAVS, a comprehensive dataset for referring audio-visual segmentation with multimodal expressions, and proposes OISA, a reasoning-enabled segmentation model that advances understanding of complex audiovisual content.
Contribution
The paper presents OmniAVS, a novel dataset with diverse multimodal expressions and complex reasoning, along with OISA, a new model for multimodal reasoning-based segmentation.
Findings
OISA outperforms existing methods on OmniAVS.
OmniAVS includes 8 types of multimodal expressions.
OISA achieves competitive results on related tasks.
Abstract
Referring audio-visual segmentation (RAVS) has recently seen significant advancements, yet challenges remain in integrating multimodal information and deeply understanding and reasoning about audiovisual content. To extend the boundaries of RAVS and facilitate future research in this field, we propose Omnimodal Referring Audio-Visual Segmentation (OmniAVS), a new dataset containing 2,104 videos and 61,095 multimodal referring expressions. OmniAVS stands out with three key innovations: (1) 8 types of multimodal expressions that flexibly combine text, speech, sound, and visual cues; (2) an emphasis on understanding audio content beyond just detecting their presence; and (3) the inclusion of complex reasoning and world knowledge in expressions. Furthermore, we introduce Omnimodal Instructed Segmentation Assistant (OISA), to address the challenges of multimodal reasoning and fine-grained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSubtitles and Audiovisual Media · Music and Audio Processing · Speech and dialogue systems
