SeaVIS: Sound-Enhanced Association for Online Audio-Visual Instance Segmentation
Yingjian Zhu, Ying Wang, Yuyang Hong, Ruohao Guo, Kun Ding, Xin Gu, Bin Fan, Shiming Xiang

TL;DR
SeaVIS is an innovative online framework for audio-visual instance segmentation that effectively associates instances across continuous video streams by integrating audio cues and visual features, enabling real-time processing and improved sound-based object tracking.
Contribution
The paper introduces SeaVIS, the first online AVIS framework utilizing CCAF and AGCL to improve instance association and sound activity encoding in real-time videos.
Findings
Outperforms existing models on AVISeg dataset
Achieves real-time inference speed
Enhances sound-following accuracy
Abstract
Recently, an audio-visual instance segmentation (AVIS) task has been introduced, aiming to identify, segment and track individual sounding instances in videos. However, prevailing methods primarily adopt the offline paradigm, that cannot associate detected instances across consecutive clips, making them unsuitable for real-world scenarios that involve continuous video streams. To address this limitation, we introduce SeaVIS, the first online framework designed for audio-visual instance segmentation. SeaVIS leverages the Causal Cross Attention Fusion (CCAF) module to enable efficient online processing, which integrates visual features from the current frame with the entire audio history under strict causal constraints. A major challenge for conventional VIS methods is that appearance-based instance association fails to distinguish between an object's sounding and silent states, resulting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Video Analysis and Summarization
