SeaVIS: Sound-Enhanced Association for Online Audio-Visual Instance Segmentation

Yingjian Zhu; Ying Wang; Yuyang Hong; Ruohao Guo; Kun Ding; Xin Gu; Bin Fan; Shiming Xiang

arXiv:2603.01431·cs.CV·March 3, 2026

SeaVIS: Sound-Enhanced Association for Online Audio-Visual Instance Segmentation

Yingjian Zhu, Ying Wang, Yuyang Hong, Ruohao Guo, Kun Ding, Xin Gu, Bin Fan, Shiming Xiang

PDF

Open Access

TL;DR

SeaVIS is an innovative online framework for audio-visual instance segmentation that effectively associates instances across continuous video streams by integrating audio cues and visual features, enabling real-time processing and improved sound-based object tracking.

Contribution

The paper introduces SeaVIS, the first online AVIS framework utilizing CCAF and AGCL to improve instance association and sound activity encoding in real-time videos.

Findings

01

Outperforms existing models on AVISeg dataset

02

Achieves real-time inference speed

03

Enhances sound-following accuracy

Abstract

Recently, an audio-visual instance segmentation (AVIS) task has been introduced, aiming to identify, segment and track individual sounding instances in videos. However, prevailing methods primarily adopt the offline paradigm, that cannot associate detected instances across consecutive clips, making them unsuitable for real-world scenarios that involve continuous video streams. To address this limitation, we introduce SeaVIS, the first online framework designed for audio-visual instance segmentation. SeaVIS leverages the Causal Cross Attention Fusion (CCAF) module to enable efficient online processing, which integrates visual features from the current frame with the entire audio history under strict causal constraints. A major challenge for conventional VIS methods is that appearance-based instance association fails to distinguish between an object's sounding and silent states, resulting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Video Analysis and Summarization