AVS-Mamba: Exploring Temporal and Multi-modal Mamba for Audio-Visual Segmentation
Sitong Gong, Yunzhi Zhuge, Lu Zhang, Yifan Wang, Pingping Zhang, Lijun, Wang, Huchuan Lu

TL;DR
AVS-Mamba introduces a linear-complexity, multi-modal transformer alternative for audio-visual segmentation, utilizing novel blocks for temporal and cross-modal fusion to improve performance on AVS benchmarks.
Contribution
The paper presents AVS-Mamba, a selective state space model with new modules for efficient multi-scale, multi-modal video understanding in AVS tasks.
Findings
Achieves state-of-the-art results on AVSBench datasets.
Introduces linear complexity model for AVS, outperforming transformer-based methods.
Demonstrates effective multi-scale and cross-modal fusion techniques.
Abstract
The essence of audio-visual segmentation (AVS) lies in locating and delineating sound-emitting objects within a video stream. While Transformer-based methods have shown promise, their handling of long-range dependencies struggles due to quadratic computational costs, presenting a bottleneck in complex scenarios. To overcome this limitation and facilitate complex multi-modal comprehension with linear complexity, we introduce AVS-Mamba, a selective state space model to address the AVS task. Our framework incorporates two key components for video understanding and cross-modal learning: Temporal Mamba Block for sequential video processing and Vision-to-Audio Fusion Block for advanced audio-vision integration. Building on this, we develop the Multi-scale Temporal Encoder, aimed at enhancing the learning of visual features across scales, facilitating the perception of intra- and inter-frame…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Subtitles and Audiovisual Media
MethodsADaptive gradient method with the OPTimal convergence rate · Mamba: Linear-Time Sequence Modeling with Selective State Spaces
