AVS-Mamba: Exploring Temporal and Multi-modal Mamba for Audio-Visual   Segmentation

Sitong Gong; Yunzhi Zhuge; Lu Zhang; Yifan Wang; Pingping Zhang; Lijun; Wang; Huchuan Lu

arXiv:2501.07810·cs.CV·January 15, 2025

AVS-Mamba: Exploring Temporal and Multi-modal Mamba for Audio-Visual Segmentation

Sitong Gong, Yunzhi Zhuge, Lu Zhang, Yifan Wang, Pingping Zhang, Lijun, Wang, Huchuan Lu

PDF

Open Access 1 Repo

TL;DR

AVS-Mamba introduces a linear-complexity, multi-modal transformer alternative for audio-visual segmentation, utilizing novel blocks for temporal and cross-modal fusion to improve performance on AVS benchmarks.

Contribution

The paper presents AVS-Mamba, a selective state space model with new modules for efficient multi-scale, multi-modal video understanding in AVS tasks.

Findings

01

Achieves state-of-the-art results on AVSBench datasets.

02

Introduces linear complexity model for AVS, outperforming transformer-based methods.

03

Demonstrates effective multi-scale and cross-modal fusion techniques.

Abstract

The essence of audio-visual segmentation (AVS) lies in locating and delineating sound-emitting objects within a video stream. While Transformer-based methods have shown promise, their handling of long-range dependencies struggles due to quadratic computational costs, presenting a bottleneck in complex scenarios. To overcome this limitation and facilitate complex multi-modal comprehension with linear complexity, we introduce AVS-Mamba, a selective state space model to address the AVS task. Our framework incorporates two key components for video understanding and cross-modal learning: Temporal Mamba Block for sequential video processing and Vision-to-Audio Fusion Block for advanced audio-vision integration. Building on this, we develop the Multi-scale Temporal Encoder, aimed at enhancing the learning of visual features across scales, facilitating the perception of intra- and inter-frame…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sitonggong/avs-mamba
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Subtitles and Audiovisual Media

MethodsADaptive gradient method with the OPTimal convergence rate · Mamba: Linear-Time Sequence Modeling with Selective State Spaces