AVESFormer: Efficient Transformer Design for Real-Time Audio-Visual   Segmentation

Zili Wang; Qi Yang; Linsu Shi; Jiazhong Yu; Qinghua Liang; Fei Li and; Shiming Xiang

arXiv:2408.01708·cs.CV·August 6, 2024

AVESFormer: Efficient Transformer Design for Real-Time Audio-Visual Segmentation

Zili Wang, Qi Yang, Linsu Shi, Jiazhong Yu, Qinghua Liang, Fei Li and, Shiming Xiang

PDF

Open Access 3 Repos

TL;DR

AVESFormer is a novel, real-time audio-visual segmentation transformer that improves efficiency and performance by addressing attention dissipation and decoder complexity, enabling practical real-time applications.

Contribution

It introduces AVESFormer, the first real-time AVS transformer with an efficient prompt query generator and ELF decoder for enhanced speed and accuracy.

Findings

01

Achieves 79.9% on S4 dataset

02

Outperforms previous state-of-the-art methods

03

Balances performance with real-time speed

Abstract

Recently, transformer-based models have demonstrated remarkable performance on audio-visual segmentation (AVS) tasks. However, their expensive computational cost makes real-time inference impractical. By characterizing attention maps of the network, we identify two key obstacles in AVS models: 1) attention dissipation, corresponding to the over-concentrated attention weights by Softmax within restricted frames, and 2) inefficient, burdensome transformer decoder, caused by narrow focus patterns in early stages. In this paper, we introduce AVESFormer, the first real-time Audio-Visual Efficient Segmentation transformer that achieves fast, efficient and light-weight simultaneously. Our model leverages an efficient prompt query generator to correct the behaviour of cross-attention. Additionally, we propose ELF decoder to bring greater efficiency by facilitating convolutions suitable for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Advanced Data Compression Techniques

MethodsAttention Is All You Need · Softmax · Focus