AVESFormer: Efficient Transformer Design for Real-Time Audio-Visual Segmentation
Zili Wang, Qi Yang, Linsu Shi, Jiazhong Yu, Qinghua Liang, Fei Li and, Shiming Xiang

TL;DR
AVESFormer is a novel, real-time audio-visual segmentation transformer that improves efficiency and performance by addressing attention dissipation and decoder complexity, enabling practical real-time applications.
Contribution
It introduces AVESFormer, the first real-time AVS transformer with an efficient prompt query generator and ELF decoder for enhanced speed and accuracy.
Findings
Achieves 79.9% on S4 dataset
Outperforms previous state-of-the-art methods
Balances performance with real-time speed
Abstract
Recently, transformer-based models have demonstrated remarkable performance on audio-visual segmentation (AVS) tasks. However, their expensive computational cost makes real-time inference impractical. By characterizing attention maps of the network, we identify two key obstacles in AVS models: 1) attention dissipation, corresponding to the over-concentrated attention weights by Softmax within restricted frames, and 2) inefficient, burdensome transformer decoder, caused by narrow focus patterns in early stages. In this paper, we introduce AVESFormer, the first real-time Audio-Visual Efficient Segmentation transformer that achieves fast, efficient and light-weight simultaneously. Our model leverages an efficient prompt query generator to correct the behaviour of cross-attention. Additionally, we propose ELF decoder to bring greater efficiency by facilitating convolutions suitable for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Advanced Data Compression Techniques
MethodsAttention Is All You Need · Softmax · Focus
