LightAVSeg: Lightweight Audio-Visual Segmentation
Qing Zhong, Guodong Ding, Lingqiao Liu, Zaiwen Feng, Lin Yuanbo Wu, Angela Yao

TL;DR
LightAVSeg is a lightweight audio-visual segmentation framework that reduces computational complexity and maintains high accuracy, enabling efficient deployment on resource-constrained devices.
Contribution
The paper introduces a decoupled attention design and an auxiliary alignment loss to improve efficiency and semantic consistency in AVS models.
Findings
Achieves 50.4 mIoU on MS3 benchmark with only 20.5M parameters.
Replaces quadratic attention with linear-scale interaction, reducing computational cost.
Enables real-time inference on mobile processors.
Abstract
Audio-Visual Segmentation (AVS) targets pixel level localization of sounding emitting objects in videos. However, existing models rely on dense cross-modal attention with quadratic computational cost, limiting their suitability for resource efficient deployment. Most efficiency oriented methods focus on backbone reduction and overlook the interaction module as the primary bottleneck. This paper proposes LightAVSeg, a lightweight framework that replaces heavy attention with a decoupled design for semantic filtering and spatial grounding, resulting in interaction costs that scale linearly with spatial resolution. Furthermore, we introduce an auxiliary alignment loss to enforce semantic consistency during training with zero inference overhead. Extensive experiments demonstrate that LightAVSeg achieves a new state-of-the-art among lightweight methods: with 20.5M parameters ~1/7 of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
