LightAVSeg: Lightweight Audio-Visual Segmentation

Qing Zhong; Guodong Ding; Lingqiao Liu; Zaiwen Feng; Lin Yuanbo Wu; Angela Yao

arXiv:2605.08805·cs.CV·May 12, 2026

LightAVSeg: Lightweight Audio-Visual Segmentation

Qing Zhong, Guodong Ding, Lingqiao Liu, Zaiwen Feng, Lin Yuanbo Wu, Angela Yao

PDF

TL;DR

LightAVSeg is a lightweight audio-visual segmentation framework that reduces computational complexity and maintains high accuracy, enabling efficient deployment on resource-constrained devices.

Contribution

The paper introduces a decoupled attention design and an auxiliary alignment loss to improve efficiency and semantic consistency in AVS models.

Findings

01

Achieves 50.4 mIoU on MS3 benchmark with only 20.5M parameters.

02

Replaces quadratic attention with linear-scale interaction, reducing computational cost.

03

Enables real-time inference on mobile processors.

Abstract

Audio-Visual Segmentation (AVS) targets pixel level localization of sounding emitting objects in videos. However, existing models rely on dense cross-modal attention with quadratic computational cost, limiting their suitability for resource efficient deployment. Most efficiency oriented methods focus on backbone reduction and overlook the interaction module as the primary bottleneck. This paper proposes LightAVSeg, a lightweight framework that replaces heavy attention with a decoupled design for semantic filtering and spatial grounding, resulting in interaction costs that scale linearly with spatial resolution. Furthermore, we introduce an auxiliary alignment loss to enforce semantic consistency during training with zero inference overhead. Extensive experiments demonstrate that LightAVSeg achieves a new state-of-the-art among lightweight methods: with 20.5M parameters ~1/7 of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.