Can You Hear, Localize, and Segment Continually? An Exemplar-Free Continual Learning Benchmark for Audio-Visual Segmentation

Siddeshwar Raghavan; Gautham Vinod; Bruce Coburn; Fengqing Zhu

arXiv:2603.08967·cs.CV·March 11, 2026

Can You Hear, Localize, and Segment Continually? An Exemplar-Free Continual Learning Benchmark for Audio-Visual Segmentation

Siddeshwar Raghavan, Gautham Vinod, Bruce Coburn, Fengqing Zhu

PDF

Open Access

TL;DR

This paper introduces a new benchmark and baseline methods for continual learning in audio-visual segmentation, addressing the challenge of dynamic environments and preventing catastrophic forgetting.

Contribution

It presents the first exemplar-free continual learning benchmark for AVS and proposes ATLAS with Low-Rank Anchoring to improve lifelong audio-visual perception.

Findings

01

ATLAS achieves competitive performance in continual AVS scenarios

02

The benchmark enables evaluation of models in dynamic, real-world environments

03

LRA effectively mitigates catastrophic forgetting in AVS models

Abstract

Audio-Visual Segmentation (AVS) aims to produce pixel-level masks of sound producing objects in videos, by jointly learning from audio and visual signals. However, real-world environments are inherently dynamic, causing audio and visual distributions to evolve over time, which challenge existing AVS systems that assume static training settings. To address this gap, we introduce the first exemplar-free continual learning benchmark for Audio-Visual Segmentation, comprising four learning protocols across single-source and multi-source AVS datasets. We further propose a strong baseline, ATLAS, which uses audio-guided pre-fusion conditioning to modulate visual feature channels via projected audio context before cross-modal attention. Finally, we mitigate catastrophic forgetting by introducing Low-Rank Anchoring (LRA), which stabilizes adapted weights based on loss sensitivity. Extensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Hearing Loss and Rehabilitation · Music and Audio Processing