PACE: Pretrained Audio Continual Learning

Chang Li; Kanglei Zhou; Liyuan Wang

arXiv:2602.03355·cs.SD·February 4, 2026

PACE: Pretrained Audio Continual Learning

Chang Li, Kanglei Zhou, Liyuan Wang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces PACE, a novel method for audio continual learning with pretrained models, addressing unique challenges like spectral focus and representation drift, and demonstrating significant improvements on multiple benchmarks.

Contribution

The work presents the first systematic benchmark for audio CL with PTMs, analyzes unique challenges, and proposes PACE, a new method combining regularized classifiers and adaptive PEFT for better stability and semantic alignment.

Findings

01

PACE outperforms state-of-the-art baselines on six audio CL benchmarks.

02

Analytic classifiers with FSA are promising but face limitations like saturation and drift.

03

Spectrogram-based perturbations improve stability and reduce representation overlap.

Abstract

Audio is a fundamental modality for analyzing speech, music, and environmental sounds. Although pretrained audio models have significantly advanced audio understanding, they remain fragile in real-world settings where data distributions shift over time. In this work, we present the first systematic benchmark for audio continual learning (CL) with pretrained models (PTMs), together with a comprehensive analysis of its unique challenges. Unlike in vision, where parameter-efficient fine-tuning (PEFT) has proven effective for CL, directly transferring such strategies to audio leads to poor performance. This stems from a fundamental property of audio backbones: they focus on low-level spectral details rather than structured semantics, causing severe upstream-downstream misalignment. Through extensive empirical study, we identify analytic classifiers with first-session adaptation (FSA) as a…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 2

Strengths

- Well written and easy to follow. - Comparisons are fair and support the claims. - Addresses area of audio continual learning.

Weaknesses

- **Learning vs forgetting.** Different contributions such as projection based update that protects past tasks can also limit learning on the current task. Please add an analysis that separates how much performance comes from stability and how much is lost in plasticity. - **Fine grained tasks are mostly human voices.** Current fine grained results are on speech or voice like data for example TIMIT and VocalSet. A non voice fine grained audio task such as environmental or musical instrument c

Reviewer 02Rating 6Confidence 2

Strengths

1. The experiment results are impressive. 2. Analysis of the challenges of audio continual learning is clear and understandable. 3. The illustrations in the paper are good.

Weaknesses

1. The technical points are scattered: The techniques presented in this paper are rather fragmented, jumping from one idea to another without a unified narrative. This makes it difficult for readers to clearly understand the motivation behind each technique. 2. Limited novelty of the proposed methods: The idea of using a lower learning rate for training the head seems more like an empirical tuning strategy rather than a fundamentally new contribution, and the concept of Multi-Session Adaptati

Reviewer 03Rating 8Confidence 3

Strengths

1. AFAIK, this paper is the first to provide a deep dive into the unique challenges of audio CL with PTMs. The empirical analysis that distinguishes the difficulties in coarse-grained vs. fine-grained audio scenarios (Findings 1, 2, and 3 in Section 2) is a major strength and provides a clear motivation for the proposed method. 2. Their PACE framework is technically sound and its components directly address the problems identified. 3. Their empirical evaluation is thorough. The authors benchmark

Weaknesses

We appreciate the detailed analysis and strong empirical results achieved in this submission. To maximize the impact and clarity of the work, we suggest the authors address the following points: 1. The central claim of the paper is about addressing a "fundamental property of audio backbones" in continual learning. Despite this broad claim, all experiments are exclusively conducted using the EAT backbone, which is a spectrogram-based masked prediction model. Demonstrating that the observed chall

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Domain Adaptation and Few-Shot Learning