Continual Learning for On-Device Speech Recognition using Disentangled Conformers
Anuj Diwan, Ching-Feng Yeh, Wei-Ning Hsu, Paden Tomasello, Eunsol, Choi, David Harwath, Abdelrahman Mohamed

TL;DR
This paper introduces DisConformer, a novel model architecture with frozen and tunable components, and a continual learning algorithm DisentangledCL, to improve speaker-specific on-device speech recognition while maintaining compute efficiency.
Contribution
It proposes DisConformer with disentangled components and DisentangledCL for efficient continual learning in on-device speech recognition, addressing real-world user adaptation challenges.
Findings
DisConformer outperforms baseline models on LibriSpeech with 15.58% rel. WER reduction.
DisentangledCL significantly improves speaker-specific adaptation, reducing WER by 20.65%.
Models match fully finetuned baselines in some settings.
Abstract
Automatic speech recognition research focuses on training and evaluating on static datasets. Yet, as speech models are increasingly deployed on personal devices, such models encounter user-specific distributional shifts. To simulate this real-world scenario, we introduce LibriContinual, a continual learning benchmark for speaker-specific domain adaptation derived from LibriVox audiobooks, with data corresponding to 118 individual speakers and 6 train splits per speaker of different sizes. Additionally, current speech recognition models and continual learning algorithms are not optimized to be compute-efficient. We adapt a general-purpose training algorithm NetAug for ASR and create a novel Conformer variant called the DisConformer (Disentangled Conformer). This algorithm produces ASR models consisting of a frozen 'core' network for general-purpose use and several tunable 'augment'…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
