Continual Learning for On-Device Speech Recognition using Disentangled   Conformers

Anuj Diwan; Ching-Feng Yeh; Wei-Ning Hsu; Paden Tomasello; Eunsol; Choi; David Harwath; Abdelrahman Mohamed

arXiv:2212.01393·eess.AS·May 26, 2023·1 cites

Continual Learning for On-Device Speech Recognition using Disentangled Conformers

Anuj Diwan, Ching-Feng Yeh, Wei-Ning Hsu, Paden Tomasello, Eunsol, Choi, David Harwath, Abdelrahman Mohamed

PDF

Open Access

TL;DR

This paper introduces DisConformer, a novel model architecture with frozen and tunable components, and a continual learning algorithm DisentangledCL, to improve speaker-specific on-device speech recognition while maintaining compute efficiency.

Contribution

It proposes DisConformer with disentangled components and DisentangledCL for efficient continual learning in on-device speech recognition, addressing real-world user adaptation challenges.

Findings

01

DisConformer outperforms baseline models on LibriSpeech with 15.58% rel. WER reduction.

02

DisentangledCL significantly improves speaker-specific adaptation, reducing WER by 20.65%.

03

Models match fully finetuned baselines in some settings.

Abstract

Automatic speech recognition research focuses on training and evaluating on static datasets. Yet, as speech models are increasingly deployed on personal devices, such models encounter user-specific distributional shifts. To simulate this real-world scenario, we introduce LibriContinual, a continual learning benchmark for speaker-specific domain adaptation derived from LibriVox audiobooks, with data corresponding to 118 individual speakers and 6 train splits per speaker of different sizes. Additionally, current speech recognition models and continual learning algorithms are not optimized to be compute-efficient. We adapt a general-purpose training algorithm NetAug for ASR and create a novel Conformer variant called the DisConformer (Disentangled Conformer). This algorithm produces ASR models consisting of a frozen 'core' network for general-purpose use and several tunable 'augment'…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis