Speaker Adaptation for Quantised End-to-End ASR Models

Qiuming Zhao; Guangzhi Sun; Chao Zhang; Mingxing Xu; Thomas Fang Zheng

arXiv:2408.03979·cs.SD·August 9, 2024

Speaker Adaptation for Quantised End-to-End ASR Models

Qiuming Zhao, Guangzhi Sun, Chao Zhang, Mingxing Xu, Thomas Fang Zheng

PDF

Open Access

TL;DR

This paper introduces P4Q, a novel speaker adaptation method for quantised end-to-end ASR models, significantly reducing WER while maintaining small model sizes suitable for edge deployment.

Contribution

It proposes a new strategy combining speaker adaptation with quantisation for end-to-end ASR models, improving performance on resource-constrained devices.

Findings

01

Achieved 15.1% and 23.3% relative WER reductions on Whisper and Conformer models.

02

Reduced model size by 7 times with minimal speaker-specific parameters.

03

Demonstrated effectiveness on LibriSpeech and TED-LIUM 3 datasets.

Abstract

End-to-end models have shown superior performance for automatic speech recognition (ASR). However, such models are often very large in size and thus challenging to deploy on resource-constrained edge devices. While quantisation can reduce model sizes, it can lead to increased word error rates (WERs). Although improved quantisation methods were proposed to address the issue of performance degradation, the fact that quantised models deployed on edge devices often target only on a small group of users is under-explored. To this end, we propose personalisation for quantised models (P4Q), a novel strategy that uses speaker adaptation (SA) to improve quantised end-to-end ASR models by fitting them to the characteristics of the target speakers. In this paper, we study the P4Q strategy based on Whisper and Conformer attention-based encoder-decoder (AED) end-to-end ASR models, which leverages a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing