Speaker-Smoothed kNN Speaker Adaptation for End-to-End ASR
Shaojun Li, Daimeng Wei, Hengchao Shang, Jiaxin Guo, ZongYao Li,, Zhanglin Wu, Zhiqiang Rao, Yuanchang Luo, Xianghui He, Hao Yang

TL;DR
This paper introduces a speaker-smoothed kNN adaptation method for end-to-end ASR that improves recognition accuracy across different speakers without fine-tuning, especially in diverse domain scenarios.
Contribution
The paper presents a novel kNN-based speaker adaptation technique that dynamically adjusts interpolation parameters using x-vectors, achieving state-of-the-art results without model fine-tuning.
Findings
Consistently matches fine-tuning performance during speaker changes.
Achieves state-of-the-art CER reductions in all-domain settings.
Effective in both single and multi-speaker scenarios.
Abstract
Despite recent improvements in End-to-End Automatic Speech Recognition (E2E ASR) systems, the performance can degrade due to vocal characteristic mismatches between training and testing data, particularly with limited target speaker adaptation data. We propose a novel speaker adaptation approach Speaker-Smoothed kNN that leverages k-Nearest Neighbors (kNN) retrieval techniques to improve model output by finding correctly pronounced tokens from its pre-built datastore during the decoding phase. Moreover, we utilize x-vector to dynamically adjust kNN interpolation parameters for data sparsity issue. This approach was validated using KeSpeech and MagicData corpora under in-domain and all-domain settings. Our method consistently performs comparably to fine-tuning without the associated performance degradation during speaker changes. Furthermore, in the all-domain setting, our method…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing
