Speaker-Smoothed kNN Speaker Adaptation for End-to-End ASR

Shaojun Li; Daimeng Wei; Hengchao Shang; Jiaxin Guo; ZongYao Li,; Zhanglin Wu; Zhiqiang Rao; Yuanchang Luo; Xianghui He; Hao Yang

arXiv:2406.04791·cs.SD·July 3, 2024

Speaker-Smoothed kNN Speaker Adaptation for End-to-End ASR

Shaojun Li, Daimeng Wei, Hengchao Shang, Jiaxin Guo, ZongYao Li,, Zhanglin Wu, Zhiqiang Rao, Yuanchang Luo, Xianghui He, Hao Yang

PDF

Open Access

TL;DR

This paper introduces a speaker-smoothed kNN adaptation method for end-to-end ASR that improves recognition accuracy across different speakers without fine-tuning, especially in diverse domain scenarios.

Contribution

The paper presents a novel kNN-based speaker adaptation technique that dynamically adjusts interpolation parameters using x-vectors, achieving state-of-the-art results without model fine-tuning.

Findings

01

Consistently matches fine-tuning performance during speaker changes.

02

Achieves state-of-the-art CER reductions in all-domain settings.

03

Effective in both single and multi-speaker scenarios.

Abstract

Despite recent improvements in End-to-End Automatic Speech Recognition (E2E ASR) systems, the performance can degrade due to vocal characteristic mismatches between training and testing data, particularly with limited target speaker adaptation data. We propose a novel speaker adaptation approach Speaker-Smoothed kNN that leverages k-Nearest Neighbors (kNN) retrieval techniques to improve model output by finding correctly pronounced tokens from its pre-built datastore during the decoding phase. Moreover, we utilize x-vector to dynamically adjust kNN interpolation parameters for data sparsity issue. This approach was validated using KeSpeech and MagicData corpora under in-domain and all-domain settings. Our method consistently performs comparably to fine-tuning without the associated performance degradation during speaker changes. Furthermore, in the all-domain setting, our method…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing