Confidence Score Based Speaker Adaptation of Conformer Speech Recognition Systems
Jiajun Deng, Xurong Xie, Tianzi Wang, Mingyu Cui, Boyang Xue, Zengrui, Jin, Guinan Li, Shujie Hu, Xunying Liu

TL;DR
This paper introduces a confidence score-based speaker adaptation method for Conformer speech recognition systems, improving accuracy by addressing data scarcity and supervision errors with Bayesian modeling and confidence estimation.
Contribution
It proposes a novel confidence score-based unsupervised speaker adaptation approach using Bayesian learning for data-efficient and robust Conformer ASR systems.
Findings
Significant WER reductions on Switchboard and AMI datasets.
Consistent performance improvements over baseline models.
Effective confidence score estimation modules enhance adaptation reliability.
Abstract
Speaker adaptation techniques provide a powerful solution to customise automatic speech recognition (ASR) systems for individual users. Practical application of unsupervised model-based speaker adaptation techniques to data intensive end-to-end ASR systems is hindered by the scarcity of speaker-level data and performance sensitivity to transcription errors. To address these issues, a set of compact and data efficient speaker-dependent (SD) parameter representations are used to facilitate both speaker adaptive training and test-time unsupervised speaker adaptation of state-of-the-art Conformer ASR systems. The sensitivity to supervision quality is reduced using a confidence score-based selection of the less erroneous subset of speaker-level adaptation data. Two lightweight confidence score estimation modules are proposed to produce more reliable confidence scores. The data sparsity…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsAttention Is All You Need · Linear Layer · Dense Connections · Multi-Head Attention · Label Smoothing · Tanh Activation · Absolute Position Encodings · Adam · Sigmoid Activation · Position-Wise Feed-Forward Layer
