Cross-domain Speech Recognition with Unsupervised Character-level Distribution Matching
Wenxin Hou, Jindong Wang, Xu Tan, Tao Qin, Takahiro Shinozaki

TL;DR
This paper introduces CMatch, a novel unsupervised character-level distribution matching method for domain adaptation in speech recognition, significantly reducing word error rates across different devices and environments.
Contribution
It proposes a new fine-grained domain adaptation technique using character-level distribution matching with pseudo labels and self-training, improving ASR performance.
Findings
Achieves 14.39% and 16.50% relative WER reduction on Libri-Adapt.
Effectively matches character distributions across domains.
Analyzes strategies for label assignment and model adaptation.
Abstract
End-to-end automatic speech recognition (ASR) can achieve promising performance with large-scale training data. However, it is known that domain mismatch between training and testing data often leads to a degradation of recognition accuracy. In this work, we focus on the unsupervised domain adaptation for ASR and propose CMatch, a Character-level distribution matching method to perform fine-grained adaptation between each character in two domains. First, to obtain labels for the features belonging to each character, we achieve frame-level label assignment using the Connectionist Temporal Classification (CTC) pseudo labels. Then, we match the character-level distributions using Maximum Mean Discrepancy. We train our algorithm using the self-training technique. Experiments on the Libri-Adapt dataset show that our proposed approach achieves 14.39% and 16.50% relative Word Error Rate (WER)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Softmax · Dropout · Adam · Layer Normalization · Label Smoothing · Byte Pair Encoding
