Transducer Consistency Regularization for Speech to Text Applications
Cindy Tseng, Yun Tang, Vijendra Raj Apsingekar

TL;DR
This paper introduces Transducer Consistency Regularization (TCR), a novel method that improves speech-to-text models by encouraging consistent outputs across distorted inputs, effectively reducing word error rate on LibriSpeech.
Contribution
The paper proposes TCR, a new regularization technique for transducer models that weights alignments based on proximity to oracle alignments, enhancing model training and performance.
Findings
Reduces WER by 4.3% relative on LibriSpeech.
Outperforms other consistency regularization methods.
Effectively leverages data distortions and alignment weighting.
Abstract
Consistency regularization is a commonly used practice to encourage the model to generate consistent representation from distorted input features and improve model generalization. It shows significant improvement on various speech applications that are optimized with cross entropy criterion. However, it is not straightforward to apply consistency regularization for the transducer-based approaches, which are widely adopted for speech applications due to the competitive performance and streaming characteristic. The main challenge is from the vast alignment space of the transducer optimization criterion and not all the alignments within the space contribute to the model optimization equally. In this study, we present Transducer Consistency Regularization (TCR), a consistency regularization method for transducer models. We apply distortions such as spec augmentation and dropout to create…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing
MethodsDropout
