Sequence-to-Sequence ASR Optimization via Reinforcement Learning
Andros Tjandra, Sakriani Sakti, Satoshi Nakamura

TL;DR
This paper introduces a reinforcement learning-based training method for sequence-to-sequence speech recognition models, directly optimizing recognition accuracy and reducing errors compared to traditional likelihood-based training.
Contribution
It proposes using policy gradient reinforcement learning with negative Levenshtein distance as reward to improve sequence-to-sequence ASR performance.
Findings
Significant performance improvements over maximum likelihood training.
Effective direct optimization of recognition error metrics.
Demonstrated robustness in transcription accuracy.
Abstract
Despite the success of sequence-to-sequence approaches in automatic speech recognition (ASR) systems, the models still suffer from several problems, mainly due to the mismatch between the training and inference conditions. In the sequence-to-sequence architecture, the model is trained to predict the grapheme of the current time-step given the input of speech signal and the ground-truth grapheme history of the previous time-steps. However, it remains unclear how well the model approximates real-world speech during inference. Thus, generating the whole transcription from scratch based on previous predictions is complicated and errors can propagate over time. Furthermore, the model is optimized to maximize the likelihood of training data instead of error rate evaluation metrics that actually quantify recognition quality. This paper presents an alternative strategy for training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
