Focused Discriminative Training For Streaming CTC-Trained Automatic Speech Recognition Models
Adnan Haider, Xingyu Na, Erik McDermott, Tim Ng, Zhen Huang and, Xiaodan Zhuang

TL;DR
This paper proposes Focused Discriminative Training (FDT), a new method to enhance streaming end-to-end speech recognition models by targeting challenging audio segments, leading to significant WER reductions without relying on HMMs or lattices.
Contribution
The paper introduces FDT, a novel training framework that improves streaming ASR models by focusing on difficult segments, independent of traditional HMM-based discriminative methods.
Findings
FDT achieves greater WER reduction than traditional fine-tuning methods.
FDT improves performance on models trained on LibriSpeech and large-scale datasets.
The approach is effective without HMMs or lattice-based decision processes.
Abstract
This paper introduces a novel training framework called Focused Discriminative Training (FDT) to further improve streaming word-piece end-to-end (E2E) automatic speech recognition (ASR) models trained using either CTC or an interpolation of CTC and attention-based encoder-decoder (AED) loss. The proposed approach presents a novel framework to identify and improve a model's recognition on challenging segments of an audio. Notably, this training framework is independent of hidden Markov models (HMMs) and lattices, eliminating the need for substantial decision-making regarding HMM topology, lexicon, and graph generation, as typically required in standard discriminative training approaches. Compared to additional fine-tuning with MMI or MWER loss on the encoder, FDT is shown to be more effective in achieving greater reductions in Word Error Rate (WER) on streaming models trained on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
