Focused Discriminative Training For Streaming CTC-Trained Automatic   Speech Recognition Models

Adnan Haider; Xingyu Na; Erik McDermott; Tim Ng; Zhen Huang and; Xiaodan Zhuang

arXiv:2408.13008·cs.LG·August 26, 2024

Focused Discriminative Training For Streaming CTC-Trained Automatic Speech Recognition Models

Adnan Haider, Xingyu Na, Erik McDermott, Tim Ng, Zhen Huang and, Xiaodan Zhuang

PDF

Open Access

TL;DR

This paper proposes Focused Discriminative Training (FDT), a new method to enhance streaming end-to-end speech recognition models by targeting challenging audio segments, leading to significant WER reductions without relying on HMMs or lattices.

Contribution

The paper introduces FDT, a novel training framework that improves streaming ASR models by focusing on difficult segments, independent of traditional HMM-based discriminative methods.

Findings

01

FDT achieves greater WER reduction than traditional fine-tuning methods.

02

FDT improves performance on models trained on LibriSpeech and large-scale datasets.

03

The approach is effective without HMMs or lattice-based decision processes.

Abstract

This paper introduces a novel training framework called Focused Discriminative Training (FDT) to further improve streaming word-piece end-to-end (E2E) automatic speech recognition (ASR) models trained using either CTC or an interpolation of CTC and attention-based encoder-decoder (AED) loss. The proposed approach presents a novel framework to identify and improve a model's recognition on challenging segments of an audio. Notably, this training framework is independent of hidden Markov models (HMMs) and lattices, eliminating the need for substantial decision-making regarding HMM topology, lexicon, and graph generation, as typically required in standard discriminative training approaches. Compared to additional fine-tuning with MMI or MWER loss on the encoder, FDT is shown to be more effective in achieving greater reductions in Word Error Rate (WER) on streaming models trained on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis