HENT-SRT: Hierarchical Efficient Neural Transducer with Self-Distillation for Joint Speech Recognition and Translation
Amir Hussein, Cihan Xiao, Matthew Wiesner, Dan Povey, Leibny Paola Garcia, Sanjeev Khudanpur

TL;DR
HENT-SRT introduces a hierarchical neural transducer framework with self-distillation for joint speech recognition and translation, improving reordering handling, computational efficiency, and translation quality, achieving state-of-the-art results across multiple languages.
Contribution
The paper presents a novel hierarchical neural transducer with self-distillation and efficiency improvements for joint speech recognition and translation, addressing reordering and performance issues.
Findings
Achieves state-of-the-art results on Arabic, Spanish, and Mandarin datasets.
Reduces training complexity through hierarchical encoding and pruned loss.
Improves translation quality with a blank penalty during decoding.
Abstract
Neural transducers (NT) provide an effective framework for speech streaming, demonstrating strong performance in automatic speech recognition (ASR). However, the application of NT to speech translation (ST) remains challenging, as existing approaches struggle with word reordering and performance degradation when jointly modeling ASR and ST, resulting in a gap with attention-based encoder-decoder (AED) models. Existing NT-based ST approaches also suffer from high computational training costs. To address these issues, we propose HENT-SRT (Hierarchical Efficient Neural Transducer for Speech Recognition and Translation), a novel framework that factorizes ASR and translation tasks to better handle reordering. To ensure robust ST while preserving ASR performance, we use self-distillation with CTC consistency regularization. Moreover, we improve computational efficiency by incorporating best…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Neural Networks and Applications
