Tight Integrated End-to-End Training for Cascaded Speech Translation
Parnia Bahar, Tobias Bieschke, Ralf Schl\"uter, Hermann Ney

TL;DR
This paper introduces a tightly integrated end-to-end training approach for cascaded speech translation that jointly optimizes ASR and MT components, improving translation quality and consistency over traditional cascade and direct models.
Contribution
It proposes a novel end-to-end trainable model that collapses cascade components into a single system using soft decision passing, enhancing performance and consistency.
Findings
Outperforms cascade models by up to 1.8% BLEU and 2.0% TER.
Achieves better results than direct speech translation models.
Enables joint optimization of all components with backpropagation.
Abstract
A cascaded speech translation model relies on discrete and non-differentiable transcription, which provides a supervision signal from the source side and helps the transformation between source speech and target text. Such modeling suffers from error propagation between ASR and MT models. Direct speech translation is an alternative method to avoid error propagation; however, its performance is often behind the cascade system. To use an intermediate representation and preserve the end-to-end trainability, previous studies have proposed using two-stage models by passing the hidden vectors of the recognizer into the decoder of the MT model and ignoring the MT encoder. This work explores the feasibility of collapsing the entire cascade components into a single end-to-end trainable model by optimizing all parameters of ASR and MT models jointly without ignoring any learned parameters. It is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
