Understanding and Bridging the Modality Gap for Speech Translation
Qingkai Fang, Yang Feng

TL;DR
This paper investigates the modality gap between speech translation and machine translation, linking it to exposure bias, and proposes a novel regularization method with adaptive training to improve end-to-end speech translation performance.
Contribution
It introduces the Cress method with scheduled sampling and token-level adaptive training to effectively bridge the modality gap in speech translation.
Findings
Cress reduces the modality gap during inference.
The approach improves translation quality across multiple language directions.
Results demonstrate significant gains on the MuST-C dataset.
Abstract
How to achieve better end-to-end speech translation (ST) by leveraging (text) machine translation (MT) data? Among various existing techniques, multi-task learning is one of the effective ways to share knowledge between ST and MT in which additional MT data can help to learn source-to-target mapping. However, due to the differences between speech and text, there is always a gap between ST and MT. In this paper, we first aim to understand this modality gap from the target-side representation differences, and link the modality gap to another well-known problem in neural machine translation: exposure bias. We find that the modality gap is relatively small during training except for some difficult cases, but keeps increasing during inference due to the cascading effect. To address these problems, we propose the Cross-modal Regularization with Scheduled Sampling (Cress) method. Specifically,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
