Understanding and Bridging the Modality Gap for Speech Translation

Qingkai Fang; Yang Feng

arXiv:2305.08706·cs.CL·May 16, 2023·1 cites

Understanding and Bridging the Modality Gap for Speech Translation

Qingkai Fang, Yang Feng

PDF

Open Access 1 Repo

TL;DR

This paper investigates the modality gap between speech translation and machine translation, linking it to exposure bias, and proposes a novel regularization method with adaptive training to improve end-to-end speech translation performance.

Contribution

It introduces the Cress method with scheduled sampling and token-level adaptive training to effectively bridge the modality gap in speech translation.

Findings

01

Cress reduces the modality gap during inference.

02

The approach improves translation quality across multiple language directions.

03

Results demonstrate significant gains on the MuST-C dataset.

Abstract

How to achieve better end-to-end speech translation (ST) by leveraging (text) machine translation (MT) data? Among various existing techniques, multi-task learning is one of the effective ways to share knowledge between ST and MT in which additional MT data can help to learn source-to-target mapping. However, due to the differences between speech and text, there is always a gap between ST and MT. In this paper, we first aim to understand this modality gap from the target-side representation differences, and link the modality gap to another well-known problem in neural machine translation: exposure bias. We find that the modality gap is relatively small during training except for some difficult cases, but keeps increasing during inference due to the cascading effect. To address these problems, we propose the Cross-modal Regularization with Scheduled Sampling (Cress) method. Specifically,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ictnlp/cress
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications