Bridging the Modality Gap for Speech-to-Text Translation

Yuchen Liu; Junnan Zhu; Jiajun Zhang; and Chengqing Zong

arXiv:2010.14920·cs.CL·October 29, 2020·39 cites

Bridging the Modality Gap for Speech-to-Text Translation

Yuchen Liu, Junnan Zhu, Jiajun Zhang, and Chengqing Zong

PDF

Open Access

TL;DR

This paper introduces the STAST model, which bridges the modality gap between speech and text in end-to-end speech translation, leading to improved performance and state-of-the-art results.

Contribution

The paper proposes a novel decoupling and adaptation approach to better align speech and text representations in speech translation models.

Findings

01

Significant performance improvements over baseline models.

02

Achieved new state-of-the-art results on English-French and English-German datasets.

03

Effective cross-modal adaptation enhances semantic representation.

Abstract

End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way. Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously, which ignores the speech-and-text modality differences and makes the encoder overloaded, leading to great difficulty in learning such a model. To address these issues, we propose a Speech-to-Text Adaptation for Speech Translation (STAST) model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text. Specifically, we decouple the speech translation encoder into three parts and introduce a shrink mechanism to match the length of speech representation with that of the corresponding text transcription. To obtain better semantic representation, we completely integrate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling