Bridging the Modality Gap for Speech-to-Text Translation
Yuchen Liu, Junnan Zhu, Jiajun Zhang, and Chengqing Zong

TL;DR
This paper introduces the STAST model, which bridges the modality gap between speech and text in end-to-end speech translation, leading to improved performance and state-of-the-art results.
Contribution
The paper proposes a novel decoupling and adaptation approach to better align speech and text representations in speech translation models.
Findings
Significant performance improvements over baseline models.
Achieved new state-of-the-art results on English-French and English-German datasets.
Effective cross-modal adaptation enhances semantic representation.
Abstract
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way. Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously, which ignores the speech-and-text modality differences and makes the encoder overloaded, leading to great difficulty in learning such a model. To address these issues, we propose a Speech-to-Text Adaptation for Speech Translation (STAST) model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text. Specifically, we decouple the speech translation encoder into three parts and introduce a shrink mechanism to match the length of speech representation with that of the corresponding text transcription. To obtain better semantic representation, we completely integrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
