Discrete Cross-Modal Alignment Enables Zero-Shot Speech Translation
Chen Wang, Yuchen Liu, Boxing Chen, Jiajun Zhang, Wei Luo, Zhongqiang, Huang, Chengqing Zong

TL;DR
This paper introduces a Discrete Cross-Modal Alignment method that uses a shared discrete vocabulary to enable zero-shot speech translation, significantly improving performance without requiring parallel translation data.
Contribution
The novel DCMA approach aligns speech and text modalities in a shared discrete space using vector quantization, enabling effective zero-shot speech translation.
Findings
Significant improvement over previous zero-shot methods
Achieves performance comparable to supervised baselines
Effective across multiple language pairs
Abstract
End-to-end Speech Translation (ST) aims at translating the source language speech into target language text without generating the intermediate transcriptions. However, the training of end-to-end methods relies on parallel ST data, which are difficult and expensive to obtain. Fortunately, the supervised data for automatic speech recognition (ASR) and machine translation (MT) are usually more accessible, making zero-shot speech translation a potential direction. Existing zero-shot methods fail to align the two modalities of speech and text into a shared semantic space, resulting in much worse performance compared to the supervised ST methods. In order to enable zero-shot ST, we propose a novel Discrete Cross-Modal Alignment (DCMA) method that employs a shared discrete vocabulary space to accommodate and match both modalities of speech and text. Specifically, we introduce a vector…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling
MethodsALIGN
