Discrete Cross-Modal Alignment Enables Zero-Shot Speech Translation

Chen Wang; Yuchen Liu; Boxing Chen; Jiajun Zhang; Wei Luo; Zhongqiang; Huang; Chengqing Zong

arXiv:2210.09556·cs.CL·October 19, 2022·1 cites

Discrete Cross-Modal Alignment Enables Zero-Shot Speech Translation

Chen Wang, Yuchen Liu, Boxing Chen, Jiajun Zhang, Wei Luo, Zhongqiang, Huang, Chengqing Zong

PDF

Open Access 1 Repo

TL;DR

This paper introduces a Discrete Cross-Modal Alignment method that uses a shared discrete vocabulary to enable zero-shot speech translation, significantly improving performance without requiring parallel translation data.

Contribution

The novel DCMA approach aligns speech and text modalities in a shared discrete space using vector quantization, enabling effective zero-shot speech translation.

Findings

01

Significant improvement over previous zero-shot methods

02

Achieves performance comparable to supervised baselines

03

Effective across multiple language pairs

Abstract

End-to-end Speech Translation (ST) aims at translating the source language speech into target language text without generating the intermediate transcriptions. However, the training of end-to-end methods relies on parallel ST data, which are difficult and expensive to obtain. Fortunately, the supervised data for automatic speech recognition (ASR) and machine translation (MT) are usually more accessible, making zero-shot speech translation a potential direction. Existing zero-shot methods fail to align the two modalities of speech and text into a shared semantic space, resulting in much worse performance compared to the supervised ST methods. In order to enable zero-shot ST, we propose a novel Discrete Cross-Modal Alignment (DCMA) method that employs a shared discrete vocabulary space to accommodate and match both modalities of speech and text. Specifically, we introduce a vector…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

znlp/zero-shot-st
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling

MethodsALIGN