Soft Alignment of Modality Space for End-to-end Speech Translation

Yuhao Zhang; Kaiqi Kou; Bei Li; Chen Xu; Chunliang Zhang; Tong Xiao,; Jingbo Zhu

arXiv:2312.10952·cs.CL·December 19, 2023·1 cites

Soft Alignment of Modality Space for End-to-end Speech Translation

Yuhao Zhang, Kaiqi Kou, Bei Li, Chen Xu, Chunliang Zhang, Tong Xiao,, Jingbo Zhu

PDF

Open Access

TL;DR

This paper introduces Soft Alignment (S-Align), an adversarial training method that aligns speech and text representations into a shared space, improving end-to-end speech translation performance across multiple languages.

Contribution

It proposes a novel soft alignment approach using adversarial training to better align modality spaces in speech translation models, surpassing traditional hard alignment methods.

Findings

01

S-Align outperforms H-Align in multiple translation tasks

02

Achieves translation quality comparable to specialized models

03

Effective cross-modal and cross-lingual transfer demonstrated

Abstract

End-to-end Speech Translation (ST) aims to convert speech into target text within a unified model. The inherent differences between speech and text modalities often impede effective cross-modal and cross-lingual transfer. Existing methods typically employ hard alignment (H-Align) of individual speech and text segments, which can degrade textual representations. To address this, we introduce Soft Alignment (S-Align), using adversarial training to align the representation spaces of both modalities. S-Align creates a modality-invariant space while preserving individual modality quality. Experiments on three languages from the MuST-C dataset show S-Align outperforms H-Align across multiple tasks and offers translation capabilities on par with specialized translation models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling

MethodsALIGN