Soft Alignment of Modality Space for End-to-end Speech Translation
Yuhao Zhang, Kaiqi Kou, Bei Li, Chen Xu, Chunliang Zhang, Tong Xiao,, Jingbo Zhu

TL;DR
This paper introduces Soft Alignment (S-Align), an adversarial training method that aligns speech and text representations into a shared space, improving end-to-end speech translation performance across multiple languages.
Contribution
It proposes a novel soft alignment approach using adversarial training to better align modality spaces in speech translation models, surpassing traditional hard alignment methods.
Findings
S-Align outperforms H-Align in multiple translation tasks
Achieves translation quality comparable to specialized models
Effective cross-modal and cross-lingual transfer demonstrated
Abstract
End-to-end Speech Translation (ST) aims to convert speech into target text within a unified model. The inherent differences between speech and text modalities often impede effective cross-modal and cross-lingual transfer. Existing methods typically employ hard alignment (H-Align) of individual speech and text segments, which can degrade textual representations. To address this, we introduce Soft Alignment (S-Align), using adversarial training to align the representation spaces of both modalities. S-Align creates a modality-invariant space while preserving individual modality quality. Experiments on three languages from the MuST-C dataset show S-Align outperforms H-Align across multiple tasks and offers translation capabilities on par with specialized translation models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling
MethodsALIGN
