Cross-modal Contrastive Learning for Speech Translation

Rong Ye; Mingxuan Wang; Lei Li

arXiv:2205.02444·cs.CL·May 6, 2022

Cross-modal Contrastive Learning for Speech Translation

Rong Ye, Mingxuan Wang, Lei Li

PDF

Open Access 1 Repo

TL;DR

This paper introduces ConST, a cross-modal contrastive learning approach that aligns speech and text representations, significantly improving speech translation performance and cross-modal retrieval accuracy.

Contribution

It presents a novel contrastive learning method for unified speech-text representations, outperforming previous approaches on the MuST-C benchmark.

Findings

01

Achieves an average BLEU score of 29.4 on MuST-C.

02

Improves cross-modal speech-text retrieval accuracy from 4% to 88%.

03

Consistently outperforms previous methods.

Abstract

How can we learn unified representations for spoken utterances and their written text? Learning similar representations for semantically similar speech and text is important for speech translation. To this end, we propose ConST, a cross-modal contrastive learning method for end-to-end speech-to-text translation. We evaluate ConST and a variety of previous baselines on a popular benchmark MuST-C. Experiments show that the proposed ConST consistently outperforms the previous methods on, and achieves an average BLEU of 29.4. The analysis further verifies that ConST indeed closes the representation gap of different modalities -- its learned representation improves the accuracy of cross-modal speech-text retrieval from 4% to 88%. Code and models are available at https://github.com/ReneeYe/ConST.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

reneeye/const
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems · Multimodal Machine Learning Applications

MethodsContrastive Learning