Modular Speech-to-Text Translation for Zero-Shot Cross-Modal Transfer

Paul-Ambroise Duquenne; Holger Schwenk; Beno\^it Sagot

arXiv:2310.03724·cs.CL·October 9, 2023

Modular Speech-to-Text Translation for Zero-Shot Cross-Modal Transfer

Paul-Ambroise Duquenne, Holger Schwenk, Beno\^it Sagot

PDF

Open Access

TL;DR

This paper demonstrates that multilingual training significantly enhances zero-shot cross-modal speech translation, outperforming supervised models like XLSR for multiple languages by leveraging independent encoders and decoders with shared representations.

Contribution

It introduces a multilingual training approach to modular speech-to-text translation, improving zero-shot cross-modal transfer performance beyond previous supervised methods.

Findings

01

Significant improvements in zero-shot speech translation across languages.

02

Outperforms XLSR-based supervised approaches in several languages.

03

Shows the effectiveness of shared representations in modular models.

Abstract

Recent research has shown that independently trained encoders and decoders, combined through a shared fixed-size representation, can achieve competitive performance in speech-to-text translation. In this work, we show that this type of approach can be further improved with multilingual training. We observe significant improvements in zero-shot cross-modal speech translation, even outperforming a supervised approach based on XLSR for several languages.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling

MethodsXLSR