Modular Speech-to-Text Translation for Zero-Shot Cross-Modal Transfer
Paul-Ambroise Duquenne, Holger Schwenk, Beno\^it Sagot

TL;DR
This paper demonstrates that multilingual training significantly enhances zero-shot cross-modal speech translation, outperforming supervised models like XLSR for multiple languages by leveraging independent encoders and decoders with shared representations.
Contribution
It introduces a multilingual training approach to modular speech-to-text translation, improving zero-shot cross-modal transfer performance beyond previous supervised methods.
Findings
Significant improvements in zero-shot speech translation across languages.
Outperforms XLSR-based supervised approaches in several languages.
Shows the effectiveness of shared representations in modular models.
Abstract
Recent research has shown that independently trained encoders and decoders, combined through a shared fixed-size representation, can achieve competitive performance in speech-to-text translation. In this work, we show that this type of approach can be further improved with multilingual training. We observe significant improvements in zero-shot cross-modal speech translation, even outperforming a supervised approach based on XLSR for several languages.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
MethodsXLSR
