
TL;DR
This paper investigates zero-shot speech translation, enabling models trained only on ASR and MT tasks to translate speech between unseen language pairs, addressing data scarcity and error propagation issues.
Contribution
It introduces methods including additional training data and an auxiliary loss to improve zero-shot speech translation performance.
Findings
Achieved up to +11.8 BLEU points in zero-shot translation.
Significant improvements in few-shot settings with limited data.
Proved the feasibility of zero-shot speech translation without direct training data.
Abstract
Speech Translation (ST) is the task of translating speech in one language into text in another language. Traditional cascaded approaches for ST, using Automatic Speech Recognition (ASR) and Machine Translation (MT) systems, are prone to error propagation. End-to-end approaches use only one system to avoid propagating error, yet are difficult to employ due to data scarcity. We explore zero-shot translation, which enables translating a pair of languages that is unseen during training, thus avoid the use of end-to-end ST data. Zero-shot translation has been shown to work for multilingual machine translation, yet has not been studied for speech translation. We attempt to build zero-shot ST models that are trained only on ASR and MT tasks but can do ST task during inference. The challenge is that the representation of text and audio is significantly different, thus the models learn ASR and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
