Data Efficient Direct Speech-to-Text Translation with Modality Agnostic Meta-Learning
Sathish Indurthi, Houjeung Han, Nikhil Kumar Lakumarapu, Beomseok Lee,, Insoo Chung, Sangha Kim, Chanwoo Kim

TL;DR
This paper introduces a modality agnostic meta-learning approach for end-to-end speech translation that leverages transfer learning from ASR and MT tasks, significantly improving translation quality especially in low-data scenarios.
Contribution
The paper proposes a novel meta-learning framework that trains a multi-task model to transfer knowledge from ASR and MT to speech translation, outperforming previous transfer learning methods.
Findings
Achieved state-of-the-art BLEU scores on En-De and En-Fr translation tasks.
Outperformed previous transfer learning approaches by large margins.
Demonstrated effectiveness in low-resource speech translation scenarios.
Abstract
End-to-end Speech Translation (ST) models have several advantages such as lower latency, smaller model size, and less error compounding over conventional pipelines that combine Automatic Speech Recognition (ASR) and text Machine Translation (MT) models. However, collecting large amounts of parallel data for ST task is more difficult compared to the ASR and MT tasks. Previous studies have proposed the use of transfer learning approaches to overcome the above difficulty. These approaches benefit from weakly supervised training data, such as ASR speech-to-transcript or MT text-to-text translation pairs. However, the parameters in these models are updated independently of each task, which may lead to sub-optimal solutions. In this work, we adopt a meta-learning algorithm to train a modality agnostic multi-task model that transfers knowledge from source tasks=ASR+MT to target task=ST where…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling
