Cross-Modal Multi-Tasking for Speech-to-Text Translation via Hard   Parameter Sharing

Brian Yan; Xuankai Chang; Antonios Anastasopoulos; Yuya Fujita; Shinji; Watanabe

arXiv:2309.15826·cs.CL·September 28, 2023

Cross-Modal Multi-Tasking for Speech-to-Text Translation via Hard Parameter Sharing

Brian Yan, Xuankai Chang, Antonios Anastasopoulos, Yuya Fujita, Shinji, Watanabe

PDF

Open Access

TL;DR

This paper introduces a multi-task learning framework for speech-to-text translation that uses hard parameter sharing and a pre-processing step to unify speech and text inputs, improving translation performance without external data.

Contribution

It proposes a novel hard parameter sharing multi-tasking framework with a pre-processing step to align speech and text modalities for end-to-end translation.

Findings

01

Improves BLEU scores by +0.5 without external data

02

Incorporates external MT data for +0.8 BLEU improvement

03

Enhances transfer learning from pre-trained models for +1.8 BLEU

Abstract

Recent works in end-to-end speech-to-text translation (ST) have proposed multi-tasking methods with soft parameter sharing which leverage machine translation (MT) data via secondary encoders that map text inputs to an eventual cross-modal representation. In this work, we instead propose a ST/MT multi-tasking framework with hard parameter sharing in which all model parameters are shared cross-modally. Our method reduces the speech-text modality gap via a pre-processing stage which converts speech and text inputs into two discrete token sequences of similar length -- this allows models to indiscriminately process both modalities simply using a joint vocabulary. With experiments on MuST-C, we demonstrate that our multi-tasking framework improves attentional encoder-decoder, Connectionist Temporal Classification (CTC), transducer, and joint CTC/attention models by an average of +0.5 BLEU…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis