Fused Acoustic and Text Encoding for Multimodal Bilingual Pretraining   and Speech Translation

Renjie Zheng; Junkun Chen; Mingbo Ma; Liang Huang

arXiv:2102.05766·cs.CL·September 15, 2021·6 cites

Fused Acoustic and Text Encoding for Multimodal Bilingual Pretraining and Speech Translation

Renjie Zheng, Junkun Chen, Mingbo Ma, Liang Huang

PDF

Open Access 1 Video

TL;DR

This paper introduces FAT-MLM and FAT-ST, a unified framework for learning joint representations of speech and text, significantly enhancing speech translation performance by leveraging diverse data sources.

Contribution

It proposes a novel fused acoustic and text masked language model and an end-to-end speech translation model that utilize multi-modal data for improved translation quality.

Findings

01

Up to +5.9 BLEU improvement in speech translation

02

Effective joint learning from diverse speech and text corpora

03

Enhanced translation performance across multiple directions

Abstract

Recently, representation learning for text and speech has successfully improved many language related tasks. However, all existing methods suffer from two limitations: (a) they only learn from one input modality, while a unified representation for both speech and text is needed by tasks such as end-to-end speech translation, and as a result,(b) they can not exploit various large-scale text and speech data and their performance is limited by the scarcity of parallel speech translation data.To address these problems, we propose a Fused Acoustic and Text Masked Language Model (FAT-MLM) which jointly learns a unified representation for both acoustic and text input from various types of corpora including parallel data for speech recognition and machine translation, and even pure speech and text data. Within this cross-modal representation learning framework, we further present an end-to-end…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Fused Acoustic and Text Encoding for Multimodal Bilingual Pretraining and Speech Translation· slideslive

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling