Dual-decoder Transformer for Joint Automatic Speech Recognition and   Multilingual Speech Translation

Hang Le; Juan Pino; Changhan Wang; Jiatao Gu; Didier Schwab; Laurent; Besacier

arXiv:2011.00747·cs.CL·November 21, 2020·22 cites

Dual-decoder Transformer for Joint Automatic Speech Recognition and Multilingual Speech Translation

Hang Le, Juan Pino, Changhan Wang, Jiatao Gu, Didier Schwab, Laurent, Besacier

PDF

Open Access 1 Repo

TL;DR

This paper presents a dual-decoder Transformer architecture that jointly performs speech recognition and multilingual speech translation, improving performance over previous models and enabling effective multitask learning.

Contribution

The paper introduces a novel dual-decoder Transformer with two interaction variants, enhancing joint ASR and speech translation performance.

Findings

01

Outperforms previous multilingual speech translation models

02

Parallel dual-decoder models show no trade-off between ASR and ST tasks

03

Achieves state-of-the-art results on MuST-C dataset

Abstract

We introduce dual-decoder Transformer, a new model architecture that jointly performs automatic speech recognition (ASR) and multilingual speech translation (ST). Our models are based on the original Transformer architecture (Vaswani et al., 2017) but consist of two decoders, each responsible for one task (ASR or ST). Our major contribution lies in how these decoders interact with each other: one decoder can attend to different information sources from the other via a dual-attention mechanism. We propose two variants of these architectures corresponding to two different levels of dependencies between the decoders, called the parallel and cross dual-decoder Transformers, respectively. Extensive experiments on the MuST-C dataset show that our models outperform the previously-reported highest translation performance in the multilingual settings, and outperform as well bilingual one-to-one…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

formiel/speech-translation
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Residual Connection · Dropout · Multi-Head Attention · Byte Pair Encoding · Softmax · Dense Connections