Dual-decoder Transformer for Joint Automatic Speech Recognition and Multilingual Speech Translation
Hang Le, Juan Pino, Changhan Wang, Jiatao Gu, Didier Schwab, Laurent, Besacier

TL;DR
This paper presents a dual-decoder Transformer architecture that jointly performs speech recognition and multilingual speech translation, improving performance over previous models and enabling effective multitask learning.
Contribution
The paper introduces a novel dual-decoder Transformer with two interaction variants, enhancing joint ASR and speech translation performance.
Findings
Outperforms previous multilingual speech translation models
Parallel dual-decoder models show no trade-off between ASR and ST tasks
Achieves state-of-the-art results on MuST-C dataset
Abstract
We introduce dual-decoder Transformer, a new model architecture that jointly performs automatic speech recognition (ASR) and multilingual speech translation (ST). Our models are based on the original Transformer architecture (Vaswani et al., 2017) but consist of two decoders, each responsible for one task (ASR or ST). Our major contribution lies in how these decoders interact with each other: one decoder can attend to different information sources from the other via a dual-attention mechanism. We propose two variants of these architectures corresponding to two different levels of dependencies between the decoders, called the parallel and cross dual-decoder Transformers, respectively. Extensive experiments on the MuST-C dataset show that our models outperform the previously-reported highest translation performance in the multilingual settings, and outperform as well bilingual one-to-one…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Residual Connection · Dropout · Multi-Head Attention · Byte Pair Encoding · Softmax · Dense Connections
