Fast-MD: Fast Multi-Decoder End-to-End Speech Translation with   Non-Autoregressive Hidden Intermediates

Hirofumi Inaguma; Siddharth Dalmia; Brian Yan; Shinji Watanabe

arXiv:2109.12804·eess.AS·September 28, 2021

Fast-MD: Fast Multi-Decoder End-to-End Speech Translation with Non-Autoregressive Hidden Intermediates

Hirofumi Inaguma, Siddharth Dalmia, Brian Yan, Shinji Watanabe

PDF

Open Access 1 Repo

TL;DR

Fast-MD introduces a non-autoregressive decoding approach for multi-decoder speech translation, significantly improving inference speed while maintaining translation quality, making it more suitable for real-world applications.

Contribution

It proposes a novel non-autoregressive hidden intermediate generation method for multi-decoder speech translation, reducing decoding time without sacrificing accuracy.

Findings

01

Achieved 2x faster decoding on GPU and 4x on CPU compared to naive MD.

02

Maintained comparable translation quality with faster inference.

03

Enhanced model performance with Conformer encoder and intermediate CTC loss.

Abstract

The multi-decoder (MD) end-to-end speech translation model has demonstrated high translation quality by searching for better intermediate automatic speech recognition (ASR) decoder states as hidden intermediates (HI). It is a two-pass decoding model decomposing the overall task into ASR and machine translation sub-tasks. However, the decoding speed is not fast enough for real-world applications because it conducts beam search for both sub-tasks during inference. We propose Fast-MD, a fast MD model that generates HI by non-autoregressive (NAR) decoding based on connectionist temporal classification (CTC) outputs followed by an ASR decoder. We investigated two types of NAR HI: (1) parallel HI by using an autoregressive Transformer ASR decoder and (2) masked HI by using Mask-CTC, which combines CTC and the conditional masked language model. To reduce a mismatch in the ASR decoder between…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

aigc-audio/audiogpt
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dropout · Layer Normalization · Position-Wise Feed-Forward Layer · Adam · Dense Connections · Byte Pair Encoding · Label Smoothing