Large-Scale Streaming End-to-End Speech Translation with Neural   Transducers

Jian Xue; Peidong Wang; Jinyu Li; Matt Post; Yashesh Gaur

arXiv:2204.05352·cs.CL·July 5, 2022

Large-Scale Streaming End-to-End Speech Translation with Neural Transducers

Jian Xue, Peidong Wang, Jinyu Li, Matt Post, Yashesh Gaur

PDF

Open Access 1 Repo

TL;DR

This paper introduces a neural transducer-based streaming end-to-end speech translation model that reduces latency, improves speech information utilization, and supports multilingual translation, outperforming traditional cascaded methods.

Contribution

It presents a novel Transformer transducer model for streaming speech translation, including attention pooling and multilingual extension, with significant efficiency and performance gains.

Findings

01

Reduces inference latency compared to cascaded systems.

02

Outperforms non-streaming cascaded speech translation for English-German.

03

Effective in multilingual speech translation with multiple languages.

Abstract

Neural transducers have been widely used in automatic speech recognition (ASR). In this paper, we introduce it to streaming end-to-end speech translation (ST), which aims to convert audio signals to texts in other languages directly. Compared with cascaded ST that performs ASR followed by text-based machine translation (MT), the proposed Transformer transducer (TT)-based ST model drastically reduces inference latency, exploits speech information, and avoids error propagation from ASR to MT. To improve the modeling capacity, we propose attention pooling for the joint network in TT. In addition, we extend TT-based ST to multilingual ST, which generates texts of multiple languages at the same time. Experimental results on a large-scale 50 thousand (K) hours pseudo-labeled training set show that TT-based ST not only significantly reduces inference time but also outperforms non-streaming…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mu-y/diarist
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing

MethodsAttention Pooling · Multi-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Dense Connections · Layer Normalization · Residual Connection