Large-Scale Streaming End-to-End Speech Translation with Neural Transducers
Jian Xue, Peidong Wang, Jinyu Li, Matt Post, Yashesh Gaur

TL;DR
This paper introduces a neural transducer-based streaming end-to-end speech translation model that reduces latency, improves speech information utilization, and supports multilingual translation, outperforming traditional cascaded methods.
Contribution
It presents a novel Transformer transducer model for streaming speech translation, including attention pooling and multilingual extension, with significant efficiency and performance gains.
Findings
Reduces inference latency compared to cascaded systems.
Outperforms non-streaming cascaded speech translation for English-German.
Effective in multilingual speech translation with multiple languages.
Abstract
Neural transducers have been widely used in automatic speech recognition (ASR). In this paper, we introduce it to streaming end-to-end speech translation (ST), which aims to convert audio signals to texts in other languages directly. Compared with cascaded ST that performs ASR followed by text-based machine translation (MT), the proposed Transformer transducer (TT)-based ST model drastically reduces inference latency, exploits speech information, and avoids error propagation from ASR to MT. To improve the modeling capacity, we propose attention pooling for the joint network in TT. In addition, we extend TT-based ST to multilingual ST, which generates texts of multiple languages at the same time. Experimental results on a large-scale 50 thousand (K) hours pseudo-labeled training set show that TT-based ST not only significantly reduces inference time but also outperforms non-streaming…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing
MethodsAttention Pooling · Multi-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Dense Connections · Layer Normalization · Residual Connection
