Neural Simultaneous Speech Translation Using Alignment-Based Chunking
Patrick Wilken, Tamer Alkhouli, Evgeny Matusov, Pavel Golik

TL;DR
This paper introduces a neural simultaneous translation model that dynamically decides when to produce output based on source input, improving translation quality and latency trade-offs in speech translation tasks.
Contribution
It presents a novel alignment-based chunking method and a joint training approach for dynamic decision-making in neural speech translation models.
Findings
Outperforms wait-k baseline by 2.6-3.7% BLEU on IWSLT 2020 English-German task.
Uses alignment-based chunking to improve translation accuracy.
Demonstrates effectiveness on both speech and text inputs.
Abstract
In simultaneous machine translation, the objective is to determine when to produce a partial translation given a continuous stream of source words, with a trade-off between latency and quality. We propose a neural machine translation (NMT) model that makes dynamic decisions when to continue feeding on input or generate output words. The model is composed of two main components: one to dynamically decide on ending a source chunk, and another that translates the consumed chunk. We train the components jointly and in a manner consistent with the inference conditions. To generate chunked training data, we propose a method that utilizes word alignment while also preserving enough context. We compare models with bidirectional and unidirectional encoders of different depths, both on real speech and text input. Our results on the IWSLT 2020 English-to-German task outperform a wait-k baseline by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
