Label-Synchronous Neural Transducer for E2E Simultaneous Speech Translation
Keqi Deng, Philip C. Woodland

TL;DR
This paper introduces LS-Transducer-SST, a neural transducer model for simultaneous speech translation that dynamically balances translation quality and latency using an innovative AIF mechanism, improving performance over existing methods.
Contribution
The paper proposes a label-synchronous neural transducer with an Auto-regressive Integrate-and-Fire mechanism for SST, enabling natural streaming, re-ordering, and utilization of text data, with controllable latency.
Findings
Outperforms existing methods in quality-latency trade-off.
Achieves 3.1/2.9 BLEU improvements on Es-En/En-De datasets.
Reduces average lagging latency by 1.4 seconds.
Abstract
While the neural transducer is popular for online speech recognition, simultaneous speech translation (SST) requires both streaming and re-ordering capabilities. This paper presents the LS-Transducer-SST, a label-synchronous neural transducer for SST, which naturally possesses these two properties. The LS-Transducer-SST dynamically decides when to emit translation tokens based on an Auto-regressive Integrate-and-Fire (AIF) mechanism. A latency-controllable AIF is also proposed, which can control the quality-latency trade-off either only during decoding, or it can be used in both decoding and training. The LS-Transducer-SST can naturally utilise monolingual text-only data via its prediction network which helps alleviate the key issue of data sparsity for E2E SST. During decoding, a chunk-based incremental joint decoding technique is designed to refine and expand the search space.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques
