Direct Simultaneous Speech-to-Text Translation Assisted by Synchronized Streaming ASR
Junkun Chen, Mingbo Ma, Renjie Zheng, Liang Huang

TL;DR
This paper introduces a synchronized dual-decoder approach for simultaneous speech-to-text translation, combining the benefits of cascaded and end-to-end methods to improve translation quality with low latency.
Contribution
It proposes a novel synchronized decoding paradigm with multitask training, enhancing translation accuracy while maintaining low latency in real-time speech translation.
Findings
Achieves better translation quality than traditional methods.
Maintains similar latency levels to existing approaches.
Demonstrates effectiveness on MuSTC dataset for En-De and En-Es.
Abstract
Simultaneous speech-to-text translation is widely useful in many scenarios. The conventional cascaded approach uses a pipeline of streaming ASR followed by simultaneous MT, but suffers from error propagation and extra latency. To alleviate these issues, recent efforts attempt to directly translate the source speech into target text simultaneously, but this is much harder due to the combination of two separate tasks. We instead propose a new paradigm with the advantages of both cascaded and end-to-end approaches. The key idea is to use two separate, but synchronized, decoders on streaming ASR and direct speech-to-text translation (ST), respectively, and the intermediate results of ASR guide the decoding policy of (but is not fed as input to) ST. During training time, we use multitask learning to jointly learn these two tasks with a shared encoder. En-to-De and En-to-Es experiments on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling
