Towards Achieving Human Parity on End-to-end Simultaneous Speech Translation via LLM Agent
Shanbo Cheng, Zhichao Huang, Tom Ko, Hang Li, Ningxin Peng, Lu Xu,, Qini Zhang

TL;DR
This paper introduces CLASI, a novel high-quality simultaneous speech translation system inspired by human interpreters, which balances translation quality and latency using a data-driven strategy and multi-modal retrieval, outperforming existing systems.
Contribution
The paper presents CLASI, a new approach integrating a read-write strategy and multi-modal retrieval to improve real-time speech translation quality and robustness, especially in challenging scenarios.
Findings
CLASI achieves VIP of 81.3% (Chinese-English) and 78.0% (English-Chinese).
Outperforms state-of-the-art systems significantly in real-world and hard datasets.
Demonstrates robustness in disfluent, informal speech scenarios.
Abstract
In this paper, we present Cross Language Agent -- Simultaneous Interpretation, CLASI, a high-quality and human-like Simultaneous Speech Translation (SiST) System. Inspired by professional human interpreters, we utilize a novel data-driven read-write strategy to balance the translation quality and latency. To address the challenge of translating in-domain terminologies, CLASI employs a multi-modal retrieving module to obtain relevant information to augment the translation. Supported by LLMs, our approach can generate error-tolerated translation by considering the input audio, historical context, and retrieved information. Experimental results show that our system outperforms other systems by significant margins. Aligned with professional human interpreters, we evaluate CLASI with a better human evaluation metric, valid information proportion (VIP), which measures the amount of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems
