Streaming Speech-to-Text Translation with a SpeechLLM
Titouan Parcollet, Shucong Zhang, Xianrui Zheng, Rogier C. van Dalen

TL;DR
This paper introduces a real-time streaming speech-to-text translation system using an LLM that decides when to output translations, achieving near non-streaming quality with minimal latency.
Contribution
It presents an LLM-based architecture capable of streaming speech translation with low latency, unlike previous slow, non-streaming systems.
Findings
Achieves translation quality close to non-streaming baselines.
Operates with a latency of only 1-2 seconds.
Works effectively across different language pairs.
Abstract
Normally, a system that translates speech into text consists of separate modules for speech recognition and text-to-text translation. Combining those tasks into a SpeechLLM promises to exploit paralinguistic information in the speech and to reduce cascaded errors. But existing SpeechLLM systems are slow since they do not work in a real streaming fashion: they wait for a complete utterance of audio before outputting a translation, or output tokens at fixed intervals, which is not suitable for real applications. This work proposes an LLM-based architecture for real streaming speech-to-text translation. The LLM learns not just to emit output tokens, but also to decide whether it has seen enough audio to do so. The system is trained using automatic alignments of the input speech and the output text. In experiments on different language pairs, the system achieves a translation quality close…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
