StreamUni: Achieving Streaming Speech Translation with a Unified Large Speech-Language Model
Shoutao Guo, Xiang Li, Mengge Liu, Wei Chen, Yang Feng

TL;DR
StreamUni introduces a unified large speech-language model that performs streaming speech translation by integrating segmentation, policy, and translation tasks, achieving state-of-the-art results without extensive task-specific training.
Contribution
The paper presents StreamUni, a novel unified model that combines multiple stages of streaming speech translation into a single framework using speech Chain-of-Thought, reducing reliance on segmentation models and extensive training.
Findings
Achieves state-of-the-art performance on StreamST benchmarks.
Effectively balances low latency with high translation quality.
Reduces need for task-specific policy training.
Abstract
Streaming speech translation (StreamST) requires determining appropriate timing, known as policy, to generate translations while continuously receiving source speech inputs, balancing low latency with high translation quality. However, existing StreamST methods typically operate on sentence-level speech segments, referred to as simultaneous speech translation (SimulST). In practice, they require collaboration with segmentation models to accomplish StreamST, where the truncated speech segments constrain SimulST models to make policy decisions and generate translations based on limited contextual information. Moreover, SimulST models struggle to learn effective policies due to the complexity of speech inputs and cross-lingual generation. To address these challenges, we propose StreamUni, which achieves StreamST through a unified Large Speech-Language Model (LSLM). Specifically, StreamUni…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. **Unified Speech Language Model for Multiple Tasks:** StreamUni effectively fine-tunes a single speech language model to perform three key actions—transcription, truncation, and streaming generation—within a unified framework. This design demonstrates strong empirical performance despite being trained on relatively limited data. 2. **Reinforcement of Data Mixing Importance:** Although the benefit of mixing offline and streaming data during training for streaming transla
1. **Flaws in Experimental Design:** - **Unfair baseline comparison:** The baselines (e.g., EDAtt for SimulST and StreamAtt for StreamST) rely on offline speech translation models trained with significantly weaker base architectures than Phi-4-Multimodal, which serves as the backbone of StreamUni. Since even their *offline translation quality* differs substantially, the comparison does not fairly isolate the advantage of the proposed method itself. - **Lack of cascade baseline:** The ent
* the proposed method is interesting * the empirical results are strong * the method is applicable to both streaming and simultaneous ST
* situate the work better, for example https://arxiv.org/abs/2502.03382 is not cited/compared to * given that simulst is quite a practical application, add considerations on real-time deployment and clarify whether the evaluation takes computational latency into account Due to these weaknesses, the current overall rating is a more conservative 6 but the reviewer is looking forward to authors' answers.
- A good streaming system to leverage SOTA LLM for streaming speech translation. - A truncation policy to limit the input audio length.
- COT proposed has been studied before as discussed in the summary section - wait-k based generation policy is suboptimal. - the comparison in the experiment is unfair and it is hard to figure out the main factor contributing to the good performance. See Question section for more details. - Forcealignment is required to train the model - Truncated speech input/text output, especially text output, leads to context information loss. It actually degrade the documentation level translation to utt
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Speech and dialogue systems
