Streaming Translation and Transcription Through Speech-to-Text Causal Alignment

Roman Koshkin; Jeon Haesung; Lianbo Liu; Hao Shi; Mengjie Zhao; Yusuke Fujita; Yui Sudo

arXiv:2603.11578·cs.CL·March 13, 2026

Streaming Translation and Transcription Through Speech-to-Text Causal Alignment

Roman Koshkin, Jeon Haesung, Lianbo Liu, Hao Shi, Mengjie Zhao, Yusuke Fujita, Yui Sudo

PDF

Open Access

TL;DR

Hikari is an end-to-end, policy-free streaming translation model that uses probabilistic WAIT tokens and decoder time dilation to improve translation quality and latency trade-offs across multiple languages.

Contribution

The paper introduces Hikari, a novel end-to-end model for streaming speech translation that eliminates reliance on heuristics and improves performance with new training strategies.

Findings

01

Achieves state-of-the-art BLEU scores in multiple languages.

02

Effectively balances quality and latency in streaming translation.

03

Outperforms recent baselines in low- and high-latency regimes.

Abstract

Simultaneous machine translation (SiMT) has traditionally relied on offline machine translation models coupled with human-engineered heuristics or learned policies. We propose Hikari, a policy-free, fully end-to-end model that performs simultaneous speech-to-text translation and streaming transcription by encoding READ/WRITE decisions into a probabilistic WAIT token mechanism. We also introduce Decoder Time Dilation, a mechanism that reduces autoregressive overhead and ensures a balanced training distribution. Additionally, we present a supervised fine-tuning strategy that trains the model to recover from delays, significantly improving the quality-latency trade-off. Evaluated on English-to-Japanese, German, and Russian, Hikari achieves new state-of-the-art BLEU scores in both low- and high-latency regimes, outperforming recent baselines.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications