TL;DR
This paper introduces Hierarchical Policy Optimization to improve simultaneous speech translation by balancing translation quality and latency, leveraging post-training on imperfect supervised data and hierarchical rewards.
Contribution
It proposes a novel hierarchical reward-based post-training method for SST that enhances translation quality and reduces latency without relying on high-quality supervised dialogue data.
Findings
Achieved over +7 COMET score improvement
Gained +1.25 MetricX score at 1.5s latency
Validated effectiveness through extensive ablation studies
Abstract
Simultaneous speech translation (SST) generates translations while receiving partial speech input. Recent advances show that large language models (LLMs) can substantially improve SST quality, but at the cost of high computational overhead. To reduce this cost, prior work reformulates SST as a multi-turn dialogue task, enabling full reuse of the LLM's key-value (KV) cache and eliminating redundant feature recomputation. However, this approach relies on supervised fine-tuning (SFT) data in dialogue form, for which few human annotations exist, and existing synthesis methods cannot guarantee data quality. In this work, we propose a Hierarchical Policy Optimization (HPO) approach that post-train models trained on imperfect SFT data. We introduce a hierarchical reward that balances translation quality and latency objectives. Experiments on English to Chinese/German/Japanese demonstrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
