RASST: Fast Cross-modal Retrieval-Augmented Simultaneous Speech Translation

Jiaxuan Luo; Siqi Ouyang; Lei Li

arXiv:2601.22777·cs.CL·February 2, 2026

RASST: Fast Cross-modal Retrieval-Augmented Simultaneous Speech Translation

Jiaxuan Luo, Siqi Ouyang, Lei Li

PDF

Open Access

TL;DR

RASST introduces a fast, integrated cross-modal retrieval method to enhance terminology accuracy and overall quality in simultaneous speech translation, addressing challenges with domain-specific terms.

Contribution

It presents a novel retrieval-augmented pipeline for SST that trains a lightweight retriever and synthesizes training data to improve terminology translation.

Findings

01

Terminology translation accuracy improved by up to 16%.

02

Overall translation quality increased by up to 3 BLEU points.

03

Effective integration of retrieval enhances SST performance.

Abstract

Simultaneous speech translation (SST) produces target text incrementally from partial speech input. Recent speech large language models (Speech LLMs) have substantially improved SST quality, yet they still struggle to correctly translate rare and domain-specific terminology. While retrieval augmentation has been effective for terminology translation in machine translation, bringing retrieval to SST is non-trivial: it requires fast and accurate cross-modal (speech-to-text) retrieval under partial, continually arriving input, and the model must decide whether and when to apply retrieved terms during incremental generation. We propose Retrieval-Augmented Simultaneous Speech Translation (RASST), which tightly integrates cross-modal retrieval into the SST pipeline. RASST trains a lightweight speech-text retriever and performs efficient sliding-window retrieval, providing chunkwise…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems