RASST: Fast Cross-modal Retrieval-Augmented Simultaneous Speech Translation
Jiaxuan Luo, Siqi Ouyang, Lei Li

TL;DR
RASST introduces a fast, integrated cross-modal retrieval method to enhance terminology accuracy and overall quality in simultaneous speech translation, addressing challenges with domain-specific terms.
Contribution
It presents a novel retrieval-augmented pipeline for SST that trains a lightweight retriever and synthesizes training data to improve terminology translation.
Findings
Terminology translation accuracy improved by up to 16%.
Overall translation quality increased by up to 3 BLEU points.
Effective integration of retrieval enhances SST performance.
Abstract
Simultaneous speech translation (SST) produces target text incrementally from partial speech input. Recent speech large language models (Speech LLMs) have substantially improved SST quality, yet they still struggle to correctly translate rare and domain-specific terminology. While retrieval augmentation has been effective for terminology translation in machine translation, bringing retrieval to SST is non-trivial: it requires fast and accurate cross-modal (speech-to-text) retrieval under partial, continually arriving input, and the model must decide whether and when to apply retrieved terms during incremental generation. We propose Retrieval-Augmented Simultaneous Speech Translation (RASST), which tightly integrates cross-modal retrieval into the SST pipeline. RASST trains a lightweight speech-text retriever and performs efficient sliding-window retrieval, providing chunkwise…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
