SEAL: Speech Embedding Alignment Learning for Speech Large Language Model with Retrieval-Augmented Generation
Chunyu Sun, Bingyu Liu, Zhichao Cui, Junhan Shi, Anbin Qi, Tian-hao Zhang, Dinghao Zhou, Lewei Lu

TL;DR
This paper introduces SEAL, a unified speech embedding framework for speech large language models that reduces latency and improves retrieval accuracy without relying on intermediate text representations.
Contribution
The paper presents a novel unified embedding approach with shared scaling for speech and text, enhancing retrieval efficiency and accuracy in speech LLMs over traditional two-stage methods.
Findings
Reduces pipeline latency by 50%
Achieves higher retrieval accuracy
Robust across diverse acoustic conditions
Abstract
Embedding-based retrieval models have made significant strides in retrieval-augmented generation (RAG) techniques for text and multimodal large language models (LLMs) applications. However, when it comes to speech larage language models (SLLMs), these methods are limited to a two-stage process, where automatic speech recognition (ASR) is combined with text-based retrieval. This sequential architecture suffers from high latency and error propagation. To address these limitations, we propose a unified embedding framework that eliminates the need for intermediate text representations. Specifically, the framework includes separate speech and text encoders, followed by a shared scaling layer that maps both modalities into a common embedding space. Our model reduces pipeline latency by 50\% while achieving higher retrieval accuracy compared to traditional two-stage methods. We also provide a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Speech and dialogue systems
