SEAL: Speech Embedding Alignment Learning for Speech Large Language Model with Retrieval-Augmented Generation

Chunyu Sun; Bingyu Liu; Zhichao Cui; Junhan Shi; Anbin Qi; Tian-hao Zhang; Dinghao Zhou; Lewei Lu

arXiv:2502.02603·eess.AS·December 11, 2025

SEAL: Speech Embedding Alignment Learning for Speech Large Language Model with Retrieval-Augmented Generation

Chunyu Sun, Bingyu Liu, Zhichao Cui, Junhan Shi, Anbin Qi, Tian-hao Zhang, Dinghao Zhou, Lewei Lu

PDF

Open Access

TL;DR

This paper introduces SEAL, a unified speech embedding framework for speech large language models that reduces latency and improves retrieval accuracy without relying on intermediate text representations.

Contribution

The paper presents a novel unified embedding approach with shared scaling for speech and text, enhancing retrieval efficiency and accuracy in speech LLMs over traditional two-stage methods.

Findings

01

Reduces pipeline latency by 50%

02

Achieves higher retrieval accuracy

03

Robust across diverse acoustic conditions

Abstract

Embedding-based retrieval models have made significant strides in retrieval-augmented generation (RAG) techniques for text and multimodal large language models (LLMs) applications. However, when it comes to speech larage language models (SLLMs), these methods are limited to a two-stage process, where automatic speech recognition (ASR) is combined with text-based retrieval. This sequential architecture suffers from high latency and error propagation. To address these limitations, we propose a unified embedding framework that eliminates the need for intermediate text representations. Specifically, the framework includes separate speech and text encoders, followed by a shared scaling layer that maps both modalities into a common embedding space. Our model reduces pipeline latency by 50\% while achieving higher retrieval accuracy compared to traditional two-stage methods. We also provide a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Speech and dialogue systems