Bridging the Modality Gap: Softly Discretizing Audio Representation for LLM-based Automatic Speech Recognition

Mu Yang; Szu-Jui Chen; Jiamin Xie; John Hansen

arXiv:2506.05706·eess.AS·November 20, 2025

Bridging the Modality Gap: Softly Discretizing Audio Representation for LLM-based Automatic Speech Recognition

Mu Yang, Szu-Jui Chen, Jiamin Xie, John Hansen

PDF

Open Access

TL;DR

This paper introduces a soft discretization technique using vector quantization to bridge the gap between continuous audio data and discrete LLM inputs, significantly enhancing LLM-based speech recognition especially out-of-domain.

Contribution

The paper proposes a novel soft discretization method that integrates VQ with LLMs for improved speech recognition performance.

Findings

01

Significant improvement over baseline in out-of-domain conditions

02

Effective alignment of audio representations with LLM inputs

03

Demonstrates potential of soft discretization as modality bridge

Abstract

One challenge of integrating speech input with large language models (LLMs) stems from the discrepancy between the continuous nature of audio data and the discrete token-based paradigm of LLMs. To mitigate this gap, we propose a method for integrating vector quantization (VQ) into LLM-based automatic speech recognition (ASR). Using the LLM embedding table as the VQ codebook, the VQ module aligns the continuous representations from the audio encoder with the discrete LLM inputs, enabling the LLM to operate on a discretized audio representation that better reflects the linguistic structure. We further create a soft "discretization" of the audio representation by updating the codebook and performing a weighted sum over the codebook embeddings. Empirical results demonstrate that our proposed method significantly improves upon the LLM-based ASR baseline, particularly in out-of-domain…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing