Phonemes vs. Projectors: An Investigation of Speech-Language Interfaces for LLM-based ASR

Ziwei Li; Lukuang Dong; Saierdaer Yusuyin; Xianyu Zhao; Zhijian Ou

arXiv:2604.09332·eess.AS·April 13, 2026

Phonemes vs. Projectors: An Investigation of Speech-Language Interfaces for LLM-based ASR

Ziwei Li, Lukuang Dong, Saierdaer Yusuyin, Xianyu Zhao, Zhijian Ou

PDF

TL;DR

This paper compares phoneme-based and projector-based speech-language interfaces for integrating speech encoders with LLMs in ASR, introducing a BPE-phoneme method that improves performance, especially in low-resource settings.

Contribution

It presents a comprehensive comparison of phoneme and projector interfaces, proposing a BPE-phoneme approach that enhances ASR performance across different languages and resource levels.

Findings

01

Phoneme-based interface is competitive with projector-based on LibriSpeech.

02

BPE-phoneme interface yields further gains in ASR accuracy.

03

Phoneme supervision improves performance in low-resource Tatar.

Abstract

Integrating pretrained speech encoders with large language models (LLMs) is promising for ASR, but performance and data efficiency depend on the speech-language interface. A common choice is a learned projector that maps encoder features into the LLM embedding space, whereas an alternative is to expose discrete phoneme sequences to the LLM. Using the same encoder and LLM backbones, we compare phoneme-based and vanilla projector-based interfaces in high-resource English and low-resource Tatar. We also propose a BPE-phoneme interface that groups frequent local phoneme patterns while preserving explicit word-boundary cues for phoneme-to-grapheme generation. On LibriSpeech, the phoneme-based interface is competitive with the vanilla projector, and the BPE-phoneme interface yields further gains. On Tatar, the phoneme-based interface substantially outperforms the vanilla projector. We further…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.