Phonemes vs. Projectors: An Investigation of Speech-Language Interfaces for LLM-based ASR
Ziwei Li, Lukuang Dong, Saierdaer Yusuyin, Xianyu Zhao, Zhijian Ou

TL;DR
This paper compares phoneme-based and projector-based speech-language interfaces for integrating speech encoders with LLMs in ASR, introducing a BPE-phoneme method that improves performance, especially in low-resource settings.
Contribution
It presents a comprehensive comparison of phoneme and projector interfaces, proposing a BPE-phoneme approach that enhances ASR performance across different languages and resource levels.
Findings
Phoneme-based interface is competitive with projector-based on LibriSpeech.
BPE-phoneme interface yields further gains in ASR accuracy.
Phoneme supervision improves performance in low-resource Tatar.
Abstract
Integrating pretrained speech encoders with large language models (LLMs) is promising for ASR, but performance and data efficiency depend on the speech-language interface. A common choice is a learned projector that maps encoder features into the LLM embedding space, whereas an alternative is to expose discrete phoneme sequences to the LLM. Using the same encoder and LLM backbones, we compare phoneme-based and vanilla projector-based interfaces in high-resource English and low-resource Tatar. We also propose a BPE-phoneme interface that groups frequent local phoneme patterns while preserving explicit word-boundary cues for phoneme-to-grapheme generation. On LibriSpeech, the phoneme-based interface is competitive with the vanilla projector, and the BPE-phoneme interface yields further gains. On Tatar, the phoneme-based interface substantially outperforms the vanilla projector. We further…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
