Phoneme-based speech recognition driven by large language models and sampling marginalization

Te Ma; Nanjie Li; Hao Huang; Zhijian Ou

arXiv:2512.18371·eess.AS·December 23, 2025

Phoneme-based speech recognition driven by large language models and sampling marginalization

Te Ma, Nanjie Li, Hao Huang, Zhijian Ou

PDF

Open Access

TL;DR

This paper introduces a sampling marginalized training strategy for phoneme-based speech recognition with large language models, enhancing training efficiency and recognition accuracy over previous methods, and demonstrating its effectiveness across multiple languages.

Contribution

The paper proposes the Sampling-K Marginalized (SKM) training strategy, replacing beam search with random sampling to improve marginalized modeling and training efficiency in LLM-P2G speech recognition.

Findings

01

SKM improves model convergence speed.

02

SKM enhances recognition performance.

03

SKM maintains model complexity.

Abstract

Recently, the Large Language Model-based Phoneme-to-Grapheme (LLM-P2G) method has shown excellent performance in speech recognition tasks and has become a feasible direction to replace the traditional WFST decoding method. This framework takes into account both recognition accuracy and system scalability through two-stage modeling of phoneme prediction and text generation. However, the existing LLM-P2G adopts the Top-K Marginalized (TKM) training strategy, and its candidate phoneme sequences rely on beam search generation, which has problems such as insufficient path diversity, low training efficiency, and high resource overhead. To this end, this paper proposes a sampling marginalized training strategy (Sampling-K Marginalized, SKM), which replaces beam search with random sampling to generate candidate paths, improving marginalized modeling and training efficiency. Experiments were…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Emotion and Mood Recognition · Face recognition and analysis