Phoneme-based speech recognition driven by large language models and sampling marginalization
Te Ma, Nanjie Li, Hao Huang, Zhijian Ou

TL;DR
This paper introduces a sampling marginalized training strategy for phoneme-based speech recognition with large language models, enhancing training efficiency and recognition accuracy over previous methods, and demonstrating its effectiveness across multiple languages.
Contribution
The paper proposes the Sampling-K Marginalized (SKM) training strategy, replacing beam search with random sampling to improve marginalized modeling and training efficiency in LLM-P2G speech recognition.
Findings
SKM improves model convergence speed.
SKM enhances recognition performance.
SKM maintains model complexity.
Abstract
Recently, the Large Language Model-based Phoneme-to-Grapheme (LLM-P2G) method has shown excellent performance in speech recognition tasks and has become a feasible direction to replace the traditional WFST decoding method. This framework takes into account both recognition accuracy and system scalability through two-stage modeling of phoneme prediction and text generation. However, the existing LLM-P2G adopts the Top-K Marginalized (TKM) training strategy, and its candidate phoneme sequences rely on beam search generation, which has problems such as insufficient path diversity, low training efficiency, and high resource overhead. To this end, this paper proposes a sampling marginalized training strategy (Sampling-K Marginalized, SKM), which replaces beam search with random sampling to generate candidate paths, improving marginalized modeling and training efficiency. Experiments were…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Emotion and Mood Recognition · Face recognition and analysis
