LLM-based phoneme-to-grapheme for phoneme-based speech recognition

Te Ma; Min Bi; Saierdaer Yusuyin; Hao Huang; Zhijian Ou

arXiv:2506.04711·cs.SD·June 6, 2025

LLM-based phoneme-to-grapheme for phoneme-based speech recognition

Te Ma, Min Bi, Saierdaer Yusuyin, Hao Huang, Zhijian Ou

PDF

Open Access

TL;DR

This paper introduces an LLM-based phoneme-to-grapheme decoding method for phoneme-based speech recognition, improving crosslingual ASR performance over traditional WFST-based systems by addressing information loss with novel training strategies.

Contribution

It proposes a new LLM-based decoding framework for phoneme-to-grapheme conversion in speech recognition, with innovative training techniques to mitigate information loss.

Findings

01

Outperforms WFST-based systems in Polish and German crosslingual ASR

02

Achieves 3.6% and 6.9% relative WER reductions respectively

03

Demonstrates effectiveness of LLM-based decoding with proposed training strategies

Abstract

In automatic speech recognition (ASR), phoneme-based multilingual pre-training and crosslingual fine-tuning is attractive for its high data efficiency and competitive results compared to subword-based models. However, Weighted Finite State Transducer (WFST) based decoding is limited by its complex pipeline and inability to leverage large language models (LLMs). Therefore, we propose LLM-based phoneme-to-grapheme (LLM-P2G) decoding for phoneme-based ASR, consisting of speech-to-phoneme (S2P) and phoneme-to-grapheme (P2G). A challenge is that there seems to have information loss in cascading S2P and P2G. To address this challenge, we propose two training strategies: data augmentation with noisy phonemes (DANP), and randomized top- $K$ marginalized (TKM) training and decoding. Our experimental results show that LLM-P2G outperforms WFST-based systems in crosslingual ASR for Polish and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Emotion and Mood Recognition