Advancing LLM-based phoneme-to-grapheme for multilingual speech recognition

Lukuang Dong; Ziwei Li; Saierdaer Yusuyin; Xianyu Zhao; Zhijian Ou

arXiv:2603.29217·eess.AS·April 1, 2026

Advancing LLM-based phoneme-to-grapheme for multilingual speech recognition

Lukuang Dong, Ziwei Li, Saierdaer Yusuyin, Xianyu Zhao, Zhijian Ou

PDF

TL;DR

This paper improves multilingual phoneme-to-grapheme conversion in speech recognition using robust LLM strategies, reducing error rates across ten languages.

Contribution

It introduces novel robustness techniques like DANP and S-SKM for LLM-based P2G, enhancing performance in multilingual speech recognition.

Findings

01

Robust training reduces average WER from 10.56% to 7.66%.

02

S-SKM avoids CTC-based probability weighting in P2G training.

03

The study demonstrates effective multilingual P2G with LLMs on CV-Lang10.

Abstract

Phoneme-based ASR factorizes recognition into speech-to-phoneme (S2P) and phoneme-to-grapheme (P2G), enabling cross-lingual acoustic sharing while keeping language-specific orthography in a separate module. While large language models (LLMs) are promising for P2G, multilingual P2G remains challenging due to language-aware generation and severe cross-language data imbalance. We study multilingual LLM-based P2G on the ten-language CV-Lang10 benchmark. We examine robustness strategies that account for S2P uncertainty, including DANP and Simplified SKM (S-SKM). S-SKM is a Monte Carlo approximation that avoids CTC-based S2P probability weighting in P2G training. Robust training and low-resource oversampling reduce the average WER from 10.56% to 7.66%.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.