Improving Grapheme-to-Phoneme Conversion through In-Context Knowledge   Retrieval with Large Language Models

Dongrui Han; Mingyu Cui; Jiawen Kang; Xixin Wu; Xunying Liu; Helen; Meng

arXiv:2411.07563·cs.AI·March 21, 2025

Improving Grapheme-to-Phoneme Conversion through In-Context Knowledge Retrieval with Large Language Models

Dongrui Han, Mingyu Cui, Jiawen Kang, Xixin Wu, Xunying Liu, Helen, Meng

PDF

Open Access

TL;DR

This paper introduces a novel G2P conversion approach leveraging Large Language Models' in-context knowledge retrieval to improve disambiguation, significantly reducing phoneme error rates in TTS systems.

Contribution

It proposes a new contextual G2P conversion method using LLMs' ICKR capabilities, demonstrating improved accuracy over baseline models on the Librig2p dataset.

Findings

01

ICKR-based G2P system reduces PER by 2.0% absolute.

02

GPT-4 enhances PER reduction by 3.5% absolute.

03

Significant performance gains demonstrated on Librig2p dataset.

Abstract

Grapheme-to-phoneme (G2P) conversion is a crucial step in Text-to-Speech (TTS) systems, responsible for mapping grapheme to corresponding phonetic representations. However, it faces ambiguities problems where the same grapheme can represent multiple phonemes depending on contexts, posing a challenge for G2P conversion. Inspired by the remarkable success of Large Language Models (LLMs) in handling context-aware scenarios, contextual G2P conversion systems with LLMs' in-context knowledge retrieval (ICKR) capabilities are proposed to promote disambiguation capability. The efficacy of incorporating ICKR into G2P conversion systems is demonstrated thoroughly on the Librig2p dataset. In particular, the best contextual G2P conversion system using ICKR outperforms the baseline with weighted average phoneme error rate (PER) reductions of 2.0% absolute (28.9% relative). Using GPT-4 in the ICKR…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech Recognition and Synthesis · Natural Language Processing Techniques

MethodsAttention Is All You Need · Linear Layer · Dense Connections · Label Smoothing · Absolute Position Encodings · Layer Normalization · Position-Wise Feed-Forward Layer · Adam · Multi-Head Attention · Residual Connection