Fast, Not Fancy: Rethinking G2P with Rich Data and Rule-Based Models

Mahta Fetrat Qharabagh; Zahra Dehghanian; Hamid R. Rabiee

arXiv:2505.12973·cs.CL·May 20, 2025

Fast, Not Fancy: Rethinking G2P with Rich Data and Rule-Based Models

Mahta Fetrat Qharabagh, Zahra Dehghanian, Hamid R. Rabiee

PDF

Open Access 3 Repos 1 Models 2 Datasets

TL;DR

This paper introduces a semi-automated dataset creation pipeline for homograph disambiguation in G2P conversion, and advocates for rule-based methods informed by rich datasets to achieve fast, accurate disambiguation suitable for real-time applications.

Contribution

It presents HomoRich dataset construction, enhances a deep learning G2P system with this data, and improves a rule-based system for real-time homograph disambiguation.

Findings

01

30% improvement in disambiguation accuracy

02

Effective dataset generation pipeline

03

Enhanced rule-based G2P system for real-time use

Abstract

Homograph disambiguation remains a significant challenge in grapheme-to-phoneme (G2P) conversion, especially for low-resource languages. This challenge is twofold: (1) creating balanced and comprehensive homograph datasets is labor-intensive and costly, and (2) specific disambiguation strategies introduce additional latency, making them unsuitable for real-time applications such as screen readers and other accessibility tools. In this paper, we address both issues. First, we propose a semi-automated pipeline for constructing homograph-focused datasets, introduce the HomoRich dataset generated through this pipeline, and demonstrate its effectiveness by applying it to enhance a state-of-the-art deep learning-based G2P system for Persian. Second, we advocate for a paradigm shift - utilizing rich offline datasets to inform the development of fast, rule-based methods suitable for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
MahtaFetrat/Homo-GE2PE-Persian
model· ♡ 5
♡ 5

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsICT in Developing Communities · Natural Language Processing Techniques · Digital Accessibility for Disabilities