Khmer Spellchecking: A Holistic Approach

Marry Kong; Rina Buoy; Sovisal Chenda; Nguonly Taing

arXiv:2511.09812·cs.CL·November 14, 2025

Khmer Spellchecking: A Holistic Approach

Marry Kong, Rina Buoy, Sovisal Chenda, Nguonly Taing

PDF

Open Access

TL;DR

This paper presents a comprehensive Khmer spellchecking system that integrates multiple linguistic tools, achieving a new state-of-the-art accuracy of 94.4% and addressing unique language challenges.

Contribution

It introduces a holistic approach combining segmentation, NER, G2P, and language modeling for Khmer spellchecking, surpassing existing methods.

Findings

01

Achieved 94.4% spellchecking accuracy

02

Developed benchmark datasets for Khmer NER and spellchecking

03

Integrated multiple linguistic modules for improved performance

Abstract

Compared to English and other high-resource languages, spellchecking for Khmer remains an unresolved problem due to several challenges. First, there are misalignments between words in the lexicon and the word segmentation model. Second, a Khmer word can be written in different forms. Third, Khmer compound words are often loosely and easily formed, and these compound words are not always found in the lexicon. Fourth, some proper nouns may be flagged as misspellings due to the absence of a Khmer named-entity recognition (NER) model. Unfortunately, existing solutions do not adequately address these challenges. This paper proposes a holistic approach to the Khmer spellchecking problem by integrating Khmer subword segmentation, Khmer NER, Khmer grapheme-to-phoneme (G2P) conversion, and a Khmer language model to tackle these challenges, identify potential correction candidates, and rank the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Language and cultural evolution · Topic Modeling