Khmer Spellchecking: A Holistic Approach
Marry Kong, Rina Buoy, Sovisal Chenda, Nguonly Taing

TL;DR
This paper presents a comprehensive Khmer spellchecking system that integrates multiple linguistic tools, achieving a new state-of-the-art accuracy of 94.4% and addressing unique language challenges.
Contribution
It introduces a holistic approach combining segmentation, NER, G2P, and language modeling for Khmer spellchecking, surpassing existing methods.
Findings
Achieved 94.4% spellchecking accuracy
Developed benchmark datasets for Khmer NER and spellchecking
Integrated multiple linguistic modules for improved performance
Abstract
Compared to English and other high-resource languages, spellchecking for Khmer remains an unresolved problem due to several challenges. First, there are misalignments between words in the lexicon and the word segmentation model. Second, a Khmer word can be written in different forms. Third, Khmer compound words are often loosely and easily formed, and these compound words are not always found in the lexicon. Fourth, some proper nouns may be flagged as misspellings due to the absence of a Khmer named-entity recognition (NER) model. Unfortunately, existing solutions do not adequately address these challenges. This paper proposes a holistic approach to the Khmer spellchecking problem by integrating Khmer subword segmentation, Khmer NER, Khmer grapheme-to-phoneme (G2P) conversion, and a Khmer language model to tackle these challenges, identify potential correction candidates, and rank the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Language and cultural evolution · Topic Modeling
