Building a Syllable Database to Solve the Problem of Khmer Word Segmentation
Nam Tran Van

TL;DR
This paper introduces a novel Khmer syllable database and segmentation method that improves accuracy and handles ambiguity, advancing natural language processing for Khmer language.
Contribution
The paper presents a new approach to Khmer word segmentation using a syllable database and component clustering, addressing previous limitations and ambiguity issues.
Findings
High segmentation accuracy achieved
Effective ambiguity elimination demonstrated
Database supports improved Khmer NLP applications
Abstract
Word segmentation is a basic problem in natural language processing. With the languages having the complex writing system like the Khmer language in Southern of Vietnam, this problem really very intractable, posing the significant challenges. Although there are some experts in Vietnam as well as international having deeply researched this problem, there are still no reasonable results meeting the demand, in particular, no treated thoroughly the ambiguous phenomenon, in the process of Khmer language processing so far. This paper present a solution based on the syllable division into component clusters using two syllable models proposed, thereby building a Khmer syllable database, is still not actually available. This method using a lexical database updated from the online Khmer dictionaries and some supported dictionaries serving role of training data and complementary linguistic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Handwritten Text Recognition Techniques · Advanced Computational Techniques and Applications
