Building a Syllable Database to Solve the Problem of Khmer Word   Segmentation

Nam Tran Van

arXiv:1703.02166·cs.CL·March 8, 2017

Building a Syllable Database to Solve the Problem of Khmer Word Segmentation

Nam Tran Van

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel Khmer syllable database and segmentation method that improves accuracy and handles ambiguity, advancing natural language processing for Khmer language.

Contribution

The paper presents a new approach to Khmer word segmentation using a syllable database and component clustering, addressing previous limitations and ambiguity issues.

Findings

01

High segmentation accuracy achieved

02

Effective ambiguity elimination demonstrated

03

Database supports improved Khmer NLP applications

Abstract

Word segmentation is a basic problem in natural language processing. With the languages having the complex writing system like the Khmer language in Southern of Vietnam, this problem really very intractable, posing the significant challenges. Although there are some experts in Vietnam as well as international having deeply researched this problem, there are still no reasonable results meeting the demand, in particular, no treated thoroughly the ambiguous phenomenon, in the process of Khmer language processing so far. This paper present a solution based on the syllable division into component clusters using two syllable models proposed, thereby building a Khmer syllable database, is still not actually available. This method using a lexical database updated from the online Khmer dictionaries and some supported dictionaries serving role of training data and complementary linguistic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

buda-base/lucene-km
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Handwritten Text Recognition Techniques · Advanced Computational Techniques and Applications