TL;DR
This paper presents a Vietnamese word segmentation method using SVM that reduces ambiguity and captures suffixes, achieving better accuracy than existing state-of-the-art methods without relying on longest matching algorithms.
Contribution
The paper introduces novel feature extraction techniques for Vietnamese word segmentation with SVM, improving accuracy and handling unknown words more effectively.
Findings
Achieved higher F1-score than UETsegmenter and RDRsegmenter
Proposed features reduce ambiguity and improve suffix prediction
Method does not require longest matching or post-processing
Abstract
In this paper, we approach Vietnamese word segmentation as a binary classification by using the Support Vector Machine classifier. We inherit features from prior works such as n-gram of syllables, n-gram of syllable types, and checking conjunction of adjacent syllables in the dictionary. We propose two novel ways to feature extraction, one to reduce the overlap ambiguity and the other to increase the ability to predict unknown words containing suffixes. Different from UETsegmenter and RDRsegmenter, two state-of-the-art Vietnamese word segmentation methods, we do not employ the longest matching algorithm as an initial processing step or any post-processing technique. According to experimental results on benchmark Vietnamese datasets, our proposed method obtained a better F1-score than the prior state-of-the-art methods UETsegmenter, and RDRsegmenter.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
