Is word segmentation necessary for Vietnamese sentiment classification?
Duc-Vu Nguyen, Ngan Luu-Thuy Nguyen

TL;DR
This paper investigates whether word segmentation is essential for Vietnamese sentiment classification, comparing models with and without segmentation across different datasets and classifiers.
Contribution
It provides the first comprehensive analysis of the necessity of word segmentation in Vietnamese sentiment classification using pre-trained language models.
Findings
Word segmentation may not be necessary for social domain sentiment classification with traditional classifiers.
Segmentation is beneficial when using BPE and deep learning models.
RDRsegmenter is identified as the most stable segmentation toolkit.
Abstract
To the best of our knowledge, this paper made the first attempt to answer whether word segmentation is necessary for Vietnamese sentiment classification. To do this, we presented five pre-trained monolingual S4- based language models for Vietnamese, including one model without word segmentation, and four models using RDRsegmenter, uitnlp, pyvi, or underthesea toolkits in the pre-processing data phase. According to comprehensive experimental results on two corpora, including the VLSP2016-SA corpus of technical article reviews from the news and social media and the UIT-VSFC corpus of the educational survey, we have two suggestions. Firstly, using traditional classifiers like Naive Bayes or Support Vector Machines, word segmentation maybe not be necessary for the Vietnamese sentiment classification corpus, which comes from the social domain. Secondly, word segmentation is necessary for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSentiment Analysis and Opinion Mining · Topic Modeling · Text and Document Classification Technologies
MethodsByte Pair Encoding
