An Experimental Evaluation of Japanese Tokenizers for Sentiment-Based Text Classification
Andre Rusli, Makoto Shishido

TL;DR
This paper compares three Japanese tokenizers—MeCab, Sudachi, and SentencePiece—in sentiment classification tasks, finding that SentencePiece combined with TF-IDF and Logistic Regression yields the best results.
Contribution
It provides an empirical evaluation of tokenizer performance in Japanese sentiment classification, highlighting the effectiveness of SentencePiece with specific classifiers.
Findings
SentencePiece with TF-IDF and Logistic Regression achieves superior classification accuracy.
Sudachi produces tokens closely aligned with dictionary definitions.
MeCab and SentencePiece offer faster processing speeds.
Abstract
This study investigates the performance of three popular tokenization tools: MeCab, Sudachi, and SentencePiece, when applied as a preprocessing step for sentiment-based text classification of Japanese texts. Using Term Frequency-Inverse Document Frequency (TF-IDF) vectorization, we evaluate two traditional machine learning classifiers: Multinomial Naive Bayes and Logistic Regression. The results reveal that Sudachi produces tokens closely aligned with dictionary definitions, while MeCab and SentencePiece demonstrate faster processing speeds. The combination of SentencePiece, TF-IDF, and Logistic Regression outperforms the other alternatives in terms of classification performance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSentiment Analysis and Opinion Mining · Topic Modeling · Advanced Text Analysis Techniques
MethodsByte Pair Encoding · SentencePiece · Logistic Regression
