An Experimental Evaluation of Japanese Tokenizers for Sentiment-Based   Text Classification

Andre Rusli; Makoto Shishido

arXiv:2412.17361·cs.CL·December 24, 2024

An Experimental Evaluation of Japanese Tokenizers for Sentiment-Based Text Classification

Andre Rusli, Makoto Shishido

PDF

Open Access 1 Repo

TL;DR

This paper compares three Japanese tokenizers—MeCab, Sudachi, and SentencePiece—in sentiment classification tasks, finding that SentencePiece combined with TF-IDF and Logistic Regression yields the best results.

Contribution

It provides an empirical evaluation of tokenizer performance in Japanese sentiment classification, highlighting the effectiveness of SentencePiece with specific classifiers.

Findings

01

SentencePiece with TF-IDF and Logistic Regression achieves superior classification accuracy.

02

Sudachi produces tokens closely aligned with dictionary definitions.

03

MeCab and SentencePiece offer faster processing speeds.

Abstract

This study investigates the performance of three popular tokenization tools: MeCab, Sudachi, and SentencePiece, when applied as a preprocessing step for sentiment-based text classification of Japanese texts. Using Term Frequency-Inverse Document Frequency (TF-IDF) vectorization, we evaluate two traditional machine learning classifiers: Multinomial Naive Bayes and Logistic Regression. The results reveal that Sudachi produces tokens closely aligned with dictionary definitions, while MeCab and SentencePiece demonstrate faster processing speeds. The combination of SentencePiece, TF-IDF, and Logistic Regression outperforms the other alternatives in terms of classification performance.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

arusl/anlp_nlp2021_d3-1
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSentiment Analysis and Opinion Mining · Topic Modeling · Advanced Text Analysis Techniques

MethodsByte Pair Encoding · SentencePiece · Logistic Regression