Which Encoding is the Best for Text Classification in Chinese, English, Japanese and Korean?
Xiang Zhang, Yann LeCun

TL;DR
This study empirically compares various encoding methods for text classification across Chinese, English, Japanese, and Korean, analyzing their effectiveness with different models and datasets to identify optimal encoding strategies.
Contribution
It provides a comprehensive empirical comparison of encoding techniques for multilingual text classification, highlighting the effectiveness of byte-level, word-level, and character-level encodings across models.
Findings
Byte-level one-hot encoding with UTF-8 is consistently competitive for convolutional networks.
Word-level n-grams are effective even without perfect segmentation.
fastText performs best with character-level n-gram encoding but can overfit with rich features.
Abstract
This article offers an empirical study on the different ways of encoding Chinese, Japanese, Korean (CJK) and English languages for text classification. Different encoding levels are studied, including UTF-8 bytes, characters, words, romanized characters and romanized words. For all encoding levels, whenever applicable, we provide comparisons with linear models, fastText and convolutional networks. For convolutional networks, we compare between encoding mechanisms using character glyph images, one-hot (or one-of-n) encoding, and embedding. In total there are 473 models, using 14 large-scale text classification datasets in 4 languages including Chinese, English, Japanese and Korean. Some conclusions from these results include that byte-level one-hot encoding based on UTF-8 consistently produces competitive results for convolutional networks, that word-level n-grams linear models are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗uer/roberta-base-finetuned-chinanews-chinesemodel· 4.8k dl· ♡ 754.8k dl♡ 75
- 🤗uer/roberta-base-finetuned-dianping-chinesemodel· 6.4k dl· ♡ 726.4k dl♡ 72
- 🤗uer/roberta-base-finetuned-ifeng-chinesemodel· 38 dl· ♡ 138 dl♡ 1
- 🤗uer/roberta-base-finetuned-jd-binary-chinesemodel· 29k dl· ♡ 4229k dl♡ 42
- 🤗uer/roberta-base-finetuned-jd-full-chinesemodel· 271 dl· ♡ 14271 dl♡ 14
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Natural Language Processing Techniques · Text and Document Classification Technologies
MethodsfastText
