Which Encoding is the Best for Text Classification in Chinese, English,   Japanese and Korean?

Xiang Zhang; Yann LeCun

arXiv:1708.02657·cs.CL·August 18, 2017·39 cites

Which Encoding is the Best for Text Classification in Chinese, English, Japanese and Korean?

Xiang Zhang, Yann LeCun

PDF

Open Access 3 Repos 5 Models

TL;DR

This study empirically compares various encoding methods for text classification across Chinese, English, Japanese, and Korean, analyzing their effectiveness with different models and datasets to identify optimal encoding strategies.

Contribution

It provides a comprehensive empirical comparison of encoding techniques for multilingual text classification, highlighting the effectiveness of byte-level, word-level, and character-level encodings across models.

Findings

01

Byte-level one-hot encoding with UTF-8 is consistently competitive for convolutional networks.

02

Word-level n-grams are effective even without perfect segmentation.

03

fastText performs best with character-level n-gram encoding but can overfit with rich features.

Abstract

This article offers an empirical study on the different ways of encoding Chinese, Japanese, Korean (CJK) and English languages for text classification. Different encoding levels are studied, including UTF-8 bytes, characters, words, romanized characters and romanized words. For all encoding levels, whenever applicable, we provide comparisons with linear models, fastText and convolutional networks. For convolutional networks, we compare between encoding mechanisms using character glyph images, one-hot (or one-of-n) encoding, and embedding. In total there are 473 models, using 14 large-scale text classification datasets in 4 languages including Chinese, English, Japanese and Korean. Some conclusions from these results include that byte-level one-hot encoding based on UTF-8 consistently produces competitive results for convolutional networks, that word-level n-grams linear models are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Natural Language Processing Techniques · Text and Document Classification Technologies

MethodsfastText