Is Word Segmentation Necessary for Deep Learning of Chinese Representations?
Xiaoya Li, Yuxian Meng, Xiaofei Sun, Qinghong Han, Arianna Yuan and, Jiwei Li

TL;DR
This paper investigates whether Chinese word segmentation is necessary for deep learning NLP tasks, finding that character-based models outperform word-based models due to issues like data sparsity and OOV words.
Contribution
The study provides empirical evidence that character-based models outperform word-based models in Chinese NLP, challenging the traditional reliance on word segmentation.
Findings
Char-based models outperform word-based models in multiple NLP tasks.
Word-based models are more vulnerable to data sparsity and OOV issues.
Character models are less prone to overfitting in Chinese NLP tasks.
Abstract
Segmenting a chunk of text into words is usually the first step of processing Chinese text, but its necessity has rarely been explored. In this paper, we ask the fundamental question of whether Chinese word segmentation (CWS) is necessary for deep learning-based Chinese Natural Language Processing. We benchmark neural word-based models which rely on word segmentation against neural char-based models which do not involve word segmentation in four end-to-end NLP benchmark tasks: language modeling, machine translation, sentence matching/paraphrase and text classification. Through direct comparisons between these two types of models, we find that char-based models consistently outperform word-based models. Based on these observations, we conduct comprehensive experiments to study why word-based models underperform char-based models in these deep learning-based NLP tasks. We show that it is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
