Convolutional Neural Network with Word Embeddings for Chinese Word Segmentation
Chunqi Wang, Bo Xu

TL;DR
This paper introduces a convolutional neural network model with word embeddings for Chinese word segmentation, automatically capturing n-gram features and achieving state-of-the-art results without external resources.
Contribution
It proposes a CNN-based model that automatically captures n-gram features and effectively integrates word embeddings for improved Chinese word segmentation.
Findings
Achieves 95.7% on PKU and 97.3% on MSR without feature engineering.
With word embeddings, reaches 96.5% on PKU and 98.0% on MSR.
Outperforms previous models on benchmark datasets.
Abstract
Character-based sequence labeling framework is flexible and efficient for Chinese word segmentation (CWS). Recently, many character-based neural models have been applied to CWS. While they obtain good performance, they have two obvious weaknesses. The first is that they heavily rely on manually designed bigram feature, i.e. they are not good at capturing n-gram features automatically. The second is that they make no use of full word information. For the first weakness, we propose a convolutional neural model, which is able to capture rich n-gram features without any feature engineering. For the second one, we propose an effective approach to integrate the proposed model with word embeddings. We evaluate the model on two benchmark datasets: PKU and MSR. Without any feature engineering, the model obtains competitive performance -- 95.7% on PKU and 97.3% on MSR. Armed with word embeddings,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Handwritten Text Recognition Techniques · Topic Modeling
