Effective Neural Solution for Multi-Criteria Word Segmentation
Han He, Lei Wu, Hua Yan, Zhimin Gao, Yi Feng, George Townsend

TL;DR
This paper introduces a unified neural model for Chinese Word Segmentation that uses artificial tokens to specify criteria, achieving state-of-the-art results across multiple datasets without increasing model complexity.
Contribution
The paper proposes a novel approach that incorporates artificial tokens to handle multi-criteria segmentation in a single shared model, eliminating the need for private layers.
Findings
Surpassed state-of-the-art on Bakeoff 2005 and 2008 datasets.
Achieved high performance on large-scale datasets.
Model complexity remains minimal and constant.
Abstract
We present a simple yet elegant solution to train a single joint model on multi-criteria corpora for Chinese Word Segmentation (CWS). Our novel design requires no private layers in model architecture, instead, introduces two artificial tokens at the beginning and ending of input sentence to specify the required target criteria. The rest of the model including Long Short-Term Memory (LSTM) layer and Conditional Random Fields (CRFs) layer remains unchanged and is shared across all datasets, keeping the size of parameter collection minimal and constant. On Bakeoff 2005 and Bakeoff 2008 datasets, our innovative design has surpassed both single-criterion and multi-criteria state-of-the-art learning results. To the best knowledge, our design is the first one that has achieved the latest high performance on such large scale datasets. Source codes and corpora of this paper are available on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text and Document Classification Technologies
