State-of-the-art Chinese Word Segmentation with Bi-LSTMs

Ji Ma; Kuzman Ganchev; David Weiss

arXiv:1808.06511·cs.CL·August 27, 2018·6 cites

State-of-the-art Chinese Word Segmentation with Bi-LSTMs

Ji Ma, Kuzman Ganchev, David Weiss

PDF

Open Access 1 Repo

TL;DR

This paper demonstrates that a simple Bi-LSTM model, combined with standard deep learning practices, can outperform more complex architectures in Chinese word segmentation, highlighting the importance of data resources for further progress.

Contribution

It shows that a straightforward Bi-LSTM approach with best practices can surpass complex models in Chinese word segmentation accuracy.

Findings

01

Bi-LSTM achieves superior accuracy on key datasets

02

Out-of-vocabulary words remain a significant challenge

03

Further improvements require better resources, not just architecture changes

Abstract

A wide variety of neural-network architectures have been proposed for the task of Chinese word segmentation. Surprisingly, we find that a bidirectional LSTM model, when combined with standard deep learning techniques and best practices, can achieve better accuracy on many of the popular datasets as compared to models based on more complex neural-network architectures. Furthermore, our error analysis shows that out-of-vocabulary words remain challenging for neural-network models, and many of the remaining errors are unlikely to be fixed through architecture changes. Instead, more effort should be made on exploring resources for further improvement.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

efeatikkan/Chinese_Word_Segmenter
tf

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Handwritten Text Recognition Techniques

MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory