Subword Encoding in Lattice LSTM for Chinese Word Segmentation
Jie Yang, Yue Zhang, Shuailong Liang

TL;DR
This paper introduces a lattice LSTM model for Chinese word segmentation that effectively utilizes subword information without external segmentors, achieving competitive results and providing insights into the contributions of lexicon and embeddings.
Contribution
It demonstrates that subword encoding in lattice LSTM performs comparably to word embeddings and offers a more independent approach for Chinese word segmentation.
Findings
Subword encoding achieves similar performance to word embeddings.
Lattice LSTM with subword encoding outperforms previous models on benchmarks.
Lexicon information contributes more than pretrained embeddings.
Abstract
We investigate a lattice LSTM network for Chinese word segmentation (CWS) to utilize words or subwords. It integrates the character sequence features with all subsequences information matched from a lexicon. The matched subsequences serve as information shortcut tunnels which link their start and end characters directly. Gated units are used to control the contribution of multiple input links. Through formula derivation and comparison, we show that the lattice LSTM is an extension of the standard LSTM with the ability to take multiple inputs. Previous lattice LSTM model takes word embeddings as the lexicon input, we prove that subword encoding can give the comparable performance and has the benefit of not relying on any external segmentor. The contribution of lattice LSTM comes from both lexicon and pretrained embeddings information, we find that the lexicon information contributes more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text and Document Classification Technologies
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory
