Neural Word Segmentation with Rich Pretraining
Jie Yang, Yue Zhang, Fei Dong

TL;DR
This paper explores how integrating external sources of information into neural word segmentation models through pretraining can significantly enhance their accuracy, achieving results comparable to state-of-the-art methods across multiple benchmarks.
Contribution
It introduces a modular neural segmentation model that leverages rich external data for pretraining key components, improving performance over previous approaches.
Findings
Pretraining with external sources boosts segmentation accuracy.
The model achieves competitive results on six benchmarks.
Rich external information enhances neural segmentation performance.
Abstract
Neural word segmentation research has benefited from large-scale raw texts by leveraging them for pretraining character and word embeddings. On the other hand, statistical segmentation research has exploited richer sources of external information, such as punctuation, automatic segmentation and POS. We investigate the effectiveness of a range of external training sources for neural word segmentation by building a modular segmentation model, pretraining the most important submodule using rich external sources. Results show that such pretraining significantly improves the model, leading to accuracies competitive to the best methods on six benchmarks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
