Word Segmentation as Graph Partition
Yuanhao Liu, Sheng Yu

TL;DR
This paper introduces a novel unsupervised graph-based method for Chinese word segmentation, modeling sentences as character graphs and applying spectral partitioning to improve segmentation accuracy on specialized and benchmark datasets.
Contribution
It presents a new graph partitioning approach for Chinese word segmentation, utilizing spectral algorithms and unsupervised techniques, with demonstrated effectiveness on diverse corpora.
Findings
Effective segmentation on Chinese health records
Competitive results on standard benchmarks
Unsupervised approach reduces reliance on labeled data
Abstract
We propose a new approach to the Chinese word segmentation problem that considers the sentence as an undirected graph, whose nodes are the characters. One can use various techniques to compute the edge weights that measure the connection strength between characters. Spectral graph partition algorithms are used to group the characters and achieve word segmentation. We follow the graph partition approach and design several unsupervised algorithms, and we show their inspiring segmentation results on two corpora: (1) electronic health records in Chinese, and (2) benchmark data from the Second International Chinese Word Segmentation Bakeoff.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Algorithms and Data Compression · Topic Modeling
