Combine CRF and MMSEG to Boost Chinese Word Segmentation in Social Media
Yao Yushi, Huang Zheng

TL;DR
This paper introduces a joint CRF and MMSEG algorithm with extended features and an Internet lexicon to improve Chinese word segmentation specifically for social media text, outperforming existing models.
Contribution
It presents a novel combination of CRF and MMSEG with extended features and lexicon integration tailored for social media Chinese text segmentation.
Findings
Outperforms state-of-the-art models on Sina Weibo data
Effective handling of colloquial and Internet terms in social media
Enhanced segmentation accuracy for social media Chinese text
Abstract
In this paper, we propose a joint algorithm for the word segmentation on Chinese social media. Previous work mainly focus on word segmentation for plain Chinese text, in order to develop a Chinese social media processing tool, we need to take the main features of social media into account, whose grammatical structure is not rigorous, and the tendency of using colloquial and Internet terms makes the existing Chinese-processing tools inefficient to obtain good performance on social media. In our approach, we combine CRF and MMSEG algorithm and extend features of traditional CRF algorithm to train the model for word segmentation, We use Internet lexicon in order to improve the performance of our model on Chinese social media. Our experimental result on Sina Weibo shows that our approach outperforms the state-of-the-art model.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Advanced Text Analysis Techniques
MethodsConditional Random Field
