Reduce Meaningless Words for Joint Chinese Word Segmentation and Part-of-speech Tagging
Kaixu Zhang, Maosong Sun

TL;DR
This paper presents a new framework for joint Chinese word segmentation and part-of-speech tagging that significantly reduces meaningless word generation and improves overall accuracy by incorporating extensive lexical features.
Contribution
It introduces a novel feature-enhanced framework utilizing large-scale lexical resources to minimize meaningless words in joint Chinese S&T tasks.
Findings
62.9% reduction in meaningless word generation
F1 score for segmentation increased to 0.984
Effective use of large-scale lexical resources
Abstract
Conventional statistics-based methods for joint Chinese word segmentation and part-of-speech tagging (S&T) have generalization ability to recognize new words that do not appear in the training data. An undesirable side effect is that a number of meaningless words will be incorrectly created. We propose an effective and efficient framework for S&T that introduces features to significantly reduce meaningless words generation. A general lexicon, Wikepedia and a large-scale raw corpus of 200 billion characters are used to generate word-based features for the wordhood. The word-lattice based framework consists of a character-based model and a word-based model in order to employ our word-based features. Experiments on Penn Chinese treebank 5 show that this method has a 62.9% reduction of meaningless word generation in comparison with the baseline. As a result, the F1 measure for segmentation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Handwritten Text Recognition Techniques
