Reduce Meaningless Words for Joint Chinese Word Segmentation and   Part-of-speech Tagging

Kaixu Zhang; Maosong Sun

arXiv:1305.5918·cs.CL·May 28, 2013·1 cites

Reduce Meaningless Words for Joint Chinese Word Segmentation and Part-of-speech Tagging

Kaixu Zhang, Maosong Sun

PDF

Open Access

TL;DR

This paper presents a new framework for joint Chinese word segmentation and part-of-speech tagging that significantly reduces meaningless word generation and improves overall accuracy by incorporating extensive lexical features.

Contribution

It introduces a novel feature-enhanced framework utilizing large-scale lexical resources to minimize meaningless words in joint Chinese S&T tasks.

Findings

01

62.9% reduction in meaningless word generation

02

F1 score for segmentation increased to 0.984

03

Effective use of large-scale lexical resources

Abstract

Conventional statistics-based methods for joint Chinese word segmentation and part-of-speech tagging (S&T) have generalization ability to recognize new words that do not appear in the training data. An undesirable side effect is that a number of meaningless words will be incorrectly created. We propose an effective and efficient framework for S&T that introduces features to significantly reduce meaningless words generation. A general lexicon, Wikepedia and a large-scale raw corpus of 200 billion characters are used to generate word-based features for the wordhood. The word-lattice based framework consists of a character-based model and a word-based model in order to employ our word-based features. Experiments on Penn Chinese treebank 5 show that this method has a 62.9% reduction of meaningless word generation in comparison with the baseline. As a result, the F1 measure for segmentation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Handwritten Text Recognition Techniques