Binary Tree based Chinese Word Segmentation

Kaixu Zhang; Can Wang; Maosong Sun

arXiv:1305.3981·cs.CL·May 20, 2013

Binary Tree based Chinese Word Segmentation

Kaixu Zhang, Can Wang, Maosong Sun

PDF

Open Access

TL;DR

This paper introduces a binary tree framework for Chinese word segmentation that addresses granularity mismatch issues, improving accuracy by up to 20% through specialized tree pruning and enabling detailed error analysis.

Contribution

The paper proposes a novel binary tree based framework with tree building and pruning steps to effectively handle granularity mismatch in Chinese word segmentation.

Findings

01

Error reduction of up to 20% with improved tree pruning.

02

Framework compatible with existing sequence tagging methods.

03

Provides quantitative methods for error analysis.

Abstract

Chinese word segmentation is a fundamental task for Chinese language processing. The granularity mismatch problem is the main cause of the errors. This paper showed that the binary tree representation can store outputs with different granularity. A binary tree based framework is also designed to overcome the granularity mismatch problem. There are two steps in this framework, namely tree building and tree pruning. The tree pruning step is specially designed to focus on the granularity problem. Previous work for Chinese word segmentation such as the sequence tagging can be easily employed in this framework. This framework can also provide quantitative error analysis methods. The experiments showed that after using a more sophisticated tree pruning function for a state-of-the-art conditional random field based baseline, the error reduction can be up to 20%.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Rough Sets and Fuzzy Logic