Character Feature Engineering for Japanese Word Segmentation

Mike Tian-Jian Jiang

arXiv:1910.01761·cs.CL·October 7, 2019

Character Feature Engineering for Japanese Word Segmentation

Mike Tian-Jian Jiang

PDF

Open Access

TL;DR

This paper introduces a character feature engineering approach for Japanese word segmentation that leverages linguistic intuition to enable dynamic lexicon expansion while maintaining model stability, improving OOV handling.

Contribution

The work proposes a novel character feature scheme that allows for flexible lexicon updates without retraining, addressing limitations of traditional word-based methods.

Findings

01

Competitive F1 scores achieved across datasets

02

Improved OOV recall with lexicon expansion

03

Model remains stable during dynamic updates

Abstract

On word segmentation problems, machine learning architecture engineering often draws attention. The problem representation itself, however, has remained almost static as either word lattice ranking or character sequence tagging, for at least two decades. The latter of-ten shows stronger predictive power than the former for out-of-vocabulary (OOV) issue. When the issue escalating to rapid adaptation, which is a common scenario for industrial applications, active learning of partial annotations or re-training with additional lexical re-sources is usually applied, however, from a somewhat word-based perspective. Not only it is uneasy for end-users to comply with linguistically consistent word boundary decisions, but also the risk/cost of forking models permanently with estimated weights is seldom affordable. To overcome the obstacle, this work provides an alternative, which uses linguistic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Handwritten Text Recognition Techniques