Character Feature Engineering for Japanese Word Segmentation
Mike Tian-Jian Jiang

TL;DR
This paper introduces a character feature engineering approach for Japanese word segmentation that leverages linguistic intuition to enable dynamic lexicon expansion while maintaining model stability, improving OOV handling.
Contribution
The work proposes a novel character feature scheme that allows for flexible lexicon updates without retraining, addressing limitations of traditional word-based methods.
Findings
Competitive F1 scores achieved across datasets
Improved OOV recall with lexicon expansion
Model remains stable during dynamic updates
Abstract
On word segmentation problems, machine learning architecture engineering often draws attention. The problem representation itself, however, has remained almost static as either word lattice ranking or character sequence tagging, for at least two decades. The latter of-ten shows stronger predictive power than the former for out-of-vocabulary (OOV) issue. When the issue escalating to rapid adaptation, which is a common scenario for industrial applications, active learning of partial annotations or re-training with additional lexical re-sources is usually applied, however, from a somewhat word-based perspective. Not only it is uneasy for end-users to comply with linguistically consistent word boundary decisions, but also the risk/cost of forking models permanently with estimated weights is seldom affordable. To overcome the obstacle, this work provides an alternative, which uses linguistic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Handwritten Text Recognition Techniques
