Universal Word Segmentation: Implementation and Interpretation

Yan Shao; Christian Hardmeier; Joakim Nivre

arXiv:1807.02974·cs.CL·July 10, 2018

Universal Word Segmentation: Implementation and Interpretation

Yan Shao, Christian Hardmeier, Joakim Nivre

PDF

Open Access 1 Repo

TL;DR

This paper introduces a universal sequence tagging framework for word segmentation across diverse languages, analyzing typological factors influencing accuracy and achieving state-of-the-art results on the Universal Dependencies datasets.

Contribution

It presents a novel, adaptable segmentation model and insights into how language features affect segmentation performance, improving accuracy on challenging languages.

Findings

01

Segmentation accuracy correlates positively with word boundary markers.

02

Accuracy decreases with the number of unique non-segmental terms.

03

The model achieves state-of-the-art results across all tested languages.

Abstract

Word segmentation is a low-level NLP task that is non-trivial for a considerable number of languages. In this paper, we present a sequence tagging framework and apply it to word segmentation for a wide range of languages with different writing systems and typological characteristics. Additionally, we investigate the correlations between various typological factors and word segmentation accuracy. The experimental results indicate that segmentation accuracy is positively related to word boundary markers and negatively to the number of unique non-segmental terms. Based on the analysis, we design a small set of language-specific settings and extensively evaluate the segmentation system on the Universal Dependencies datasets. Our model obtains state-of-the-art accuracies on all the UD languages. It performs substantially better on languages that are non-trivial to segment, such as Chinese,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yanshao9798/segmenter
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification