Universal Word Segmentation: Implementation and Interpretation
Yan Shao, Christian Hardmeier, Joakim Nivre

TL;DR
This paper introduces a universal sequence tagging framework for word segmentation across diverse languages, analyzing typological factors influencing accuracy and achieving state-of-the-art results on the Universal Dependencies datasets.
Contribution
It presents a novel, adaptable segmentation model and insights into how language features affect segmentation performance, improving accuracy on challenging languages.
Findings
Segmentation accuracy correlates positively with word boundary markers.
Accuracy decreases with the number of unique non-segmental terms.
The model achieves state-of-the-art results across all tested languages.
Abstract
Word segmentation is a low-level NLP task that is non-trivial for a considerable number of languages. In this paper, we present a sequence tagging framework and apply it to word segmentation for a wide range of languages with different writing systems and typological characteristics. Additionally, we investigate the correlations between various typological factors and word segmentation accuracy. The experimental results indicate that segmentation accuracy is positively related to word boundary markers and negatively to the number of unique non-segmental terms. Based on the analysis, we design a small set of language-specific settings and extensively evaluate the segmentation system on the Universal Dependencies datasets. Our model obtains state-of-the-art accuracies on all the UD languages. It performs substantially better on languages that are non-trivial to segment, such as Chinese,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
