Lexically Grounded Subword Segmentation
Jind\v{r}ich Libovick\'y, Jind\v{r}ich Helcl

TL;DR
This paper introduces innovative subword segmentation methods grounded in lexical meaning, combining morphological analysis, algebraic embedding techniques, and efficient bigram models, leading to improved linguistic plausibility and tagging performance.
Contribution
The paper presents a novel approach to subword segmentation that incorporates lexical semantics and efficient algorithms, advancing the state of tokenization methods.
Findings
Improved segmentation precision on morpheme boundaries.
Enhanced Rénnyi efficiency across 8 languages.
Consistent POS tagging performance gains.
Abstract
We present three innovations in tokenization and subword segmentation. First, we propose to use unsupervised morphological analysis with Morfessor as pre-tokenization. Second, we present an algebraic method for obtaining subword embeddings grounded in a word embedding space. Based on that, we design a novel subword segmentation algorithm that uses the embeddings, ensuring that the procedure considers lexical meaning. Third, we introduce an efficient segmentation algorithm based on a subword bigram model that can be initialized with the lexically aware segmentation method to avoid using Morfessor and large embedding tables at inference time. We evaluate the proposed approaches using two intrinsic metrics and measure their performance on two downstream tasks: part-of-speech tagging and machine translation. Our experiments show significant improvements in the morphological plausibility of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Lexicography and Language Studies
MethodsAttentive Walk-Aggregating Graph Neural Network
