TL;DR
This study evaluates subword tokenization methods in Uralic languages, demonstrating that morphology-sensitive approaches like OBPE improve POS tagging accuracy and transferability in low-resource, agglutinative languages.
Contribution
It systematically compares three subword paradigms across Uralic languages, highlighting OBPE's advantages for morphological alignment and cross-lingual transfer.
Findings
OBPE outperforms BPE and Unigram in POS tagging accuracy
OBPE reduces fragmentation in open-class categories
Transfer efficacy depends on tagging architecture and language proximity
Abstract
Subword tokenization critically affects Natural Language Processing (NLP) performance, yet its behavior in morphologically rich and low-resource language families remains under-explored. This study systematically compares three subword paradigms -- Byte Pair Encoding (BPE), Overlap BPE (OBPE), and Unigram Language Model -- across six Uralic languages with varying resource availability and typological diversity. Using part-of-speech (POS) tagging as a controlled downstream task, we show that OBPE consistently achieves stronger morphological alignment and higher tagging accuracy than conventional methods, particularly within the Latin-script group. These gains arise from reduced fragmentation in open-class categories and a better balance across the frequency spectrum. Transfer efficacy further depends on the downstream tagging architecture, interacting with both training volume and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
