Tokenization and Morphological Fidelity in Uralic NLP: A Cross-Lingual Evaluation

Nuo Xu; Ahrii Kim

arXiv:2602.04241·cs.CL·March 31, 2026

Tokenization and Morphological Fidelity in Uralic NLP: A Cross-Lingual Evaluation

Nuo Xu, Ahrii Kim

PDF

1 Video

TL;DR

This study evaluates subword tokenization methods in Uralic languages, demonstrating that morphology-sensitive approaches like OBPE improve POS tagging accuracy and transferability in low-resource, agglutinative languages.

Contribution

It systematically compares three subword paradigms across Uralic languages, highlighting OBPE's advantages for morphological alignment and cross-lingual transfer.

Findings

01

OBPE outperforms BPE and Unigram in POS tagging accuracy

02

OBPE reduces fragmentation in open-class categories

03

Transfer efficacy depends on tagging architecture and language proximity

Abstract

Subword tokenization critically affects Natural Language Processing (NLP) performance, yet its behavior in morphologically rich and low-resource language families remains under-explored. This study systematically compares three subword paradigms -- Byte Pair Encoding (BPE), Overlap BPE (OBPE), and Unigram Language Model -- across six Uralic languages with varying resource availability and typological diversity. Using part-of-speech (POS) tagging as a controlled downstream task, we show that OBPE consistently achieves stronger morphological alignment and higher tagging accuracy than conventional methods, particularly within the Latin-script group. These gains arise from reduced fragmentation in open-class categories and a better balance across the frequency spectrum. Transfer efficacy further depends on the downstream tagging architecture, interacting with both training volume and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Tokenization and Morphological Fidelity in Uralic NLP: A Cross-Lingual Evaluation· underline