Contextual morphologically-guided tokenization for Latin encoder models

Marisa Hudspeth; Patrick J. Burns; Brendan O'Connor

arXiv:2511.09709·cs.CL·November 14, 2025

Contextual morphologically-guided tokenization for Latin encoder models

Marisa Hudspeth, Patrick J. Burns, Brendan O'Connor

PDF

Open Access 1 Video

TL;DR

This paper explores morphologically-guided tokenization for Latin, demonstrating that leveraging linguistic resources enhances language model performance, especially on out-of-domain texts, offering a promising approach for morphologically rich and low-resource languages.

Contribution

It introduces a morphologically-aware tokenization method for Latin that improves downstream task performance and generalization, emphasizing the value of linguistic resources in language modeling.

Findings

01

Performance improved on four downstream tasks

02

Significant gains for out-of-domain texts

03

Linguistic resources enhance modeling for morphologically complex languages

Abstract

Tokenization is a critical component of language model pretraining, yet standard tokenization methods often prioritize information-theoretical goals like high compression and low fertility rather than linguistic goals like morphological alignment. In fact, they have been shown to be suboptimal for morphologically rich languages, where tokenization quality directly impacts downstream performance. In this work, we investigate morphologically-aware tokenization for Latin, a morphologically rich language that is medium-resource in terms of pretraining data, but high-resource in terms of curated lexical resources -- a distinction that is often overlooked but critical in discussions of low-resource language modeling. We find that morphologically-guided tokenization improves overall performance on four downstream tasks. Performance gains are most pronounced for out of domain texts,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Contextual morphologically-guided tokenization for Latin encoder models· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification