Contextual morphologically-guided tokenization for Latin encoder models
Marisa Hudspeth, Patrick J. Burns, Brendan O'Connor

TL;DR
This paper explores morphologically-guided tokenization for Latin, demonstrating that leveraging linguistic resources enhances language model performance, especially on out-of-domain texts, offering a promising approach for morphologically rich and low-resource languages.
Contribution
It introduces a morphologically-aware tokenization method for Latin that improves downstream task performance and generalization, emphasizing the value of linguistic resources in language modeling.
Findings
Performance improved on four downstream tasks
Significant gains for out-of-domain texts
Linguistic resources enhance modeling for morphologically complex languages
Abstract
Tokenization is a critical component of language model pretraining, yet standard tokenization methods often prioritize information-theoretical goals like high compression and low fertility rather than linguistic goals like morphological alignment. In fact, they have been shown to be suboptimal for morphologically rich languages, where tokenization quality directly impacts downstream performance. In this work, we investigate morphologically-aware tokenization for Latin, a morphologically rich language that is medium-resource in terms of pretraining data, but high-resource in terms of curated lexical resources -- a distinction that is often overlooked but critical in discussions of low-resource language modeling. We find that morphologically-guided tokenization improves overall performance on four downstream tasks. Performance gains are most pronounced for out of domain texts,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
