The Learning Dynamics of Subword Segmentation for Morphologically Diverse Languages

Francois Meyer; Jan Buys

arXiv:2511.09197·cs.CL·November 20, 2025

The Learning Dynamics of Subword Segmentation for Morphologically Diverse Languages

Francois Meyer, Jan Buys

PDF

Open Access

TL;DR

This paper investigates how subword segmentation evolves during training in language models across diverse languages, revealing four learning stages and potential benefits for low-resource, morphologically complex languages.

Contribution

It extends the SSLM framework to support pretraining and finetuning, analyzing subword dynamics across languages and proposing learnable subwords for improved NLP tasks.

Findings

01

Four stages of subword learning identified.

02

Subword boundaries become finer during finetuning.

03

Learnable subwords improve text generation and transfer.

Abstract

Subword segmentation is typically applied in preprocessing and stays fixed during training. Alternatively, it can be learned during training to optimise the training objective. In this paper we study the learning dynamics of subword segmentation: if a language model can dynamically optimise tokenisation, how do its subwords evolve during pretraining and finetuning? To explore this, we extend the subword segmental language model (SSLM), a framework for learning subwords during training, to support pretraining and finetuning. We train models for three typologically diverse languages to study learning dynamics across the morphological spectrum: Isi-Xhosa is conjunctive (long word forms composed of many morphemes), Setswana is disjunctive (morphemes written as separate words), and English represents a typological middle ground. We analyse subword dynamics from a linguistic perspective,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLanguage and cultural evolution · Natural Language Processing Techniques · Topic Modeling