The Learning Dynamics of Subword Segmentation for Morphologically Diverse Languages
Francois Meyer, Jan Buys

TL;DR
This paper investigates how subword segmentation evolves during training in language models across diverse languages, revealing four learning stages and potential benefits for low-resource, morphologically complex languages.
Contribution
It extends the SSLM framework to support pretraining and finetuning, analyzing subword dynamics across languages and proposing learnable subwords for improved NLP tasks.
Findings
Four stages of subword learning identified.
Subword boundaries become finer during finetuning.
Learnable subwords improve text generation and transfer.
Abstract
Subword segmentation is typically applied in preprocessing and stays fixed during training. Alternatively, it can be learned during training to optimise the training objective. In this paper we study the learning dynamics of subword segmentation: if a language model can dynamically optimise tokenisation, how do its subwords evolve during pretraining and finetuning? To explore this, we extend the subword segmental language model (SSLM), a framework for learning subwords during training, to support pretraining and finetuning. We train models for three typologically diverse languages to study learning dynamics across the morphological spectrum: Isi-Xhosa is conjunctive (long word forms composed of many morphemes), Setswana is disjunctive (morphemes written as separate words), and English represents a typological middle ground. We analyse subword dynamics from a linguistic perspective,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLanguage and cultural evolution · Natural Language Processing Techniques · Topic Modeling
