Training Language Models with homotokens Leads to Delayed Overfitting
Adrian Cosma, Stefan Ruseti, Emilian Radoi, Mihai Dascalu

TL;DR
This paper introduces homotokens, a data augmentation method that uses alternative subword segmentations to improve language model training by delaying overfitting and enhancing generalization, especially in data-constrained scenarios.
Contribution
The paper formalizes homotokens as meaning-preserving segmentation variants and proposes a lightweight architecture to incorporate them, improving model robustness without altering the core training objective.
Findings
Homotoken augmentation delays overfitting in pretraining.
Effectiveness depends on tokenizer quality, with stronger gains when tokens are highly compressed.
Homotokens improve generalization across diverse datasets.
Abstract
Subword tokenization introduces a computational layer in language models where many distinct token sequences decode to the same surface form and preserve meaning, yet induce different internal computations. Despite this non-uniqueness, language models are typically trained using a single canonical longest-prefix tokenization. We formalize homotokens-alternative valid subword segmentations of the same lexical item-as a strictly meaning-preserving form of data augmentation. We introduce a lightweight training architecture that conditions canonical next-token prediction on sampled homotoken variants via an auxiliary causal encoder and block-causal cross-attention, without modifying the training objective or token interface. In data-constrained pretraining, homotoken augmentation consistently delays overfitting under repeated data exposure and improves generalization across diverse…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Text Readability and Simplification · Natural Language Processing Techniques
