Training Language Models with homotokens Leads to Delayed Overfitting

Adrian Cosma; Stefan Ruseti; Emilian Radoi; Mihai Dascalu

arXiv:2601.02867·cs.CL·January 14, 2026

Training Language Models with homotokens Leads to Delayed Overfitting

Adrian Cosma, Stefan Ruseti, Emilian Radoi, Mihai Dascalu

PDF

Open Access

TL;DR

This paper introduces homotokens, a data augmentation method that uses alternative subword segmentations to improve language model training by delaying overfitting and enhancing generalization, especially in data-constrained scenarios.

Contribution

The paper formalizes homotokens as meaning-preserving segmentation variants and proposes a lightweight architecture to incorporate them, improving model robustness without altering the core training objective.

Findings

01

Homotoken augmentation delays overfitting in pretraining.

02

Effectiveness depends on tokenizer quality, with stronger gains when tokens are highly compressed.

03

Homotokens improve generalization across diverse datasets.

Abstract

Subword tokenization introduces a computational layer in language models where many distinct token sequences decode to the same surface form and preserve meaning, yet induce different internal computations. Despite this non-uniqueness, language models are typically trained using a single canonical longest-prefix tokenization. We formalize homotokens-alternative valid subword segmentations of the same lexical item-as a strictly meaning-preserving form of data augmentation. We introduce a lightweight training architecture that conditions canonical next-token prediction on sampled homotoken variants via an auxiliary causal encoder and block-causal cross-attention, without modifying the training objective or token interface. In data-constrained pretraining, homotoken augmentation consistently delays overfitting under repeated data exposure and improves generalization across diverse…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Text Readability and Simplification · Natural Language Processing Techniques