Establishing a Scale for Kullback-Leibler Divergence in Language Models Across Various Settings

Ryo Kishino; Yusuke Takase; Momose Oyama; Hiroaki Yamagiwa; Hidetoshi Shimodaira

arXiv:2505.15353·cs.CL·April 21, 2026

Establishing a Scale for Kullback-Leibler Divergence in Language Models Across Various Settings

Ryo Kishino, Yusuke Takase, Momose Oyama, Hiroaki Yamagiwa, Hidetoshi Shimodaira

PDF

TL;DR

This paper introduces a unified scale for measuring Kullback-Leibler divergence in language models across various training and model configurations, enabling consistent comparisons.

Contribution

It extends the log-likelihood vector framework to include checkpoints and layers, establishing a stable KL divergence scale across diverse settings.

Findings

01

KL divergence changes are smaller than weight changes during training.

02

Language model behavior stabilizes early despite ongoing weight drift.

03

Subdiffusive learning trajectories are observed in log-likelihood space.

Abstract

Log-likelihood vectors define a common space for comparing language models as probability distributions, enabling unified comparisons across heterogeneous settings. We extend this framework to training checkpoints and intermediate layers, and establish a consistent scale for KL divergence across pretraining, model size, random seeds, quantization, fine-tuning, and layers. Analysis of Pythia pretraining trajectories further shows that changes in log-likelihood space, as measured by the scaling behavior of KL divergence, are much smaller than in weight space, resulting in subdiffusive learning trajectories and early stabilization of language-model behavior despite weight drift.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.