Establishing a Scale for Kullback-Leibler Divergence in Language Models Across Various Settings
Ryo Kishino, Yusuke Takase, Momose Oyama, Hiroaki Yamagiwa, Hidetoshi Shimodaira

TL;DR
This paper introduces a unified scale for measuring Kullback-Leibler divergence in language models across various training and model configurations, enabling consistent comparisons.
Contribution
It extends the log-likelihood vector framework to include checkpoints and layers, establishing a stable KL divergence scale across diverse settings.
Findings
KL divergence changes are smaller than weight changes during training.
Language model behavior stabilizes early despite ongoing weight drift.
Subdiffusive learning trajectories are observed in log-likelihood space.
Abstract
Log-likelihood vectors define a common space for comparing language models as probability distributions, enabling unified comparisons across heterogeneous settings. We extend this framework to training checkpoints and intermediate layers, and establish a consistent scale for KL divergence across pretraining, model size, random seeds, quantization, fine-tuning, and layers. Analysis of Pythia pretraining trajectories further shows that changes in log-likelihood space, as measured by the scaling behavior of KL divergence, are much smaller than in weight space, resulting in subdiffusive learning trajectories and early stabilization of language-model behavior despite weight drift.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
