Beyond Temperature: Hyperfitting as a Late-Stage Geometric Expansion
Meimingwei Li, Yuanhao Ding, Esteban Garces Arias, Christian Heumann

TL;DR
This paper investigates the hyperfitting phenomenon in fine-tuned LLMs, revealing it as a late-stage geometric expansion in the final transformer layer that enhances diversity and generation quality.
Contribution
It uncovers the geometric mechanism behind hyperfitting, distinguishes it from temperature scaling, and introduces Late-Stage LoRA for efficient fine-tuning.
Findings
Hyperfitting is distinct from temperature scaling and distribution sharpening.
Hyperfitting involves a significant geometric expansion in the final transformer layer.
Late-Stage LoRA achieves robust generation with minimal parameter updates.
Abstract
Recent work has identified a counterintuitive phenomenon termed "Hyperfitting", where fine-tuning Large Language Models (LLMs) to near-zero training loss on small datasets surprisingly enhances open-ended generation quality and mitigates repetition in greedy decoding. While effective, the underlying mechanism remains poorly understood, with the extremely low-entropy output distributions suggesting a potential equivalence to simple temperature scaling. In this work, we demonstrate that this phenomenon is fundamentally distinct from distribution sharpening; entropy-matched control experiments reveal that temperature scaling fails to replicate the diversity gains of hyperfitting. Furthermore, we falsify the hypothesis of static vocabulary reweighting, showing through ablation studies that hyperfitting relies on a dynamic, context-dependent rank reordering mechanism. Layer-wise analysis…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
