Beyond Temperature: Hyperfitting as a Late-Stage Geometric Expansion

Meimingwei Li; Yuanhao Ding; Esteban Garces Arias; Christian Heumann

arXiv:2605.22579·cs.CL·May 22, 2026

Beyond Temperature: Hyperfitting as a Late-Stage Geometric Expansion

Meimingwei Li, Yuanhao Ding, Esteban Garces Arias, Christian Heumann

PDF

TL;DR

This paper investigates the hyperfitting phenomenon in fine-tuned LLMs, revealing it as a late-stage geometric expansion in the final transformer layer that enhances diversity and generation quality.

Contribution

It uncovers the geometric mechanism behind hyperfitting, distinguishes it from temperature scaling, and introduces Late-Stage LoRA for efficient fine-tuning.

Findings

01

Hyperfitting is distinct from temperature scaling and distribution sharpening.

02

Hyperfitting involves a significant geometric expansion in the final transformer layer.

03

Late-Stage LoRA achieves robust generation with minimal parameter updates.

Abstract

Recent work has identified a counterintuitive phenomenon termed "Hyperfitting", where fine-tuning Large Language Models (LLMs) to near-zero training loss on small datasets surprisingly enhances open-ended generation quality and mitigates repetition in greedy decoding. While effective, the underlying mechanism remains poorly understood, with the extremely low-entropy output distributions suggesting a potential equivalence to simple temperature scaling. In this work, we demonstrate that this phenomenon is fundamentally distinct from distribution sharpening; entropy-matched control experiments reveal that temperature scaling fails to replicate the diversity gains of hyperfitting. Furthermore, we falsify the hypothesis of static vocabulary reweighting, showing through ablation studies that hyperfitting relies on a dynamic, context-dependent rank reordering mechanism. Layer-wise analysis…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.