A Unified Framework for Critical Scaling of Inverse Temperature in Self-Attention
Tomohiro Hayase, Ryo Karakida

TL;DR
This paper introduces a unified theoretical framework that determines the optimal inverse-temperature scaling in self-attention models based on the gap-counting function, reconciling previous conflicting laws.
Contribution
It provides a general theory linking the inverse-temperature scale to the gap-counting function, unifying prior scaling laws and enabling diagnostics for various attention-score families.
Findings
The critical inverse-temperature scale is determined by the upper-tail accumulation of gap counts.
Below this scale, attention scores are not well-separated; above it, entropy collapses.
The framework unifies existing scaling laws and applies to practical transformer models.
Abstract
Length-dependent logit rescaling is widely used to stabilize long-context self-attention, but existing analyses and methods suggest conflicting inverse-temperature laws for the context length , ranging from to and . We provide a general theory showing that the desirable scale is determined by the gap-counting function of each attention row. Counting how many competitors lie within each gap from the maximum, we define an upper-tail accumulation scale and prove that it gives the critical inverse-temperature scale for softmax concentration: below this scale, the top competitors remain unseparated, whereas above it, the attention entropy collapses. This framework unifies prior scaling laws as different and yields a direct diagnostic for attention-score families, from idealized theoretical models to more practical transformers.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
