A Unified Framework for Critical Scaling of Inverse Temperature in Self-Attention

Tomohiro Hayase; Ryo Karakida

arXiv:2605.12697·stat.ML·May 14, 2026

A Unified Framework for Critical Scaling of Inverse Temperature in Self-Attention

Tomohiro Hayase, Ryo Karakida

PDF

TL;DR

This paper introduces a unified theoretical framework that determines the optimal inverse-temperature scaling in self-attention models based on the gap-counting function, reconciling previous conflicting laws.

Contribution

It provides a general theory linking the inverse-temperature scale to the gap-counting function, unifying prior scaling laws and enabling diagnostics for various attention-score families.

Findings

01

The critical inverse-temperature scale is determined by the upper-tail accumulation of gap counts.

02

Below this scale, attention scores are not well-separated; above it, entropy collapses.

03

The framework unifies existing scaling laws and applies to practical transformer models.

Abstract

Length-dependent logit rescaling is widely used to stabilize long-context self-attention, but existing analyses and methods suggest conflicting inverse-temperature laws for the context length $n$ , ranging from $(lo g n)^{1/2}$ to $lo g n$ and $(lo g n)^{2}$ . We provide a general theory showing that the desirable scale is determined by the gap-counting function $N_{n}$ of each attention row. Counting how many competitors lie within each gap from the maximum, we define an upper-tail accumulation scale and prove that it gives the critical inverse-temperature scale for softmax concentration: below this scale, the top competitors remain unseparated, whereas above it, the attention entropy collapses. This framework unifies prior scaling laws as different $N_{n}$ and yields a direct diagnostic for attention-score families, from idealized theoretical models to more practical transformers.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.