Optimal Attention Temperature Improves the Robustness of In-Context Learning under Distribution Shift in High Dimensions

Samet Demir; Zafer Dogan

arXiv:2511.01292·stat.ML·May 12, 2026

Optimal Attention Temperature Improves the Robustness of In-Context Learning under Distribution Shift in High Dimensions

Samet Demir, Zafer Dogan

PDF

TL;DR

This paper introduces an optimal attention temperature control method to enhance the robustness of in-context learning in pretrained Transformers under distribution shifts, supported by theoretical analysis and empirical validation.

Contribution

It provides a closed-form solution for the optimal attention temperature that minimizes generalization error under distribution shift, linking it to attention score moments.

Findings

01

Optimal temperature minimizes ICL error under distribution shift.

02

Temperature adjustment improves performance on QA benchmarks with noisy demonstrations.

03

Theory aligns with empirical results on GPT-2 and Llama2-7B.

Abstract

Pretrained Transformers can perform in-context learning (ICL) from a few demonstrations, but this ability can fail sharply when the test distribution differs from pretraining, a common deployment setting. We study attention temperature as a simple inference-time control for improving ICL robustness under such shifts. In a high-dimensional linear-regression framework, we analyze a Transformer with "approximate softmax" attention, which preserves softmax's normalization and temperature-dependent selectivity while remaining tractable. We derive a closed-form expression for the ICL generalization error under distribution shift, and show that it is minimized by an explicit optimal attention temperature. This characterization yields interpretable guidance by linking the best temperature to moments of the pre-softmax attention scores, and predicts when temperature adjustment can recover near…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.