Optimal Attention Temperature Improves the Robustness of In-Context Learning under Distribution Shift in High Dimensions
Samet Demir, Zafer Dogan

TL;DR
This paper introduces an optimal attention temperature control method to enhance the robustness of in-context learning in pretrained Transformers under distribution shifts, supported by theoretical analysis and empirical validation.
Contribution
It provides a closed-form solution for the optimal attention temperature that minimizes generalization error under distribution shift, linking it to attention score moments.
Findings
Optimal temperature minimizes ICL error under distribution shift.
Temperature adjustment improves performance on QA benchmarks with noisy demonstrations.
Theory aligns with empirical results on GPT-2 and Llama2-7B.
Abstract
Pretrained Transformers can perform in-context learning (ICL) from a few demonstrations, but this ability can fail sharply when the test distribution differs from pretraining, a common deployment setting. We study attention temperature as a simple inference-time control for improving ICL robustness under such shifts. In a high-dimensional linear-regression framework, we analyze a Transformer with "approximate softmax" attention, which preserves softmax's normalization and temperature-dependent selectivity while remaining tractable. We derive a closed-form expression for the ICL generalization error under distribution shift, and show that it is minimized by an explicit optimal attention temperature. This characterization yields interpretable guidance by linking the best temperature to moments of the pre-softmax attention scores, and predicts when temperature adjustment can recover near…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
