Boosting In-Context Learning in LLMs Through the Lens of Classical Supervised Learning
Korel Gundem, Juncheng Dong, Dennis Zhang, Vahid Tarokh, Zhengling Qi

TL;DR
This paper introduces Supervised Calibration, a novel framework for improving in-context learning in large language models by optimally adjusting their decision boundaries, leading to state-of-the-art calibration performance.
Contribution
The paper proposes a flexible, loss-based calibration method that can modify and reverse LLM decision boundaries, unifying and surpassing existing calibration techniques.
Findings
SC achieves state-of-the-art calibration across multiple datasets.
It effectively addresses bias and instability in ICL.
The method generalizes many existing calibration approaches.
Abstract
In-Context Learning (ICL) allows Large Language Models (LLMs) to adapt to new tasks with just a few examples, but their predictions often suffer from systematic biases, leading to unstable performance in classification. While calibration techniques are proposed to mitigate these biases, we show that, in the logit space, many of these methods are equivalent to merely shifting the LLM's decision boundary without having the ability to alter its orientation. This proves inadequate when biases cause the LLM to be severely misaligned. To address these limitations and provide a unifying framework, we propose Supervised Calibration (SC), a loss-minimization-based framework, which learns an optimal, per-class affine transformation of LLM's predictive probabilities in the logit space without requiring external data beyond the context. By using a more expressive functional class, SC not only…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
- The breadth of benchmarking datasets is an advantage, and >2 LLM families is also a slight advantage, although it is increasingly expected. The former is attenuated somewhat by the fact that these are only classification datasets (and this attenuation is attenuated by the fact that narrowly focusing on classification is the point); the latter is attenuated by the fact that only small 7B models are used.
- Not that there are word ounts per section, and this is minor, but there is so little context given (Sec 3.4 somewhat notwithstanding) -- Sec 2 is barely a paragraph with only a handful of papers mentioned, with very little nuance into how they're mentioned, nor any comparison or caveat between them. Another reason this is a minor complaint is that other references are strewn throughout, but still some claims throughout could benefit from additional external context (e.g., that order can bias I
1. This paper proposes an automatic data generation method for ICL calibration, which helps reduce the high data requirements of previous approaches. 2. This paper designs a regularization term to improve prediction consistency for the same query under different demonstration conditions. This is interesting, and I would like to see more analysis on this point (see Weakness). 3. This paper employs batched calibration training across different numbers of demonstrations. Specifically, a k
1. The empirical method proposed in this paper, i.e., training an affine transformation to rescale the results of restricted decoding, does not go beyond the scope of previous works, and the authors have not compared their approach against them. These prior works include KNN prompting [1], Hidden Calibration [2], and Prototypical Calibration [3], which all utilize the high-degree-of-freedom decision boundary modification. While I acknowledge that the proposed automatic training data generation m
1. This paper presents a novel idea for calibrating LLMs, achieving notable performance improvements. 2. The paper is well-organized, with clear and precise mathematical formulations, making it highly readable. 3. It presents a clear and compelling motivation, which effectively supports the subsequent experiments.
1. It cannot be applied to black-box model calibration, as the method requires access to the model’s internal outputs or representations for fitting the calibration estimator. 2. The model is somewhat overly complex, and the inclusion of multiple regularization terms blurs the main focus, which in turn reduces its practical applicability. 3. The paper lacks comparisons with the latest calibration methods. While expecting comparisons with 2025 methods might be unrealistic, there should at least b
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
