A Study on the Calibration of In-context Learning
Hanlin Zhang, Yi-Fan Zhang, Yaodong Yu, Dhruv Madeka, Dean Foster,, Eric Xing, Himabindu Lakkaraju, Sham Kakade

TL;DR
This paper investigates how in-context learning affects model calibration in language models, revealing that calibration varies with the number of examples and prompting methods, and proposing recalibration techniques to improve reliability.
Contribution
It provides a comprehensive analysis of calibration behavior in in-context learning and introduces a scaling-binning method for better calibration of language models.
Findings
Calibration initially worsens with more in-context examples
Fine-tuning and chain-of-thought prompting can cause miscalibration
Scaling-binning calibrator reduces calibration errors effectively
Abstract
Accurate uncertainty quantification is crucial for the safe deployment of machine learning models, and prior research has demonstrated improvements in the calibration of modern language models (LMs). We study in-context learning (ICL), a prevalent method for adapting static LMs through tailored prompts, and examine the balance between performance and calibration across a broad spectrum of natural language understanding and reasoning tasks. Through comprehensive experiments, we observe that, with an increasing number of ICL examples, models initially exhibit increased miscalibration before achieving better calibration and miscalibration tends to arise in low-shot settings. Moreover, we find that methods aimed at improving usability, such as fine-tuning and chain-of-thought (CoT) prompting, can lead to miscalibration and unreliable natural language explanations. Furthermore, we explore…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsEducation and Learning Interventions
