Large Language Models are Miscalibrated In-Context Learners
Chengzu Li, Han Zhou, Goran Glava\v{s}, Anna Korhonen, Ivan Vuli\'c

TL;DR
This paper investigates the calibration issues of instruction-tuned large language models in low-resource settings and proposes self-ensembling strategies to improve both calibration and task performance.
Contribution
It provides an in-depth analysis of miscalibration in in-context learning and introduces self-ensembling methods to enhance calibration without sacrificing performance.
Findings
Miscalibration persists across all learning methods in low-resource scenarios.
Self-ensembling with max probability improves calibration and robustness.
Guidelines are provided for choosing learning paradigms based on data familiarity.
Abstract
When adapting ICL with or without fine-tuning, we are curious about whether the instruction-tuned language model is able to achieve well-calibrated results without suffering from the problem of overconfidence (i.e., miscalibration) considering its strong instruction following ability, especially in such limited data setups. In this work, we deliver an in-depth analysis of the behavior across different choices of learning methods from the perspective of both performance and calibration. Through extensive controlled experiments, we observe that the miscalibration problem exists across all learning methods in low-resource setups. To achieve simultaneous gain for both in-task performance and calibration, we then study the potential of self-ensembling applied at different modeling stages (e.g., variations of in-context examples or variations in prompts or different ensembling strategies) to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Speech and dialogue systems · Multimodal Machine Learning Applications
MethodsShrink and Fine-Tune
