Large Language Models are Miscalibrated In-Context Learners

Chengzu Li; Han Zhou; Goran Glava\v{s}; Anna Korhonen; Ivan Vuli\'c

arXiv:2312.13772·cs.CL·May 23, 2025·1 cites

Large Language Models are Miscalibrated In-Context Learners

Chengzu Li, Han Zhou, Goran Glava\v{s}, Anna Korhonen, Ivan Vuli\'c

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper investigates the calibration issues of instruction-tuned large language models in low-resource settings and proposes self-ensembling strategies to improve both calibration and task performance.

Contribution

It provides an in-depth analysis of miscalibration in in-context learning and introduces self-ensembling methods to enhance calibration without sacrificing performance.

Findings

01

Miscalibration persists across all learning methods in low-resource scenarios.

02

Self-ensembling with max probability improves calibration and robustness.

03

Guidelines are provided for choosing learning paradigms based on data familiarity.

Abstract

When adapting ICL with or without fine-tuning, we are curious about whether the instruction-tuned language model is able to achieve well-calibrated results without suffering from the problem of overconfidence (i.e., miscalibration) considering its strong instruction following ability, especially in such limited data setups. In this work, we deliver an in-depth analysis of the behavior across different choices of learning methods from the perspective of both performance and calibration. Through extensive controlled experiments, we observe that the miscalibration problem exists across all learning methods in low-resource setups. To achieve simultaneous gain for both in-task performance and calibration, we then study the potential of self-ensembling applied at different modeling stages (e.g., variations of in-context examples or variations in prompts or different ensembling strategies) to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cambridgeltl/ensembled-sicl
pytorchOfficial

Videos

Large Language Models are Miscalibrated In-Context Learners· underline

Taxonomy

TopicsTopic Modeling · Speech and dialogue systems · Multimodal Machine Learning Applications

MethodsShrink and Fine-Tune