Influences on LLM Calibration: A Study of Response Agreement, Loss Functions, and Prompt Styles
Yuxi Xia, Pedro Henrique Luz de Araujo, Klim Zaporojets, Benjamin Roth

TL;DR
This study investigates how response agreement, loss functions, and prompt styles influence LLM calibration, proposing Calib-n to improve confidence estimation across diverse models and prompts, with empirical validation on multiple datasets.
Contribution
The paper introduces Calib-n, a novel framework that enhances LLM calibration by aggregating responses from multiple models and optimizing loss functions, addressing generalization across prompt styles and model sizes.
Findings
Response agreement improves calibration performance.
Focal loss outperforms binary cross-entropy in calibration.
Few-shot prompts are most effective for auxiliary models.
Abstract
Calibration, the alignment between model confidence and prediction accuracy, is critical for the reliable deployment of large language models (LLMs). Existing works neglect to measure the generalization of their methods to other prompt styles and different sizes of LLMs. To address this, we define a controlled experimental setting covering 12 LLMs and four prompt styles. We additionally investigate if incorporating the response agreement of multiple LLMs and an appropriate loss function can improve calibration performance. Concretely, we build Calib-n, a novel framework that trains an auxiliary model for confidence estimation that aggregates responses from multiple LLMs to capture inter-model agreement. To optimize calibration, we integrate focal and AUC surrogate losses alongside binary cross-entropy. Experiments across four datasets demonstrate that both response agreement and focal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNuclear Engineering Thermal-Hydraulics
MethodsFocal Loss
