Influences on LLM Calibration: A Study of Response Agreement, Loss   Functions, and Prompt Styles

Yuxi Xia; Pedro Henrique Luz de Araujo; Klim Zaporojets; Benjamin Roth

arXiv:2501.03991·cs.CL·January 8, 2025

Influences on LLM Calibration: A Study of Response Agreement, Loss Functions, and Prompt Styles

Yuxi Xia, Pedro Henrique Luz de Araujo, Klim Zaporojets, Benjamin Roth

PDF

Open Access 1 Video

TL;DR

This study investigates how response agreement, loss functions, and prompt styles influence LLM calibration, proposing Calib-n to improve confidence estimation across diverse models and prompts, with empirical validation on multiple datasets.

Contribution

The paper introduces Calib-n, a novel framework that enhances LLM calibration by aggregating responses from multiple models and optimizing loss functions, addressing generalization across prompt styles and model sizes.

Findings

01

Response agreement improves calibration performance.

02

Focal loss outperforms binary cross-entropy in calibration.

03

Few-shot prompts are most effective for auxiliary models.

Abstract

Calibration, the alignment between model confidence and prediction accuracy, is critical for the reliable deployment of large language models (LLMs). Existing works neglect to measure the generalization of their methods to other prompt styles and different sizes of LLMs. To address this, we define a controlled experimental setting covering 12 LLMs and four prompt styles. We additionally investigate if incorporating the response agreement of multiple LLMs and an appropriate loss function can improve calibration performance. Concretely, we build Calib-n, a novel framework that trains an auxiliary model for confidence estimation that aggregates responses from multiple LLMs to capture inter-model agreement. To optimize calibration, we integrate focal and AUC surrogate losses alongside binary cross-entropy. Experiments across four datasets demonstrate that both response agreement and focal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Influences on LLM Calibration: A Study of Response Agreement, Loss Functions, and Prompt Styles· underline

Taxonomy

TopicsNuclear Engineering Thermal-Hydraulics

MethodsFocal Loss