Self-ensemble: Mitigating Confidence Mis-calibration for Large Language Models

Zicheng Xu; Guanchu Wang; Guangyao Zheng; Yu-Neng Chuang; Alexander Szalay; Xia Hu; Vladimir Braverman

arXiv:2506.01951·cs.CL·October 14, 2025

Self-ensemble: Mitigating Confidence Mis-calibration for Large Language Models

Zicheng Xu, Guanchu Wang, Guangyao Zheng, Yu-Neng Chuang, Alexander Szalay, Xia Hu, Vladimir Braverman

PDF

Open Access

TL;DR

This paper introduces Self-ensemble, a method to improve confidence calibration in large language models for multiple-choice questions, especially with many options, by aggregating predictions across groups without extra training.

Contribution

The paper proposes a plug-and-play Self-ensemble approach that mitigates confidence distortion in LLMs during multi-choice tasks without needing labeled data for tuning.

Findings

01

Self-ensemble improves LLM accuracy on multi-choice questions.

02

It reduces over-confidence and under-confidence issues.

03

Outperforms standard inference and baseline methods.

Abstract

Although Large Language Models (LLMs) perform well in general fields, they exhibit a confidence distortion problem on multi-choice question-answering (MCQA), particularly as the number of answer choices increases. Specifically, on MCQA with many choices, LLMs suffer from under-confidence in correct predictions and over-confidence in incorrect ones, leading to a substantially degraded performance. To solve this problem, we propose Self-ensemble in this work. Our method splits the choices into several groups and ensembles LLM predictions across these groups to reach a final decision. The advantage of Self-ensemble is its plug-and-play nature, where it can be integrated into existing LLM architecture based on a designed attention mask and positional encoding, without requiring labeled datasets for parameter tuning. Experimental results on three LLMs and datasets demonstrate that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech Recognition and Synthesis