Self-ensemble: Mitigating Confidence Mis-calibration for Large Language Models
Zicheng Xu, Guanchu Wang, Guangyao Zheng, Yu-Neng Chuang, Alexander Szalay, Xia Hu, Vladimir Braverman

TL;DR
This paper introduces Self-ensemble, a method to improve confidence calibration in large language models for multiple-choice questions, especially with many options, by aggregating predictions across groups without extra training.
Contribution
The paper proposes a plug-and-play Self-ensemble approach that mitigates confidence distortion in LLMs during multi-choice tasks without needing labeled data for tuning.
Findings
Self-ensemble improves LLM accuracy on multi-choice questions.
It reduces over-confidence and under-confidence issues.
Outperforms standard inference and baseline methods.
Abstract
Although Large Language Models (LLMs) perform well in general fields, they exhibit a confidence distortion problem on multi-choice question-answering (MCQA), particularly as the number of answer choices increases. Specifically, on MCQA with many choices, LLMs suffer from under-confidence in correct predictions and over-confidence in incorrect ones, leading to a substantially degraded performance. To solve this problem, we propose Self-ensemble in this work. Our method splits the choices into several groups and ensembles LLM predictions across these groups to reach a final decision. The advantage of Self-ensemble is its plug-and-play nature, where it can be integrated into existing LLM architecture based on a designed attention mask and positional encoding, without requiring labeled datasets for parameter tuning. Experimental results on three LLMs and datasets demonstrate that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech Recognition and Synthesis
