Calibrating LLMs with Information-Theoretic Evidential Deep Learning
Yawei Li, David R\"ugamer, Bernd Bischl, Mina Rezaei

TL;DR
This paper introduces IB-EDL, a regularized evidential deep learning method that improves the calibration and uncertainty estimation of fine-tuned large language models by incorporating an information bottleneck to reduce overfitting.
Contribution
It proposes a novel regularization technique for evidential deep learning using an information bottleneck, enhancing LLM calibration and uncertainty estimation.
Findings
IB-EDL outperforms existing methods in calibration accuracy.
The approach reduces overconfidence in small dataset training.
Experiments show improved trustworthiness of LLMs.
Abstract
Fine-tuned large language models (LLMs) often exhibit overconfidence, particularly when trained on small datasets, resulting in poor calibration and inaccurate uncertainty estimates. Evidential Deep Learning (EDL), an uncertainty-aware approach, enables uncertainty estimation in a single forward pass, making it a promising method for calibrating fine-tuned LLMs. However, despite its computational efficiency, EDL is prone to overfitting, as its training objective can result in overly concentrated probability distributions. To mitigate this, we propose regularizing EDL by incorporating an information bottleneck (IB). Our approach IB-EDL suppresses spurious information in the evidence generated by the model and encourages truly predictive information to influence both the predictions and uncertainty estimates. Extensive experiments across various fine-tuned LLMs and tasks demonstrate that…
Peer Reviews
Decision·ICLR 2025 Poster
- Relevance: The paper addresses an important problem; the uncertainty calibration of LLM outputs (at the token level), which is an important component for providing safe deployment and error estimation for LLMs. - Theoretical Soundness: The paper takes a very principled approach, identifying precise issues in the application of EDL to LLMs and addressing them with mathematically derived solutions based on reasonable starting assumptions. - Experimental Soundness: The experiments are more than
The only main weakness of the paper is the clarity of scope: From the experiments, it is clear that this is a method for uncertainty calibration for a specific task at the fine-tuning stage of an LLM. e.g., an LLM is fine-tuned for summarisation and the proposed method offers a way to obtain superior uncertainty calibration *for the fine-tuned task*. This is indeed very useful, but needs to be stated more clearly in the abstract and intro, as from all sections up until the experiments it is unc
The paper is well-presented and easy to follow. The paper proposes a unified perspective on EDL methods by framing several existing EDL methods as special cases within the IB-EDL framework. The experimental results are strong and covers multiple LLMs, showing that IB-EDL addresses the overconfidence problem well.
The pipeline of applying IB to EDL ($x → f(x; θ) → e˜ → e → α → π → y$) seems computational heavy, it would be informative to include an inference time comparison of IB-EDL against other methods. Although the training overhead of IB-EDL is relatively small compared with the pretraining step, the paper lacks comparison of training time compared with LoRA, which is the fine-tuning backbone of IB-EDL. For out-of-distribution (OOD) scenarios, the paper only compares the OOD detection capabilities
* This work is well-motivated theoretically, pointing out deficiencies with previous EDL losses (e.g. paragraph *Challenges when applying IB to an internal layer of an LLM*). * The final regularization term $\mathcal{L}_{IB-Info}$ seems relatively simple, although I do have some questions about practical implementation (see questions) * Strong experimental results; the OOD detection and noise injection experiments (tables 3,4) in particular seem support the argument for this additional regulari
* This work seems only applicable for a discrete set of mutually exclusive classes. I'm not sure how this extends to the case of open-ended generation where multiple responses may be semantically equivalent or have different logical relationships. * While experiments show improvements in OOD detection when moving from OBQA--> ARC,CSQA datasets, I'd be interested to see the performance when the datasets are 'futher apart' semantically (e.g. moving from a reading comprehension to a math task).
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Imbalanced Data Classification Techniques · Advanced Computational Techniques and Applications
