Calibrated Large Language Models for Binary Question Answering
Patrizio Giovannotti, Alexander Gammerman

TL;DR
This paper introduces a new calibration method using the inductive Venn–Abers predictor for large language models in binary question answering, improving probability accuracy and trustworthiness.
Contribution
It presents a novel calibration approach with IVAP for LLMs, outperforming temperature scaling in binary classification tasks.
Findings
IVAP achieves better calibration than temperature scaling.
The method maintains high predictive quality.
Results are demonstrated on the BoolQ dataset with Llama 2.
Abstract
Quantifying the uncertainty of predictions made by large language models (LLMs) in binary text classification tasks remains a challenge. Calibration, in the context of LLMs, refers to the alignment between the model's predicted probabilities and the actual correctness of its predictions. A well-calibrated model should produce probabilities that accurately reflect the likelihood of its predictions being correct. We propose a novel approach that utilizes the inductive Venn--Abers predictor (IVAP) to calibrate the probabilities associated with the output tokens corresponding to the binary labels. Our experiments on the BoolQ dataset using the Llama 2 model demonstrate that IVAP consistently outperforms the commonly used temperature scaling method for various label token choices, achieving well-calibrated probabilities while maintaining high predictive quality. Our findings contribute to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Expert finding and Q&A systems · Speech and dialogue systems
MethodsLLaMA
