TL;DR
This paper introduces MUSE, an information-theoretic ensemble method that improves uncertainty calibration in large language models by leveraging model diversity and subset aggregation, with applications in binary prediction tasks.
Contribution
The paper presents MUSE, a novel subset ensemble approach using Jensen-Shannon Divergence to enhance LLM uncertainty quantification and calibration.
Findings
MUSE improves calibration over single models and naive ensembles.
MUSE enhances predictive performance in binary tasks.
Using MUSE as a guide improves LLM calibration through chain-of-thought distillation.
Abstract
Large language models (LLMs) often behave inconsistently across inputs, indicating uncertainty and motivating the need for its quantification in high-stakes settings. Prior work on calibration and uncertainty quantification often focuses on individual models, overlooking the potential of model diversity. We hypothesize that LLMs make complementary predictions due to differences in training and the Zipfian nature of language, and that aggregating their outputs leads to more reliable uncertainty estimates. To leverage this, we propose MUSE (Multi-LLM Uncertainty via Subset Ensembles), a simple information-theoretic method that uses Jensen-Shannon Divergence to identify and aggregate well-calibrated subsets of LLMs. Experiments on binary prediction tasks demonstrate improved calibration and predictive performance compared to single-model and na\"ive ensemble baselines. In addition, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
