Do Not Design, Learn: A Trainable Scoring Function for Uncertainty   Estimation in Generative LLMs

Duygu Nur Yaldiz; Yavuz Faruk Bakman; Baturalp Buyukates; Chenyang; Tao; Anil Ramakrishna; Dimitrios Dimitriadis; Jieyu Zhao; Salman Avestimehr

arXiv:2406.11278·cs.CL·February 14, 2025

Do Not Design, Learn: A Trainable Scoring Function for Uncertainty Estimation in Generative LLMs

Duygu Nur Yaldiz, Yavuz Faruk Bakman, Baturalp Buyukates, Chenyang, Tao, Anil Ramakrishna, Dimitrios Dimitriadis, Jieyu Zhao, Salman Avestimehr

PDF

Open Access 1 Video

TL;DR

This paper introduces Learnable Response Scoring (LARS), a trainable scoring function that improves uncertainty estimation in large language models by capturing complex token dependencies, leading to more reliable confidence assessments.

Contribution

The paper proposes LARS, a novel supervised scoring function that outperforms existing methods in uncertainty estimation for LLMs across multiple tasks.

Findings

01

LARS achieves up to 16% AUROC improvement over existing methods.

02

LARS effectively captures complex token dependencies for better uncertainty calibration.

03

Experimental results demonstrate LARS's superior performance in QA and reasoning tasks.

Abstract

Uncertainty estimation (UE) of generative large language models (LLMs) is crucial for evaluating the reliability of generated sequences. A significant subset of UE methods utilize token probabilities to assess uncertainty, aggregating multiple token probabilities into a single UE score using a scoring function. Existing scoring functions for probability-based UE, such as length-normalized scoring and semantic contribution-based weighting, are designed to solve certain aspects of the problem but exhibit limitations, including the inability to handle biased probabilities and complex semantic dependencies between tokens. To address these issues, in this work, we propose Learnable Response Scoring (LARS) function, a novel scoring function that leverages supervised data to capture complex dependencies between tokens and probabilities, thereby producing more reliable and calibrated response…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Do Not Design, Learn: A Trainable Scoring Function for Uncertainty Estimation in Generative LLMs· underline

Taxonomy

TopicsData Stream Mining Techniques · Simulation Techniques and Applications · Semantic Web and Ontologies

MethodsLARS