Defining and Evaluating Decision and Composite Risk in Language Models Applied to Natural Language Inference
Ke Shen, Mayank Kejriwal

TL;DR
This paper introduces a framework to measure and evaluate the risks associated with confidence levels in large language models, focusing on decision abstention and inference accuracy in natural language reasoning tasks.
Contribution
It defines decision and composite risks, and proposes an experimental framework with metrics for assessing these risks in discriminative and generative LLMs.
Findings
Framework improves confidence in low-risk tasks by 20.1%
Framework skips 19.8% of high-risk tasks to avoid errors
Demonstrates utility on four commonsense reasoning datasets
Abstract
Despite their impressive performance, large language models (LLMs) such as ChatGPT are known to pose important risks. One such set of risks arises from misplaced confidence, whether over-confidence or under-confidence, that the models have in their inference. While the former is well studied, the latter is not, leading to an asymmetry in understanding the comprehensive risk of the model based on misplaced confidence. In this paper, we address this asymmetry by defining two types of risk (decision and composite risk), and proposing an experimental framework consisting of a two-level inference architecture and appropriate metrics for measuring such risks in both discriminative and generative LLMs. The first level relies on a decision rule that determines whether the underlying language model should abstain from inference. The second level (which applies if the model does not abstain) is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Sparse Evolutionary Training · Softmax · Dense Connections · Dropout · Linear Layer · Attention Dropout · Residual Connection · Linear Warmup With Linear Decay
