Do LLMs Know When to NOT Answer? Investigating Abstention Abilities of   Large Language Models

Nishanth Madhusudhan; Sathwik Tejaswi Madhusudhan; Vikas Yadav; Masoud; Hashemi

arXiv:2407.16221·cs.CL·September 25, 2024

Do LLMs Know When to NOT Answer? Investigating Abstention Abilities of Large Language Models

Nishanth Madhusudhan, Sathwik Tejaswi Madhusudhan, Vikas Yadav, Masoud, Hashemi

PDF

1 Datasets

TL;DR

This paper introduces a standardized black-box evaluation method and dataset to assess Large Language Models' ability to abstain from answering uncertain questions, highlighting current limitations and potential improvements.

Contribution

It presents a new evaluation framework, dataset, and confusion matrix for assessing abstention abilities of LLMs, applicable to black-box models, and explores prompting strategies to improve abstention performance.

Findings

01

GPT-4 and Mixtral 8x22b struggle with abstention

02

Strict prompting and Chain-of-Thought improve abstention ability

03

Proposed AUCM offers a structured evaluation approach

Abstract

Abstention Ability (AA) is a critical aspect of Large Language Model (LLM) reliability, referring to an LLM's capability to withhold responses when uncertain or lacking a definitive answer, without compromising performance. Although previous studies have attempted to improve AA, they lack a standardised evaluation method and remain unsuitable for black-box models where token prediction probabilities are inaccessible. This makes comparative analysis challenging, especially for state-of-the-art closed-source commercial LLMs. This paper bridges this gap by introducing a black-box evaluation approach and a new dataset, Abstain-QA, crafted to rigorously assess AA across varied question types (answerable and unanswerable), domains (well-represented and under-represented), and task types (fact centric and reasoning). We also propose a new confusion matrix, the ''Answerable-Unanswerable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

ServiceNow-AI/Abstain-QA
dataset· 38 dl
38 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Adam · Label Smoothing · Linear Layer · Byte Pair Encoding · Layer Normalization · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Dense Connections