On the Self-Verification Limitations of Large Language Models on   Reasoning and Planning Tasks

Kaya Stechly; Karthik Valmeekam; Subbarao Kambhampati

arXiv:2402.08115·cs.AI·August 6, 2024·2 cites

On the Self-Verification Limitations of Large Language Models on Reasoning and Planning Tasks

Kaya Stechly, Karthik Valmeekam, Subbarao Kambhampati

PDF

Open Access 1 Video

TL;DR

This paper empirically investigates the limitations of self-verification in large language models, revealing that external verification significantly outperforms self-critique in reasoning and planning tasks.

Contribution

It provides a systematic empirical study on the effectiveness of iterative prompting and external verification for LLMs in reasoning and planning, highlighting the limitations of self-critique.

Findings

01

Self-critique causes significant performance collapse.

02

External verification leads to notable performance improvements.

03

Re-prompting with a sound verifier retains most benefits.

Abstract

There has been considerable divergence of opinion on the reasoning abilities of Large Language Models (LLMs). While the initial optimism that reasoning might emerge automatically with scale has been tempered thanks to a slew of counterexamples--ranging from multiplication to simple planning--there persists a wide spread belief that LLMs can self-critique and improve their own solutions in an iterative fashion. This belief seemingly rests on the assumption that verification of correctness should be easier than generation--a rather classical argument from computational complexity--which should be irrelevant to LLMs to the extent that what they are doing is approximate retrieval. In this paper, we set out to systematically investigate the effectiveness of iterative prompting in the context of reasoning and planning. We present a principled empirical study of the performance of GPT-4 in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

On the self-verification limitations of large language models on reasoning and planning tasks· slideslive

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsAttention Is All You Need · Sparse Evolutionary Training · Position-Wise Feed-Forward Layer · Dropout · Linear Layer · Dense Connections · Label Smoothing · Absolute Position Encodings · Softmax · Byte Pair Encoding