Two Failures of Self-Consistency in the Multi-Step Reasoning of LLMs

Angelica Chen; Jason Phang; Alicia Parrish; Vishakh Padmakumar; Chen; Zhao; Samuel R. Bowman; Kyunghyun Cho

arXiv:2305.14279·cs.CL·February 9, 2024·6 cites

Two Failures of Self-Consistency in the Multi-Step Reasoning of LLMs

Angelica Chen, Jason Phang, Alicia Parrish, Vishakh Padmakumar, Chen, Zhao, Samuel R. Bowman, Kyunghyun Cho

PDF

Open Access

TL;DR

This paper highlights two key failures of self-consistency in multi-step reasoning by large language models, showing that current models like GPT-3/-4 often lack consistency in hypothetical and compositional contexts, which challenges their reasoning validity.

Contribution

It introduces two novel types of self-consistency relevant for multi-step reasoning and demonstrates their failure in existing large language models.

Findings

01

GPT-3/-4 models show poor consistency in hypothetical reasoning.

02

Models exhibit low compositional consistency when intermediate steps are replaced.

03

Self-consistency failures question the reliability of LLMs in multi-step tasks.

Abstract

Large language models (LLMs) have achieved widespread success on a variety of in-context few-shot tasks, but this success is typically evaluated via correctness rather than consistency. We argue that self-consistency is an important criteria for valid multi-step reasoning in tasks where the solution is composed of the answers to multiple sub-steps. We propose two types of self-consistency that are particularly important for multi-step reasoning -- hypothetical consistency (a model's ability to predict what its output would be in a hypothetical other context) and compositional consistency (consistency of a model's final outputs when intermediate sub-steps are replaced with the model's outputs for those steps). We demonstrate that multiple variants of the GPT-3/-4 models exhibit poor consistency rates across both types of consistency on a variety of tasks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Explainable Artificial Intelligence (XAI)

Methods{Dispute@FaQ-s}How to file a dispute with Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Refunds@Expedia|||How do I get a full refund from Expedia? · Cosine Annealing · Softmax · Layer Normalization · Byte Pair Encoding · Dropout · Linear Layer