Two Failures of Self-Consistency in the Multi-Step Reasoning of LLMs
Angelica Chen, Jason Phang, Alicia Parrish, Vishakh Padmakumar, Chen, Zhao, Samuel R. Bowman, Kyunghyun Cho

TL;DR
This paper highlights two key failures of self-consistency in multi-step reasoning by large language models, showing that current models like GPT-3/-4 often lack consistency in hypothetical and compositional contexts, which challenges their reasoning validity.
Contribution
It introduces two novel types of self-consistency relevant for multi-step reasoning and demonstrates their failure in existing large language models.
Findings
GPT-3/-4 models show poor consistency in hypothetical reasoning.
Models exhibit low compositional consistency when intermediate steps are replaced.
Self-consistency failures question the reliability of LLMs in multi-step tasks.
Abstract
Large language models (LLMs) have achieved widespread success on a variety of in-context few-shot tasks, but this success is typically evaluated via correctness rather than consistency. We argue that self-consistency is an important criteria for valid multi-step reasoning in tasks where the solution is composed of the answers to multiple sub-steps. We propose two types of self-consistency that are particularly important for multi-step reasoning -- hypothetical consistency (a model's ability to predict what its output would be in a hypothetical other context) and compositional consistency (consistency of a model's final outputs when intermediate sub-steps are replaced with the model's outputs for those steps). We demonstrate that multiple variants of the GPT-3/-4 models exhibit poor consistency rates across both types of consistency on a variety of tasks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Explainable Artificial Intelligence (XAI)
Methods{Dispute@FaQ-s}How to file a dispute with Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Refunds@Expedia|||How do I get a full refund from Expedia? · Cosine Annealing · Softmax · Layer Normalization · Byte Pair Encoding · Dropout · Linear Layer
