Existing LLMs Are Not Self-Consistent For Simple Tasks
Zhenru Lin, Jiawen Tao, Yang Yuan, Andrew Chi-Chih Yao

TL;DR
This paper demonstrates that even simple tasks reveal significant self-inconsistency in large language models, and introduces metrics and methods to quantify and improve their internal reasoning consistency.
Contribution
The study introduces new inconsistency metrics and two automated approaches to measure and mitigate self-inconsistency in LLMs on simple tasks.
Findings
All models exhibit high inconsistency on simple tasks.
Proposed methods partially improve model consistency.
Highlights the complexity of achieving reliable reasoning in LLMs.
Abstract
Large Language Models (LLMs) have grown increasingly powerful, yet ensuring their decisions remain transparent and trustworthy requires self-consistency -- no contradictions in their internal reasoning. Our study reveals that even on simple tasks, such as comparing points on a line or a plane, or reasoning in a family tree, all smaller models are highly inconsistent, and even state-of-the-art models like DeepSeek-R1 and GPT-o4-mini are not fully self-consistent. To quantify and mitigate these inconsistencies, we introduce inconsistency metrics and propose two automated methods -- a graph-based and an energy-based approach. While these fixes provide partial improvements, they also highlight the complexity and importance of self-consistency in building more reliable and interpretable AI. The code and data are available at https://github.com/scorpio-nova/llm-self-consistency.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
