How Well Do LLMs Perform on the Simplest Long-Chain Reasoning Tasks: An Empirical Study on the Equivalence Class Problem
Chun Zheng, Lianlong Wu, Bingqian Li, Lvting Liu, and Yi Zhou

TL;DR
This study evaluates how well large language models perform on the long-chain reasoning task called the Equivalence Class Problem, revealing that reasoning models outperform non-reasoning ones but still face significant challenges.
Contribution
It provides an empirical analysis of LLMs' reasoning capabilities on ECP, highlighting the differences between reasoning and non-reasoning models and identifying problem difficulty factors.
Findings
Non-reasoning LLMs fail ECP tasks.
Reasoning models perform better but do not fully solve ECP.
Hardest instances for non-reasoning models align with phase transition points, while for reasoning models they relate to maximum diameter.
Abstract
Large Language Models (LLMs) have achieved great improvements in recent years. Nevertheless, it still remains unclear how good LLMs are for reasoning tasks, especially for long-chain ones. In this paper, we evaluate LLMs' performance on the simplest yet long-chain reasoning task, namely the Equivalence Class Problem (ECP), i.e., determining whether two variables are equal given a set of randomly generated equivalence relations. We consider both reasoning and non-reasoning representative LLMs over a large variety of problem instances, ranging over different numbers of variables, connectivity probabilities, prompts, and other factors. The experimental results show that non-reasoning LLMs fail ECP, while reasoning models are significantly better but still struggle to completely solve this problem. Interestingly, considering various connectivity probabilities with a fixed number of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
