LR^2Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems
Jianghao Chen, Zhenlin Wei, Zhenjiang Ren, Ziyong Li, Jiajun Zhang

TL;DR
LR$^2$Bench is a new benchmark with 850 constraint satisfaction problems designed to evaluate the long-chain reflective reasoning abilities of large language models, revealing current models' limited performance and highlighting the need for further improvements.
Contribution
This paper introduces LR$^2$Bench, the first comprehensive benchmark specifically designed to evaluate the reflective reasoning capabilities of large language models across diverse constraint satisfaction tasks.
Findings
State-of-the-art LRMs achieve only around 20-24% accuracy on LR$^2$Bench.
Reflective reasoning remains a significant challenge for current large language models.
LR$^2$Bench covers diverse constraint types, providing a thorough evaluation framework.
Abstract
Recent progress in Large Reasoning Models (LRMs) has significantly enhanced the reasoning abilities of Large Language Models (LLMs), empowering them to tackle increasingly complex tasks through reflection capabilities, such as making assumptions, backtracking, and self-refinement. However, effectively evaluating such reflection capabilities remains challenging due to the lack of appropriate benchmarks. To bridge this gap, we introduce LRBench, a novel benchmark designed to evaluate the Long-chain Reflective Reasoning capabilities of LLMs. LRBench comprises 850 samples across six Constraint Satisfaction Problems (CSPs) where reflective reasoning is crucial for deriving solutions that meet all given constraints. Each type of task focuses on distinct constraint patterns, such as knowledge-based, logical, and spatial constraints, providing a comprehensive evaluation of diverse…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
