LR^2Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems

Jianghao Chen; Zhenlin Wei; Zhenjiang Ren; Ziyong Li; Jiajun Zhang

arXiv:2502.17848·cs.CL·June 26, 2025

LR^2Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems

Jianghao Chen, Zhenlin Wei, Zhenjiang Ren, Ziyong Li, Jiajun Zhang

PDF

TL;DR

LR$^2$Bench is a new benchmark with 850 constraint satisfaction problems designed to evaluate the long-chain reflective reasoning abilities of large language models, revealing current models' limited performance and highlighting the need for further improvements.

Contribution

This paper introduces LR$^2$Bench, the first comprehensive benchmark specifically designed to evaluate the reflective reasoning capabilities of large language models across diverse constraint satisfaction tasks.

Findings

01

State-of-the-art LRMs achieve only around 20-24% accuracy on LR$^2$Bench.

02

Reflective reasoning remains a significant challenge for current large language models.

03

LR$^2$Bench covers diverse constraint types, providing a thorough evaluation framework.

Abstract

Recent progress in Large Reasoning Models (LRMs) has significantly enhanced the reasoning abilities of Large Language Models (LLMs), empowering them to tackle increasingly complex tasks through reflection capabilities, such as making assumptions, backtracking, and self-refinement. However, effectively evaluating such reflection capabilities remains challenging due to the lack of appropriate benchmarks. To bridge this gap, we introduce LR $^{2}$ Bench, a novel benchmark designed to evaluate the Long-chain Reflective Reasoning capabilities of LLMs. LR $^{2}$ Bench comprises 850 samples across six Constraint Satisfaction Problems (CSPs) where reflective reasoning is crucial for deriving solutions that meet all given constraints. Each type of task focuses on distinct constraint patterns, such as knowledge-based, logical, and spatial constraints, providing a comprehensive evaluation of diverse…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.