RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques
Zhengyang Tang, Ziniu Li, Zhenyang Xiao, Tian Ding, Ruoyu Sun, Benyou, Wang, Dayiheng Liu, Fei Huang, Tianyu Liu, Bowen Yu, Junyang Lin

TL;DR
This paper introduces RealCritic, a benchmark for evaluating the critique capabilities of LLMs through a closed-loop approach involving self-critique, cross-critique, and iterative critique across reasoning tasks.
Contribution
It presents a novel benchmark with a closed-loop evaluation method for assessing LLM critique abilities, highlighting differences between classical and advanced models.
Findings
Classical LLMs lag behind advanced models in critique tasks.
Classical LLMs may underperform in self-critique and iterative critique.
The benchmark distinguishes reasoning capabilities of different LLMs.
Abstract
Critiques are important for enhancing the performance of Large Language Models (LLMs), enabling both self-improvement and constructive feedback for others by identifying flaws and suggesting improvements. However, evaluating the critique capabilities of LLMs presents a significant challenge due to the open-ended nature of the task. In this work, we introduce a new benchmark designed to assess the critique capabilities of LLMs. Unlike existing benchmarks, which typically function in an open-loop fashion, our approach employs a closed-loop methodology that evaluates the quality of corrections generated from critiques. Moreover, the benchmark incorporates features such as self-critique, cross-critique, and iterative critique, which are crucial for distinguishing the abilities of advanced reasoning models from more classical ones. We implement this benchmark using eight challenging…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
