RealCritic: Towards Effectiveness-Driven Evaluation of Language Model   Critiques

Zhengyang Tang; Ziniu Li; Zhenyang Xiao; Tian Ding; Ruoyu Sun; Benyou; Wang; Dayiheng Liu; Fei Huang; Tianyu Liu; Bowen Yu; Junyang Lin

arXiv:2501.14492·cs.CL·January 27, 2025

RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques

Zhengyang Tang, Ziniu Li, Zhenyang Xiao, Tian Ding, Ruoyu Sun, Benyou, Wang, Dayiheng Liu, Fei Huang, Tianyu Liu, Bowen Yu, Junyang Lin

PDF

Open Access 1 Repo

TL;DR

This paper introduces RealCritic, a benchmark for evaluating the critique capabilities of LLMs through a closed-loop approach involving self-critique, cross-critique, and iterative critique across reasoning tasks.

Contribution

It presents a novel benchmark with a closed-loop evaluation method for assessing LLM critique abilities, highlighting differences between classical and advanced models.

Findings

01

Classical LLMs lag behind advanced models in critique tasks.

02

Classical LLMs may underperform in self-critique and iterative critique.

03

The benchmark distinguishes reasoning capabilities of different LLMs.

Abstract

Critiques are important for enhancing the performance of Large Language Models (LLMs), enabling both self-improvement and constructive feedback for others by identifying flaws and suggesting improvements. However, evaluating the critique capabilities of LLMs presents a significant challenge due to the open-ended nature of the task. In this work, we introduce a new benchmark designed to assess the critique capabilities of LLMs. Unlike existing benchmarks, which typically function in an open-loop fashion, our approach employs a closed-loop methodology that evaluates the quality of corrections generated from critiques. Moreover, the benchmark incorporates features such as self-critique, cross-critique, and iterative critique, which are crucial for distinguishing the abilities of advanced reasoning models from more classical ones. We implement this benchmark using eight challenging…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tangzhy/realcritic
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling