Critique Ability of Large Language Models
Liangchen Luo, Zi Lin, Yinxiao Liu, Lei Shu, Yun Zhu, Jingbo Shang,, Lei Meng

TL;DR
This paper evaluates the critique abilities of large language models across various tasks, introduces a benchmark called CriticBench, and finds that critique capability is limited and improves with model size, especially in self-critique scenarios.
Contribution
It introduces CriticBench, a new benchmark for assessing LLM critique abilities, and provides insights into the challenges and potential of self-critique for model improvement.
Findings
Critique ability improves with larger model size.
Self-critique remains challenging even for top models.
Models perform worse on problems where they are most uncertain.
Abstract
Critical thinking is essential for rational decision-making and problem-solving. This skill hinges on the ability to provide precise and reasoned critiques and is a hallmark of human intelligence. In the era of large language models (LLMs), this study explores the ability of LLMs to deliver accurate critiques across various tasks. We are interested in this topic as a capable critic model could not only serve as a reliable evaluator, but also as a source of supervised signals for model tuning. Particularly, if a model can self-critique, it has the potential for autonomous self-improvement. To examine this, we introduce a unified evaluation framework for assessing the critique abilities of LLMs. We develop a benchmark called CriticBench, which comprises 3K high-quality natural language queries and corresponding model responses; and annotate the correctness of these responses. The…
Peer Reviews
Decision·Submitted to ICLR 2024
The paper is well-written and easy to follow. The authors are very clear about all details in the data collection process and provided good motivation for the various design choices. The evaluation is thorough and covers a wide range of models. The proposed new heuristic is not particularly novel, but achieves solid improvement on the new benchmark.
A critique in this paper is defined as a language model assessment of another language model output on some underlying task. A good critique model should be effective at identifying flaws in language model outputs. The challenging examples to the task of critique are nuanced flaws, which would also require a detailed explanation by the critique model. But the benchmark proposed by this paper use a simplistic quantitative metric that reduces the quality of a critique to a binary decision, which a
1. The paper addresses an important and under-explored aspect of LLMs, which is their ability to critique their own outputs. This is a valuable contribution as it moves beyond traditional evaluation metrics and looks at a model's ability to self-improve. 2. The paper presents a clear definition of critique ability and distinguishes between critique and self-critique, which helps in setting the scope and understanding the objectives of the study.
1. The paper could benefit from a more detailed discussion on the limitations of the current approach, particularly regarding the scalability of the self-check method and its applicability to real-world scenarios [1,2,3]. 2. The study is limited to a few tasks and datasets. Expanding the benchmark to include more diverse tasks and domains would make the findings more generalizable. 3. The evaluation of self-critique abilities shows that models struggle with certain tasks, but the paper does no
- To explore the critique ability of LLMs is interesting, and timely at this point. - This paper provides a standardized way to evaluate the critique ability of LLMs on diverse tasks, - The paper offers several noteworthy insights, such as the challenges associated with self-critique in LLMs. These findings can guide future research and model development.
- The evaluation is not comprehensive. While it claims to evaluate the critique ability, it only evaluates this across three tasks: math, code, and commonsense. A broader range of tasks should be tested. - The paper does not discuss potential biases. Without discussing these biases, it's unclear how they might influence the evaluation results, which could affect the validity of the findings. - Authors could offer a more in-depth analysis of the utility of self-critique. Understanding why self-cr
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Software Engineering Research · Natural Language Processing Techniques
