RefineBench: Evaluating Refinement Capability of Language Models via Checklists

Young-Jun Lee; Seungone Kim; Byung-Kwan Lee; Minkyeong Moon; Yechan Hwang; Jong Myoung Kim; Graham Neubig; Sean Welleck; Ho-Jin Choi

arXiv:2511.22173·cs.CL·December 1, 2025

RefineBench: Evaluating Refinement Capability of Language Models via Checklists

Young-Jun Lee, Seungone Kim, Byung-Kwan Lee, Minkyeong Moon, Yechan Hwang, Jong Myoung Kim, Graham Neubig, Sean Welleck, Ho-Jin Choi

PDF

Open Access 1 Datasets 3 Reviews

TL;DR

RefineBench is a benchmark designed to evaluate the ability of language models to self-refine or improve their responses, revealing current limitations and the effectiveness of guided feedback in enhancing model outputs.

Contribution

This paper introduces RefineBench, a new benchmark with a checklist-based evaluation framework for assessing LM self-refinement capabilities across diverse domains.

Findings

01

Self-refinement scores are modest for frontier LMs like GPT-5.

02

Most models fail to improve responses consistently over iterations.

03

Guided refinement significantly improves response quality within five turns.

Abstract

Can language models (LMs) self-refine their own responses? This question is increasingly relevant as a wide range of real-world user interactions involve refinement requests. However, prior studies have largely tested LMs' refinement abilities on verifiable tasks such as competition math or symbolic reasoning with simplified scaffolds, whereas users often pose open-ended queries and provide varying degrees of feedback on what they desire. The recent advent of reasoning models that exhibit self-reflection patterns in their chains-of-thought further motivates this question. To analyze this, we introduce RefineBench, a benchmark of 1,000 challenging problems across 11 domains paired with a checklist-based evaluation framework. We evaluate two refinement modes: (1) guided refinement, where an LM is provided natural language feedback, and (2) self-refinement, where LMs attempt to improve…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

- Very interesting problem. - The dataset is a significant contribution. - Human evaluation of checklist generation. - Summary statistics of the dataset and comparison to other benchmarks are included and well-presented. - Extensive experiments.

Weaknesses

I really only found one letdown in this paper, but it is a big one -- there is no human evaluation/verification of the evaluation pipeline. To be strong, there needs to be a human verification of a sample of the end-to-end evaluation process. How do we know how good the LLMs are at comparing the answer to the checklist and providing good feedback? This is instrumental to understanding the results. I also think that a good baseline would have been to compare the success with human feedback (de

Reviewer 02Rating 6Confidence 4

Strengths

1. The benchmark is well-curated, covering a wide range of topics and domains. The manual quality control process also seems solid. 2. The evaluation spans a large number of models, showing a commendable level of comprehensiveness. 3. The findings are interesting - especially the comparison between thinking models and standard ones. As the paper notes, whether refinement itself is beneficial has been extensively studied and debated in prior work, but revisiting this question in the context of re

Weaknesses

1. I have concerns about using the same checklist for both external guidance and evaluation. Could this create potential leakage, where models optimize for missing checklist items instead of genuinely improving quality? It's unclear whether the provided guidance leads to real improvement or just better checklist completion. 2. Discussion of related work is strangely organized. The CriticBench line of work seems most relevant and should probably be introduced earlier in Section 2. In contrast, th

Reviewer 03Rating 6Confidence 2

Strengths

- The paper introduces a new benchmark with a relatively large problem set and clear per-problem checklists, enabling more reliable evaluation of LLMs’ reasoning abilities. - The analyses are clear and highlight that self-refinement remains challenging, particularly due to LLMs’ difficulty in identifying specific errors and determining how to adjust initial answers.

Weaknesses

- The study uses GPT-4.1 as the sole evaluator, which may introduce bias. Incorporating a second independent LLM-as-judge or human auditing would strengthen the evaluation. - For problems that originally include images, textual descriptions may omit important details. Expanding the benchmark to a multimodal setting would address this limitation.

Code & Models

Datasets

RefineBench/RefineBench
dataset· 1.7k dl
1.7k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Artificial Intelligence in Healthcare and Education