RefuteBench 2.0 -- Agentic Benchmark for Dynamic Evaluation of LLM   Responses to Refutation Instruction

Jianhao Yan; Yun Luo; Yue Zhang

arXiv:2502.18308·cs.CL·February 26, 2025

RefuteBench 2.0 -- Agentic Benchmark for Dynamic Evaluation of LLM Responses to Refutation Instruction

Jianhao Yan, Yun Luo, Yue Zhang

PDF

Open Access

TL;DR

RefuteBench 2.0 introduces an advanced benchmark using LLM agents as refuters and evaluators to assess their ability to incorporate and retain refutation feedback in multi-turn interactions.

Contribution

It extends the original RefuteBench by integrating LLM-based refuters and evaluators, enabling more comprehensive and dynamic assessment of refutation capabilities.

Findings

01

LLM-based refuters produce more human-like refutations

02

Evaluators show high correlation with human scoring

03

Models struggle to memorize and utilize refutation information over long dialogues

Abstract

In the multi-turn interaction schema, large language models (LLMs) can leverage user feedback to enhance the quality and relevance of their responses. However, evaluating an LLM's ability to incorporate user refutation feedback is crucial yet challenging. In this study, we introduce RefuteBench 2.0, which significantly extends the original RefuteBench by incorporating LLM agents as refuters and evaluators, which allows for flexible and comprehensive assessment. We design both transient and persistent refutation instructions with different validity periods. Meta-evaluation shows that the LLM-based refuter could generate more human-like refutations and the evaluators could assign scores with high correlation with humans. Experimental results of various LLMs show that current models could effectively satisfy the refutation but fail to memorize the refutation information. Interestingly,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAsian Geopolitics and Ethnography · Nuclear Materials and Properties · Nuclear reactor physics and engineering

MethodsSoftmax · Attention Is All You Need