RefuteBench 2.0 -- Agentic Benchmark for Dynamic Evaluation of LLM Responses to Refutation Instruction
Jianhao Yan, Yun Luo, Yue Zhang

TL;DR
RefuteBench 2.0 introduces an advanced benchmark using LLM agents as refuters and evaluators to assess their ability to incorporate and retain refutation feedback in multi-turn interactions.
Contribution
It extends the original RefuteBench by integrating LLM-based refuters and evaluators, enabling more comprehensive and dynamic assessment of refutation capabilities.
Findings
LLM-based refuters produce more human-like refutations
Evaluators show high correlation with human scoring
Models struggle to memorize and utilize refutation information over long dialogues
Abstract
In the multi-turn interaction schema, large language models (LLMs) can leverage user feedback to enhance the quality and relevance of their responses. However, evaluating an LLM's ability to incorporate user refutation feedback is crucial yet challenging. In this study, we introduce RefuteBench 2.0, which significantly extends the original RefuteBench by incorporating LLM agents as refuters and evaluators, which allows for flexible and comprehensive assessment. We design both transient and persistent refutation instructions with different validity periods. Meta-evaluation shows that the LLM-based refuter could generate more human-like refutations and the evaluators could assign scores with high correlation with humans. Experimental results of various LLMs show that current models could effectively satisfy the refutation but fail to memorize the refutation information. Interestingly,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAsian Geopolitics and Ethnography · Nuclear Materials and Properties · Nuclear reactor physics and engineering
MethodsSoftmax · Attention Is All You Need
