RevisEval: Improving LLM-as-a-Judge via Response-Adapted References
Qiyuan Zhang, Yufei Wang, Tiezheng YU, Yuxin Jiang, Chuhan Wu,, Liangyou Li, Yasheng Wang, Xin Jiang, Lifeng Shang, Ruiming Tang, Fuyuan Lyu,, Chen Ma

TL;DR
RevisEval introduces response-adapted references using LLMs to improve the reliability and accuracy of automatic text generation evaluation, outperforming traditional metrics and LLM-based judgments.
Contribution
The paper proposes a novel evaluation paradigm that adaptively revises responses to create more relevant references, enhancing the performance of LLM-based and classical evaluation metrics.
Findings
RevisEval outperforms traditional reference-free and reference-based evaluation methods.
Response-adapted references improve classical metrics like BLEU and BERTScore.
RevisEval reduces bias and enhances relevance in evaluation results.
Abstract
With significant efforts in recent studies, LLM-as-a-Judge has become a cost-effective alternative to human evaluation for assessing text generation quality in a wide range of tasks. However, there still remains a reliability gap between LLM-as-a-Judge and human evaluation. One important reason is the lack of guided oracles in the evaluation process. Motivated by the role of reference pervasively used in classic text evaluation, we introduce RevisEval, a novel text generation evaluation paradigm via the response-adapted references. RevisEval is driven by the key observation that an ideal reference should maintain the necessary relevance to the response to be evaluated. Specifically, RevisEval leverages the text revision capabilities of large language models (LLMs) to adaptively revise the response, then treat the revised text as the reference (response-adapted reference) for the…
Peer Reviews
Decision·ICLR 2025 Poster
1. The paper motivates the problem very well by identifying the issues with current reference-based evaluation paradigms. The idea of dynamically generating contextually relevant references is creative and interesting. It aims to address very important and quite relevant aspects of using LLM as evaluators. 2. Extensive experiments have been conducted across different tasks as well as various metrics have been evaluated. The authors also show the generalizability of their approach to different m
While I agree with the motivation behind the paper, I am not sure about the soundness of the methodology followed to generate the reference answers: 1. Using the response itself to generate an "adapted reference", the evaluation might indirectly validate the response’s content and structure. This may lead to artificially inflated evaluations, as the evaluator is essentially comparing the response against a modified version of itself, which serves as the reference. 2. If the response contains su
* Simple yet effective method — the core idea of the proposed method, RevisEval, is very simple—simply "rewrite" the response based on the human-written reference (and the rubric) and use it as a new reference. The method is also effective for many settings including LLM-as-a-Judge and traditional reference-based metrics. It is easy to imagine that the proposed method is used in evaluation of many NLG tasks going forward. * Good ablation studies — the paper provides a wide set of ablation studie
No major weakness as far as I see. Here are some minor weakness points: * Unclear names—personally I find "response-adapted references" very confusing. It sounds like the method adapt references based on response, but actually it's the other way around. It is actually reference-adapted responses, but I'm not sure if this is a better way of describing it (I don't have any better ideas). * Unclear description of the experiment settings—the main body of paper benefits a bit of description about t
The proposed method is intuitive and reasonable, with a straightforward implementation that advances previous work using LLMs to generate references for evaluation. They also consider a comprehensive range of experimental setups, baseline methods, and evaluation benchmarks to verify the effectiveness of their method, resulting in solid experimental analyses.
Given that previous studies have already utilized LLMs to generate higher-quality references as replacements for traditional references (Tang et al., 2024), the innovation and contribution of this method are somewhat diminished. I believe they could further enhance the analysis by more comprehensively comparing these two approaches for generating references (generation as reference vs. revision as reference). Additionally, I suggest exploring the use of more refined response-adapted references,
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Law
