Judge as A Judge: Improving the Evaluation of Retrieval-Augmented   Generation through the Judge-Consistency of Large Language Models

Shuliang Liu; Xinze Li; Zhenghao Liu; Yukun Yan; Cheng Yang; Zheni; Zeng; Zhiyuan Liu; Maosong Sun; Ge Yu

arXiv:2502.18817·cs.CL·February 27, 2025

Judge as A Judge: Improving the Evaluation of Retrieval-Augmented Generation through the Judge-Consistency of Large Language Models

Shuliang Liu, Xinze Li, Zhenghao Liu, Yukun Yan, Cheng Yang, Zheni, Zeng, Zhiyuan Liu, Maosong Sun, Ge Yu

PDF

Open Access

TL;DR

This paper introduces ConsJudge, a method to improve the consistency and accuracy of LLM-based evaluations of RAG outputs, leading to better training and assessment of retrieval-augmented generation models.

Contribution

The paper proposes ConsJudge, a novel approach that enhances LLM judgment consistency for RAG evaluation, addressing prompt sensitivity issues in automated metrics.

Findings

01

ConsJudge improves judgment accuracy across various RAG models and datasets.

02

Judgments by ConsJudge align closely with those of superior LLMs.

03

ConsJudge enhances the training process of RAG models through more reliable evaluations.

Abstract

Retrieval-Augmented Generation (RAG) has proven its effectiveness in alleviating hallucinations for Large Language Models (LLMs). However, existing automated evaluation metrics cannot fairly evaluate the outputs generated by RAG models during training and evaluation. LLM-based judgment models provide the potential to produce high-quality judgments, but they are highly sensitive to evaluation prompts, leading to inconsistencies when judging the output of RAG models. This paper introduces the Judge-Consistency (ConsJudge) method, which aims to enhance LLMs to generate more accurate evaluations for RAG models. Specifically, ConsJudge prompts LLMs to generate different judgments based on various combinations of judgment dimensions, utilize the judge-consistency to evaluate these judgments and select the accepted and rejected judgments for DPO training. Our experiments show that ConsJudge…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Artificial Intelligence in Healthcare and Education