Judge as A Judge: Improving the Evaluation of Retrieval-Augmented Generation through the Judge-Consistency of Large Language Models
Shuliang Liu, Xinze Li, Zhenghao Liu, Yukun Yan, Cheng Yang, Zheni, Zeng, Zhiyuan Liu, Maosong Sun, Ge Yu

TL;DR
This paper introduces ConsJudge, a method to improve the consistency and accuracy of LLM-based evaluations of RAG outputs, leading to better training and assessment of retrieval-augmented generation models.
Contribution
The paper proposes ConsJudge, a novel approach that enhances LLM judgment consistency for RAG evaluation, addressing prompt sensitivity issues in automated metrics.
Findings
ConsJudge improves judgment accuracy across various RAG models and datasets.
Judgments by ConsJudge align closely with those of superior LLMs.
ConsJudge enhances the training process of RAG models through more reliable evaluations.
Abstract
Retrieval-Augmented Generation (RAG) has proven its effectiveness in alleviating hallucinations for Large Language Models (LLMs). However, existing automated evaluation metrics cannot fairly evaluate the outputs generated by RAG models during training and evaluation. LLM-based judgment models provide the potential to produce high-quality judgments, but they are highly sensitive to evaluation prompts, leading to inconsistencies when judging the output of RAG models. This paper introduces the Judge-Consistency (ConsJudge) method, which aims to enhance LLMs to generate more accurate evaluations for RAG models. Specifically, ConsJudge prompts LLMs to generate different judgments based on various combinations of judgment dimensions, utilize the judge-consistency to evaluate these judgments and select the accepted and rejected judgments for DPO training. Our experiments show that ConsJudge…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Artificial Intelligence in Healthcare and Education
