xFinder: Large Language Models as Automated Evaluators for Reliable   Evaluation

Qingchen Yu; Zifan Zheng; Shichao Song; Zhiyu Li; Feiyu Xiong; Bo; Tang; Ding Chen

arXiv:2405.11874·cs.CL·February 26, 2025·2 cites

xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation

Qingchen Yu, Zifan Zheng, Shichao Song, Zhiyu Li, Feiyu Xiong, Bo, Tang, Ding Chen

PDF

Open Access 1 Repo

TL;DR

This paper introduces xFinder, a novel LLM-based evaluator that improves answer extraction accuracy and evaluation reliability by focusing on enhancing the key answer extraction module, outperforming existing methods.

Contribution

The paper proposes xFinder, a new LLM-based evaluation framework with a specialized dataset, significantly improving answer extraction and judgment accuracy over traditional RegEx and judge models.

Findings

01

xFinder achieves 93.42% extraction accuracy with only 500M parameters.

02

xFinder's judgment accuracy reaches 97.61%, surpassing existing frameworks.

03

Improving answer extraction enhances overall evaluation reliability.

Abstract

The continuous advancement of large language models (LLMs) has brought increasing attention to the critical issue of developing fair and reliable methods for evaluating their performance. Particularly, the emergence of cheating phenomena, such as test set leakage and prompt format overfitting, poses significant challenges to the reliable evaluation of LLMs. As evaluation frameworks commonly use Regular Expression (RegEx) for answer extraction, models may adjust their responses to fit formats easily handled by RegEx. Nevertheless, the key answer extraction module based on RegEx frequently suffers from extraction errors. Furthermore, recent studies proposing fine-tuned LLMs as judge models for automated evaluation face challenges in terms of generalization ability and fairness. This paper comprehensively analyzes the entire LLM evaluation chain and demonstrates that optimizing the key…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

iaar-shanghai/xfinder
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems

MethodsSparse Evolutionary Training