xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation
Qingchen Yu, Zifan Zheng, Shichao Song, Zhiyu Li, Feiyu Xiong, Bo, Tang, Ding Chen

TL;DR
This paper introduces xFinder, a novel LLM-based evaluator that improves answer extraction accuracy and evaluation reliability by focusing on enhancing the key answer extraction module, outperforming existing methods.
Contribution
The paper proposes xFinder, a new LLM-based evaluation framework with a specialized dataset, significantly improving answer extraction and judgment accuracy over traditional RegEx and judge models.
Findings
xFinder achieves 93.42% extraction accuracy with only 500M parameters.
xFinder's judgment accuracy reaches 97.61%, surpassing existing frameworks.
Improving answer extraction enhances overall evaluation reliability.
Abstract
The continuous advancement of large language models (LLMs) has brought increasing attention to the critical issue of developing fair and reliable methods for evaluating their performance. Particularly, the emergence of cheating phenomena, such as test set leakage and prompt format overfitting, poses significant challenges to the reliable evaluation of LLMs. As evaluation frameworks commonly use Regular Expression (RegEx) for answer extraction, models may adjust their responses to fit formats easily handled by RegEx. Nevertheless, the key answer extraction module based on RegEx frequently suffers from extraction errors. Furthermore, recent studies proposing fine-tuned LLMs as judge models for automated evaluation face challenges in terms of generalization ability and fairness. This paper comprehensively analyzes the entire LLM evaluation chain and demonstrates that optimizing the key…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems
MethodsSparse Evolutionary Training
