BatchEval: Towards Human-like Text Evaluation
Peiwen Yuan, Shaoxiong Feng, Yiwei Li, Xinglin Wang, Boyuan Pan, Heda, Wang, Kan Li

TL;DR
BatchEval introduces a batch-wise evaluation paradigm for automatic text assessment using large language models, significantly improving robustness, consistency, and correlation with human judgment over traditional sample-wise methods.
Contribution
It proposes a novel batch-wise evaluation framework that addresses prompt sensitivity and noise issues, with an optimal two-stage procedure and heterogeneous batch strategy.
Findings
Outperforms state-of-the-art methods by 10.5% in Pearson correlation.
Achieves comparable performance with only 64% API cost.
Demonstrates robustness and generalization across multiple tasks and models.
Abstract
Significant progress has been made in automatic text evaluation with the introduction of large language models (LLMs) as evaluators. However, current sample-wise evaluation paradigm suffers from the following issues: (1) Sensitive to prompt design; (2) Poor resistance to noise; (3) Inferior ensemble performance with static reference. Inspired by the fact that humans treat both criterion definition and inter sample comparison as references for evaluation, we propose BatchEval, a paradigm that conducts batch-wise evaluation iteratively to alleviate the above problems. We explore variants under this paradigm and confirm the optimal settings are two stage procedure with heterogeneous batch composition strategy and decimal scoring format. Comprehensive experiments across 3 LLMs on 4 text evaluation tasks demonstrate that BatchEval outperforms state-of-the-art methods by 10.5% on Pearson…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Computational and Text Analysis Methods
