Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form QA
Sher Badshah, Hassan Sajjad

TL;DR
This paper introduces a reference-guided verdict method using multiple LLMs as judges to improve the automatic evaluation of open-ended question-answering tasks, showing strong correlation with human judgments.
Contribution
It presents a novel multi-LLM judging framework that enhances evaluation reliability for free-form QA, surpassing traditional metrics in capturing semantic depth.
Findings
Improved evaluation accuracy correlates with human judgments.
Combining multiple LLMs yields more reliable assessments.
Method outperforms conventional metrics in open-ended tasks.
Abstract
The emergence of Large Language Models (LLMs) as chat assistants capable of generating human-like conversations has amplified the need for robust evaluation methods, particularly for open-ended tasks. Conventional metrics such as EM and F1, while useful, are inadequate for capturing the full semantics and contextual depth of such generative outputs. We propose a reference-guided verdict method that automates the evaluation process by leveraging multiple LLMs as judges. Through experiments on free-form question-answering tasks, we demonstrate that combining multiple models improves the reliability and accuracy of evaluations, especially in tasks where a single model may struggle. The results indicate a strong correlation with human evaluations, establishing the proposed method as a reliable alternative to traditional metrics.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsArtificial Intelligence in Law · Comparative and International Law Studies · Legal Education and Practice Innovations
