Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form QA

Sher Badshah; Hassan Sajjad

arXiv:2408.09235·cs.CL·November 12, 2025·2 cites

Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form QA

Sher Badshah, Hassan Sajjad

PDF

Open Access 1 Video

TL;DR

This paper introduces a reference-guided verdict method using multiple LLMs as judges to improve the automatic evaluation of open-ended question-answering tasks, showing strong correlation with human judgments.

Contribution

It presents a novel multi-LLM judging framework that enhances evaluation reliability for free-form QA, surpassing traditional metrics in capturing semantic depth.

Findings

01

Improved evaluation accuracy correlates with human judgments.

02

Combining multiple LLMs yields more reliable assessments.

03

Method outperforms conventional metrics in open-ended tasks.

Abstract

The emergence of Large Language Models (LLMs) as chat assistants capable of generating human-like conversations has amplified the need for robust evaluation methods, particularly for open-ended tasks. Conventional metrics such as EM and F1, while useful, are inadequate for capturing the full semantics and contextual depth of such generative outputs. We propose a reference-guided verdict method that automates the evaluation process by leveraging multiple LLMs as judges. Through experiments on free-form question-answering tasks, we demonstrate that combining multiple models improves the reliability and accuracy of evaluations, especially in tasks where a single model may struggle. The results indicate a strong correlation with human evaluations, establishing the proposed method as a reliable alternative to traditional metrics.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form QA· underline

Taxonomy

TopicsArtificial Intelligence in Law · Comparative and International Law Studies · Legal Education and Practice Innovations