Who Judges the Judge? Evaluating LLM-as-a-Judge for French Medical open-ended QA

Ikram Belmadani; Oumaima El Khettari; Pac\^ome Constant dit Beaufils; Richard Dufour; Benoit Favre

arXiv:2603.04033·cs.CL·March 5, 2026

Who Judges the Judge? Evaluating LLM-as-a-Judge for French Medical open-ended QA

Ikram Belmadani, Oumaima El Khettari, Pac\^ome Constant dit Beaufils, Richard Dufour, Benoit Favre

PDF

Open Access 1 Video

TL;DR

This paper investigates the use of large language models as automated judges for French medical open-ended question answering, highlighting the importance of domain adaptation and fine-tuning for reliable evaluation.

Contribution

It demonstrates that domain-adapted and large general-purpose LLMs align well with expert judgments, and shows that lightweight fine-tuning improves evaluation consistency and reduces generator bias.

Findings

01

Domain-adapted models achieve high agreement with experts.

02

Fine-tuning reduces sensitivity to answer generators.

03

Lightweight adaptation enables scalable evaluation in low-resource settings.

Abstract

Automatic evaluation of medical open-ended question answering (OEQA) remains challenging due to the need for expert annotations. We evaluate whether large language models (LLMs) can act as judges of semantic equivalence in French medical OEQA, comparing closed-access, general-purpose, and biomedical domain-adapted models. Our results show that LLM-based judgments are strongly influenced by the model that generated the answer, with agreement varying substantially across generators. Domain-adapted and large general-purpose models achieve the highest alignment with expert annotations. We further show that lightweight adaptation of a compact model using supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) substantially improves performance and reduces generator sensitivity, even with limited data. Overall, our findings highlight the need for generator-aware evaluation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Who Judges the Judge? Evaluating LLM-as-a-Judge for French Medical open-ended QA· underline

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Machine Learning in Healthcare