Who Judges the Judge? Evaluating LLM-as-a-Judge for French Medical open-ended QA
Ikram Belmadani, Oumaima El Khettari, Pac\^ome Constant dit Beaufils, Richard Dufour, Benoit Favre

TL;DR
This paper investigates the use of large language models as automated judges for French medical open-ended question answering, highlighting the importance of domain adaptation and fine-tuning for reliable evaluation.
Contribution
It demonstrates that domain-adapted and large general-purpose LLMs align well with expert judgments, and shows that lightweight fine-tuning improves evaluation consistency and reduces generator bias.
Findings
Domain-adapted models achieve high agreement with experts.
Fine-tuning reduces sensitivity to answer generators.
Lightweight adaptation enables scalable evaluation in low-resource settings.
Abstract
Automatic evaluation of medical open-ended question answering (OEQA) remains challenging due to the need for expert annotations. We evaluate whether large language models (LLMs) can act as judges of semantic equivalence in French medical OEQA, comparing closed-access, general-purpose, and biomedical domain-adapted models. Our results show that LLM-based judgments are strongly influenced by the model that generated the answer, with agreement varying substantially across generators. Domain-adapted and large general-purpose models achieve the highest alignment with expert annotations. We further show that lightweight adaptation of a compact model using supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) substantially improves performance and reduces generator sensitivity, even with limited data. Overall, our findings highlight the need for generator-aware evaluation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Machine Learning in Healthcare
