YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering
Jennifer D'Souza, Hamed Babaei Giglou, Quentin M\"unch

TL;DR
YESciEval is a framework that enhances the robustness and reliability of LLM-based evaluation for scientific question answering by combining rubric-based assessment with reinforcement learning, enabling scalable and transparent evaluation.
Contribution
It introduces a novel evaluation framework that reduces bias and improves the reliability of LLMs as judges in scientific QA, independent of proprietary models and human feedback.
Findings
YESciEval improves evaluation consistency across models.
The framework reduces optimism bias in LLM evaluators.
It supports scalable, cost-free scientific QA assessment.
Abstract
Large Language Models (LLMs) drive scientific question-answering on modern search engines, yet their evaluation robustness remains underexplored. We introduce YESciEval, an open-source framework that combines fine-grained rubric-based assessment with reinforcement learning to mitigate optimism bias in LLM evaluators. We release multidisciplinary scienceQ&A datasets, including adversarial variants, with evaluation scores from multiple LLMs. Independent of proprietary models and human feedback, our approach enables scalable, cost-free evaluation. By advancing reliable LLM-as-a-judge models, this work supports AI alignment and fosters robust, transparent evaluation essential for scientific inquiry.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Expert finding and Q&A systems · Advanced Graph Neural Networks
