Same Input, Different Scores: A Multi Model Study on the Inconsistency of LLM Judge
Fiona Lau

TL;DR
This paper systematically evaluates the scoring consistency of various large language models used as automated judges, revealing significant variability influenced by model differences and temperature settings, which impacts reliability in enterprise applications.
Contribution
It provides a comprehensive analysis of scoring stability across multiple LLMs and settings, highlighting the need for monitoring and robust evaluation strategies in practical deployments.
Findings
Substantial variability in scores across models and runs.
Lower temperatures can improve stability for some models.
Systematic differences in scoring styles lead to divergent ratings.
Abstract
Large language models are increasingly used as automated evaluators in research and enterprise settings, a practice known as LLM-as-a-judge. While prior work has examined accuracy, bias, and alignment with human preferences, far less attention has been given to how consistently LLMs assign numerical scores, an important concern for many production workflows. This study systematically evaluates scoring stability across five commonly used models, GPT-4o, GPT-4o-mini, Gemini-2.5-Flash, Claude-Haiku-4.5, and Claude-Sonnet-4.5, two temperature settings, and real enterprise question-answer pairs drawn from a retrieval-augmented generation (RAG) system. We address three questions: how stable a model's scores are across repeated runs, how differently models score identical inputs, and how temperature affects scoring consistency. Temperature controls the determinism of an LLM's output. Despite…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Computational and Text Analysis Methods · Artificial Intelligence in Healthcare and Education
