Same Input, Different Scores: A Multi Model Study on the Inconsistency of LLM Judge

Fiona Lau

arXiv:2603.04417·cs.CL·March 6, 2026

Same Input, Different Scores: A Multi Model Study on the Inconsistency of LLM Judge

Fiona Lau

PDF

Open Access

TL;DR

This paper systematically evaluates the scoring consistency of various large language models used as automated judges, revealing significant variability influenced by model differences and temperature settings, which impacts reliability in enterprise applications.

Contribution

It provides a comprehensive analysis of scoring stability across multiple LLMs and settings, highlighting the need for monitoring and robust evaluation strategies in practical deployments.

Findings

01

Substantial variability in scores across models and runs.

02

Lower temperatures can improve stability for some models.

03

Systematic differences in scoring styles lead to divergent ratings.

Abstract

Large language models are increasingly used as automated evaluators in research and enterprise settings, a practice known as LLM-as-a-judge. While prior work has examined accuracy, bias, and alignment with human preferences, far less attention has been given to how consistently LLMs assign numerical scores, an important concern for many production workflows. This study systematically evaluates scoring stability across five commonly used models, GPT-4o, GPT-4o-mini, Gemini-2.5-Flash, Claude-Haiku-4.5, and Claude-Sonnet-4.5, two temperature settings, and real enterprise question-answer pairs drawn from a retrieval-augmented generation (RAG) system. We address three questions: how stable a model's scores are across repeated runs, how differently models score identical inputs, and how temperature affects scoring consistency. Temperature controls the determinism of an LLM's output. Despite…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Computational and Text Analysis Methods · Artificial Intelligence in Healthcare and Education