The Necessity of Setting Temperature in LLM-as-a-Judge
Lujun Li, Lama Sleem, Yangjie Xu, Yewei Song, Aolin Jia, Jerome Francois, Radu State

TL;DR
This paper investigates how temperature settings affect the performance of LLMs used as judges in evaluating text quality, revealing that temperature choice significantly impacts evaluation outcomes.
Contribution
It provides a systematic analysis of temperature effects on LLM judge performance using controlled experiments and causal inference methods.
Findings
Temperature significantly influences LLM judge behavior.
Lower temperatures do not always lead to better evaluation accuracy.
Task-dependent effects of temperature are observed in LLM judging performance.
Abstract
LLM-as-a-Judge has emerged as an effective and low-cost paradigm for evaluating text quality and factual correctness. Prior studies have shown substantial agreement between LLM judges and human experts, even on tasks that are difficult to assess automatically. In practice, researchers commonly employ fixed temperature configurations during the evaluation process-with values of 0.1 and 1.0 being the most prevalent choices-a convention that is largely empirical rather than principled. However, recent researches suggest that LLM performance exhibits non-trivial sensitivity to temperature settings, that lower temperatures do not universally yield optimal outcomes, and that such effects are highly task-dependent. This raises a critical research question: does temperature influence judge performance in LLM centric evaluation? To address this, we systematically investigate the relationship…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
