Adaptive Rigor in AI System Evaluation using Temperature-Controlled Verdict Aggregation via Generalized Power Mean
Aleksandr Meshkov

TL;DR
This paper introduces TCVA, a flexible evaluation method for LLMs that adjusts strictness via a temperature parameter, aligning better with human judgments across domains.
Contribution
It proposes Temperature-Controlled Verdict Aggregation, a novel approach combining verdict scoring and generalized power mean with a tunable temperature for domain-adaptive evaluation.
Findings
TCVA achieves human-like correlation with judgments on benchmark datasets.
It outperforms existing methods like DeepEval in correlation metrics.
The approach requires no extra LLM calls when tuning the temperature.
Abstract
Existing evaluation methods for LLM-based AI systems, such as LLM-as-a-Judge, verdict systems, and NLI, do not always align well with human assessment because they cannot adapt their strictness to the application domain. This paper presents Temperature-Controlled Verdict Aggregation (TCVA), a method that combines a five-level verdict-scoring system with generalized power-mean aggregation and an intuitive temperature parameter T [0.1, 1.0] to control evaluation rigor. Low temperatures yield pessimistic scores suited for safety-critical domains; high temperatures produce lenient scores appropriate for conversational AI. Experimental evaluation on three benchmark datasets with human Likert-scale annotations (SummEval and USR) shows that TCVA achieves correlation with human judgments comparable to RAGAS on faithfulness (Spearman = 0.667 vs. 0.676) while consistently outperforming DeepEval.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
