Tuning LLM Judge Design Decisions for 1/1000 of the Cost
David Salinas, Omar Swelam, Frank Hutter

TL;DR
This paper introduces a systematic approach to optimize LLM-based judges by tuning hyperparameters using multi-objective multi-fidelity methods, achieving better accuracy-cost trade-offs and utilizing open models for accessible evaluation.
Contribution
It presents a novel hyperparameter tuning method for LLM judges that reduces evaluation costs and improves performance using multi-objective multi-fidelity optimization.
Findings
Identified judges outperform existing benchmarks in accuracy and cost-efficiency.
Utilized open-weight models to enhance accessibility and reproducibility.
Reduced evaluation costs significantly through multi-fidelity methods.
Abstract
Evaluating Large Language Models (LLMs) often requires costly human annotations. To address this, LLM-based judges have been proposed, which compare the outputs of two LLMs enabling the ranking of models without human intervention. While several approaches have been proposed, many confounding factors are present between different papers. For instance the model, the prompt and other hyperparameters are typically changed at the same time making apple-to-apple comparisons challenging. In this paper, we propose to systematically analyze and tune the hyperparameters of LLM judges. To alleviate the high cost of evaluating a judge, we propose to leverage multi-objective multi-fidelity which allows to find judges that trade accuracy for cost and also significantly reduce the cost of the search. Our method identifies judges that not only outperform existing benchmarks in accuracy and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDispute Resolution and Class Actions
