TL;DR
This paper introduces a diverse, domain-specific evaluation set for LLMs as judges, improving differentiation and alignment with human preferences, and providing an open-source tool for detailed analysis.
Contribution
It presents a novel data pipeline for creating diverse evaluation sets across multiple domains and languages, enhancing benchmark effectiveness for LLMs as judges.
Findings
High separability (84%) across models
84% agreement with Chatbot Arena
0.915 Spearman correlation with human judgments
Abstract
Large Language Models (LLMs) have revolutionized the landscape of machine learning, yet current benchmarks often fall short in capturing the diverse behavior of these models in real-world applications. A benchmark's usefulness is determined by its ability to clearly differentiate between models of varying capabilities (separability) and closely align with human preferences. Existing frameworks like Alpaca-Eval 2.0 LC \cite{dubois2024lengthcontrolledalpacaevalsimpleway} and Arena-Hard v0.1 \cite{li2024crowdsourced} are limited by their focus on general-purpose queries and lack of diversity across domains such as law, medicine, and multilingual contexts. In this paper, we address these limitations by introducing a novel data pipeline that curates diverse, domain-specific evaluation sets tailored for LLM-as-a-Judge frameworks. Our approach leverages a combination of manual curation,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
MethodsALIGN · Focus
