Constructing Domain-Specific Evaluation Sets for LLM-as-a-judge

Ravi Raju; Swayambhoo Jain; Bo Li; Jonathan Li; Urmish Thakker

arXiv:2408.08808·cs.LG·August 21, 2024

Constructing Domain-Specific Evaluation Sets for LLM-as-a-judge

Ravi Raju, Swayambhoo Jain, Bo Li, Jonathan Li, Urmish Thakker

PDF

1 Video

TL;DR

This paper introduces a diverse, domain-specific evaluation set for LLMs as judges, improving differentiation and alignment with human preferences, and providing an open-source tool for detailed analysis.

Contribution

It presents a novel data pipeline for creating diverse evaluation sets across multiple domains and languages, enhancing benchmark effectiveness for LLMs as judges.

Findings

01

High separability (84%) across models

02

84% agreement with Chatbot Arena

03

0.915 Spearman correlation with human judgments

Abstract

Large Language Models (LLMs) have revolutionized the landscape of machine learning, yet current benchmarks often fall short in capturing the diverse behavior of these models in real-world applications. A benchmark's usefulness is determined by its ability to clearly differentiate between models of varying capabilities (separability) and closely align with human preferences. Existing frameworks like Alpaca-Eval 2.0 LC \cite{dubois2024lengthcontrolledalpacaevalsimpleway} and Arena-Hard v0.1 \cite{li2024crowdsourced} are limited by their focus on general-purpose queries and lack of diversity across domains such as law, medicine, and multilingual contexts. In this paper, we address these limitations by introducing a novel data pipeline that curates diverse, domain-specific evaluation sets tailored for LLM-as-a-Judge frameworks. Our approach leverages a combination of manual curation,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Constructing Domain-Specific Evaluation Sets for LLM-as-a-judge· underline

Taxonomy

MethodsALIGN · Focus