JudgeBlender: Ensembling Judgments for Automatic Relevance Assessment
Hossein A. Rahmani, Emine Yilmaz, Nick Craswell, Bhaskar Mitra

TL;DR
JudgeBlender is a framework that combines multiple smaller open-source models and prompts to generate reliable relevance judgments, reducing reliance on expensive large LLMs and improving efficiency in search system evaluation.
Contribution
It introduces a novel ensembling approach using open-source models and prompts for relevance assessment, challenging the need for large LLMs like GPT-4.
Findings
JudgeBlender achieves competitive performance on the LLMJudge benchmark.
Ensembling multiple models or prompts improves relevance judgment reliability.
Smaller models can effectively replace large LLMs for evaluation tasks.
Abstract
The effective training and evaluation of retrieval systems require a substantial amount of relevance judgments, which are traditionally collected from human assessors -- a process that is both costly and time-consuming. Large Language Models (LLMs) have shown promise in generating relevance labels for search tasks, offering a potential alternative to manual assessments. Current approaches often rely on a single LLM, such as GPT-4, which, despite being effective, are expensive and prone to intra-model biases that can favour systems leveraging similar models. In this work, we introduce JudgeBlender, a framework that employs smaller, open-source models to provide relevance judgments by combining evaluations across multiple LLMs (LLMBlender) or multiple prompts (PromptBlender). By leveraging the LLMJudge benchmark [18], we compare JudgeBlender with state-of-the-art methods and the top…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning
MethodsAttention Is All You Need · Linear Layer · Dropout · Multi-Head Attention · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Adam · Layer Normalization · Softmax
