Does UMBRELA Work on Other LLMs?

Naghmeh Farzi; Laura Dietz

arXiv:2507.09483·cs.IR·July 15, 2025

Does UMBRELA Work on Other LLMs?

Naghmeh Farzi, Laura Dietz

PDF

TL;DR

This paper evaluates the generalizability of the UMBRELA LLM Judge framework across various large language models, analyzing its effectiveness in relevance assessment and ranking consistency beyond the original GPT-4-based setup.

Contribution

It systematically reproduces and assesses UMBRELA on multiple LLMs, revealing performance variations and limitations with smaller models.

Findings

01

UMBRELA with DeepSeek V3 performs comparably to GPT-4o.

02

Performance declines with smaller LLMs like LLaMA-3.3-70B.

03

Reproducibility across models shows varying accuracy in relevance assessment.

Abstract

We reproduce the UMBRELA LLM Judge evaluation framework across a range of large language models (LLMs) to assess its generalizability beyond the original study. Our investigation evaluates how LLM choice affects relevance assessment accuracy, focusing on leaderboard rank correlation and per-label agreement metrics. Results demonstrate that UMBRELA with DeepSeek V3 obtains very comparable performance to GPT-4o (used in original work). For LLaMA-3.3-70B we obtain slightly lower performance, which further degrades with smaller LLMs.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.