
TL;DR
This paper evaluates the generalizability of the UMBRELA LLM Judge framework across various large language models, analyzing its effectiveness in relevance assessment and ranking consistency beyond the original GPT-4-based setup.
Contribution
It systematically reproduces and assesses UMBRELA on multiple LLMs, revealing performance variations and limitations with smaller models.
Findings
UMBRELA with DeepSeek V3 performs comparably to GPT-4o.
Performance declines with smaller LLMs like LLaMA-3.3-70B.
Reproducibility across models shows varying accuracy in relevance assessment.
Abstract
We reproduce the UMBRELA LLM Judge evaluation framework across a range of large language models (LLMs) to assess its generalizability beyond the original study. Our investigation evaluates how LLM choice affects relevance assessment accuracy, focusing on leaderboard rank correlation and per-label agreement metrics. Results demonstrate that UMBRELA with DeepSeek V3 obtains very comparable performance to GPT-4o (used in original work). For LLaMA-3.3-70B we obtain slightly lower performance, which further degrades with smaller LLMs.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
