CLEAR: Error Analysis via LLM-as-a-Judge Made Easy
Asaf Yehudai, Lilach Eden, Yotam Perlitz, Roy Bar-Haim, Michal Shmueli-Scheuer

TL;DR
CLEAR is an open-source tool that enhances LLM evaluation by providing detailed, instance-level error analysis and visualizations, helping users understand specific model weaknesses beyond simple scores.
Contribution
We introduce CLEAR, a novel interactive package that offers detailed error analysis and visualization for LLM evaluation, addressing the lack of interpretability in current scoring methods.
Findings
CLEAR effectively identifies and visualizes specific error types.
It facilitates detailed analysis of RAG and Math benchmark performances.
User case studies demonstrate improved understanding of model behaviors.
Abstract
The evaluation of Large Language Models (LLMs) increasingly relies on other LLMs acting as judges. However, current evaluation paradigms typically yield a single score or ranking, answering which model is better but not why. While essential for benchmarking, these top-level scores obscure the specific, actionable reasons behind a model's performance. To bridge this gap, we introduce CLEAR, an interactive, open-source package for LLM-based error analysis. CLEAR first generates per-instance textual feedback, then it creates a set of system-level error issues, and quantifies the prevalence of each identified issue. Our package also provides users with an interactive dashboard that allows for a comprehensive error analysis through aggregate visualizations, applies interactive filters to isolate specific issues or score ranges, and drills down to the individual instances that exemplify a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
