Finding Blind Spots in Evaluator LLMs with Interpretable Checklists
Sumanth Doddapaneni, Mohammed Safi Ur Rahman Khan, Sshubam Verma,, Mitesh M. Khapra

TL;DR
This paper introduces FBI, a framework to evaluate the proficiency of LLMs as evaluators by testing their ability to detect quality drops in generated answers across multiple capabilities, revealing significant shortcomings.
Contribution
The paper presents a novel framework, FBI, for systematically assessing the reliability of LLMs as evaluators through targeted perturbations and comprehensive testing.
Findings
Evaluator LLMs failed to detect quality drops in over 50% of cases
Reference-based evaluations performed better than single-answer and pairwise methods
Current Evaluator LLMs are unreliable for accurate assessment of generated text
Abstract
Large Language Models (LLMs) are increasingly relied upon to evaluate text outputs of other LLMs, thereby influencing leaderboards and development decisions. However, concerns persist over the accuracy of these assessments and the potential for misleading conclusions. In this work, we investigate the effectiveness of LLMs as evaluators for text generation tasks. We propose FBI, a novel framework designed to examine the proficiency of Evaluator LLMs in assessing four critical abilities in other LLMs: factual accuracy, instruction following, coherence in long-form writing, and reasoning proficiency. By introducing targeted perturbations in answers generated by LLMs, that clearly impact one of these key capabilities, we test whether an Evaluator LLM can detect these quality drops. By creating a total of 2400 perturbed answers covering 22 perturbation categories, we conduct a comprehensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Artificial Intelligence in Law
