Finding Blind Spots in Evaluator LLMs with Interpretable Checklists

Sumanth Doddapaneni; Mohammed Safi Ur Rahman Khan; Sshubam Verma,; Mitesh M. Khapra

arXiv:2406.13439·cs.CL·November 27, 2024

Finding Blind Spots in Evaluator LLMs with Interpretable Checklists

Sumanth Doddapaneni, Mohammed Safi Ur Rahman Khan, Sshubam Verma,, Mitesh M. Khapra

PDF

Open Access 1 Repo 2 Datasets 1 Video

TL;DR

This paper introduces FBI, a framework to evaluate the proficiency of LLMs as evaluators by testing their ability to detect quality drops in generated answers across multiple capabilities, revealing significant shortcomings.

Contribution

The paper presents a novel framework, FBI, for systematically assessing the reliability of LLMs as evaluators through targeted perturbations and comprehensive testing.

Findings

01

Evaluator LLMs failed to detect quality drops in over 50% of cases

02

Reference-based evaluations performed better than single-answer and pairwise methods

03

Current Evaluator LLMs are unreliable for accurate assessment of generated text

Abstract

Large Language Models (LLMs) are increasingly relied upon to evaluate text outputs of other LLMs, thereby influencing leaderboards and development decisions. However, concerns persist over the accuracy of these assessments and the potential for misleading conclusions. In this work, we investigate the effectiveness of LLMs as evaluators for text generation tasks. We propose FBI, a novel framework designed to examine the proficiency of Evaluator LLMs in assessing four critical abilities in other LLMs: factual accuracy, instruction following, coherence in long-form writing, and reasoning proficiency. By introducing targeted perturbations in answers generated by LLMs, that clearly impact one of these key capabilities, we test whether an Evaluator LLM can detect these quality drops. By creating a total of 2400 perturbed answers covering 22 perturbation categories, we conduct a comprehensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ai4bharat/fbi
noneOfficial

Datasets

Videos

Finding Blind Spots in Evaluator LLMs with Interpretable Checklists· underline

Taxonomy

TopicsNatural Language Processing Techniques · Artificial Intelligence in Law