Towards Understanding the Robustness of LLM-based Evaluations under Perturbations
Manav Chaudhary, Harshit Gupta, Savita Bhat, Vasudeva Varma

TL;DR
This paper investigates the use of Large Language Models, specifically Google Gemini 1, as automatic evaluators for summarization and dialog tasks, analyzing their alignment with human judgments and robustness under input perturbations.
Contribution
It introduces an evaluation framework using LLMs for non-standardized metrics and assesses their performance and robustness compared to human evaluators.
Findings
LLMs show limited alignment with human judgments.
LLMs are not robust against input perturbations.
Significant improvements are needed for reliable LLM-based evaluation.
Abstract
Traditional evaluation metrics like BLEU and ROUGE fall short when capturing the nuanced qualities of generated text, particularly when there is no single ground truth. In this paper, we explore the potential of Large Language Models (LLMs), specifically Google Gemini 1, to serve as automatic evaluators for non-standardized metrics in summarization and dialog-based tasks. We conduct experiments across multiple prompting strategies to examine how LLMs fare as quality evaluators when compared with human judgments on the SummEval and USR datasets, asking the model to generate both a score as well as a justification for the score. Furthermore, we explore the robustness of the LLM evaluator by using perturbed inputs. Our findings suggest that while LLMs show promise, their alignment with human evaluators is limited, they are not robust against perturbations and significant improvements are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFuzzy Logic and Control Systems · Neural Networks and Applications
