Blind to the Human Touch: Overlap Bias in LLM-Based Summary Evaluation

Jiangnan Fang; Cheng-Tse Liu; Hanieh Deilamsalehy; Nesreen K. Ahmed; Puneet Mathur; Nedim Lipka; Franck Dernoncourt; Ryan A. Rossi

arXiv:2602.07673·cs.CL·February 10, 2026

Blind to the Human Touch: Overlap Bias in LLM-Based Summary Evaluation

Jiangnan Fang, Cheng-Tse Liu, Hanieh Deilamsalehy, Nesreen K. Ahmed, Puneet Mathur, Nedim Lipka, Franck Dernoncourt, Ryan A. Rossi

PDF

Open Access

TL;DR

This paper investigates biases in LLM-based summary evaluation, revealing that LLM judges favor LLM-generated summaries over human ones as overlap decreases, highlighting limitations in current evaluation methods.

Contribution

The study provides a detailed bias analysis of LLM judges based on overlap metrics, across multiple models and in the context of summarization evaluation.

Findings

01

LLM judges prefer LLM summaries over human summaries as overlap decreases.

02

Models struggle to evaluate summaries with limited overlap, indicating a need for improved techniques.

03

Bias patterns are consistent across different LLMs regardless of their own biases.

Abstract

Large language model (LLM) judges have often been used alongside traditional, algorithm-based metrics for tasks like summarization because they better capture semantic information, are better at reasoning, and are more robust to paraphrasing. However, LLM judges show biases for length and order among others, and are vulnerable to various adversarial input prompts. While recent studies have looked into these biases, few have analyzed them at a more granular level in relation to a well-defined overlap metric. In this work we provide an LLM judge bias analysis as a function of overlap with human-written responses in the domain of summarization. We test 9 recent LLMs with parameter counts ranging from 1 billion to 12 billion, including variants of Gemma 3 and LLaMA 3. We find that LLM judges increasingly prefer summaries generated by other LLMs over those written by humans as the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Text Readability and Simplification · Computational and Text Analysis Methods