Evaluating Large Language Models for Abstract Evaluation Tasks: An Empirical Study

Yinuo Liu; Emre Sezgin; Eric A. Youngstrom

arXiv:2601.19925·cs.CL·January 29, 2026

Evaluating Large Language Models for Abstract Evaluation Tasks: An Empirical Study

Yinuo Liu, Emre Sezgin, Eric A. Youngstrom

PDF

Open Access

TL;DR

This study evaluates the consistency and reliability of large language models in assessing academic abstracts, comparing their performance to human reviewers and exploring their potential as supplementary tools in scientific review processes.

Contribution

It provides empirical evidence on the agreement levels between LLMs and humans in abstract evaluation, highlighting their strengths and limitations for supporting peer review.

Findings

01

LLMs showed good-to-excellent agreement with each other (ICCs: 0.59-0.87).

02

ChatGPT and Claude had moderate agreement with humans on overall quality (ICCs ~0.45-0.60).

03

LLMs are suitable for batch processing and consistent application of rubrics, but less reliable on subjective criteria.

Abstract

Introduction: Large language models (LLMs) can process requests and generate texts, but their feasibility for assessing complex academic content needs further investigation. To explore LLM's potential in assisting scientific review, this study examined ChatGPT-5, Gemini-3-Pro, and Claude-Sonnet-4.5's consistency and reliability in evaluating abstracts compared to one another and to human reviewers. Methods: 160 abstracts from a local conference were graded by human reviewers and three LLMs using one rubric. Composite score distributions across three LLMs and fourteen reviewers were examined. Inter-rater reliability was calculated using intraclass correlation coefficients (ICCs) for within-AI reliability and AI-human concordance. Bland-Altman plots were examined for visual agreement patterns and systematic bias. Results: LLMs achieved good-to-excellent agreement with each other (ICCs:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Meta-analysis and systematic reviews · Delphi Technique in Research