Evaluating Large Language Models for Abstract Evaluation Tasks: An Empirical Study
Yinuo Liu, Emre Sezgin, Eric A. Youngstrom

TL;DR
This study evaluates the consistency and reliability of large language models in assessing academic abstracts, comparing their performance to human reviewers and exploring their potential as supplementary tools in scientific review processes.
Contribution
It provides empirical evidence on the agreement levels between LLMs and humans in abstract evaluation, highlighting their strengths and limitations for supporting peer review.
Findings
LLMs showed good-to-excellent agreement with each other (ICCs: 0.59-0.87).
ChatGPT and Claude had moderate agreement with humans on overall quality (ICCs ~0.45-0.60).
LLMs are suitable for batch processing and consistent application of rubrics, but less reliable on subjective criteria.
Abstract
Introduction: Large language models (LLMs) can process requests and generate texts, but their feasibility for assessing complex academic content needs further investigation. To explore LLM's potential in assisting scientific review, this study examined ChatGPT-5, Gemini-3-Pro, and Claude-Sonnet-4.5's consistency and reliability in evaluating abstracts compared to one another and to human reviewers. Methods: 160 abstracts from a local conference were graded by human reviewers and three LLMs using one rubric. Composite score distributions across three LLMs and fourteen reviewers were examined. Inter-rater reliability was calculated using intraclass correlation coefficients (ICCs) for within-AI reliability and AI-human concordance. Bland-Altman plots were examined for visual agreement patterns and systematic bias. Results: LLMs achieved good-to-excellent agreement with each other (ICCs:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Meta-analysis and systematic reviews · Delphi Technique in Research
