Can AI Solve the Peer Review Crisis? A Large Scale Cross Model   Experiment of LLMs' Performance and Biases in Evaluating over 1000 Economics   Papers

Pat Pataranutaporn; Nattavudh Powdthavee; Chayapatr Achiwaranguprok,; Pattie Maes

arXiv:2502.00070·cs.CY·April 4, 2025

Can AI Solve the Peer Review Crisis? A Large Scale Cross Model Experiment of LLMs' Performance and Biases in Evaluating over 1000 Economics Papers

Pat Pataranutaporn, Nattavudh Powdthavee, Chayapatr Achiwaranguprok,, Pattie Maes

PDF

Open Access

TL;DR

This study evaluates the performance and biases of four large language models in assessing over 1,200 economics papers, revealing their ability to distinguish research quality and highlighting biases related to author identity and institutional prestige.

Contribution

It provides one of the first large-scale experimental assessments of LLMs in peer review, comparing their evaluation accuracy and biases across multiple models and experiments.

Findings

01

LLMs can differentiate research quality based on text content.

02

Claude and Gemma excel at capturing quality gradients.

03

GPT and other models exhibit biases favoring top authors and institutions.

Abstract

This study examines the potential of large language models (LLMs) to augment the academic peer review process by reliably evaluating the quality of economics research without introducing systematic bias. We conduct one of the first large-scale experimental assessments of four LLMs (GPT-4o, Claude 3.5, Gemma 3, and LLaMA 3.3) across two complementary experiments. In the first, we use nonparametric binscatter and linear regression techniques to analyze over 29,000 evaluations of 1,220 anonymized papers drawn from 110 economics journals excluded from the training data of current LLMs, along with a set of AI-generated submissions. The results show that LLMs consistently distinguish between higher- and lower-quality research based solely on textual content, producing quality gradients that closely align with established journal prestige measures. Claude and Gemma perform exceptionally well…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImpact of AI and Big Data on Business and Society · Explainable Artificial Intelligence (XAI) · Stock Market Forecasting Methods

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Layer Normalization · Softmax · Cosine Annealing · Multi-Head Attention · Dense Connections · Linear Regression · Discriminative Fine-Tuning