Can large language models provide useful feedback on research papers? A   large-scale empirical analysis

Weixin Liang; Yuhui Zhang; Hancheng Cao; Binglu Wang; Daisy Ding,; Xinyu Yang; Kailas Vodrahalli; Siyu He; Daniel Smith; Yian Yin; Daniel; McFarland; James Zou

arXiv:2310.01783·cs.LG·October 4, 2023·42 cites

Can large language models provide useful feedback on research papers? A large-scale empirical analysis

Weixin Liang, Yuhui Zhang, Hancheng Cao, Binglu Wang, Daisy Ding,, Xinyu Yang, Kailas Vodrahalli, Siyu He, Daniel Smith, Yian Yin, Daniel, McFarland, James Zou

PDF

Open Access 1 Repo

TL;DR

This study systematically evaluates GPT-4's ability to generate useful scientific feedback by comparing it with human peer reviews across major journals and conferences, revealing promising results and notable limitations.

Contribution

It introduces an automated pipeline for GPT-4 to provide feedback on research papers and offers large-scale empirical analysis comparing its performance with human reviewers.

Findings

01

GPT-4's feedback overlaps with human reviews by about 30-40%.

02

Over 57% of researchers found GPT-4 feedback helpful or very helpful.

03

GPT-4 feedback was perceived as more beneficial than some human reviews.

Abstract

Expert feedback lays the foundation of rigorous research. However, the rapid growth of scholarly production and intricate knowledge specialization challenge the conventional scientific feedback mechanisms. High-quality peer reviews are increasingly difficult to obtain. Researchers who are more junior or from under-resourced settings have especially hard times getting timely feedback. With the breakthrough of large language models (LLM) such as GPT-4, there is growing interest in using LLMs to generate scientific feedback on research manuscripts. However, the utility of LLM-generated feedback has not been systematically studied. To address this gap, we created an automated pipeline using GPT-4 to provide comments on the full PDFs of scientific papers. We evaluated the quality of GPT-4's feedback through two large-scale studies. We first quantitatively compared GPT-4's generated feedback…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

weixin-liang/llm-scientific-feedback
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Scientific Computing and Data Management

MethodsMulti-Head Attention · Attention Is All You Need · Dropout · Dense Connections · Linear Layer · Label Smoothing · Adam · Absolute Position Encodings · Residual Connection · Layer Normalization