APRES: An Agentic Paper Revision and Evaluation System
Bingchen Zhao, Jenny Zhang, Chenxi Whitehouse, Minqi Jiang, Michael Shvartsman, Abhishek Charnalia, Despoina Magka, Tatiana Shavrina, Derek Dunfield, Oisin Mac Aodha, Yoram Bachrach

TL;DR
APRES leverages Large Language Models to automatically revise scientific papers based on an evaluation rubric, improving citation prediction accuracy and human preference, thereby enhancing scientific communication and impact.
Contribution
This paper introduces APRES, a novel LLM-powered system that automatically revises scientific manuscripts to improve their quality and future citation potential without altering core content.
Findings
Improves citation prediction accuracy by 19.6%.
Papers revised by APRES are preferred by experts 79% of the time.
Demonstrates LLMs can effectively stress-test manuscripts before submission.
Abstract
Scientific discoveries must be communicated clearly to realize their full potential. Without effective communication, even the most groundbreaking findings risk being overlooked or misunderstood. The primary way scientists communicate their work and receive feedback from the community is through peer review. However, the current system often provides inconsistent feedback between reviewers, ultimately hindering the improvement of a manuscript and limiting its potential impact. In this paper, we introduce a novel method APRES powered by Large Language Models (LLMs) to update a scientific papers text based on an evaluation rubric. Our automated method discovers a rubric that is highly predictive of future citation counts, and integrate it with APRES in an automated system that revises papers to enhance their quality and impact. Crucially, this objective should be met without altering the…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper proposes an original problem formulation and solution to the important problem of evaluating and improving the presentation of scientific papers. While prior work has studied aspects of this problem at a more component level, this paper takes a more end-to-end approach in a mostly ecologically valid setting with recent AI papers. The use of language models to automatically derive rubrics is understudied, and I appreciate this paper demonstrating the end-to-end effectiveness of this app
This paper's primary original contribution lies in the problem definition, and it does not make a large advance in terms of the development of new optimization methods (this is fine). See the "Questions" section for a significant question about the experimental setup in the evaluation of the Phase 2 (rewriting) problem---there is lack of clarity, but if the evaluation metric is not held constant across the subplots in Figure 2 then that would call into question the internal validity of that expe
1. Problem impact/significance: The paper presents an end-to-end system for the impactful and timely problem of using LLMs to generate useful feedback for paper revision. 2. The idea of using an LLM-based agent to iteratively generate rubrics that are predictive of citation count based on text content appears to be novel, and the proposed MultiAIDE search method outperforms existing methods for predicting future citation count. 3. The human evaluation results scoring paper quality suggest tha
1. As noted in the abstract, it is important that any paper revision system not revise core scientific content in a paper. While there are some constraints placed on the system to limit revision of such content, none of the evaluations of APRES and revised papers measure whether scientific content changed. This to me is one of the key weaknesses of this work, as a paper revision system that always changed a lot of scientific content such that the results looked more positive would be expected to
1. The idea of using LLMs to assist paper writing is promising. 2. Agentic paper revision and evaluation is intuitive to help researchers in writing papers.
1. The core design of this paper uses the number of **citations** as a metric to improve given papers' presentation and readability, which is questionable. The authors cite Ante (2022) as support. However: a) The claim of Ante's paper is "Our analysis furthermore shows that higher readability scores significantly relates to the likelihood of articles not receiving any citations", which **contradicts the claim of this paper**. b) Even if readability (x) can increase or decrease scientific impact
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExpert finding and Q&A systems · Topic Modeling · Advanced Text Analysis Techniques
