CycleResearcher: Improving Automated Research via Automated Review
Yixuan Weng, Minjun Zhu, Guangsheng Bao, Hongbo Zhang, Jindong Wang,, Yue Zhang, Linyi Yang

TL;DR
This paper introduces CycleResearcher, an open-source LLM-based framework for fully automated research and review, demonstrating promising results in automating scientific inquiry and peer review processes.
Contribution
It presents a novel iterative training framework using open-source LLMs for autonomous research and peer review, along with new datasets and promising evaluation results.
Findings
CycleReviewer reduces review score prediction error by 26.89%.
Generated papers achieve a competitive review score of 5.36.
The framework demonstrates potential for fully automated scientific research.
Abstract
The automation of scientific discovery has been a long-standing goal within the research community, driven by the potential to accelerate knowledge creation. While significant progress has been made using commercial large language models (LLMs) as research assistants or idea generators, the possibility of automating the entire research process with open-source LLMs remains largely unexplored. This paper explores the feasibility of using open-source post-trained LLMs as autonomous agents capable of performing the full cycle of automated research and review, from literature review and manuscript preparation to peer review and paper refinement. Our iterative preference training framework consists of CycleResearcher, which conducts research tasks, and CycleReviewer, which simulates the peer review process, providing iterative feedback via reinforcement learning. To train these models, we…
Peer Reviews
Decision·ICLR 2025 Poster
The introduction of CycleResearcher and CycleReviewer models to automate the entire research process, including literature review, manuscript preparation, peer review, and revision, is highly innovative. This framework mimics the real-world research cycle, enhancing the efficiency and consistency of scientific inquiry. Performance Improvement: The CycleReviewer model demonstrates a significant improvement in predicting paper scores, outperforming human reviewers by 26.89% in mean absolute error
Generalizability Across Domains: The models are primarily designed for machine learning-related research, and their generalizability to other scientific fields remains unexplored. This limitation suggests that the framework might not perform as well in domains outside of machine learning. Reward Design: The paper highlights the issue of reward definition, where the policy model might exploit loopholes in the reward model to maximize rewards without genuinely improving the quality of the generate
Originality: The idea to design both a researcher and a reviewer is novel and interesting. Quality: The usage of recent preference optimization methods is a nice technical plus. The work contributes datasets to the direction of scientific peer reviewing, which is a resource that is rather helpful for the field. RL details and how they fit in is nice. Clarity: Figures are well-designed and artistically pleasing. Appreciate the various different ways that are used to evaluate the methods (qual
Originality: N/A Quality: One big issue of the paper is the method in which the authors obtain the "ground truth" review score: "for each submission, we use the average of the other n − 1 reviewers’ scores as an estimator of the true score." In my opinion (and what feels like a general consensus in the community), it's pretty clear that this isn't the correct approach in determining a ground truth quality of a paper. Different reviewers have different expertises and opinions, and may disagree
* Training LLMs with reinforcement learning on parts of the AI research process is a novel and significant contribution. * The paper includes numerous experiments and ablations. The overall methodology is sound (with exceptions, see weaknesses). * The authors achieve strong results on the metrics they choose. It is somewhat impressive that their system achieved an acceptance rate of 31.07%, similar to ICLR 2024's acceptance rate. * Authors use open-source models with a large range of scale (from
* The writing is overclaiming the extent to which the paper covers the full research process. Authors write that the paper “explores performing the *full* cycle of automated research and review”, however the paper omits crucial part of the process: actually running experiments. Instead, the authors train models to write complete papers purely from abstracts of past work, with completely hallucinated experiment design and results. * I do not think that the task authors train models for — halluc
Videos
Taxonomy
TopicsScientific Computing and Data Management
