ResearchGPT: Benchmarking and Training LLMs for End-to-End Computer Science Research Workflows

Penghao Wang; Yuhao Zhou; Mengxuan Wu; Ziheng Qin; Bangyuan Zhu; Shengbin Huang; Xuanlei Zhao; Panpan Zhang; Xiaojiang Peng; Yuzhang Shang; Jianfei Yang; Zheng Zhu; Tianlong Chen; Zhangyang Wang; Kai Wang

arXiv:2510.20279·cs.LG·October 27, 2025

ResearchGPT: Benchmarking and Training LLMs for End-to-End Computer Science Research Workflows

Penghao Wang, Yuhao Zhou, Mengxuan Wu, Ziheng Qin, Bangyuan Zhu, Shengbin Huang, Xuanlei Zhao, Panpan Zhang, Xiaojiang Peng, Yuzhang Shang, Jianfei Yang, Zheng Zhu, Tianlong Chen, Zhangyang Wang, Kai Wang

PDF

Open Access 1 Datasets 4 Reviews

TL;DR

This paper introduces ResearchGPT, a benchmark and training dataset for LLMs to assist in end-to-end computer science research workflows, demonstrating that domain-specific high-quality data significantly improves research assistance capabilities.

Contribution

The paper presents CS-54k, a high-quality scientific Q&A corpus, and demonstrates how training on this data enhances LLMs' ability to support scientific research tasks.

Findings

01

Models trained on CS-50k outperform larger proprietary models.

02

High-quality domain data improves research assistance capabilities.

03

CS-4k effectively stratifies LLMs by capability tiers.

Abstract

As large language models (LLMs) advance, the ultimate vision for their role in science is emerging: we could build an AI collaborator to effectively assist human beings throughout the entire scientific research process. We refer to this envisioned system as ResearchGPT. Given that scientific research progresses through multiple interdependent phases, achieving this vision requires rigorous benchmarks that evaluate the end-to-end workflow rather than isolated sub-tasks. To this end, we contribute CS-54k, a high-quality corpus of scientific Q&A pairs in computer science, built from 14k CC-licensed papers. It is constructed through a scalable, paper-grounded pipeline that combines retrieval-augmented generation (RAG) with multi-stage quality control to ensure factual grounding. From this unified corpus, we derive two complementary subsets: CS-4k, a carefully curated benchmark for…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 3

Strengths

The paper constructs a large corpus of Q & A pairs for computer science literature. While mostly not a completely new concept, this corpus might be useful training data. A subset of this corpus is used as a benchmark for Q&A on computer science questions and might be useful when evaluating models on their CS paper knowledge, though the evaluation should be validated thoroughly. The authors also experiment with finetuning and RL and show that the training corpus can lead to higher scores on the

Weaknesses

* The filtration step where `(Q, A)` pairs are removed if GPT 4 mini, Gemini 2.5 flash or Claude 3.5 haiku either are all correctly or are all incorrect (based on LM as a judge given the ground truth answer) requires justification. Does this not introduce significant bias into the CS-4k benchmark? For the training dataset, this probably doesn't matter, but for the benchmark CS-4k it might especially be problematic that comparatively weak models (compared to e.g., o3 etc.) are used for this filtr

Reviewer 02Rating 4Confidence 3

Strengths

- This work contributes lots of QA pairs for computer science papers - The proposed trained 7B model outperforms several proprietary LLMs that are much larger than 7B - The proposed data construction pipeline may benefit the research community - The benchmark reveals some interesting findings on LLMs' capabilities in answering questions related to computer science papers

Weaknesses

- The question-answering setting seems too naive. It seems that the questions in the benchmark don't contain the raw text from related papers (Figure 5). I don't think this is a very useful setting. It would be more meaningful if you append the entire paper or append the same text chunk with the question. - The comparison is not very fair. The answer is generated by LLM based on guidelines, which introduces a bias. The guidelines for generating the answer at test time is not introduced in the qu

Reviewer 03Rating 2Confidence 4

Strengths

1. The authors are tackling an important problem, that being AI-assisted research beyond exam-style questions like in Humanity’s Last Exam (HLE). 2. The authors extract QA pairs from a large corpus of research papers, which is useful as a set of tasks and data for frontier models. 3. The results show clear differences in performance between top frontier models and weaker models on their benchmarks.

Weaknesses

1. The authors claim this dataset / benchmark “systematically evaluates the end-to-end re- search workflow in computer science through open-ended scientific question answering”, but do not justify this claim. The dataset / benchmark consists of synthetically generated QA questions from existing papers, and it is unclear how this connects to the claim. 2. There is a lack of analysis on where the LMs fail / succeed on the tasks, and how to improve existing LM capabilities on these tasks other than

Reviewer 04Rating 4Confidence 3

Strengths

- The paper addresses an important gap in LLM evaluation by attempting to cover the full spectrum of the research workflow rather than isolated skills. - The dataset construction pipeline is well‑documented, reproducible, and based on real, high‑quality scientific papers with explicit measures to reduce hallucination through RAG and strict prompt constraints. - The multi‑stage quality control process is comprehensive, combining automatic checks for reasonability, difficulty balancing, and fi

Weaknesses

- Possible overstatement of “end‑to‑end” evaluation. The paper states that CS‑4k supports *end‑to‑end* evaluation of scientific research workflows. In reality, the benchmark combines results from separate sub‑tasks that match different workflow stages. These are tested independently without linked task sequences or shared context, so the process is not a full end‑to‑end pipeline. The claim may therefore be overstated, as the benchmark measures broad coverage rather than full workflow execution

Code & Models

Datasets

wph6/CS-54k
dataset· 45 dl
45 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Scientific Computing and Data Management · Artificial Intelligence in Healthcare and Education