ResearchGPT: Benchmarking and Training LLMs for End-to-End Computer Science Research Workflows
Penghao Wang, Yuhao Zhou, Mengxuan Wu, Ziheng Qin, Bangyuan Zhu, Shengbin Huang, Xuanlei Zhao, Panpan Zhang, Xiaojiang Peng, Yuzhang Shang, Jianfei Yang, Zheng Zhu, Tianlong Chen, Zhangyang Wang, Kai Wang

TL;DR
This paper introduces ResearchGPT, a benchmark and training dataset for LLMs to assist in end-to-end computer science research workflows, demonstrating that domain-specific high-quality data significantly improves research assistance capabilities.
Contribution
The paper presents CS-54k, a high-quality scientific Q&A corpus, and demonstrates how training on this data enhances LLMs' ability to support scientific research tasks.
Findings
Models trained on CS-50k outperform larger proprietary models.
High-quality domain data improves research assistance capabilities.
CS-4k effectively stratifies LLMs by capability tiers.
Abstract
As large language models (LLMs) advance, the ultimate vision for their role in science is emerging: we could build an AI collaborator to effectively assist human beings throughout the entire scientific research process. We refer to this envisioned system as ResearchGPT. Given that scientific research progresses through multiple interdependent phases, achieving this vision requires rigorous benchmarks that evaluate the end-to-end workflow rather than isolated sub-tasks. To this end, we contribute CS-54k, a high-quality corpus of scientific Q&A pairs in computer science, built from 14k CC-licensed papers. It is constructed through a scalable, paper-grounded pipeline that combines retrieval-augmented generation (RAG) with multi-stage quality control to ensure factual grounding. From this unified corpus, we derive two complementary subsets: CS-4k, a carefully curated benchmark for…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The paper constructs a large corpus of Q & A pairs for computer science literature. While mostly not a completely new concept, this corpus might be useful training data. A subset of this corpus is used as a benchmark for Q&A on computer science questions and might be useful when evaluating models on their CS paper knowledge, though the evaluation should be validated thoroughly. The authors also experiment with finetuning and RL and show that the training corpus can lead to higher scores on the
* The filtration step where `(Q, A)` pairs are removed if GPT 4 mini, Gemini 2.5 flash or Claude 3.5 haiku either are all correctly or are all incorrect (based on LM as a judge given the ground truth answer) requires justification. Does this not introduce significant bias into the CS-4k benchmark? For the training dataset, this probably doesn't matter, but for the benchmark CS-4k it might especially be problematic that comparatively weak models (compared to e.g., o3 etc.) are used for this filtr
- This work contributes lots of QA pairs for computer science papers - The proposed trained 7B model outperforms several proprietary LLMs that are much larger than 7B - The proposed data construction pipeline may benefit the research community - The benchmark reveals some interesting findings on LLMs' capabilities in answering questions related to computer science papers
- The question-answering setting seems too naive. It seems that the questions in the benchmark don't contain the raw text from related papers (Figure 5). I don't think this is a very useful setting. It would be more meaningful if you append the entire paper or append the same text chunk with the question. - The comparison is not very fair. The answer is generated by LLM based on guidelines, which introduces a bias. The guidelines for generating the answer at test time is not introduced in the qu
1. The authors are tackling an important problem, that being AI-assisted research beyond exam-style questions like in Humanity’s Last Exam (HLE). 2. The authors extract QA pairs from a large corpus of research papers, which is useful as a set of tasks and data for frontier models. 3. The results show clear differences in performance between top frontier models and weaker models on their benchmarks.
1. The authors claim this dataset / benchmark “systematically evaluates the end-to-end re- search workflow in computer science through open-ended scientific question answering”, but do not justify this claim. The dataset / benchmark consists of synthetically generated QA questions from existing papers, and it is unclear how this connects to the claim. 2. There is a lack of analysis on where the LMs fail / succeed on the tasks, and how to improve existing LM capabilities on these tasks other than
- The paper addresses an important gap in LLM evaluation by attempting to cover the full spectrum of the research workflow rather than isolated skills. - The dataset construction pipeline is well‑documented, reproducible, and based on real, high‑quality scientific papers with explicit measures to reduce hallucination through RAG and strict prompt constraints. - The multi‑stage quality control process is comprehensive, combining automatic checks for reasonability, difficulty balancing, and fi
- Possible overstatement of “end‑to‑end” evaluation. The paper states that CS‑4k supports *end‑to‑end* evaluation of scientific research workflows. In reality, the benchmark combines results from separate sub‑tasks that match different workflow stages. These are tested independently without linked task sequences or shared context, so the process is not a full end‑to‑end pipeline. The claim may therefore be overstated, as the benchmark measures broad coverage rather than full workflow execution
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Scientific Computing and Data Management · Artificial Intelligence in Healthcare and Education
