TL;DR
RPC-Bench is a large-scale, fine-grained QA benchmark for research paper comprehension, revealing significant gaps in current models' ability to understand scholarly content accurately.
Contribution
It introduces a novel benchmark with a detailed taxonomy and an LLM-human interaction framework for evaluating scientific understanding.
Findings
Even GPT-5 achieves only 68.2% correctness-completeness.
Model performance drops to 37.46% after conciseness adjustment.
RPC-Bench exposes substantial gaps in current scientific paper comprehension.
Abstract
Understanding research papers remains challenging for foundation models due to specialized scientific discourse and complex figures and tables, yet existing benchmarks offer limited fine-grained evaluation at scale. To address this gap, we introduce RPC-Bench, a large-scale question-answering benchmark built from review-rebuttal exchanges of high-quality computer science papers, containing 15K human-verified QA pairs. We design a fine-grained taxonomy aligned with the scientific research flow to assess models' ability to understand and answer why, what, and how questions in scholarly contexts. We also define an elaborate LLM-human interaction annotation framework to support large-scale labeling and quality control. Following the LLM-as-a-Judge paradigm, we develop a scalable framework that evaluates models on correctness-completeness and conciseness, with high agreement to human…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
