ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents
Manasi Sharma, Chen Bo Calvin Zhang, Chaithanya Bandi, Clinton Wang, Ankit Aich, Huy Nghiem, Tahseen Rabbani, Ye Htet, Brian Jang, Sumana Basu, Aishwarya Balwani, Denis Peskoff, Marcos Ayestaran, Sean M. Hendryx, Brad Kenstler, Bing Liu

TL;DR
ResearchRubrics is a comprehensive benchmark with detailed rubrics and evaluation protocols designed to assess the performance of deep research agents across diverse, open-ended tasks, highlighting current limitations in reasoning and contextual understanding.
Contribution
The paper introduces ResearchRubrics, a large-scale, standardized benchmark with detailed rubrics and a new complexity framework for evaluating deep research agents.
Findings
Leading DR systems score below 68% compliance with rubrics.
Current agents often miss implicit context and reasoning about retrieved info.
ResearchRubrics facilitates systematic assessment and comparison of DR capabilities.
Abstract
Deep Research (DR) is an emerging agent application that leverages large language models (LLMs) to address open-ended queries. It requires the integration of several capabilities, including multi-step reasoning, cross-document synthesis, and the generation of evidence-backed, long-form answers. Evaluating DR remains challenging because responses are lengthy and diverse, admit many valid solutions, and often depend on dynamic information sources. We introduce ResearchRubrics, a standardized benchmark for DR built with over 2,800+ hours of human labor that pairs realistic, domain-diverse prompts with 2,500+ expert-written, fine-grained rubrics to assess factual grounding, reasoning soundness, and clarity. We also propose a new complexity framework for categorizing DR tasks along three axes: conceptual breadth, logical nesting, and exploration. In addition, we develop human and model-based…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper focuses on an underexplored but important problem: evaluating whether LLMs can reason like scientists rather than just answer questions. - The proposed Research Rubrics provides four well-defined dimensions: Problem Understanding, Reasoning Process, Solution Design, and Scientific Contribution. - The experiments benchmark multiple leading models (GPT-4, Claude 3, Gemini, Qwen2, Mistral) and compare different prompting strategies (chain-of-thought, critique loop, research-plan promp
- The study mainly focuses on computer science problems (e.g., ICLR/NeurIPS-style tasks), so it’s unclear how well the framework generalizes to other scientific domains. - The framework evaluates the quality of reasoning, but it does not assess whether the model’s ideas could actually produce valid or impactful scientific outcomes. - Model performance depends heavily on prompt design (e.g., “research-plan” prompts boost scores), which suggests the framework might partly reflect prompt engineer
Diagnostic clarity. Not many works exist to evaluate the output of deep research agents, and this work provides a meaningful contribution towards that front. The rubric design makes sense- and helps to evaluate the open-ended nature of these very long responses. The work provides a meaningful step to make sense and introduce clarity into the messy subjective space of research quality into a more structured and reproducible measurement framework.
The paper's main weakness is on validity. Disagreements occur in rubric creation, so the final produced rubric through the 3 stage process masks inherent disagreement and tries to measure progress against a rubric that would be created by the average human. No IAA measures or analysis of disagreements occur; the main question, of a benchmark, is if progress on said benchmark would demonstrate a meaningful improvement on the task for end users. It is unclear from the rubric creation process that
- Clear motivation: DR tasks are open-ended, dynamic, and require long-form synthesis; existing QA-style benchmarks and short-answer datasets underrepresent these needs. - Human-authored rubrics: The choice to use carefully designed, expert-written rubrics (including negative criteria and weighted mandatory vs. optional items) is a thoughtful departure from purely automated reference-based metrics and helps capture nuanced expectations. - Fine-grained evaluation: The rubric axes (explicit, impli
- Scale and representativeness: 75 tasks and 1,868 criteria are substantial in rubric detail but relatively small in task count for a general-purpose benchmark, raising questions about coverage and generalization across the wide variety of DR use cases. - Expert definition and domain specialization: “Experts” are defined as strong STEM generalists rather than domain specialists (e.g., legal, medical). For domains with high stakes or regulatory complexity, lack of specialist involvement may reduc
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Artificial Intelligence in Healthcare and Education
