ScholarGym: Benchmarking Large Language Model Capabilities in the Information-Gathering Stage of Deep Research
Hao Shen, Hang Yang, Zhouhong Gu, Weili Han

TL;DR
ScholarGym provides a structured evaluation environment for the information-gathering stage of deep research with large language models, enabling detailed analysis of individual components and revealing key bottlenecks.
Contribution
It introduces a novel benchmark that isolates and assesses each step of the research process, facilitating decomposable analysis of language model capabilities in academic literature retrieval.
Findings
Iterative query decomposition improves retrieval F1 by 2.9-3.3×
Extended thinking models trade recall for precision
Query Planning and Relevance Assessment are key performance bottlenecks
Abstract
Large language models have advanced from single-turn question answering to deep research systems that iteratively decompose research questions, invoke retrieval tools, and synthesize information across multiple rounds. Evaluating such systems typically involves scoring their final research reports holistically, but this end-to-end paradigm tightly couples the language model's decision-making, workflow design, and environmental feedback, precluding decomposable analysis of individual components. We introduce ScholarGym, an evaluation environment that isolates the information-gathering stage of deep research on academic literature. Under a unified workflow, ScholarGym decomposes the research process into three explicit stages -- Query Planning, Tool Invocation, and Relevance Assessment -- and evaluates each against 2,536 expert-annotated queries over a static corpus of 570K papers with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Topic Modeling · Information Retrieval and Search Behavior
