ScholarGym: Benchmarking Large Language Model Capabilities in the Information-Gathering Stage of Deep Research

Hao Shen; Hang Yang; Zhouhong Gu; Weili Han

arXiv:2601.21654·cs.AI·February 18, 2026

ScholarGym: Benchmarking Large Language Model Capabilities in the Information-Gathering Stage of Deep Research

Hao Shen, Hang Yang, Zhouhong Gu, Weili Han

PDF

Open Access 1 Datasets

TL;DR

ScholarGym provides a structured evaluation environment for the information-gathering stage of deep research with large language models, enabling detailed analysis of individual components and revealing key bottlenecks.

Contribution

It introduces a novel benchmark that isolates and assesses each step of the research process, facilitating decomposable analysis of language model capabilities in academic literature retrieval.

Findings

01

Iterative query decomposition improves retrieval F1 by 2.9-3.3×

02

Extended thinking models trade recall for precision

03

Query Planning and Relevance Assessment are key performance bottlenecks

Abstract

Large language models have advanced from single-turn question answering to deep research systems that iteratively decompose research questions, invoke retrieval tools, and synthesize information across multiple rounds. Evaluating such systems typically involves scoring their final research reports holistically, but this end-to-end paradigm tightly couples the language model's decision-making, workflow design, and environmental feedback, precluding decomposable analysis of individual components. We introduce ScholarGym, an evaluation environment that isolates the information-gathering stage of deep research on academic literature. Under a unified workflow, ScholarGym decomposes the research process into three explicit stages -- Query Planning, Tool Invocation, and Relevance Assessment -- and evaluates each against 2,536 expert-annotated queries over a static corpus of 570K papers with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

shenhao/ScholarGym
dataset· 9 dl
9 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Topic Modeling · Information Retrieval and Search Behavior