SciNetBench: A Relation-Aware Benchmark for Scientific Literature Retrieval Agents
Chenyang Shao, Yong Li, Fengli Xu

TL;DR
SciNetBench is a new benchmark designed to evaluate scientific literature retrieval agents' ability to understand and utilize relations within scientific papers, addressing a key limitation of current content-focused retrieval methods.
Contribution
We introduce SciNetBench, the first relation-aware benchmark for scientific literature retrieval, and demonstrate the significant performance gap in current agents' ability to handle relational information.
Findings
Current retrieval agents perform below 20% accuracy on relation-aware tasks.
Providing relational ground truth improves review quality by 23.4%.
Relational understanding is crucial for effective scientific literature retrieval.
Abstract
The rapid development of AI agent has spurred the development of advanced research tools, such as Deep Research. Achieving this require a nuanced understanding of the relations within scientific literature, surpasses the scope of keyword-based or embedding-based retrieval. Existing retrieval agents mainly focus on the content-level similarities and are unable to decode critical relational dynamics, such as identifying corroborating or conflicting studies or tracing technological lineages, all of which are essential for a comprehensive literature review. Consequently, this fundamental limitation often results in a fragmented knowledge structure, misleading sentiment interpretation, and inadequate modeling of collective scientific progress. To investigate relation-aware retrieval more deeply, we propose SciNetBench, the first Scientific Network Relation-aware Benchmark for literature…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper introduces a large benchmark, SciNetBench, built on top of OpenAlex (full snapshot) and contains 18M+ AI papers across 177 subfields. 2. Taking into account the relations between papers through citations contexts is good although not a novel idea. 3. The baselines used in the paper cover embedding-based, agentic search, and deep research agents.
1. Although the dataset is interesting, I find the novelty of this paper limited. I also wonder if the components that are added to the system overall do not in fact introduce biases. For example, first, for the novelty of the paper, besides putting together some metrics from the literature (e.g., novelty by Uzzi et al or disruption by Funk and Owen-Smith), I do not see the novelty in the creation of this new benchmark. Second, for the novelty / disruption, the question that is used in the paper
1. This paper is well-written and clearly structured, making it easy to follow. The figures are well designed and effectively facilitate the understanding of both dataset construction and ground-truth labels. 2. Retrieving scientific literature with relation-aware conditions is a highly interesting research task. I believe this study holds significant potential for real-world applications.
1. The scope of the proposed benchmark dataset is limited, as it includes only AI-related papers and therefore lacks the generality needed to thoroughly evaluate relation-aware retrieval capabilities across diverse scenarios. 2. Although a key novelty of this benchmark lies in its construction of three types of queries, the total number of queries remains relatively small (only 1,000), which hinders effective model training and comprehensive evaluation. Moreover, it is unclear why the authors o
- The authors did not rely on the more common topical matching, and extended it to network‑aware retrieval, which is important for actual research workflows. - The three task granularities that the authors proposed are well‑motivated, technically sound, and complementary to each other. - The authors adopted robust metrics such as Uzzi novelty, Funk–Owen‑Smith disruption, and credit allocation. - SciNetBench is a large-scale, open-source benchmark, which will be very valuable for the community. -
- Ground‑truth path heuristic may bias to popularity. Selecting the path with maximum cumulative citations introduced the risk of preferencing older/highly cited detours over the widely accepted or methodologically coherent trajectories. - Novelty‑LLM / Disruption‑LLM and Rationality‑LLM rely on a single model. The authors should consider multi‑judge aggregation or report inter‑rater reliability. - Some strongest baselines (web‑search and deep‑research agents) are proprietary API-based and chang
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Topic Modeling · Advanced Graph Neural Networks
