AgenticRAGTracer: A Hop-Aware Benchmark for Diagnosing Multi-Step Retrieval Reasoning in Agentic RAG
Qijie You, Wenkai Yu, Wentao Zhang

TL;DR
This paper introduces AgenticRAGTracer, a new automatically constructed benchmark for multi-step reasoning in agentic retrieval-augmented generation, enabling detailed diagnosis of model reasoning failures across multiple domains.
Contribution
It presents the first large-scale, automatically generated benchmark for step-by-step validation of agentic RAG models, supporting fine-grained analysis of reasoning capabilities.
Findings
Large language models perform poorly on the benchmark, with GPT-5 achieving only 22.6% accuracy.
Failures are mainly due to distorted reasoning chains, such as premature collapse or wandering.
The benchmark reveals critical reasoning deficiencies not captured by traditional evaluation methods.
Abstract
With the rapid advancement of agent-based methods in recent years, Agentic RAG has undoubtedly become an important research direction. Multi-hop reasoning, which requires models to engage in deliberate thinking and multi-step interaction, serves as a critical testbed for assessing such capabilities. However, existing benchmarks typically provide only final questions and answers, while lacking the intermediate hop-level questions that gradually connect atomic questions to the final multi-hop query. This limitation prevents researchers from analyzing at which step an agent fails and restricts more fine-grained evaluation of model capabilities. Moreover, most current benchmarks are manually constructed, which is both time-consuming and labor-intensive, while also limiting scalability and generalization. To address these challenges, we introduce AgenticRAGTracer, the first Agentic RAG…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Graph Neural Networks
