AgenticRAGTracer: A Hop-Aware Benchmark for Diagnosing Multi-Step Retrieval Reasoning in Agentic RAG

Qijie You; Wenkai Yu; Wentao Zhang

arXiv:2602.19127·cs.CL·February 24, 2026

AgenticRAGTracer: A Hop-Aware Benchmark for Diagnosing Multi-Step Retrieval Reasoning in Agentic RAG

Qijie You, Wenkai Yu, Wentao Zhang

PDF

Open Access 1 Datasets

TL;DR

This paper introduces AgenticRAGTracer, a new automatically constructed benchmark for multi-step reasoning in agentic retrieval-augmented generation, enabling detailed diagnosis of model reasoning failures across multiple domains.

Contribution

It presents the first large-scale, automatically generated benchmark for step-by-step validation of agentic RAG models, supporting fine-grained analysis of reasoning capabilities.

Findings

01

Large language models perform poorly on the benchmark, with GPT-5 achieving only 22.6% accuracy.

02

Failures are mainly due to distorted reasoning chains, such as premature collapse or wandering.

03

The benchmark reveals critical reasoning deficiencies not captured by traditional evaluation methods.

Abstract

With the rapid advancement of agent-based methods in recent years, Agentic RAG has undoubtedly become an important research direction. Multi-hop reasoning, which requires models to engage in deliberate thinking and multi-step interaction, serves as a critical testbed for assessing such capabilities. However, existing benchmarks typically provide only final questions and answers, while lacking the intermediate hop-level questions that gradually connect atomic questions to the final multi-hop query. This limitation prevents researchers from analyzing at which step an agent fails and restricts more fine-grained evaluation of model capabilities. Moreover, most current benchmarks are manually constructed, which is both time-consuming and labor-intensive, while also limiting scalability and generalization. To address these challenges, we introduce AgenticRAGTracer, the first Agentic RAG…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

YqjMartin/AgenticRAGTracer
dataset· 54 dl
54 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Graph Neural Networks