Total Recall QA: A Verifiable Evaluation Suite for Deep Research Agents

Mahta Rafiee; Heydar Soudani; Zahra Abbasiantaeb; Mohammad Aliannejadi; Faegheh Hasibi; Hamed Zamani

arXiv:2603.18516·cs.IR·March 20, 2026

Total Recall QA: A Verifiable Evaluation Suite for Deep Research Agents

Mahta Rafiee, Heydar Soudani, Zahra Abbasiantaeb, Mohammad Aliannejadi, Faegheh Hasibi, Hamed Zamani

PDF

Open Access

TL;DR

This paper introduces TRQA, a comprehensive evaluation framework and benchmark for deep research agents that perform multi-step reasoning over large sources, addressing current evaluation challenges with structured, large-scale datasets.

Contribution

It proposes a new evaluation framework satisfying key requirements and creates TRQA, a benchmark combining real-world and synthetic data for assessing deep research agents.

Findings

01

Established baseline retrieval and end-to-end results.

02

Demonstrated the framework's ability to evaluate complex research agents.

03

Provided a scalable, structured approach for future evaluations.

Abstract

Deep research agents have emerged as LLM-based systems designed to perform multi-step information seeking and reasoning over large, open-domain sources to answer complex questions by synthesizing information from multiple information sources. Given the complexity of the task and despite various recent efforts, evaluation of deep research agents remains fundamentally challenging. This paper identifies a list of requirements and optional properties for evaluating deep research agents. We observe that existing benchmarks do not satisfy all identified requirements. Inspired by prior research on TREC Total Recall Tracks, we introduce the task of Total Recall Question Answering and develop a framework for deep research agents evaluation that satisfies the identified criteria. Our framework constructs single-answer, total recall queries with precise evaluation and relevance judgments derived…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Information Retrieval and Search Behavior · Expert finding and Q&A systems