Total Recall QA: A Verifiable Evaluation Suite for Deep Research Agents
Mahta Rafiee, Heydar Soudani, Zahra Abbasiantaeb, Mohammad Aliannejadi, Faegheh Hasibi, Hamed Zamani

TL;DR
This paper introduces TRQA, a comprehensive evaluation framework and benchmark for deep research agents that perform multi-step reasoning over large sources, addressing current evaluation challenges with structured, large-scale datasets.
Contribution
It proposes a new evaluation framework satisfying key requirements and creates TRQA, a benchmark combining real-world and synthetic data for assessing deep research agents.
Findings
Established baseline retrieval and end-to-end results.
Demonstrated the framework's ability to evaluate complex research agents.
Provided a scalable, structured approach for future evaluations.
Abstract
Deep research agents have emerged as LLM-based systems designed to perform multi-step information seeking and reasoning over large, open-domain sources to answer complex questions by synthesizing information from multiple information sources. Given the complexity of the task and despite various recent efforts, evaluation of deep research agents remains fundamentally challenging. This paper identifies a list of requirements and optional properties for evaluating deep research agents. We observe that existing benchmarks do not satisfy all identified requirements. Inspired by prior research on TREC Total Recall Tracks, we introduce the task of Total Recall Question Answering and develop a framework for deep research agents evaluation that satisfies the identified criteria. Our framework constructs single-answer, total recall queries with precise evaluation and relevance judgments derived…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Information Retrieval and Search Behavior · Expert finding and Q&A systems
