Benchmarking Deep Search over Heterogeneous Enterprise Data
Prafulla Kumar Choubey, Xiangyu Peng, Shilpa Bhagavath, Kung-Hsiang Huang, Caiming Xiong, Chien-Sheng Wu

TL;DR
This paper introduces a comprehensive benchmark for evaluating deep search capabilities in enterprise data, emphasizing multi-hop reasoning over diverse, real-world sources to improve retrieval-augmented generation systems.
Contribution
It provides a realistic, synthetic data pipeline and a large-scale benchmark dataset for assessing source-aware, multi-hop retrieval in complex enterprise environments.
Findings
Deep search remains a significant bottleneck in current RAG systems.
Existing methods often retrieve incomplete evidence, affecting reasoning accuracy.
Performance scores indicate substantial room for improvement in deep search techniques.
Abstract
We present a new benchmark for evaluating Deep Search--a realistic and complex form of retrieval-augmented generation (RAG) that requires source-aware, multi-hop reasoning over diverse, sparsed, but related sources. These include documents, meeting transcripts, Slack messages, GitHub, and URLs, which vary in structure and often contain human-to-human interactions. We build it using a synthetic data pipeline that simulates business workflows across product planning, development, and support stages, generating interconnected content with realistic noise and multi-hop questions with guaranteed ground-truth answers. We release our benchmark with both answerable and unanswerable queries, and retrieval pool of 39,190 enterprise artifacts, enabling fine-grained evaluation of long-context LLM and RAG systems. Our experiments reveal that even the best-performing agentic RAG methods achieve an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsInformation Retrieval and Search Behavior · Topic Modeling · Expert finding and Q&A systems
