Benchmarking Deep Search over Heterogeneous Enterprise Data

Prafulla Kumar Choubey; Xiangyu Peng; Shilpa Bhagavath; Kung-Hsiang Huang; Caiming Xiong; Chien-Sheng Wu

arXiv:2506.23139·cs.CL·July 1, 2025

Benchmarking Deep Search over Heterogeneous Enterprise Data

Prafulla Kumar Choubey, Xiangyu Peng, Shilpa Bhagavath, Kung-Hsiang Huang, Caiming Xiong, Chien-Sheng Wu

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper introduces a comprehensive benchmark for evaluating deep search capabilities in enterprise data, emphasizing multi-hop reasoning over diverse, real-world sources to improve retrieval-augmented generation systems.

Contribution

It provides a realistic, synthetic data pipeline and a large-scale benchmark dataset for assessing source-aware, multi-hop retrieval in complex enterprise environments.

Findings

01

Deep search remains a significant bottleneck in current RAG systems.

02

Existing methods often retrieve incomplete evidence, affecting reasoning accuracy.

03

Performance scores indicate substantial room for improvement in deep search techniques.

Abstract

We present a new benchmark for evaluating Deep Search--a realistic and complex form of retrieval-augmented generation (RAG) that requires source-aware, multi-hop reasoning over diverse, sparsed, but related sources. These include documents, meeting transcripts, Slack messages, GitHub, and URLs, which vary in structure and often contain human-to-human interactions. We build it using a synthetic data pipeline that simulates business workflows across product planning, development, and support stages, generating interconnected content with realistic noise and multi-hop questions with guaranteed ground-truth answers. We release our benchmark with both answerable and unanswerable queries, and retrieval pool of 39,190 enterprise artifacts, enabling fine-grained evaluation of long-context LLM and RAG systems. Our experiments reveal that even the best-performing agentic RAG methods achieve an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Salesforce/HERB
dataset· 55 dl
55 dl

Videos

Benchmarking Deep Search over Heterogeneous Enterprise Data· underline

Taxonomy

TopicsInformation Retrieval and Search Behavior · Topic Modeling · Expert finding and Q&A systems