EnterpriseRAG-Bench: A RAG Benchmark for Company Internal Knowledge
Yuhong Sun, Joachim Rahmfeld, Chris Weaver, Weijia Chen, Roshan Desai, Wenxi Huang, Mark H. Butler

TL;DR
EnterpriseRAG-Bench provides a synthetic, realistic dataset and evaluation framework for testing retrieval-augmented generation models on company-internal knowledge sources, addressing a gap in existing benchmarks.
Contribution
It introduces a large-scale, multi-source enterprise dataset with a generation framework and leaderboard, enabling realistic benchmarking of RAG models on proprietary data.
Findings
Dataset includes 500,000 documents across nine enterprise sources.
Questions test various retrieval and reasoning capabilities.
Framework allows customization for different industries and data sources.
Abstract
Retrieval-Augmented Generation (RAG) has become the standard approach for grounding large language models in information that was not available during training. While existing datasets and benchmarks focus on web or other public sources, there is still no widely adopted dataset that realistically reflects the nature of company-internal knowledge. Meanwhile, startups, enterprises, and researchers are increasingly developing AI Agents designed to operate over exactly this kind of proprietary data. To close this gap, we release a synthetic enterprise corpus, its generation framework, and a leaderboard. We present EnterpriseRAG-Bench, a dataset consisting of approximately 500,000 documents spanning nine enterprise source types (Slack, Gmail, Linear, Google Drive, HubSpot, Fireflies, GitHub, Jira, and Confluence) and 500 questions across ten categories that test distinct retrieval and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
