Overcoming the "Impracticality" of RAG: Proposing a Real-World Benchmark and Multi-Dimensional Diagnostic Framework
Kenichirou Narita, Siqi Peng, Taku Fukui, Moyuru Yamada, Satoshi Munakata, Satoru Takahashi

TL;DR
This paper introduces a multi-dimensional diagnostic framework and an enterprise RAG benchmark to better evaluate and diagnose the complex challenges faced in real-world deployment beyond simple accuracy.
Contribution
It proposes a novel four-axis difficulty taxonomy and integrates it into a benchmark to systematically diagnose RAG system weaknesses in enterprise settings.
Findings
Existing benchmarks do not diagnose multi-faceted challenges.
The framework reveals specific weaknesses in RAG systems.
The benchmark aids in aligning model performance with operational needs.
Abstract
Performance evaluation of Retrieval-Augmented Generation (RAG) systems within enterprise environments is governed by multi-dimensional and composite factors extending far beyond simple final accuracy checks. These factors include reasoning complexity, retrieval difficulty, the diverse structure of documents, and stringent requirements for operational explainability. Existing academic benchmarks fail to systematically diagnose these interlocking challenges, resulting in a critical gap where models achieving high performance scores fail to meet the expected reliability in practical deployment. To bridge this discrepancy, this research proposes a multi-dimensional diagnostic framework by defining a four-axis difficulty taxonomy and integrating it into an enterprise RAG benchmark to diagnose potential system weaknesses.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
