CReSt: A Comprehensive Benchmark for Retrieval-Augmented Generation with Complex Reasoning over Structured Documents
Minsoo Khang, Sangjun Park, Teakgyu Hong, Dawoon Jung

TL;DR
CReSt is a comprehensive benchmark designed to evaluate large language models' abilities in retrieval-augmented generation tasks involving complex reasoning, structured document understanding, and responsible response handling, addressing a critical gap in current evaluation methods.
Contribution
This work introduces CReSt, a unified benchmark with 2,245 examples in English and Korean, and a tailored evaluation methodology for assessing LLMs on multiple practical RAG capabilities.
Findings
Advanced LLMs struggle with consistent performance across key RAG dimensions.
CReSt reveals significant gaps in models' reasoning and structural understanding.
The benchmark highlights areas for future improvement in RAG system development.
Abstract
Large Language Models (LLMs) have made substantial progress in recent years, yet evaluating their capabilities in practical Retrieval-Augmented Generation (RAG) scenarios remains challenging. In practical applications, LLMs must demonstrate complex reasoning, refuse to answer appropriately, provide precise citations, and effectively understand document layout. These capabilities are crucial for advanced task handling, uncertainty awareness, maintaining reliability, and structural understanding. While some of the prior works address these aspects individually, there is a need for a unified framework that evaluates them collectively in practical RAG scenarios. To address this, we present CReSt (A Comprehensive Benchmark for Retrieval-Augmented Generation with Complex Reasoning over Structured Documents), a benchmark designed to assess these key dimensions holistically. CReSt comprises…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
