CReSt: A Comprehensive Benchmark for Retrieval-Augmented Generation with Complex Reasoning over Structured Documents

Minsoo Khang; Sangjun Park; Teakgyu Hong; Dawoon Jung

arXiv:2505.17503·cs.CL·May 26, 2025

CReSt: A Comprehensive Benchmark for Retrieval-Augmented Generation with Complex Reasoning over Structured Documents

Minsoo Khang, Sangjun Park, Teakgyu Hong, Dawoon Jung

PDF

TL;DR

CReSt is a comprehensive benchmark designed to evaluate large language models' abilities in retrieval-augmented generation tasks involving complex reasoning, structured document understanding, and responsible response handling, addressing a critical gap in current evaluation methods.

Contribution

This work introduces CReSt, a unified benchmark with 2,245 examples in English and Korean, and a tailored evaluation methodology for assessing LLMs on multiple practical RAG capabilities.

Findings

01

Advanced LLMs struggle with consistent performance across key RAG dimensions.

02

CReSt reveals significant gaps in models' reasoning and structural understanding.

03

The benchmark highlights areas for future improvement in RAG system development.

Abstract

Large Language Models (LLMs) have made substantial progress in recent years, yet evaluating their capabilities in practical Retrieval-Augmented Generation (RAG) scenarios remains challenging. In practical applications, LLMs must demonstrate complex reasoning, refuse to answer appropriately, provide precise citations, and effectively understand document layout. These capabilities are crucial for advanced task handling, uncertainty awareness, maintaining reliability, and structural understanding. While some of the prior works address these aspects individually, there is a need for a unified framework that evaluates them collectively in practical RAG scenarios. To address this, we present CReSt (A Comprehensive Benchmark for Retrieval-Augmented Generation with Complex Reasoning over Structured Documents), a benchmark designed to assess these key dimensions holistically. CReSt comprises…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.