ShredBench: Evaluating the Semantic Reasoning Capabilities of Multimodal LLMs in Document Reconstruction
Zichun Guo, Yuling Shi, Wenhao Zeng, Chao Hu, Haotian Lin, Terry Yue Zhuo, Jiawei Chen, Xiaodong Gu, Wenping Ma

TL;DR
ShredBench is a new benchmark for evaluating multimodal large language models' ability to reconstruct shredded documents, revealing significant challenges in semantic reasoning across visual discontinuities.
Contribution
The paper introduces ShredBench, a systematic evaluation framework with an automated pipeline for assessing VRDU capabilities of MLLMs on shredded documents across multiple languages and formats.
Findings
MLLMs perform well on intact documents but struggle with shredded content.
Performance drops sharply as fragmentation increases, indicating difficulty in visual-semantic reasoning.
Current MLLMs lack the fine-grained reasoning needed for robust document reconstruction.
Abstract
Multimodal Large Language Models (MLLMs) have achieved remarkable performance in Visually Rich Document Understanding (VRDU) tasks, but their capabilities are mainly evaluated on pristine, well-structured document images. We consider content restoration from shredded fragments, a challenging VRDU setting that requires integrating visual pattern recognition with semantic reasoning under significant content discontinuities. To facilitate systematic evaluation of complex VRDU tasks, we introduce ShredBench, a benchmark supported by an automated generation pipeline that renders fragmented documents directly from Markdown. The proposed pipeline ensures evaluation validity by allowing the flexible integration of latest or unseen textual sources to prevent training data contamination. ShredBench assesses four scenarios (English, Chinese, Code, Table) with three fragmentation granularities (8,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
