From Recognition to Reasoning: Benchmarking and Enhancing MLLMs on Real-World Receipt Document Understanding

Yandi Wang; Libin Zhan; Ziwei Huang; Tiancheng Luo; Yuxuan Jiang; Wang Dong; Leilei Gan; Jun Chen

arXiv:2605.22413·cs.CV·May 22, 2026

From Recognition to Reasoning: Benchmarking and Enhancing MLLMs on Real-World Receipt Document Understanding

Yandi Wang, Libin Zhan, Ziwei Huang, Tiancheng Luo, Yuxuan Jiang, Wang Dong, Leilei Gan, Jun Chen

PDF

1 Repo

TL;DR

This paper introduces ReceiptBench, a large-scale benchmark for receipt document understanding, and proposes a two-stage training framework with reinforcement learning to improve multimodal models' reasoning capabilities.

Contribution

The paper presents ReceiptBench, a comprehensive dataset with hierarchical sub-tasks, and a novel training method using Metric-Aware Group Relative Policy Optimization to enhance reasoning in multimodal models.

Findings

01

Our method achieves state-of-the-art performance on ReceiptBench.

02

Models trained with our framework outperform proprietary models on complex reasoning tasks.

03

ReceiptBench covers diverse receipt types and detailed sub-tasks for comprehensive evaluation.

Abstract

Extracting structured information from visual documents (Visual Information Extraction, VIE) is a cornerstone of business automation. While recent Multimodal Large Language Models (MLLMs) have shown promising capabilities, existing benchmarks suffer from critical limitations in scale and realism, lack semantic granularity, and fail to cover diverse document types. To bridge this gap, we introduce ReceiptBench, a large-scale, human-annotated benchmark consisting of 10k diverse receipts, organizing information extraction into four hierarchical sub-tasks: (1) Basic Perception for raw text spotting, (2) Format Normalization for strictly following standardization instructions, (3) Semantic Reasoning for inferring implicit attributes from context, and (4) Structure Parsing for handling nested line items. Furthermore, we propose a two-stage training framework incorporating Metric-Aware Group…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wwwT0ri/ReceiptBench
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.