TL;DR
This paper introduces ReceiptBench, a large-scale benchmark for receipt document understanding, and proposes a two-stage training framework with reinforcement learning to improve multimodal models' reasoning capabilities.
Contribution
The paper presents ReceiptBench, a comprehensive dataset with hierarchical sub-tasks, and a novel training method using Metric-Aware Group Relative Policy Optimization to enhance reasoning in multimodal models.
Findings
Our method achieves state-of-the-art performance on ReceiptBench.
Models trained with our framework outperform proprietary models on complex reasoning tasks.
ReceiptBench covers diverse receipt types and detailed sub-tasks for comprehensive evaluation.
Abstract
Extracting structured information from visual documents (Visual Information Extraction, VIE) is a cornerstone of business automation. While recent Multimodal Large Language Models (MLLMs) have shown promising capabilities, existing benchmarks suffer from critical limitations in scale and realism, lack semantic granularity, and fail to cover diverse document types. To bridge this gap, we introduce ReceiptBench, a large-scale, human-annotated benchmark consisting of 10k diverse receipts, organizing information extraction into four hierarchical sub-tasks: (1) Basic Perception for raw text spotting, (2) Format Normalization for strictly following standardization instructions, (3) Semantic Reasoning for inferring implicit attributes from context, and (4) Structure Parsing for handling nested line items. Furthermore, we propose a two-stage training framework incorporating Metric-Aware Group…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
