Multi-Stage Field Extraction of Financial Documents with OCR and Compact Vision-Language Models
Yichao Jin, Yushuo Wang, Qishuai Zhong, Kent Chiu Jin-Chun, Kenneth Zhu Ke, Donald MacDonald

TL;DR
This paper presents a multistage pipeline combining image processing, OCR, and compact vision-language models to efficiently extract structured information from complex, multilingual financial documents, significantly reducing computational costs and improving accuracy.
Contribution
The authors introduce a novel multistage approach that integrates traditional image processing, OCR, and compact VLMs for scalable financial document analysis, outperforming large VLMs in accuracy and efficiency.
Findings
Achieves 8.8x higher field accuracy compared to large VLMs.
Reduces GPU cost to 0.7% of large VLMs.
Lowers end-to-end latency by 92.6%.
Abstract
Financial documents are essential sources of information for regulators, auditors, and financial institutions, particularly for assessing the wealth and compliance of Small and Medium-sized Businesses. However, SMB documents are often difficult to parse. They are rarely born digital and instead are distributed as scanned images that are none machine readable. The scans themselves are low in resolution, affected by skew or rotation, and often contain noisy backgrounds. These documents also tend to be heterogeneous, mixing narratives, tables, figures, and multilingual content within the same report. Such characteristics pose major challenges for automated information extraction, especially when relying on end to end large Vision Language Models, which are computationally expensive, sensitive to noise, and slow when applied to files with hundreds of pages. We propose a multistage…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
