Design and Implementation of an OCR-Powered Pipeline for Table Extraction from Invoices
Parshva Dhilankumar Patel

TL;DR
This paper introduces an OCR-based pipeline that effectively extracts structured table data from invoices, enhancing accuracy and robustness for financial automation and archiving tasks.
Contribution
It presents a novel combination of OCR and custom post-processing techniques specifically designed for noisy and non-standard invoice formats.
Findings
Improved accuracy in table data extraction from invoices.
Robustness to noisy and non-standard invoice formats.
Supports real-world financial workflows.
Abstract
This paper presents the design and development of an OCR-powered pipeline for efficient table extraction from invoices. The system leverages Tesseract OCR for text recognition and custom post-processing logic to detect, align, and extract structured tabular data from scanned invoice documents. Our approach includes dynamic preprocessing, table boundary detection, and row-column mapping, optimized for noisy and non-standard invoice formats. The resulting pipeline significantly improves data extraction accuracy and consistency, supporting real-world use cases such as automated financial workflows and digital archiving.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Digital Media Forensic Detection · Currency Recognition and Detection
