Design and Implementation of an OCR-Powered Pipeline for Table Extraction from Invoices

Parshva Dhilankumar Patel

arXiv:2507.07029·cs.CV·July 10, 2025

Design and Implementation of an OCR-Powered Pipeline for Table Extraction from Invoices

Parshva Dhilankumar Patel

PDF

Open Access

TL;DR

This paper introduces an OCR-based pipeline that effectively extracts structured table data from invoices, enhancing accuracy and robustness for financial automation and archiving tasks.

Contribution

It presents a novel combination of OCR and custom post-processing techniques specifically designed for noisy and non-standard invoice formats.

Findings

01

Improved accuracy in table data extraction from invoices.

02

Robustness to noisy and non-standard invoice formats.

03

Supports real-world financial workflows.

Abstract

This paper presents the design and development of an OCR-powered pipeline for efficient table extraction from invoices. The system leverages Tesseract OCR for text recognition and custom post-processing logic to detect, align, and extract structured tabular data from scanned invoice documents. Our approach includes dynamic preprocessing, table boundary detection, and row-column mapping, optimized for noisy and non-standard invoice formats. The resulting pipeline significantly improves data extraction accuracy and consistency, supporting real-world use cases such as automated financial workflows and digital archiving.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Digital Media Forensic Detection · Currency Recognition and Detection