Document Image Cleaning using Budget-Aware Black-Box Approximation

Ganesh Tata; Katyani Singh; Eric Van Oeveren; Nilanjan Ray

arXiv:2306.13236·cs.CV·June 26, 2023

Document Image Cleaning using Budget-Aware Black-Box Approximation

Ganesh Tata, Katyani Singh, Eric Van Oeveren, Nilanjan Ray

PDF

Open Access 1 Repo

TL;DR

This paper introduces budget-aware sample selection algorithms for training document preprocessors that approximate OCR engines, significantly reducing query costs and training time while maintaining high accuracy.

Contribution

It presents novel sample selection methods that enable efficient training of OCR preprocessors with minimal queries, reducing costs and computational resources.

Findings

01

Achieved over 60% reduction in training time.

02

Reduced OCR engine queries to less than 10% of original.

03

Improved word-level accuracy by 4% with minimal queries.

Abstract

Recent work has shown that by approximating the behaviour of a non-differentiable black-box function using a neural network, the black-box can be integrated into a differentiable training pipeline for end-to-end training. This methodology is termed "differentiable bypass,'' and a successful application of this method involves training a document preprocessor to improve the performance of a black-box OCR engine. However, a good approximation of an OCR engine requires querying it for all samples throughout the training process, which can be computationally and financially expensive. Several zeroth-order optimization (ZO) algorithms have been proposed in black-box attack literature to find adversarial examples for a black-box model by computing its gradient in a query-efficient manner. However, the query complexity and convergence rate of such algorithms makes them infeasible for our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tataganesh/query-efficient-approx-to-improve-ocr
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Neural Network Applications · Digital Media Forensic Detection