Document Image Cleaning using Budget-Aware Black-Box Approximation
Ganesh Tata, Katyani Singh, Eric Van Oeveren, Nilanjan Ray

TL;DR
This paper introduces budget-aware sample selection algorithms for training document preprocessors that approximate OCR engines, significantly reducing query costs and training time while maintaining high accuracy.
Contribution
It presents novel sample selection methods that enable efficient training of OCR preprocessors with minimal queries, reducing costs and computational resources.
Findings
Achieved over 60% reduction in training time.
Reduced OCR engine queries to less than 10% of original.
Improved word-level accuracy by 4% with minimal queries.
Abstract
Recent work has shown that by approximating the behaviour of a non-differentiable black-box function using a neural network, the black-box can be integrated into a differentiable training pipeline for end-to-end training. This methodology is termed "differentiable bypass,'' and a successful application of this method involves training a document preprocessor to improve the performance of a black-box OCR engine. However, a good approximation of an OCR engine requires querying it for all samples throughout the training process, which can be computationally and financially expensive. Several zeroth-order optimization (ZO) algorithms have been proposed in black-box attack literature to find adversarial examples for a black-box model by computing its gradient in a query-efficient manner. However, the query complexity and convergence rate of such algorithms makes them infeasible for our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Neural Network Applications · Digital Media Forensic Detection
