Multimodal Adaptive Inference for Document Image Classification with Anytime Early Exiting
Omar Hamed, Souhail Bakkali, Marie-Francine Moens, Matthew Blaschko,, and Jordy Van Landeghem

TL;DR
This paper introduces a multimodal early exit model for document image classification that balances high accuracy with reduced latency, making it suitable for scalable, real-world VDU applications.
Contribution
It presents the first multimodal early exit design for VDU, combining training strategies and exit layer placements to optimize performance and efficiency.
Findings
Over 20% latency reduction while maintaining accuracy
Improved performance-efficiency trade-off over traditional methods
Calibration enhances confidence scores for early exiting
Abstract
This work addresses the need for a balanced approach between performance and efficiency in scalable production environments for visually-rich document understanding (VDU) tasks. Currently, there is a reliance on large document foundation models that offer advanced capabilities but come with a heavy computational burden. In this paper, we propose a multimodal early exit (EE) model design that incorporates various training strategies, exit layer types and placements. Our goal is to achieve a Pareto-optimal balance between predictive performance and efficiency for multimodal document image classification. Through a comprehensive set of experiments, we compare our approach with traditional exit policies and showcase an improved performance-efficiency trade-off. Our multimodal EE design preserves the model's predictive capabilities, enhancing both speed and latency. This is achieved through…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Image Retrieval and Classification Techniques
MethodsSparse Evolutionary Training · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
