Multimodal Adaptive Inference for Document Image Classification with   Anytime Early Exiting

Omar Hamed; Souhail Bakkali; Marie-Francine Moens; Matthew Blaschko,; and Jordy Van Landeghem

arXiv:2405.12705·cs.CV·May 22, 2024

Multimodal Adaptive Inference for Document Image Classification with Anytime Early Exiting

Omar Hamed, Souhail Bakkali, Marie-Francine Moens, Matthew Blaschko,, and Jordy Van Landeghem

PDF

Open Access 1 Repo

TL;DR

This paper introduces a multimodal early exit model for document image classification that balances high accuracy with reduced latency, making it suitable for scalable, real-world VDU applications.

Contribution

It presents the first multimodal early exit design for VDU, combining training strategies and exit layer placements to optimize performance and efficiency.

Findings

01

Over 20% latency reduction while maintaining accuracy

02

Improved performance-efficiency trade-off over traditional methods

03

Calibration enhances confidence scores for early exiting

Abstract

This work addresses the need for a balanced approach between performance and efficiency in scalable production environments for visually-rich document understanding (VDU) tasks. Currently, there is a reliance on large document foundation models that offer advanced capabilities but come with a heavy computational burden. In this paper, we propose a multimodal early exit (EE) model design that incorporates various training strategies, exit layer types and placements. Our goal is to achieve a Pareto-optimal balance between predictive performance and efficiency for multimodal document image classification. Through a comprehensive set of experiments, we compare our approach with traditional exit policies and showcase an improved performance-efficiency trade-off. Our multimodal EE design preserves the model's predictive capabilities, enhancing both speed and latency. This is achieved through…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Jordy-VL/multi-modal-early-exit
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Image Retrieval and Classification Techniques

MethodsSparse Evolutionary Training · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings