Modular Multimodal Architecture for Document Classification
Tyler Dauphinee, Nikunj Patel, Mohammad Rashidi

TL;DR
This paper introduces a modular multimodal architecture that combines visual and textual features to improve document page classification, achieving state-of-the-art accuracy on the RVL-CDIP benchmark.
Contribution
The paper presents a novel multimodal approach that integrates visual and textual information for document classification, surpassing existing methods.
Findings
Achieved 93.03% accuracy on RVL-CDIP benchmark.
Outperformed previous state-of-the-art methods.
Demonstrated effectiveness of multimodal integration in document analysis.
Abstract
Page classification is a crucial component to any document analysis system, allowing for complex branching control flows for different components of a given document. Utilizing both the visual and textual content of a page, the proposed method exceeds the current state-of-the-art performance on the RVL-CDIP benchmark at 93.03% test accuracy.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Computational Techniques and Applications · Text and Document Classification Technologies · Web Data Mining and Analysis
MethodsTest
