Modular Multimodal Architecture for Document Classification

Tyler Dauphinee; Nikunj Patel; Mohammad Rashidi

arXiv:1912.04376·cs.CV·December 11, 2019·26 cites

Modular Multimodal Architecture for Document Classification

Tyler Dauphinee, Nikunj Patel, Mohammad Rashidi

PDF

Open Access 3 Repos

TL;DR

This paper introduces a modular multimodal architecture that combines visual and textual features to improve document page classification, achieving state-of-the-art accuracy on the RVL-CDIP benchmark.

Contribution

The paper presents a novel multimodal approach that integrates visual and textual information for document classification, surpassing existing methods.

Findings

01

Achieved 93.03% accuracy on RVL-CDIP benchmark.

02

Outperformed previous state-of-the-art methods.

03

Demonstrated effectiveness of multimodal integration in document analysis.

Abstract

Page classification is a crucial component to any document analysis system, allowing for complex branching control flows for different components of a given document. Utilizing both the visual and textual content of a page, the proposed method exceeds the current state-of-the-art performance on the RVL-CDIP benchmark at 93.03% test accuracy.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Computational Techniques and Applications · Text and Document Classification Technologies · Web Data Mining and Analysis

MethodsTest