# Multimodal deep networks for text and image-based document   classification

**Authors:** Nicolas Audebert, Catherine Herold, Kuider Slimani, C\'edric Vidal

arXiv: 1907.06370 · 2019-07-16

## TL;DR

This paper introduces a multimodal deep learning approach combining visual features and OCR-extracted text for improved document image classification, outperforming purely visual methods on benchmark datasets.

## Contribution

The authors develop a multimodal neural network that integrates image data and OCR text embeddings, enhancing classification accuracy in real-world document analysis tasks.

## Key findings

- Boosts image classification accuracy by 3% on Tobacco3482 and RVL-CDIP datasets.
- Effective even with noisy OCR text, demonstrating robustness.
- Provides a new dataset for OCR-based text analysis in document classification.

## Abstract

Classification of document images is a critical step for archival of old manuscripts, online subscription and administrative procedures. Computer vision and deep learning have been suggested as a first solution to classify documents based on their visual appearance. However, achieving the fine-grained classification that is required in real-world setting cannot be achieved by visual analysis alone. Often, the relevant information is in the actual text content of the document. We design a multimodal neural network that is able to learn from word embeddings, computed on text extracted by OCR, and from the image. We show that this approach boosts pure image accuracy by 3% on Tobacco3482 and RVL-CDIP augmented by our new QS-OCR text dataset (https://github.com/Quicksign/ocrized-text-dataset), even without clean text information.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1907.06370/full.md

## Figures

30 figures with captions in the complete paper: https://tomesphere.com/paper/1907.06370/full.md

## References

39 references — full list in the complete paper: https://tomesphere.com/paper/1907.06370/full.md

---
Source: https://tomesphere.com/paper/1907.06370