VisualWordGrid: Information Extraction From Scanned Documents Using A   Multimodal Approach

Mohamed Kerroumi; Othmane Sayem; Aymen Shabou

arXiv:2010.02358·cs.CV·July 6, 2021

VisualWordGrid: Information Extraction From Scanned Documents Using A Multimodal Approach

Mohamed Kerroumi, Othmane Sayem, Aymen Shabou

PDF

TL;DR

VisualWordGrid is a multimodal method that encodes text, visual, and layout features into a 3D tensor for improved field extraction from scanned documents, outperforming recent models especially on small datasets.

Contribution

It introduces a novel multimodal representation combining textual, visual, and layout information into a 3D tensor for document segmentation.

Findings

01

Higher performance than state-of-the-art methods

02

Robustness on small datasets

03

Low inference time

Abstract

We introduce a novel approach for scanned document representation to perform field extraction. It allows the simultaneous encoding of the textual, visual and layout information in a 3-axis tensor used as an input to a segmentation model. We improve the recent Chargrid and Wordgrid \cite{chargrid} models in several ways, first by taking into account the visual modality, then by boosting its robustness in regards to small datasets while keeping the inference time low. Our approach is tested on public and private document-image datasets, showing higher performances compared to the recent state-of-the-art methods.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.