MATrIX -- Modality-Aware Transformer for Information eXtraction

Thomas Delteil; Edouard Belval; Lei Chen; Luis Goncalves; Vijay; Mahadevan

arXiv:2205.08094·cs.CV·May 18, 2022·1 cites

MATrIX -- Modality-Aware Transformer for Information eXtraction

Thomas Delteil, Edouard Belval, Lei Chen, Luis Goncalves, Vijay, Mahadevan

PDF

Open Access

TL;DR

MATrIX is a novel multi-modal transformer designed for extracting information from visually rich documents, integrating spatial, visual, and textual data through modality-aware attention mechanisms for improved understanding.

Contribution

The paper introduces a modality-aware transformer pre-trained with unsupervised tasks that effectively combines multiple modalities for document information extraction.

Findings

01

Outperforms strong baselines on three datasets

02

Effectively integrates spatial, visual, and textual information

03

Uses learned modality-aware relative bias in attention mechanism

Abstract

We present MATrIX - a Modality-Aware Transformer for Information eXtraction in the Visual Document Understanding (VDU) domain. VDU covers information extraction from visually rich documents such as forms, invoices, receipts, tables, graphs, presentations, or advertisements. In these, text semantics and visual information supplement each other to provide a global understanding of the document. MATrIX is pre-trained in an unsupervised way with specifically designed tasks that require the use of multi-modal information (spatial, visual, or textual). We consider the spatial and text modalities all at once in a single token set. To make the attention more flexible, we use a learned modality-aware relative bias in the attention mechanism to modulate the attention between the tokens of different modalities. We evaluate MATrIX on 3 different datasets each with strong baselines.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Data Visualization and Analytics · Multimodal Machine Learning Applications

MethodsAttention Is All You Need · Linear Layer · Adam · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Residual Connection · Label Smoothing · Absolute Position Encodings · Softmax · Multi-Head Attention