MATrIX -- Modality-Aware Transformer for Information eXtraction
Thomas Delteil, Edouard Belval, Lei Chen, Luis Goncalves, Vijay, Mahadevan

TL;DR
MATrIX is a novel multi-modal transformer designed for extracting information from visually rich documents, integrating spatial, visual, and textual data through modality-aware attention mechanisms for improved understanding.
Contribution
The paper introduces a modality-aware transformer pre-trained with unsupervised tasks that effectively combines multiple modalities for document information extraction.
Findings
Outperforms strong baselines on three datasets
Effectively integrates spatial, visual, and textual information
Uses learned modality-aware relative bias in attention mechanism
Abstract
We present MATrIX - a Modality-Aware Transformer for Information eXtraction in the Visual Document Understanding (VDU) domain. VDU covers information extraction from visually rich documents such as forms, invoices, receipts, tables, graphs, presentations, or advertisements. In these, text semantics and visual information supplement each other to provide a global understanding of the document. MATrIX is pre-trained in an unsupervised way with specifically designed tasks that require the use of multi-modal information (spatial, visual, or textual). We consider the spatial and text modalities all at once in a single token set. To make the attention more flexible, we use a learned modality-aware relative bias in the attention mechanism to modulate the attention between the tokens of different modalities. We evaluate MATrIX on 3 different datasets each with strong baselines.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Data Visualization and Analytics · Multimodal Machine Learning Applications
MethodsAttention Is All You Need · Linear Layer · Adam · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Residual Connection · Label Smoothing · Absolute Position Encodings · Softmax · Multi-Head Attention
