Going Full-TILT Boogie on Document Understanding with Text-Image-Layout   Transformer

Rafa{\l} Powalski; {\L}ukasz Borchmann; Dawid Jurkiewicz; Tomasz; Dwojak; Micha{\l} Pietruszka; Gabriela Pa{\l}ka

arXiv:2102.09550·cs.CL·July 13, 2021

Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer

Rafa{\l} Powalski, {\L}ukasz Borchmann, Dawid Jurkiewicz, Tomasz, Dwojak, Micha{\l} Pietruszka, Gabriela Pa{\l}ka

PDF

1 Repo

TL;DR

This paper introduces TILT, a unified encoder-decoder Transformer architecture that integrates layout, visual, and textual information for advanced document understanding, achieving state-of-the-art results in layout-aware tasks.

Contribution

The paper presents a novel TILT neural network that unifies layout, visual, and textual features in a single end-to-end model for document comprehension.

Findings

01

Achieves state-of-the-art results on DocVQA, CORD, SROIE datasets.

02

Effectively unifies layout, visual, and textual information.

03

Simplifies document understanding with an end-to-end approach.

Abstract

We address the challenging problem of Natural Language Comprehension beyond plain-text documents by introducing the TILT neural network architecture which simultaneously learns layout information, visual features, and textual semantics. Contrary to previous approaches, we rely on a decoder capable of unifying a variety of problems involving natural language. The layout is represented as an attention bias and complemented with contextualized visual information, while the core of our model is a pretrained encoder-decoder Transformer. Our novel approach achieves state-of-the-art results in extracting information from documents and answering questions which demand layout understanding (DocVQA, CORD, SROIE). At the same time, we simplify the process by employing an end-to-end model.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

uakarsh/TiLT-Implementation
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Dropout · Byte Pair Encoding · Adam · Dense Connections · Softmax