TL;DR
This paper introduces TILT, a unified encoder-decoder Transformer architecture that integrates layout, visual, and textual information for advanced document understanding, achieving state-of-the-art results in layout-aware tasks.
Contribution
The paper presents a novel TILT neural network that unifies layout, visual, and textual features in a single end-to-end model for document comprehension.
Findings
Achieves state-of-the-art results on DocVQA, CORD, SROIE datasets.
Effectively unifies layout, visual, and textual information.
Simplifies document understanding with an end-to-end approach.
Abstract
We address the challenging problem of Natural Language Comprehension beyond plain-text documents by introducing the TILT neural network architecture which simultaneously learns layout information, visual features, and textual semantics. Contrary to previous approaches, we rely on a decoder capable of unifying a variety of problems involving natural language. The layout is represented as an attention bias and complemented with contextualized visual information, while the core of our model is a pretrained encoder-decoder Transformer. Our novel approach achieves state-of-the-art results in extracting information from documents and answering questions which demand layout understanding (DocVQA, CORD, SROIE). At the same time, we simplify the process by employing an end-to-end model.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Dropout · Byte Pair Encoding · Adam · Dense Connections · Softmax
