Kleister: A novel task for Information Extraction involving Long Documents with Complex Layout
Filip Grali\'nski, Tomasz Stanis{\l}awek, Anna Wr\'oblewska, Dawid, Lipi\'nski, Agnieszka Kaliska, Paulina Rosalska, Bartosz Topolski,, Przemys{\l}aw Biecek

TL;DR
Kleister introduces a new information extraction task for long, complex documents, emphasizing the importance of layout and structure, and provides datasets and baseline methods to advance research in this area.
Contribution
The paper presents a novel IE task for long documents with complex layouts, along with datasets, baseline pipelines, and an analysis of text extraction tools' impact.
Findings
Pipeline method with NER architectures performs baseline results.
Layout features improve information extraction accuracy.
Text extraction errors affect IE system performance.
Abstract
State-of-the-art solutions for Natural Language Processing (NLP) are able to capture a broad range of contexts, like the sentence-level context or document-level context for short documents. But these solutions are still struggling when it comes to longer, real-world documents with the information encoded in the spatial structure of the document, such as page elements like tables, forms, headers, openings or footers; complex page layout or presence of multiple pages. To encourage progress on deeper and more complex Information Extraction (IE) we introduce a new task (named Kleister) with two new datasets. Utilizing both textual and structural layout features, an NLP system must find the most important information, about various types of entities, in long formal documents. We propose Pipeline method as a text-only baseline with different Named Entity Recognition architectures (Flair,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Web Data Mining and Analysis
MethodsLinear Layer · Weight Decay · Residual Connection · Adam · Layer Normalization · Softmax · Attention Is All You Need · Dropout · Refunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention
