KVP10k : A Comprehensive Dataset for Key-Value Pair Extraction in   Business Documents

Oshri Naparstek; Roi Pony; Inbar Shapira; Foad Abo Dahood; Ophir; Azulai; Yevgeny Yaroker; Nadav Rubinstein; Maksym Lysak; Peter Staar; Ahmed; Nassar; Nikolaos Livathinos; Christoph Auer; Elad Amrani; Idan Friedman; Orit; Prince; Yevgeny Burshtein; Adi Raz Goldfarb; Udi Barzelay

arXiv:2405.00505·cs.IR·May 2, 2024

KVP10k : A Comprehensive Dataset for Key-Value Pair Extraction in Business Documents

Oshri Naparstek, Roi Pony, Inbar Shapira, Foad Abo Dahood, Ophir, Azulai, Yevgeny Yaroker, Nadav Rubinstein, Maksym Lysak, Peter Staar, Ahmed, Nassar, Nikolaos Livathinos, Christoph Auer, Elad Amrani, Idan Friedman, Orit, Prince, Yevgeny Burshtein, Adi Raz Goldfarb, Udi Barzelay

PDF

Open Access 1 Repo

TL;DR

This paper introduces KVP10k, a large, diverse dataset and benchmark for extracting key-value pairs from business documents without predefined keys, addressing a significant gap in existing resources.

Contribution

It presents the first comprehensive dataset and benchmark specifically designed for non-predetermined key-value pair extraction in complex business documents.

Findings

01

KVP10k contains 10,707 annotated images.

02

The benchmark challenges models to extract KVPs without predefined keys.

03

The dataset enhances diversity and annotation detail for better model training.

Abstract

In recent years, the challenge of extracting information from business documents has emerged as a critical task, finding applications across numerous domains. This effort has attracted substantial interest from both industry and academy, highlighting its significance in the current technological landscape. Most datasets in this area are primarily focused on Key Information Extraction (KIE), where the extraction process revolves around extracting information using a specific, predefined set of keys. Unlike most existing datasets and benchmarks, our focus is on discovering key-value pairs (KVPs) without relying on predefined keys, navigating through an array of diverse templates and complex layouts. This task presents unique challenges, primarily due to the absence of comprehensive datasets and benchmarks tailored for non-predetermined KVP extraction. To address this gap, we introduce…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ibm/kvp10k
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Text Analysis Techniques

MethodsSparse Evolutionary Training · Focus