SAIL: Sample-Centric In-Context Learning for Document Information   Extraction

Jinyu Zhang; Zhiyuan You; Jize Wang; Xinyi Le

arXiv:2412.17092·cs.CL·December 24, 2024

SAIL: Sample-Centric In-Context Learning for Document Information Extraction

Jinyu Zhang, Zhiyuan You, Jize Wang, Xinyi Le

PDF

Open Access 1 Repo 1 Video

TL;DR

SAIL introduces a sample-centric in-context learning approach that enhances document information extraction from visually rich documents by leveraging fine-grained textual and layout similarities, outperforming existing training-free methods.

Contribution

The paper proposes a novel SAIL method that combines entity-level textual similarity and layout analysis with a unified prompt template for improved zero-shot document extraction.

Findings

01

Outperforms training-free baselines on multiple benchmarks.

02

Achieves results close to full-training methods.

03

Demonstrates strong generalization across datasets.

Abstract

Document Information Extraction (DIE) aims to extract structured information from Visually Rich Documents (VRDs). Previous full-training approaches have demonstrated strong performance but may struggle with generalization to unseen data. In contrast, training-free methods leverage powerful pre-trained models like Large Language Models (LLMs) to address various downstream tasks with only a few examples. Nonetheless, training-free methods for DIE encounter two primary challenges: (1) understanding the complex relationship between layout and textual elements in VRDs, and (2) providing accurate guidance to pre-trained models. To address these challenges, we propose Sample-centric In-context Learning (SAIL) for DIE. SAIL introduces a fine-grained entity-level textual similarity to facilitate in-depth text analysis by LLMs and incorporates layout similarity to enhance the analysis of layouts…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sky-goldfish/sail
pytorchOfficial

Videos

SAIL: Sample-Centric In-Context Learning for Document Information Extraction· underline

Taxonomy

TopicsText and Document Classification Technologies · Topic Modeling · Handwritten Text Recognition Techniques

MethodsBalanced Selection