Problem Solved? Information Extraction Design Space for Layout-Rich Documents using LLMs

Gaye Colakoglu; G\"urkan Solmaz; Jonathan F\"urst

arXiv:2502.18179·cs.CL·February 4, 2026

Problem Solved? Information Extraction Design Space for Layout-Rich Documents using LLMs

Gaye Colakoglu, G\"urkan Solmaz, Jonathan F\"urst

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper explores how to effectively use large language models for extracting information from layout-rich documents, proposing a new benchmark and methods to optimize their performance without fine-tuning.

Contribution

It introduces LayIE-LLM, an open-source test suite for layout-aware IE, and develops a simple OFAT method to optimize LLM configurations efficiently.

Findings

01

LLMs require specific pipeline adjustments for layout-rich IE

02

Optimized configurations significantly outperform baseline setups

03

Near-optimal results achieved with low computational cost

Abstract

This paper defines and explores the design space for information extraction (IE) from layout-rich documents using large language models (LLMs). The three core challenges of layout-aware IE with LLMs are 1) data structuring, 2) model engagement, and 3) output refinement. Our study investigates the sub-problems and methods within these core challenges, such as input representation, chunking, prompting, selection of LLMs, and multimodal models. It examines the effect of different design choices through LayIE-LLM, a new, open-source, layout-aware IE test suite, benchmarking against traditional, fine-tuned IE models. The results on two IE datasets show that LLMs require adjustment of the IE pipeline to achieve competitive performance: the optimized configuration found with LayIE-LLM achieves 13.3--37.5 F1 points more than a general-practice baseline configuration using the same LLM. To find…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gayecolakoglu/layie-llm
noneOfficial

Videos

Problem Solved? Information Extraction Design Space for Layout-Rich Documents using LLMs· underline

Taxonomy

TopicsSemantic Web and Ontologies