DocIE@XLLM25: In-Context Learning for Information Extraction using Fully Synthetic Demonstrations

Nicholas Popovi\v{c}; Ashish Kangen; Tim Schopf; Michael F\"arber

arXiv:2507.05997·cs.CL·July 9, 2025

DocIE@XLLM25: In-Context Learning for Information Extraction using Fully Synthetic Demonstrations

Nicholas Popovi\v{c}, Ashish Kangen, Tim Schopf, Michael F\"arber

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper introduces a fully automated synthetic data generation pipeline combined with in-context learning for document-level entity and relation extraction, reducing the need for manual annotation and improving zero-shot extraction capabilities.

Contribution

It presents a novel approach that integrates synthetic data creation with retrieval-based in-context learning using LLMs, enhancing document-level information extraction without manual labels.

Findings

01

Synthetic dataset of 59k entities and 30k relations created.

02

In-context learning performs challenging document-level extraction tasks.

03

State-of-the-art LLMs still face difficulties in zero-shot document extraction.

Abstract

Large, high-quality annotated corpora remain scarce in document-level entity and relation extraction in zero-shot or few-shot settings. In this paper, we present a fully automatic, LLM-based pipeline for synthetic data generation and in-context learning for document-level entity and relation extraction. In contrast to existing approaches that rely on manually annotated demonstrations or direct zero-shot inference, our method combines synthetic data generation with retrieval-based in-context learning, using a reasoning-optimized language model. This allows us to build a high-quality demonstration database without manual annotation and to dynamically retrieve relevant examples at inference time. Based on our approach we produce a synthetic dataset of over $5 k$ Wikipedia abstracts with approximately $59 k$ entities and $30 k$ relation triples. Finally, we evaluate in-context learning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

nicpopovic/vital_articles_synthetic_information_extraction
dataset· 10 dl
10 dl

Videos

DocIE@XLLM25: In-Context Learning for Information Extraction using Fully Synthetic Demonstrations· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification