Do not be greedy, Think Twice: Sampling and Selection for Document-level Information Extraction

Mikel Zubillaga; Oscar Sainz; Oier Lopez de Lacalle; Eneko Agirre

arXiv:2601.18395·cs.CL·January 27, 2026

Do not be greedy, Think Twice: Sampling and Selection for Document-level Information Extraction

Mikel Zubillaga, Oscar Sainz, Oier Lopez de Lacalle, Eneko Agirre

PDF

Open Access

TL;DR

This paper demonstrates that sampling multiple candidate outputs with reasoning models and selecting the best improves document-level information extraction over traditional greedy decoding, with methods validated through experiments.

Contribution

It introduces ThinkTwice, a novel sampling and selection framework for DocIE, including unsupervised and supervised methods, and a way to generate training data with reasoning traces.

Findings

01

Sampling with ThinkTwice outperforms greedy decoding.

02

Supervised selection with reward models improves accuracy.

03

Rejection sampling effectively creates training data with reasoning traces.

Abstract

Document-level Information Extraction (DocIE) aims to produce an output template with the entities and relations of interest occurring in the given document. Standard practices include prompting decoder-only LLMs using greedy decoding to avoid output variability. Rather than treating this variability as a limitation, we show that sampling can produce substantially better solutions than greedy decoding, especially when using reasoning models. We thus propose ThinkTwice, a sampling and selection framework in which the LLM generates multiple candidate templates for a given document, and a selection module chooses the most suitable one. We introduce both an unsupervised method that exploits agreement across generated outputs, and a supervised selection method using reward models trained on labeled DocIE data. To address the scarcity of golden reasoning trajectories for DocIE, we propose a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Biomedical Text Mining and Ontologies · Natural Language Processing Techniques