Do not be greedy, Think Twice: Sampling and Selection for Document-level Information Extraction
Mikel Zubillaga, Oscar Sainz, Oier Lopez de Lacalle, Eneko Agirre

TL;DR
This paper demonstrates that sampling multiple candidate outputs with reasoning models and selecting the best improves document-level information extraction over traditional greedy decoding, with methods validated through experiments.
Contribution
It introduces ThinkTwice, a novel sampling and selection framework for DocIE, including unsupervised and supervised methods, and a way to generate training data with reasoning traces.
Findings
Sampling with ThinkTwice outperforms greedy decoding.
Supervised selection with reward models improves accuracy.
Rejection sampling effectively creates training data with reasoning traces.
Abstract
Document-level Information Extraction (DocIE) aims to produce an output template with the entities and relations of interest occurring in the given document. Standard practices include prompting decoder-only LLMs using greedy decoding to avoid output variability. Rather than treating this variability as a limitation, we show that sampling can produce substantially better solutions than greedy decoding, especially when using reasoning models. We thus propose ThinkTwice, a sampling and selection framework in which the LLM generates multiple candidate templates for a given document, and a selection module chooses the most suitable one. We introduce both an unsupervised method that exploits agreement across generated outputs, and a supervised selection method using reward models trained on labeled DocIE data. To address the scarcity of golden reasoning trajectories for DocIE, we propose a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Biomedical Text Mining and Ontologies · Natural Language Processing Techniques
