Is This Collection Worth My LLM's Time? Automatically Measuring Information Potential in Text Corpora

Tristan Karch; Luca Engel; Philippe Schwaller; Fr\'ed\'eric Kaplan

arXiv:2502.13691·cs.CL·January 9, 2026

Is This Collection Worth My LLM's Time? Automatically Measuring Information Potential in Text Corpora

Tristan Karch, Luca Engel, Philippe Schwaller, Fr\'ed\'eric Kaplan

PDF

Open Access

TL;DR

This paper introduces an automated, model-agnostic method to evaluate the potential informational value of text collections for large language models by measuring performance differences on generated MCQs.

Contribution

It presents a novel pipeline that estimates information gain from text corpora without training or fine-tuning LLMs, aiding data prioritization.

Findings

01

Effectively identifies valuable information-rich collections

02

Correlates performance gaps with information potential

03

Validated on diverse datasets including historical and Wikipedia texts

Abstract

As large language models (LLMs) converge towards similar capabilities, the key to advancing their performance lies in identifying and incorporating valuable new information sources. However, evaluating which text collections are worth the substantial investment required for digitization, preprocessing, and integration into LLM systems remains a significant challenge. We present a novel approach to this challenge: an automated pipeline that evaluates the potential information gain from text collections without requiring model training or fine-tuning. Our method generates multiple choice questions (MCQs) from texts and measures an LLM's performance both with and without access to the source material. The performance gap between these conditions serves as a proxy for the collection's information potential. We validate our approach using five strategically selected datasets: EPFL PhD…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Artificial Intelligence in Law