Greedy Information Projection for LLM Data Selection
Victor Ye Dong, Kuan-Yun Lee, Jiamei Shuai, Shengfei Liu, Yi Liu, Jian Jiao

TL;DR
The paper introduces GIP, a mutual information-based framework for selecting training data for large language models, balancing quality and diversity to improve fine-tuning efficiency.
Contribution
GIP provides a novel, principled approach to data selection using mutual information and a geometric projection perspective, enabling efficient subset selection for LLM fine-tuning.
Findings
GIP matches full-data fine-tuning performance with fewer examples.
GIP efficiently balances quality and diversity in data selection.
GIP reduces computational resources needed for fine-tuning.
Abstract
We present \emph{Greedy Information Projection} (\textsc{GIP}), a principled framework for choosing training examples for large language model fine-tuning. \textsc{GIP} casts selection as maximizing mutual information between a subset of examples and task-specific query signals, which may originate from LLM quality judgments, metadata, or other sources. The framework involves optimizing a closed-form mutual information objective defined using both data and query embeddings, naturally balancing {\it quality} and {\it diversity}. Optimizing this score is equivalent to maximizing the projection of the query embedding matrix onto the span of the selected data, which provides a geometric explanation for the co-emergence of quality and diversity. Building on this view, we employ a fast greedy matching-pursuit procedure with efficient projection-based updates. On instruction-following and…
Peer Reviews
Decision·Submitted to ICLR 2026
This paper tackles a very practical question, how to get full data performance from a much smaller, smarter subset, and does so with a clean, unified formulation. Casting selection as mutual information between data embeddings and “query” signals gives a principled way to make quality and diversity fall out of the same objective, rather than balancing hand tuned terms. The Gaussian projection view makes the geometry intuitive: choose examples whose span covers the directions encoded by the score
The central theory rests on a jointly Gaussian, approximately linear coupling between data embeddings and query signals. The paper does not probe robustness to misspecification of this assumption. A concrete way to strengthen this is to add stress tests where either the embedding map is deliberately distorted (e.g., random rotations, dimension reduction, or adversarial noise) or the score vectors are corrupted or biased, and then measure both objective values and downstream fine tuning performan
This paper presents a principled and unified information-theoretic framework for LLM data selection, casting the problem as mutual information maximization under a joint Gaussian model. The formulation connects data quality and diversity within a single objective and provides a clear geometric interpretation through projection onto the span of selected data. The proposed greedy matching pursuit algorithm is conceptually simple, computationally efficient, and scalable to realistic dataset sizes.
1. The theoretical framework heavily relies on the assumption that data and query embeddings are jointly Gaussian, which is unlikely to hold for real LLM embeddings. 2. Evaluations are conducted only on relatively small instruction-tuning datasets and mid-sized models (7B–8B). Could the author provide results on larger datasets and LLM models? 3. The method comparison is narrow, particularly on reasoning tasks. More recent or SOTA baselines are required to assess how much improvement comes fr
1. The framework models data selection as maximizing mutual information between the selected subset and query signals within a single, unified information-theoretic objective, which holds the advantage of balancing quality and diversity instead of taking the sequential approach. 2. A fast greedy matching pursuit approximation algorithm is proposed to solve the approximate dual problem. This MP approach uses efficient, projection-based updates.
1. The total runtime complexity includes a substantial initial $O(m^2d)$ cost for precomputing the data inner product matrix. This renders the method computationally challenging to scale to truly massive datasets in practice, suggesting the claim of "nearly linear" scaling is an overstatement as the linearity only holds after the quadratic precomputation. 2. To achieve efficiency, the method relies on linearization, optimizing an upper bound/trace approximation of the determinant objective, and
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Machine Learning and Algorithms
