Reliability-Aware Determinantal Point Processes for Robust Informative Data Selection in Large Language Models
Ahmad Sarlak, Abolfazl Razi

TL;DR
This paper introduces ProbDPP, a reliability-aware determinantal point process method for robust, diverse data selection in large language models, addressing data access uncertainties with a new objective and online learning algorithm.
Contribution
It proposes ProbDPP, a novel approach that incorporates data reliability into DPP-based selection, along with an online learning algorithm with theoretical guarantees.
Findings
ProbDPP effectively accounts for data access unreliability.
The proposed algorithm achieves bounded regret in online learning.
The method enhances data selection robustness under uncertainty.
Abstract
Informative data selection is a key requirement for large language models (LLMs) to minimize the amount of data required for fine-tuning, network distillation, and token pruning, enabling fast and efficient deployment, especially under computational and communication constraints. Traditional subset selection methods, including those based on Determinantal Point Processes (DPP), focus on maximizing diversity but assume that selected data batches are always available error-free. This presumption prohibits their use under partial storage outage, imperfect communication, and stochastic access failures. Furthermore, we show that the original formulation collapses under such conditions. To address this gap, we introduce ProbDPP, a novel reliability-aware implementation of k-DPP that accounts for probabilistic data access by recasting the objective function with a regularization term that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed systems and fault tolerance · Stochastic Gradient Optimization Techniques · Age of Information Optimization
