Sampling from a $k$-DPP without looking at all items

Daniele Calandriello; Micha{\l} Derezi\'nski; Michal Valko

arXiv:2006.16947·cs.LG·July 1, 2020

Sampling from a $k$-DPP without looking at all items

Daniele Calandriello, Micha{\l} Derezi\'nski, Michal Valko

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper introduces an efficient algorithm for sampling from a $k$-DPP that requires observing only a small subset of all items, significantly reducing computational costs while maintaining exact distributional guarantees.

Contribution

The authors develop a novel adaptive sampling algorithm that efficiently generates $k$-DPP samples without examining all items, improving scalability for large datasets.

Findings

01

Achieves several orders of magnitude faster sampling compared to previous methods.

02

Produces exact $k$-DPP samples by observing only a small fraction of data.

03

Empirically validated on large datasets with high accuracy.

Abstract

Determinantal point processes (DPPs) are a useful probabilistic model for selecting a small diverse subset out of a large collection of items, with applications in summarization, stochastic optimization, active learning and more. Given a kernel function and a subset size $k$ , our goal is to sample $k$ out of $n$ items with probability proportional to the determinant of the kernel matrix induced by the subset (a.k.a. $k$ -DPP). Existing $k$ -DPP sampling algorithms require an expensive preprocessing step which involves multiple passes over all $n$ items, making it infeasible for large datasets. A na\"ive heuristic addressing this problem is to uniformly subsample a fraction of the data and perform $k$ -DPP sampling only on those items, however this method offers no guarantee that the produced sample will even approximately resemble the target distribution over the original dataset. In this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

misovalko/my-research-papers
dataset· 21 dl
21 dl

Videos

Sampling from a k-DPP without looking at all items· slideslive

Taxonomy

TopicsData Management and Algorithms · Biometric Identification and Security · Bayesian Methods and Mixture Models