Statistical Foundations of DIME: Risk Estimation for Practical Index Selection

Giulio D'Erasmo; Cesare Campagnano; Antonio Mallia; Pierpaolo Brutti; Nicola Tonellotto; Fabrizio Silvestri

arXiv:2601.05649·cs.IR·April 13, 2026

Statistical Foundations of DIME: Risk Estimation for Practical Index Selection

Giulio D'Erasmo, Cesare Campagnano, Antonio Mallia, Pierpaolo Brutti, Nicola Tonellotto, Fabrizio Silvestri

PDF

1 Video

TL;DR

This paper introduces a statistically grounded method for selecting optimal embedding dimensions per query, reducing embedding size by about 50% while maintaining effectiveness in information retrieval.

Contribution

It provides a new criterion for dynamic, query-dependent dimension selection that eliminates the need for costly grid searches in DIME.

Findings

01

Achieves parity in effectiveness with reduced embedding size.

02

Reduces embedding size by approximately 50% on average.

03

Works across different models and datasets.

Abstract

High-dimensional dense embeddings have become central to modern Information Retrieval, but many dimensions are noisy or redundant. Recently proposed DIME (Dimension IMportance Estimation), provides query-dependent scores to identify informative components of embeddings. DIME relies on a costly grid search to select a priori a dimensionality for all the query corpus's embeddings. Our work provides a statistically grounded criterion that directly identifies the optimal set of dimensions for each query at inference time. Experiments confirm achieving parity of effectiveness and reduces embedding size by an average of $\sim 50%$ across different models and datasets at inference time.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Statistical Foundations of DIME: Risk Estimation for Practical Index Selection· underline