CoRECT: A Framework for Evaluating Embedding Compression Techniques at Scale

L. Caspari; M. Dinzinger; K. Ghosh Dastidar; C. Fellicious; J. Mitrovi\'c; M. Granitzer

arXiv:2510.19340·cs.IR·January 16, 2026

CoRECT: A Framework for Evaluating Embedding Compression Techniques at Scale

L. Caspari, M. Dinzinger, K. Ghosh Dastidar, C. Fellicious, J. Mitrovi\'c, M. Granitzer

PDF

Open Access 1 Datasets

TL;DR

This paper introduces CoRECT, a comprehensive framework for evaluating embedding compression techniques at scale, addressing the impact of corpus complexity on dense retrieval performance and benchmarking eight compression methods.

Contribution

The paper presents CoRECT, a new large-scale evaluation framework with a curated dataset collection, enabling consistent comparison of embedding compression techniques across diverse corpus complexities.

Findings

01

Non-learned compression reduces index size significantly with minimal performance loss.

02

Performance of compression methods varies across models and datasets.

03

CoRECT facilitates informed selection of compression techniques.

Abstract

Dense retrieval systems have proven to be effective across various benchmarks, but require substantial memory to store large search indices. Recent advances in embedding compression show that index sizes can be greatly reduced with minimal loss in ranking quality. However, existing studies often overlook the role of corpus complexity -- a critical factor, as recent work shows that both corpus size and document length strongly affect dense retrieval performance. In this paper, we introduce CoRECT (Controlled Retrieval Evaluation of Compression Techniques), a framework for large-scale evaluation of embedding compression methods, supported by a newly curated dataset collection. To demonstrate its utility, we benchmark eight representative types of compression methods. Notably, we show that non-learned compression achieves substantial index size reduction, even on up to 100M passages, with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

PaDaS-Lab/CoRE
dataset· 185 dl
185 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInformation Retrieval and Search Behavior · Topic Modeling · Advanced Image and Video Retrieval Techniques