OPERA: Online Data Pruning for Efficient Retrieval Model Adaptation

Haoyang Fang; Shuai Zhang; Yifei Ma; Hengyi Wang; Cuixiong Hu; Katrin Kirchhoff; Bernie Wang; George Karypis

arXiv:2603.17205·cs.IR·April 2, 2026

OPERA: Online Data Pruning for Efficient Retrieval Model Adaptation

Haoyang Fang, Shuai Zhang, Yifei Ma, Hengyi Wang, Cuixiong Hu, Katrin Kirchhoff, Bernie Wang, George Karypis

PDF

TL;DR

OPERA introduces static and dynamic data pruning methods to enhance the efficiency and effectiveness of retrieval model fine-tuning across multiple domains, reducing training time while improving ranking and retrieval performance.

Contribution

The paper proposes a novel data pruning framework with static and dynamic strategies that adaptively select training data, demonstrating significant performance gains and training efficiency improvements.

Findings

01

Static pruning improves ranking (NDCG) but may reduce retrieval (Recall).

02

Dynamic pruning achieves the best balance, enhancing both ranking and retrieval metrics.

03

DP reduces training time by over 50% while maintaining or improving performance.

Abstract

Domain-specific finetuning is essential for dense retrievers, yet not all training pairs contribute equally to the learning process. We introduce OPERA, a data pruning framework that exploits this heterogeneity to improve both the effectiveness and efficiency of retrieval model adaptation. We first investigate static pruning (SP), which retains only high-similarity query-document pairs, revealing an intrinsic quality-coverage tradeoff: ranking (NDCG) improves while retrieval (Recall) can degrade due to reduced query diversity. To resolve this tradeoff, we propose a two-stage dynamic pruning (DP) strategy that adaptively modulates sampling probabilities at both query and document levels throughout training, prioritizing high-quality examples while maintaining access to the full training set. Evaluations across eight datasets spanning six domains demonstrate the effectiveness of both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.