AdaDeDup: Adaptive Hybrid Data Pruning for Efficient Large-Scale Object Detection Training

Feiyang Kang; Nadine Chang; Maying Shen; Marc T. Law; Rafid Mahmood; Ruoxi Jia; Jose M. Alvarez

arXiv:2507.00049·cs.CV·July 2, 2025

AdaDeDup: Adaptive Hybrid Data Pruning for Efficient Large-Scale Object Detection Training

Feiyang Kang, Nadine Chang, Maying Shen, Marc T. Law, Rafid Mahmood, Ruoxi Jia, Jose M. Alvarez

PDF

TL;DR

AdaDeDup is a hybrid data pruning method that adaptively reduces dataset size for large-scale object detection training, maintaining high performance while significantly cutting computational costs.

Contribution

It introduces a novel hybrid framework combining density-based and model-informed pruning with cluster-adaptive thresholds for improved data efficiency.

Findings

01

Outperforms existing pruning baselines on large-scale benchmarks.

02

Reduces data by 20% with minimal performance loss.

03

Achieves over 54% reduction in performance degradation compared to random sampling.

Abstract

The computational burden and inherent redundancy of large-scale datasets challenge the training of contemporary machine learning models. Data pruning offers a solution by selecting smaller, informative subsets, yet existing methods struggle: density-based approaches can be task-agnostic, while model-based techniques may introduce redundancy or prove computationally prohibitive. We introduce Adaptive De-Duplication (AdaDeDup), a novel hybrid framework that synergistically integrates density-based pruning with model-informed feedback in a cluster-adaptive manner. AdaDeDup first partitions data and applies an initial density-based pruning. It then employs a proxy model to evaluate the impact of this initial pruning within each cluster by comparing losses on kept versus pruned samples. This task-aware signal adaptively adjusts cluster-specific pruning thresholds, enabling more aggressive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.