AdaDeDup: Adaptive Hybrid Data Pruning for Efficient Large-Scale Object Detection Training
Feiyang Kang, Nadine Chang, Maying Shen, Marc T. Law, Rafid Mahmood, Ruoxi Jia, Jose M. Alvarez

TL;DR
AdaDeDup is a hybrid data pruning method that adaptively reduces dataset size for large-scale object detection training, maintaining high performance while significantly cutting computational costs.
Contribution
It introduces a novel hybrid framework combining density-based and model-informed pruning with cluster-adaptive thresholds for improved data efficiency.
Findings
Outperforms existing pruning baselines on large-scale benchmarks.
Reduces data by 20% with minimal performance loss.
Achieves over 54% reduction in performance degradation compared to random sampling.
Abstract
The computational burden and inherent redundancy of large-scale datasets challenge the training of contemporary machine learning models. Data pruning offers a solution by selecting smaller, informative subsets, yet existing methods struggle: density-based approaches can be task-agnostic, while model-based techniques may introduce redundancy or prove computationally prohibitive. We introduce Adaptive De-Duplication (AdaDeDup), a novel hybrid framework that synergistically integrates density-based pruning with model-informed feedback in a cluster-adaptive manner. AdaDeDup first partitions data and applies an initial density-based pruning. It then employs a proxy model to evaluate the impact of this initial pruning within each cluster by comparing losses on kept versus pruned samples. This task-aware signal adaptively adjusts cluster-specific pruning thresholds, enabling more aggressive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
