Dataset Color Quantization: A Training-Oriented Framework for Dataset-Level Compression
Chenyue Yu, Lingao Xiao, Jinhong Deng, Ivor W. Tsang, Yang He

TL;DR
This paper introduces Dataset Color Quantization (DCQ), a novel framework that compresses large image datasets by reducing color-space redundancy, maintaining training effectiveness while significantly lowering storage requirements.
Contribution
DCQ is a new training-oriented dataset compression method that preserves essential color and structural information, improving training performance under aggressive data reduction.
Findings
DCQ achieves significant dataset size reduction with minimal impact on model accuracy.
Experiments show DCQ outperforms existing dataset compression approaches.
DCQ maintains training effectiveness across multiple benchmark datasets.
Abstract
Large-scale image datasets are fundamental to deep learning, but their high storage demands pose challenges for deployment in resource-constrained environments. While existing approaches reduce dataset size by discarding samples, they often ignore the significant redundancy within each image -- particularly in the color space. To address this, we propose Dataset Color Quantization (DCQ), a unified framework that compresses visual datasets by reducing color-space redundancy while preserving information crucial for model training. DCQ achieves this by enforcing consistent palette representations across similar images, selectively retaining semantically important colors guided by model perception, and maintaining structural details necessary for effective feature learning. Extensive experiments across CIFAR-10, CIFAR-100, Tiny-ImageNet, and ImageNet-1K show that DCQ significantly improves…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper's method is clearly stated, and the paper is easy to follow. 2. The proposed method is intuitive and proven to be effective under the defined settings.
1. I am not very clear about the comparison with direct image quantization, especially as the size of the dataset increases. For example, if extending the experiment to ImageNet 22k, will the proposed method still outperform the direct image quantization? 2. As the classification task is highly abstract, the proposed method could work. However, it is not clear if the method still works for dense prediction tasks such as image segmentation and object detection. It would make the paper stronger i
- This work proposes a dataset-level color quantization framework tailored specifically for training tasks, overcoming the limitations of traditional color quantization methods. It effectively balances the trade-off between storage compression and model trainability. - This paper introduces the training-oriented dataset color quantization framework that integrates (1) a shared clustering-based palette, (2) attention-guided bit allocation, and (3) edge-preserving optimization. - The proposed
1. The method proposed in this paper reduces dataset storage by compressing the color space, whereas traditional dataset pruning achieves this goal by removing data samples. However, directly comparing these two approaches is somewhat unfair, as the method in this paper does not reduce the actual number of training samples. In other words, the proposed approach may not necessarily decrease the total number of training iterations required for the model training, leading to have some questions abo
1. The work focuses on color-level redundancy as a distinct dimension of dataset compression, rather than only reducing samples or resolution. This makes the problem practically relevant for scenarios where storage and bandwidth are constrained but training quality must be preserved. 2. The proposed framework combines chromaticity-aware clustering, attention-guided palette allocation, and differentiable, texture-preserving refinement. The components are mutually consistent and clearly aligned w
1. Most individual elements (clustering, attention guidance, STE-based refinement) are known in the literature; the central contribution lies in integrating them for dataset-level color compression. Readers may perceive this as a strong system design rather than a conceptual breakthrough. 2. The paper does not fully quantify the training-time overhead of palette learning, attention computation, and differentiable refinement, particularly on large datasets and modern backbones. 3. The evaluati
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Compression Techniques · Generative Adversarial Networks and Image Synthesis · Color Science and Applications
