dreaMLearning: Data Compression Assisted Machine Learning
Xiaobo Zhao, Aaron Hurst, Panagiotis Karras, Daniel E. Lucani

TL;DR
dreaMLearning introduces a framework that enables machine learning directly on compressed data, significantly reducing resource requirements while maintaining performance, thus advancing efficient learning on resource-constrained devices.
Contribution
The paper presents dreaMLearning, a novel approach allowing learning from compressed data without decompression, utilizing entropy-based lossless compression to improve efficiency across various ML tasks.
Findings
Training speed up to 8.8x faster
Memory usage reduced by 10x
Storage requirements cut by 42%
Abstract
Despite rapid advancements, machine learning, particularly deep learning, is hindered by the need for large amounts of labeled data to learn meaningful patterns without overfitting and immense demands for computation and storage, which motivate research into architectures that can achieve good performance with fewer resources. This paper introduces dreaMLearning, a novel framework that enables learning from compressed data without decompression, built upon Entropy-based Generalized Deduplication (EntroGeDe), an entropy-driven lossless compression method that consolidates information into a compact set of representative samples. DreaMLearning accommodates a wide range of data types, tasks, and model architectures. Extensive experiments on regression and classification tasks with tabular and image data demonstrate that dreaMLearning accelerates training by up to 8.8x, reduces memory usage…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. A novel method that learns from compressed data without having to decompress, speeding up performance and lowering memory demand, outperforms all related concepts. o “accelerates training by up to 10×, reduces peak memory usage of training data by 10×, and cuts storage by 37%, with a minimal impact on model performance.” 2. Per epoch costs decrease significantly. o "dreaMLearning delivers higher accuracy at every budget with much lower time" o "Training time scales near-linearly with budget f
1. Lacks any strong theoretical section, the method seems to be based on heuristics as opposed to strong mathematics. o It includes a few mathematical concepts; they have a bit of mathematics covering things like implementation and how their algorithm can be derived from MSE, but they lack pure proofs. 2. For DCT-compressed images, they inverse-transform and reconvert before training; that’s still a decompression step, which goes against their phrasing o "For DCT-compressed datasets, each retrie
1. Using of bit-level entropy for adaptive clustering and compression improves both information retention and compression efficiency. 2. Evaluations span multiple domains, not only image datasets, but tabular regression. 3. Achieves 10× faster training, 10× lower memory, and ~40% storage savings with minimal accuracy drop; especially promising for edge/federated applications.
1. I have a big concern about the baseline results reported in the paper. All that seems to be lower than the naive baseline random selection. And all the approaches the shapes lower performance than full data training. For instance, the reported results for InfoBatch are significantly weaker than those in the original paper. At first glance, I thought it was because of the large pruning ratio used in this paper. However, I checked the original infobatch paper and found that they can achieve 94.
The strengths can be summarized as follows. - This paper proposes a unified pipeline that avoids explicit decompression and reports accuracy, time, RAM, and storage with consistent protocols. - The proposed Entropy-guided EntroGeDe provides a way to trade information retention against compression, with an explicit compressed-size objective and an implementable algorithm. - The implementation details are clear for training setups and baselines, which improves reproducibility.
The weaknesses are summarized as listed below. - **The practical gains can be conservative in large-scale vision.** On ImageNet-1K at 10 percent subset, accuracy is 59.9% vs random 59.7% with identical training time, while storage remains 63% of full. The gain over random is small, and the storage figure limits the headline benefit in this regime. - **The results of subset training may be confounding.** CIFAR-10 at 10% already reaches 90.2 percent versus 95.2 percent full, which is a well-known
The paper proposes dreaMLearning to train directly on losslessly compressed data, integrating an entropy-guided extension of Generalized Deduplication (EntroGeDe); this creative combination distinguishes it from coreset selection and dataset distillation. Its quality is supported by a principled foundation in GeDe, intuitive mechanisms for information preservation, and empirical evidence across tabular and image tasks showing substantial runtime, memory, and storage gains with minimal accuracy
-The authors claim that training on compressed representations preserves the training-relevant distribution while accelerating convergence. However, there is limited formal analysis connecting EntroGeDe’s compression objective (entropy and deduplication) to the generalization or optimization dynamic. -Provide theoretical guarantees or bounds (e.g., stability, gradient bias, variance changes…relative to full-batch and mini-batch SGD) for some models beyond linear regression, and include conditio
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Stochastic Gradient Optimization Techniques · Big Data and Digital Economy
