EntroGD: Scalable Generalized Deduplication for Efficient Direct Analytics on Compressed IoT Data
Xiaobo Zhao, Daniel E. Lucani

TL;DR
EntroGD is a scalable, entropy-guided generalized deduplication framework that enables efficient direct analytics on compressed IoT data by reducing complexity and data access requirements.
Contribution
It introduces a novel entropy-guided GD framework that achieves linear complexity and supports direct analytics on compressed high-dimensional IoT datasets.
Findings
Reduces configuration time by up to 53.5 times.
Enables analytics with only 2.6% of original data.
Accelerates clustering by up to 31.6 times.
Abstract
Massive data streams from IoT and cyber-physical systems must be processed under strict bandwidth, latency, and resource constraints. Generalized Deduplication (GD) is a promising lossless compression framework, as it supports random access and direct analytics on compressed data. However, existing GD algorithms exhibit quadratic complexity , which limits their scalability for high-dimensional datasets. This paper proposes \textbf{EntroGD}, an entropy-guided GD framework that decouples analytical fidelity from compression efficiency to achieve linear complexity . EntroGD adopts a two-stage design, first constructing compact condensed samples to preserve information critical for analytics, and then applying entropy-based bit selection to maximize compression. Experiments on 18 IoT datasets show that EntroGD reduces configuration time by up to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Data Security Solutions · Cloud Computing and Resource Management · IoT and Edge/Fog Computing
