How Many Samples Required in Big Data Collection: A Differential Message Importance Measure
Shanyun Liu, Rui She, Pingyi Fan

TL;DR
This paper introduces the Differential Message Importance Measure (DMIM), a new metric for assessing sample distribution quality in big data collection, linking it to distribution goodness-of-fit and aiding sampling point selection.
Contribution
It defines DMIM as a novel measure for message importance, connecting it to the Kolmogorov-Smirnov statistic and providing a new approach for evaluating distribution accuracy in sampling.
Findings
DMIM deviation correlates with distribution goodness-of-fit.
Numerical results validate the properties and accuracy of DMIM.
Decreasing DMIM deviation improves empirical distribution approximation.
Abstract
Information collection is a fundamental problem in big data, where the size of sampling sets plays a very important role. This work considers the information collection process by taking message importance into account. Similar to differential entropy, we define differential message importance measure (DMIM) as a measure of message importance for continuous random variable. It is proved that the change of DMIM can describe the gap between the distribution of a set of sample values and a theoretical distribution. In fact, the deviation of DMIM is equivalent to Kolmogorov-Smirnov statistic, but it offers a new way to characterize the distribution goodness-of-fit. Numerical results show some basic properties of DMIM and the accuracy of the proposed approximate values. Furthermore, it is also obtained that the empirical distribution approaches the real distribution with decreasing of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Mechanics and Entropy · Complex Network Analysis Techniques · Neural Networks and Applications
