Differential Message Importance Measure: A New Approach to the Required Sampling Number in Big Data Structure Characterization
Shanyun Liu, Rui She, Pingyi Fan

TL;DR
This paper introduces the Differential Message Importance Measure (DMIM), a new distribution-free criterion for determining the required sample size in big data structure characterization, linking message importance to distribution goodness-of-fit.
Contribution
It proposes DMIM as a novel measure for data importance, providing a new approach to sample size determination and distribution assessment in big data analysis.
Findings
DMIM effectively characterizes distribution differences.
Approximate values for DMIM are accurate for normal distributions.
Decreasing DMIM deviation improves empirical distribution approximation.
Abstract
Data collection is a fundamental problem in the scenario of big data, where the size of sampling sets plays a very important role, especially in the characterization of data structure. This paper considers the information collection process by taking message importance into account, and gives a distribution-free criterion to determine how many samples are required in big data structure characterization. Similar to differential entropy, we define differential message importance measure (DMIM) as a measure of message importance for continuous random variable. The DMIM for many common densities is discussed, and high-precision approximate values for normal distribution are given. Moreover, it is proved that the change of DMIM can describe the gap between the distribution of a set of sample values and a theoretical distribution. In fact, the deviation of DMIM is equivalent to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComplex Network Analysis Techniques · Anomaly Detection Techniques and Applications · Advanced Clustering Algorithms Research
