Rules of Thumb for Information Acquisition from Large and Redundant Data
Wolfgang Gatterbauer

TL;DR
This paper models how information is acquired from large, redundant datasets, revealing that common intuitions like the 80-20 rule often do not apply, and provides robust rules for understanding sampling effects in power-law distributed data.
Contribution
It introduces an abstract model of information acquisition from redundant data, deriving new rules of thumb for sampling effects, especially under power-law and Zipf distributions, validated with real web data.
Findings
80-20 rule does not hold for Zipf distributions, with less than 40% information learned from 20% data.
Sampling from power-law distributions results in truncated distributions with the same exponent.
Certain power-law functions remain invariant under sampling.
Abstract
We develop an abstract model of information acquisition from redundant data. We assume a random sampling process from data which provide information with bias and are interested in the fraction of information we expect to learn as function of (i) the sampled fraction (recall) and (ii) varying bias of information (redundancy distributions). We develop two rules of thumb with varying robustness. We first show that, when information bias follows a Zipf distribution, the 80-20 rule or Pareto principle does surprisingly not hold, and we rather expect to learn less than 40% of the information when randomly sampling 20% of the overall data. We then analytically prove that for large data sets, randomized sampling from power-law distributions leads to "truncated distributions" with the same power-law exponent. This second rule is very robust and also holds for distributions that deviate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Complex Network Analysis Techniques · Advanced Text Analysis Techniques
