Ambiguities, Built-in Biases and Flaws in Big Data Insight Extraction

Serge Galam

arXiv:2506.21262·physics.soc-ph·June 27, 2025·Inf.

Ambiguities, Built-in Biases and Flaws in Big Data Insight Extraction

Serge Galam

PDF

Open Access

TL;DR

This paper demonstrates that hierarchical classification methods in big data analysis can introduce systematic biases and ambiguities, affecting the reliability of extracted insights, even with complete data.

Contribution

It reveals how local ambiguity resolution in hierarchical models causes biases, highlighting fundamental flaws in common data reduction techniques.

Findings

01

Additional white aggregates emerge from local ambiguities.

02

Systematic bias increases with recursive aggregation.

03

Local symmetry-breaking decisions skew outcomes.

Abstract

I address the challenge of extracting reliable insights from large datasets using a simplified model that illustrates how hierarchical classification can distort outcomes. The model consists of discrete pixels labeled red, blue, or white. Red and blue indicate distinct properties, and white represents unclassified or ambiguous data. A macro-color is assigned only if one color holds a strict majority among the pixels. Otherwise, the aggregate is labeled white, reflecting uncertainty. This setup mimics a percolation threshold at fifty percent. Assuming direct access of the various proportions of colors is infeasible from the data, I implement a hierarchical coarse-graining procedure. Elements (first pixels, then aggregates) are recursively grouped and reclassified via local majority rules, producing ultimately a single super-aggregate whose color represents the inferred macro-property of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBig Data Technologies and Applications