Basic statistics for probabilistic symbolic variables: a novel metric-based approach
Antonio Irpino, Rosanna Verde

TL;DR
This paper introduces a novel metric-based approach using Wasserstein distance to compute basic statistics for distributional data, enabling improved analysis of multi-valued variables in data mining.
Contribution
It extends classic inertia measures with Wasserstein distance for distributional data, providing new tools for variability and association analysis in multi-valued variables.
Findings
Proves properties of the Wasserstein distance in this context
Demonstrates the approach with a clustering algorithm on real data
Shows the effectiveness of the new statistics for distributional data
Abstract
In data mining, it is usually to describe a set of individuals using some summaries (means, standard deviations, histograms, confidence intervals) that generalize individual descriptions into a typology description. In this case, data can be described by several values. In this paper, we propose an approach for computing basic statics for such data, and, in particular, for data described by numerical multi-valued variables (interval, histograms, discrete multi-valued descriptions). We propose to treat all numerical multi-valued variables as distributional data, i.e. as individuals described by distributions. To obtain new basic statistics for measuring the variability and the association between such variables, we extend the classic measure of inertia, calculated with the Euclidean distance, using the squared Wasserstein distance defined between probability measures. The distance is a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
