Basic statistics for probabilistic symbolic variables: a novel   metric-based approach

Antonio Irpino; Rosanna Verde

arXiv:1110.2295·stat.ME·May 3, 2016

Basic statistics for probabilistic symbolic variables: a novel metric-based approach

Antonio Irpino, Rosanna Verde

PDF

TL;DR

This paper introduces a novel metric-based approach using Wasserstein distance to compute basic statistics for distributional data, enabling improved analysis of multi-valued variables in data mining.

Contribution

It extends classic inertia measures with Wasserstein distance for distributional data, providing new tools for variability and association analysis in multi-valued variables.

Findings

01

Proves properties of the Wasserstein distance in this context

02

Demonstrates the approach with a clustering algorithm on real data

03

Shows the effectiveness of the new statistics for distributional data

Abstract

In data mining, it is usually to describe a set of individuals using some summaries (means, standard deviations, histograms, confidence intervals) that generalize individual descriptions into a typology description. In this case, data can be described by several values. In this paper, we propose an approach for computing basic statics for such data, and, in particular, for data described by numerical multi-valued variables (interval, histograms, discrete multi-valued descriptions). We propose to treat all numerical multi-valued variables as distributional data, i.e. as individuals described by distributions. To obtain new basic statistics for measuring the variability and the association between such variables, we extend the classic measure of inertia, calculated with the Euclidean distance, using the squared Wasserstein distance defined between probability measures. The distance is a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.