Distance Assessment and Hypothesis Testing of High-Dimensional Samples using Variational Autoencoders
Marco Henrique de Almeida In\'acio, Rafael Izbicki, B\'alint, Gyires-T\'oth

TL;DR
This paper presents a method using variational autoencoders to measure the distance between high-dimensional datasets and tests if they originate from the same distribution, aiding data exploration in machine learning.
Contribution
It introduces a novel approach combining variational autoencoders with permutation hypothesis testing for dataset comparison in high dimensions.
Findings
Effective distance measurement on generated and public datasets
Supports data exploration by quantifying dataset discrepancies
Applicable in early machine learning pipeline stages
Abstract
Given two distinct datasets, an important question is if they have arisen from the the same data generating function or alternatively how their data generating functions diverge from one another. In this paper, we introduce an approach for measuring the distance between two datasets with high dimensionality using variational autoencoders. This approach is augmented by a permutation hypothesis test in order to check the hypothesis that the data generating distributions are the same within a significance level. We evaluate both the distance measurement and hypothesis testing approaches on generated and on public datasets. According to the results the proposed approach can be used for data exploration (e.g. by quantifying the discrepancy/separability between categories of images), which can be particularly useful in the early phases of the pipeline of most machine learning projects.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Anomaly Detection Techniques and Applications · Cell Image Analysis Techniques
