Recognizing Variables from their Data via Deep Embeddings of Distributions
Jonas Mueller, Alex Smola

TL;DR
This paper introduces a deep learning-based method for recognizing variables across datasets by embedding data distributions, improving robustness and accuracy over traditional statistical techniques.
Contribution
The paper presents a novel neural embedding approach for variable recognition that handles numeric and string data, outperforming existing distributional similarity methods.
Findings
Embeddings generalize well to new data sources.
Method outperforms standard statistical techniques.
Handles both numeric and string data effectively.
Abstract
A key obstacle in automated analytics and meta-learning is the inability to recognize when different datasets contain measurements of the same variable. Because provided attribute labels are often uninformative in practice, this task may be more robustly addressed by leveraging the data values themselves rather than just relying on their arbitrarily selected variable names. Here, we present a computationally efficient method to identify high-confidence variable matches between a given set of data values and a large repository of previously encountered datasets. Our approach enjoys numerous advantages over distributional similarity based techniques because we leverage learned vector embeddings of datasets which adaptively account for natural forms of data variation encountered in practice. Based on the neural architecture of deep sets, our embeddings can be computed for both numeric and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
