Can Deep Neural Networks Predict Data Correlations from Column Names?
Immanuel Trummer

TL;DR
This study investigates whether deep language models can predict data correlations from column names, introducing a new benchmark and analyzing factors affecting prediction accuracy to enhance database schema analysis.
Contribution
The paper introduces a novel benchmark for data correlation analysis using Kaggle datasets and evaluates language models' ability to predict correlations from schema text.
Findings
Language models can predict data correlations from column names.
Column name length and word ratio influence prediction success.
Schema text provides valuable information for database tuning and profiling.
Abstract
Recent publications suggest using natural language analysis on database schema elements to guide tuning and profiling efforts. The underlying hypothesis is that state-of-the-art language processing methods, so-called language models, are able to extract information on data properties from schema text. This paper examines that hypothesis in the context of data correlation analysis: is it possible to find column pairs with correlated data by analyzing their names via language models? First, the paper introduces a novel benchmark for data correlation analysis, created by analyzing thousands of Kaggle data sets (and available for download). Second, it uses that data to study the ability of language models to predict correlation, based on column names. The analysis covers different language models, various correlation metrics, and a multitude of accuracy metrics. It pinpoints factors that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Advanced Text Analysis Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Dense Connections · Softmax · Layer Normalization
