Can Deep Neural Networks Predict Data Correlations from Column Names?

Immanuel Trummer

arXiv:2107.04553·cs.DB·September 12, 2023·1 cites

Can Deep Neural Networks Predict Data Correlations from Column Names?

Immanuel Trummer

PDF

Open Access 1 Repo

TL;DR

This study investigates whether deep language models can predict data correlations from column names, introducing a new benchmark and analyzing factors affecting prediction accuracy to enhance database schema analysis.

Contribution

The paper introduces a novel benchmark for data correlation analysis using Kaggle datasets and evaluates language models' ability to predict correlations from schema text.

Findings

01

Language models can predict data correlations from column names.

02

Column name length and word ratio influence prediction success.

03

Schema text provides valuable information for database tuning and profiling.

Abstract

Recent publications suggest using natural language analysis on database schema elements to guide tuning and profiling efforts. The underlying hypothesis is that state-of-the-art language processing methods, so-called language models, are able to extract information on data properties from schema text. This paper examines that hypothesis in the context of data correlation analysis: is it possible to find column pairs with correlated data by analyzing their names via language models? First, the paper introduces a novel benchmark for data correlation analysis, created by analyzing thousands of Kaggle data sets (and available for download). Second, it uses that data to study the ability of language models to predict correlation, based on column names. The analysis covers different language models, various correlation metrics, and a multitude of accuracy metrics. It pinpoints factors that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

itrummer/DataCorrelationPredictionWithNLP
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Advanced Text Analysis Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Dense Connections · Softmax · Layer Normalization