CAE: Character-Level Autoencoder for Non-Semantic Relational Data Grouping
Veera V S Bhargav Nunna, Shinae Kang, Zheyuan Zhou, Virginia Wang, Sucharitha Boinapally, Michael Foley

TL;DR
This paper presents a character-level autoencoder that effectively groups semantically identical columns in large-scale non-semantic relational datasets, overcoming NLP limitations and improving data profiling accuracy.
Contribution
The novel CAE approach operates at the character level with fixed dictionaries, enabling scalable, efficient identification of column similarities in industrial data environments.
Findings
Achieved 80.95% accuracy in top 5 column matching
Outperformed traditional NLP methods like Bag of Words (47.62%)
Reduced memory and training time for large datasets
Abstract
Enterprise relational databases increasingly contain vast amounts of non-semantic data - IP addresses, product identifiers, encoded keys, and timestamps - that challenge traditional semantic analysis. This paper introduces a novel Character-Level Autoencoder (CAE) approach that automatically identifies and groups semantically identical columns in non-semantic relational datasets by detecting column similarities based on data patterns and structures. Unlike conventional Natural Language Processing (NLP) models that struggle with limitations in semantic interpretability and out-of-vocabulary tokens, our approach operates at the character level with fixed dictionary constraints, enabling scalable processing of large-scale data lakes and warehouses. The CAE architecture encodes text representations of non-semantic relational table columns and extracts high-dimensional feature embeddings for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Time Series Analysis and Forecasting · Advanced Graph Neural Networks
