Scaling Experiments in Self-Supervised Cross-Table Representation Learning
Maximilian Schambach, Dominique Paul, Johannes S. Otterbach

TL;DR
This paper introduces a Transformer-based model for deep tabular data representation learning, exploring its scaling behavior from small to very large models trained on extensive datasets, and evaluating its performance via linear probing.
Contribution
It presents a novel Transformer architecture tailored for tabular data and systematically studies its scaling properties across different model sizes and training setups.
Findings
Scaling improves performance on benchmark datasets.
Cross-table pretraining enhances generalization.
Model size up to 10^7 parameters is feasible and effective.
Abstract
To analyze the scaling potential of deep tabular representation learning models, we introduce a novel Transformer-based architecture specifically tailored to tabular data and cross-table representation learning by utilizing table-specific tokenizers and a shared Transformer backbone. Our training approach encompasses both single-table and cross-table models, trained via missing value imputation through a self-supervised masked cell recovery objective. To understand the scaling behavior of our method, we train models of varying sizes, ranging from approximately to parameters. These models are trained on a carefully curated pretraining dataset, consisting of 135M training tokens sourced from 76 diverse datasets. We assess the scaling of our architecture in both single-table and cross-table pretraining setups by evaluating the pretrained models using linear probing on a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · AI in cancer detection
MethodsMulti-Head Attention · Attention Is All You Need · Dense Connections · Linear Layer · Dropout · Byte Pair Encoding · Label Smoothing · Absolute Position Encodings · Adam · Softmax
