PORTAL: Scalable Tabular Foundation Models via Content-Specific Tokenization
Marco Spinaci, Marek Polewczyk, Johannes Hoffart, Markus C. Kohler,, Sam Thelin, Tassilo Klein

TL;DR
PORTAL is a scalable self-supervised framework for tabular data that uses content-specific tokenization, enabling effective pre-training on diverse, uncleaned datasets and achieving competitive results on classification and regression tasks.
Contribution
The paper introduces PORTAL, a novel content-specific tokenization method that allows scalable pre-training of tabular models without data cleaning or structural constraints.
Findings
Effective pre-training on large, uncleaned datasets
Achieves state-of-the-art results on complex tasks
Handles multiple data modalities without preprocessing
Abstract
Self-supervised learning on tabular data seeks to apply advances from natural language and image domains to the diverse domain of tables. However, current techniques often struggle with integrating multi-domain data and require data cleaning or specific structural requirements, limiting the scalability of pre-training datasets. We introduce PORTAL (Pretraining One-Row-at-a-Time for All tabLes), a framework that handles various data modalities without the need for cleaning or preprocessing. This simple yet powerful approach can be effectively pre-trained on online-collected datasets and fine-tuned to match state-of-the-art methods on complex classification and regression tasks. This work offers a practical advancement in self-supervised learning for large-scale tabular data.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Natural Language Processing Techniques · Model-Driven Software Engineering Techniques
