PORTAL: Scalable Tabular Foundation Models via Content-Specific   Tokenization

Marco Spinaci; Marek Polewczyk; Johannes Hoffart; Markus C. Kohler,; Sam Thelin; Tassilo Klein

arXiv:2410.13516·cs.LG·October 18, 2024

PORTAL: Scalable Tabular Foundation Models via Content-Specific Tokenization

Marco Spinaci, Marek Polewczyk, Johannes Hoffart, Markus C. Kohler,, Sam Thelin, Tassilo Klein

PDF

Open Access 1 Repo

TL;DR

PORTAL is a scalable self-supervised framework for tabular data that uses content-specific tokenization, enabling effective pre-training on diverse, uncleaned datasets and achieving competitive results on classification and regression tasks.

Contribution

The paper introduces PORTAL, a novel content-specific tokenization method that allows scalable pre-training of tabular models without data cleaning or structural constraints.

Findings

01

Effective pre-training on large, uncleaned datasets

02

Achieves state-of-the-art results on complex tasks

03

Handles multiple data modalities without preprocessing

Abstract

Self-supervised learning on tabular data seeks to apply advances from natural language and image domains to the diverse domain of tables. However, current techniques often struggle with integrating multi-domain data and require data cleaning or specific structural requirements, limiting the scalability of pre-training datasets. We introduce PORTAL (Pretraining One-Row-at-a-Time for All tabLes), a framework that handles various data modalities without the need for cleaning or preprocessing. This simple yet powerful approach can be effectively pre-trained on online-collected datasets and fine-tuned to match state-of-the-art methods on complex classification and regression tasks. This work offers a practical advancement in self-supervised learning for large-scale tabular data.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sap-samples/portal
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Natural Language Processing Techniques · Model-Driven Software Engineering Techniques