Data Language Models: A New Foundation Model Class for Tabular Data
Eda Erol, Giuliano Pezzoli, Ozer Cem Kelahmet

TL;DR
This paper introduces Data Language Models (DLMs), a new foundation model class for tabular data that understands raw tables directly, eliminating preprocessing and enabling new AI applications.
Contribution
The paper presents Schema-1, the first DLM trained on diverse datasets, demonstrating superior performance in imputation and dataset classification compared to existing models.
Findings
Schema-1 outperforms traditional models on prediction benchmarks.
It achieves lower imputation error than classical statistical methods.
It can identify industry sectors from raw data alone.
Abstract
Every major data modality now has a foundation model that understands it natively: text has language models, images have vision models, audio has audio models. Tabular data, the modality on which many consequential real-world AI decisions are made, does not. Every approach to tabular AI today, from gradient-boosted trees to the latest tabular foundation models, requires a preprocessing pipeline before any model can consume the data. None of them understand tabular data as a modality. We introduce the Data Language Model (DLM), the missing foundation model for tabular data. A DLM understands tables the way a language model understands sentences: natively, without serialization or preprocessing, directly from raw cell values. It is the tabular data layer on which AI models, agents, and vertical AI applications can be built, eliminating the preprocessing pipelines that currently stand…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
