Prior-Aligned Data Cleaning for Tabular Foundation Models

Laure Berti-Equille

arXiv:2604.25154·cs.LG·April 29, 2026

Prior-Aligned Data Cleaning for Tabular Foundation Models

Laure Berti-Equille

PDF

1 Repo

TL;DR

This paper introduces L2C2, a reinforcement learning framework that optimizes data cleaning for tabular foundation models by aligning real-world data with synthetic priors, improving accuracy and calibration.

Contribution

It presents the first deep RL approach for prior alignment in tabular data cleaning, demonstrating improved model performance and transferability across datasets.

Findings

01

Reward engineering is complex; some reward designs lead to trivial cleaning.

02

The TFMAwareReward improves pipeline selection and accuracy on certain datasets.

03

Parameterized cleaning actions enhance reward outcomes in most datasets.

Abstract

Tabular Foundation Models (TFMs) achieve state-of-the-art zero-shot accuracy on small tabular datasets by meta-learning over synthetic data-generating processes -- making them highly attractive for practitioners who cannot afford large annotated corpora. However, their in-context learning mechanism assumes approximately clean inputs: missing values, outliers, and duplicates in the real-world data create a prior mismatch that degrades both accuracy and confidence calibration simultaneously. Correcting this mismatch requires sequential decisions over cleaning operators whose interactions no static preprocessing rule can anticipate -a natural fit for reinforcement learning~(RL). We introduce L2C2, the first deep RL framework framing tabular data cleaning as prior alignment: a learned policy sequences operators to minimize the distributional gap between dirty input and the TFM's synthetic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

laureberti/Learn2Clean
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.