Self-Improving Tabular Language Models via Iterative Reward-Guided Post-Training
Yunbo Long, Tejumade Afonja,Guangya Hao,Alexandra Brintrup, Mario Fritz

TL;DR
This paper introduces TabGRAA, a reward-guided post-training method for tabular language models that improves synthetic data quality and utility through an iterative generate-score-align protocol.
Contribution
The paper proposes a novel group-relative alignment method, TabGRAA, for self-improving tabular language models using reward-guided post-training, outperforming existing baselines.
Findings
TabGRAA improves fidelity and utility trade-offs across benchmarks.
Stable group-level updates are crucial for gains.
Both classifier-based and classifier-free rewards are effective.
Abstract
Tabular language models can generate synthetic tables by modeling rows as token sequences, but they are typically trained once with supervised fine-tuning and then used as static synthesizers. This is limiting because next-token likelihood does not directly optimize the distributional, utility, and indistinguishability properties used to evaluate synthetic data. We study iterative reward-guided post-training for tabular language models through a generate--score--align protocol, where a generator samples synthetic rows, a task-specified reward ranks them, and the model is updated relative to a fixed supervised reference. Within this protocol, we propose \textbf{TabGRAA} (\textbf{Tab}ular \textbf{G}roup-\textbf{R}elative \textbf{A}dvantage \textbf{A}lignment), a group-relative alignment method that compares high- and low-reward generated groups using group-averaged policy/reference…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
