RADAR: Benchmarking Language Models on Imperfect Tabular Data

Ken Gu; Zhihan Zhang; Kate Lin; Yuwei Zhang; Akshay Paruchuri; Hong Yu; Mehran Kazemi; Kumar Ayush; A. Ali Heydari; Maxwell A. Xu; Girish Narayanswamy; Yun Liu; Ming-Zher Poh; Yuzhe Yang; Mark Malhotra; Shwetak Patel; Hamid Palangi; Xuhai Xu; Daniel McDuff; Tim Althoff; and Xin Liu

arXiv:2506.08249·cs.DB·November 3, 2025

RADAR: Benchmarking Language Models on Imperfect Tabular Data

Ken Gu, Zhihan Zhang, Kate Lin, Yuwei Zhang, Akshay Paruchuri, Hong Yu, Mehran Kazemi, Kumar Ayush, A. Ali Heydari, Maxwell A. Xu, Girish Narayanswamy, Yun Liu, Ming-Zher Poh, Yuzhe Yang, Mark Malhotra, Shwetak Patel, Hamid Palangi, Xuhai Xu, Daniel McDuff, Tim Althoff

PDF

Open Access 1 Repo 1 Datasets

TL;DR

RADAR is a comprehensive benchmark designed to evaluate language models' ability to recognize and handle data artifacts in tabular data, revealing significant performance gaps especially with data perturbations.

Contribution

We introduce RADAR, a new benchmark with a framework for simulating data artifacts to systematically assess language models' data-aware reasoning on tabular data.

Findings

01

Models perform well on clean tables but degrade with data artifacts.

02

Performance drops are significant when handling missing values and outliers.

03

RADAR reveals critical gaps in current models' robustness to data imperfections.

Abstract

Language models (LMs) are increasingly being deployed to perform autonomous data analyses. However, their data awareness -- the ability to recognize, reason over, and appropriately handle data artifacts such as missing values, outliers, and logical inconsistencies -- remains underexplored. These artifacts are especially common in real-world tabular data and, if mishandled, can significantly compromise the validity of analytical conclusions. To address this gap, we present RADAR, a benchmark for systematically evaluating data-aware reasoning on tabular data. We develop a framework to simulate data artifacts via programmatic perturbations to enable targeted evaluation of model behavior. RADAR comprises 2980 table query pairs, grounded in real-world data spanning 9 domains and 5 data artifact types. In addition to evaluating artifact handling, RADAR systematically varies table size to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kenqgu/radar
noneOfficial

Datasets

kenqgu/RADAR
dataset· 80 dl
80 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Data Quality and Management · Machine Learning and Data Classification