Tab-Shapley: Identifying Top-k Tabular Data Quality Insights
Manisha Padala, Lokesh Nagalapatti, Atharv Tyagi, Ramasuri Narayanam,, Shiv Kumar Saini

TL;DR
This paper introduces Tab-Shapley, an efficient, unsupervised method leveraging Shapley values to identify top-k anomalous attribute sets and data quality insights in tabular datasets, addressing complex dependencies without labeled data.
Contribution
We propose a novel, game theory-based framework that efficiently computes attribute contributions to anomalies, overcoming computational challenges and capturing attribute dependencies.
Findings
Effective identification of top-k anomaly insights
Efficient closed-form Shapley value computation
Validated on real-world datasets with ground-truth anomalies
Abstract
We present an unsupervised method for aggregating anomalies in tabular datasets by identifying the top-k tabular data quality insights. Each insight consists of a set of anomalous attributes and the corresponding subsets of records that serve as evidence to the user. The process of identifying these insight blocks is challenging due to (i) the absence of labeled anomalies, (ii) the exponential size of the subset search space, and (iii) the complex dependencies among attributes, which obscure the true sources of anomalies. Simple frequency-based methods fail to capture these dependencies, leading to inaccurate results. To address this, we introduce Tab-Shapley, a cooperative game theory based framework that uses Shapley values to quantify the contribution of each attribute to the data's anomalous nature. While calculating Shapley values typically requires exponential time, we show that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSparse Evolutionary Training
