DSAI: Unbiased and Interpretable Latent Feature Extraction for Data-Centric AI
Hyowon Cho, Soonwon Ka, Daechul Park, Jaewook Kang, Minjoon Seo,, Bokyung Son

TL;DR
DSAI is a novel framework that extracts unbiased, interpretable latent features from large datasets, addressing limitations of LLMs in data grounding and enabling better data-driven insights.
Contribution
The paper introduces DSAI, a multi-stage pipeline with quantifiable metrics for unbiased, interpretable feature extraction directly from data.
Findings
High recall in synthetic datasets with known features
Effective uncovering of meaningful patterns in real-world data
Supports interpretable classification with minimal expert input
Abstract
Large language models (LLMs) often struggle to objectively identify latent characteristics in large datasets due to their reliance on pre-trained knowledge rather than actual data patterns. To address this data grounding issue, we propose Data Scientist AI (DSAI), a framework that enables unbiased and interpretable feature extraction through a multi-stage pipeline with quantifiable prominence metrics for evaluating extracted features. On synthetic datasets with known ground-truth features, DSAI demonstrates high recall in identifying expert-defined features while faithfully reflecting the underlying data. Applications on real-world datasets illustrate the framework's practical utility in uncovering meaningful patterns with minimal expert oversight, supporting use cases such as interpretable classification. The title of our paper is chosen from multiple candidates based on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Time Series Analysis and Forecasting · Neural Networks and Applications
