DSAI: Unbiased and Interpretable Latent Feature Extraction for   Data-Centric AI

Hyowon Cho; Soonwon Ka; Daechul Park; Jaewook Kang; Minjoon Seo,; Bokyung Son

arXiv:2412.06303·cs.LG·February 19, 2025

DSAI: Unbiased and Interpretable Latent Feature Extraction for Data-Centric AI

Hyowon Cho, Soonwon Ka, Daechul Park, Jaewook Kang, Minjoon Seo,, Bokyung Son

PDF

Open Access

TL;DR

DSAI is a novel framework that extracts unbiased, interpretable latent features from large datasets, addressing limitations of LLMs in data grounding and enabling better data-driven insights.

Contribution

The paper introduces DSAI, a multi-stage pipeline with quantifiable metrics for unbiased, interpretable feature extraction directly from data.

Findings

01

High recall in synthetic datasets with known features

02

Effective uncovering of meaningful patterns in real-world data

03

Supports interpretable classification with minimal expert input

Abstract

Large language models (LLMs) often struggle to objectively identify latent characteristics in large datasets due to their reliance on pre-trained knowledge rather than actual data patterns. To address this data grounding issue, we propose Data Scientist AI (DSAI), a framework that enables unbiased and interpretable feature extraction through a multi-stage pipeline with quantifiable prominence metrics for evaluating extracted features. On synthetic datasets with known ground-truth features, DSAI demonstrates high recall in identifying expert-defined features while faithfully reflecting the underlying data. Applications on real-world datasets illustrate the framework's practical utility in uncovering meaningful patterns with minimal expert oversight, supporting use cases such as interpretable classification. The title of our paper is chosen from multiple candidates based on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Time Series Analysis and Forecasting · Neural Networks and Applications