LLM as Dataset Analyst: Subpopulation Structure Discovery with Large Language Model
Yulin Luo, Ruichuan An, Bocheng Zou, Yiming Tang, Jiaming Liu,, Shanghang Zhang

TL;DR
This paper introduces a novel framework called SSD-LLM that uses large language models to analyze and uncover subpopulation structures within datasets, aiding various data understanding and management tasks.
Contribution
The paper proposes a new method leveraging LLMs for interpretable subpopulation structure discovery and provides workflows for multiple downstream subpopulation analysis tasks.
Findings
Effective subpopulation structure analysis using LLMs.
Unified workflow for multiple subpopulation tasks.
Improved understanding of dataset heterogeneity.
Abstract
The distribution of subpopulations is an important property hidden within a dataset. Uncovering and analyzing the subpopulation distribution within datasets provides a comprehensive understanding of the datasets, standing as a powerful tool beneficial to various downstream tasks, including Dataset Subpopulation Organization, Subpopulation Shift, and Slice Discovery. Despite its importance, there has been no work that systematically explores the subpopulation distribution of datasets to our knowledge. To address the limitation and solve all the mentioned tasks in a unified way, we introduce a novel concept of subpopulation structures to represent, analyze, and utilize subpopulation distributions within datasets. To characterize the structures in an interpretable manner, we propose the Subpopulation Structure Discovery with Large Language Models (SSD-LLM) framework, which employs world…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
