LLM as Dataset Analyst: Subpopulation Structure Discovery with Large   Language Model

Yulin Luo; Ruichuan An; Bocheng Zou; Yiming Tang; Jiaming Liu,; Shanghang Zhang

arXiv:2405.02363·cs.CV·July 25, 2024

LLM as Dataset Analyst: Subpopulation Structure Discovery with Large Language Model

Yulin Luo, Ruichuan An, Bocheng Zou, Yiming Tang, Jiaming Liu,, Shanghang Zhang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel framework called SSD-LLM that uses large language models to analyze and uncover subpopulation structures within datasets, aiding various data understanding and management tasks.

Contribution

The paper proposes a new method leveraging LLMs for interpretable subpopulation structure discovery and provides workflows for multiple downstream subpopulation analysis tasks.

Findings

01

Effective subpopulation structure analysis using LLMs.

02

Unified workflow for multiple subpopulation tasks.

03

Improved understanding of dataset heterogeneity.

Abstract

The distribution of subpopulations is an important property hidden within a dataset. Uncovering and analyzing the subpopulation distribution within datasets provides a comprehensive understanding of the datasets, standing as a powerful tool beneficial to various downstream tasks, including Dataset Subpopulation Organization, Subpopulation Shift, and Slice Discovery. Despite its importance, there has been no work that systematically explores the subpopulation distribution of datasets to our knowledge. To address the limitation and solve all the mentioned tasks in a unified way, we introduce a novel concept of subpopulation structures to represent, analyze, and utilize subpopulation distributions within datasets. To characterize the structures in an interpretable manner, we propose the Subpopulation Structure Discovery with Large Language Models (SSD-LLM) framework, which employs world…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

llm-as-dataset-analyst/SSDLLM
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques