Simulating Tabular Datasets through LLMs to Rapidly Explore Hypotheses   about Real-World Entities

Miguel Zabaleta; Joel Lehman

arXiv:2411.18071·cs.AI·November 28, 2024

Simulating Tabular Datasets through LLMs to Rapidly Explore Hypotheses about Real-World Entities

Miguel Zabaleta, Joel Lehman

PDF

Open Access 1 Repo

TL;DR

This paper investigates how large language models can rapidly generate and analyze tabular data about real-world entities to explore hypotheses, combining estimation and analysis to accelerate scientific discovery.

Contribution

It demonstrates that LLMs can serve as effective estimators of entity properties and suggest relevant variables for hypothesis testing, enhancing research automation.

Findings

01

LLMs improve estimation accuracy with larger models.

02

LLMs can identify relevant variables for hypotheses.

03

Potential to automate hypothesis exploration using LLMs.

Abstract

Do horror writers have worse childhoods than other writers? Though biographical details are known about many writers, quantitatively exploring such a qualitative hypothesis requires significant human effort, e.g. to sift through many biographies and interviews of writers and to iteratively search for quantitative features that reflect what is qualitatively of interest. This paper explores the potential to quickly prototype these kinds of hypotheses through (1) applying LLMs to estimate properties of concrete entities like specific people, companies, books, kinds of animals, and countries; (2) performing off-the-shelf analysis methods to reveal possible relationships among such properties (e.g. linear regression); and towards further automation, (3) applying LLMs to suggest the quantitative properties themselves that could help ground a particular qualitative hypothesis (e.g. number of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mzabaletasar/llm_hypoth_simulation
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Mining Algorithms and Applications · Data Quality and Management