Simulating Tabular Datasets through LLMs to Rapidly Explore Hypotheses about Real-World Entities
Miguel Zabaleta, Joel Lehman

TL;DR
This paper investigates how large language models can rapidly generate and analyze tabular data about real-world entities to explore hypotheses, combining estimation and analysis to accelerate scientific discovery.
Contribution
It demonstrates that LLMs can serve as effective estimators of entity properties and suggest relevant variables for hypothesis testing, enhancing research automation.
Findings
LLMs improve estimation accuracy with larger models.
LLMs can identify relevant variables for hypotheses.
Potential to automate hypothesis exploration using LLMs.
Abstract
Do horror writers have worse childhoods than other writers? Though biographical details are known about many writers, quantitatively exploring such a qualitative hypothesis requires significant human effort, e.g. to sift through many biographies and interviews of writers and to iteratively search for quantitative features that reflect what is qualitatively of interest. This paper explores the potential to quickly prototype these kinds of hypotheses through (1) applying LLMs to estimate properties of concrete entities like specific people, companies, books, kinds of animals, and countries; (2) performing off-the-shelf analysis methods to reveal possible relationships among such properties (e.g. linear regression); and towards further automation, (3) applying LLMs to suggest the quantitative properties themselves that could help ground a particular qualitative hypothesis (e.g. number of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Mining Algorithms and Applications · Data Quality and Management
