Using Imperfect Surrogates for Downstream Inference: Design-based Supervised Learning for Social Science Applications of Large Language Models
Naoki Egami, Musashi Hinck, Brandon M. Stewart, Hanying Wei

TL;DR
This paper introduces a new method called design-based supervised learning (DSL) that effectively uses imperfect surrogate labels from large language models for social science analysis, ensuring valid statistical inference despite bias.
Contribution
The paper develops a doubly-robust estimator that combines surrogate and gold-standard labels, guaranteeing valid inference even with biased surrogates and minimal assumptions.
Findings
DSL achieves valid inference with biased surrogates.
DSL's root mean squared error is comparable to prediction-focused methods.
Theoretical guarantees hold under arbitrary surrogate bias.
Abstract
In computational social science (CSS), researchers analyze documents to explain social and political phenomena. In most scenarios, CSS researchers first obtain labels for documents and then explain labels using interpretable regression analyses in the second step. One increasingly common way to annotate documents cheaply at scale is through large language models (LLMs). However, like other scalable ways of producing annotations, such surrogate labels are often imperfect and biased. We present a new algorithm for using imperfect annotation surrogates for downstream statistical analyses while guaranteeing statistical properties -- like asymptotic unbiasedness and proper uncertainty quantification -- which are fundamental to CSS research. We show that direct use of surrogate labels in downstream statistical analyses leads to substantial bias and invalid confidence intervals, even with high…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsComputational and Text Analysis Methods
MethodsFocus
