Using Imperfect Surrogates for Downstream Inference: Design-based   Supervised Learning for Social Science Applications of Large Language Models

Naoki Egami; Musashi Hinck; Brandon M. Stewart; Hanying Wei

arXiv:2306.04746·stat.ME·January 17, 2024·24 cites

Using Imperfect Surrogates for Downstream Inference: Design-based Supervised Learning for Social Science Applications of Large Language Models

Naoki Egami, Musashi Hinck, Brandon M. Stewart, Hanying Wei

PDF

Open Access 1 Video

TL;DR

This paper introduces a new method called design-based supervised learning (DSL) that effectively uses imperfect surrogate labels from large language models for social science analysis, ensuring valid statistical inference despite bias.

Contribution

The paper develops a doubly-robust estimator that combines surrogate and gold-standard labels, guaranteeing valid inference even with biased surrogates and minimal assumptions.

Findings

01

DSL achieves valid inference with biased surrogates.

02

DSL's root mean squared error is comparable to prediction-focused methods.

03

Theoretical guarantees hold under arbitrary surrogate bias.

Abstract

In computational social science (CSS), researchers analyze documents to explain social and political phenomena. In most scenarios, CSS researchers first obtain labels for documents and then explain labels using interpretable regression analyses in the second step. One increasingly common way to annotate documents cheaply at scale is through large language models (LLMs). However, like other scalable ways of producing annotations, such surrogate labels are often imperfect and biased. We present a new algorithm for using imperfect annotation surrogates for downstream statistical analyses while guaranteeing statistical properties -- like asymptotic unbiasedness and proper uncertainty quantification -- which are fundamental to CSS research. We show that direct use of surrogate labels in downstream statistical analyses leads to substantial bias and invalid confidence intervals, even with high…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Using Imperfect Surrogates for Downstream Inference: Design-based Supervised Learning for Social Science Applications of Large Language Models· slideslive

Taxonomy

TopicsComputational and Text Analysis Methods

MethodsFocus