What Is Actually Being Annotated? Inter-Prompt Reliability as a Measurement Problem in LLM-Based Social Science Labeling
Jingyuan Liu

TL;DR
This paper introduces Inter-Prompt Reliability (IPR), a framework to measure the stability of LLM annotations across varied prompts, revealing significant stochastic variation especially in interpretative tasks and advocating for prompt aggregation.
Contribution
The paper proposes the IPR framework to evaluate LLM annotation reliability across different prompts, highlighting the importance of prompt aggregation for reproducibility in social science research.
Findings
LLM annotations show high stochastic variation in interpretative tasks.
Majority voting across prompts improves reproducibility.
Prompt wording introduces methodological uncertainty.
Abstract
Large language models (LLMs) are increasingly used for annotation in computational social science, yet their methodological reliability under prompt variation remains unclear. This paper introduces Inter-Prompt Reliability (IPR), a framework for evaluating the stability of LLM outputs across semantically equivalent but linguistically varied prompts. Drawing on Inter-Rater Reliability, IPR is measured by Pairwise Agreement Rate (PAR) and its distribution to capture both consistency and stochasticity in model behavior. We evaluate this framework on two tasks with distinct properties: TREC (interpretative) and Politifact (knowledge-anchored). Results show that LLM annotation exhibits substantial stochastic variation in interpretative tasks, while appearing more stable in knowledge-based tasks. We further show that majority voting across prompts significantly improves reproducibility and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
