Codebook LLMs: Evaluating LLMs as Measurement Tools for Political Science Concepts
Andrew Halterman, Katherine A. Keith

TL;DR
This paper evaluates how well large language models can automatically measure complex political science concepts using real-world codebooks, proposing a framework and demonstrating the limitations and improvements with supervised tuning.
Contribution
It introduces a five-stage framework for assessing LLMs in political text coding and provides curated datasets, evaluation methods, and guidance for researchers.
Findings
Open-weight LLMs struggle with zero-shot codebook adherence.
Supervised instruction tuning significantly improves LLM measurement accuracy.
The paper offers datasets and an evaluation framework for future research.
Abstract
Codebooks -- documents that operationalize concepts and outline annotation procedures -- are used almost universally by social scientists when coding political texts. To code these texts automatically, researchers are increasing turning to generative large language models (LLMs). However, there is limited empirical evidence on whether "off-the-shelf" LLMs faithfully follow real-world codebook operationalizations and measure complex political constructs with sufficient accuracy. To address this, we gather and curate three real-world political science codebooks -- covering protest events, political violence and manifestos -- along with their unstructured texts and human labels. We also propose a five-stage framework for codebook-LLM measurement: preparing a codebook for both humans and LLMs, testing LLMs' basic capabilities on a codebook, evaluating zero-shot measurement accuracy (i.e.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
