TL;DR
This study systematically evaluates how different implementation choices affect LLM-based political text annotation, revealing that interaction effects are more influential than individual factors, and proposing a validation framework for researchers.
Contribution
It provides a controlled comparison of models, prompts, and learning approaches, highlighting the importance of interaction effects and introducing a validation-first framework.
Findings
Interaction effects dominate main effects in pipeline choices.
No single model or prompt style is best across all tasks.
Larger models are not always more resource-efficient or better performing.
Abstract
Political scientists are rapidly adopting large language models (LLMs) for text annotation, yet the sensitivity of annotation results to implementation choices remains poorly understood. Most evaluations test a single model or configuration; how model choice, model size, learning approach, and prompt style interact, and whether popular "best practices" survive controlled comparison, are largely unexplored. We present a controlled evaluation of these pipeline choices, testing six open-weight models across four political science annotation tasks under identical quantisation, hardware, and prompt-template conditions. Our central finding is methodological: interaction effects dominate main effects, so seemingly reasonable pipeline choices can become consequential researcher degrees of freedom. No single model, prompt style, or learning approach is uniformly superior, and the best-performing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
