A Methodological Guide on Using Large Language Models for Text Annotation in the Social Sciences and Humanities with Python and R

Qixiang Fang; Javier Garcia Bernardo; Erik-Jan van Kesteren

arXiv:2604.09638·cs.CY·April 14, 2026

A Methodological Guide on Using Large Language Models for Text Annotation in the Social Sciences and Humanities with Python and R

Qixiang Fang, Javier Garcia Bernardo, Erik-Jan van Kesteren

PDF

TL;DR

This paper offers a detailed, practical guide with code examples for social science and humanities researchers to effectively use large language models for text annotation, addressing challenges and ensuring reliable analysis.

Contribution

It provides a comprehensive, step-by-step methodology with Python and R code for using LLMs in text annotation, including quality evaluation and integration into statistical analysis.

Findings

01

Guidance on designing prompts and evaluating annotation quality

02

Strategies for integrating LLM annotations into statistical models

03

Best practices for scaling and managing costs in LLM-based annotation

Abstract

Large language models (LLMs) have become an essential tool for social science and humanities (SSH) researchers who work with text. One particularly valuable application is automating text annotation, a traditionally time-consuming step in preparing data for empirical analysis. Yet many SSH researchers face two challenges: getting started with LLMs and understanding how to address their limitations. Practically, the rapid pace of model development can make LLMs seem inaccessible or intimidating, while even experienced users may overlook how annotation errors can bias downstream statistical analyses (e.g., regression estimates and $p$ -values), even when annotation accuracy appears high. This paper provides a comprehensive, step-by-step methodological guide for using LLMs for text annotation in SSH research, with clear Python and R code snippets. We cover (1) how LLMs work and what they…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.