Researchers waste 80% of LLM annotation costs by classifying one text at a time
Christian Pipal, Eva-Maria Vogel, Morgan Wack, Frank Esser

TL;DR
Batching multiple texts and stacking variables in LLM prompts significantly reduces annotation costs while maintaining high coding accuracy, especially for batch sizes up to 100 and stacking up to 10 variables.
Contribution
This study demonstrates that large language models can efficiently perform multi-item, multi-variable text classification with minimal accuracy loss, optimizing annotation costs.
Findings
Batching reduces API calls by over 80% without significant accuracy loss.
Six of eight models maintained accuracy within 2 percentage points of single-item coding.
Stacking up to 10 variables per prompt yields comparable results to single-variable coding.
Abstract
Large language models (LLMs) are increasingly being used for text classification across the social sciences, yet researchers overwhelmingly classify one text per variable per prompt. Coding 100,000 texts on four variables requires 400,000 API calls. Batching 25 items and stacking all variables into a single prompt reduces this to 4,000 calls, cutting token costs by over 80%. Whether this degrades coding quality is unknown. We tested eight production LLMs from four providers on 3,962 expert-coded tweets across four tasks, varying batch size from 1 to 1,000 items and stacking up to 25 coding dimensions per prompt. Six of eight models maintained accuracy within 2 pp of the single-item baseline through batch sizes of 100. Variable stacking with up to 10 dimensions produced results comparable to single-variable coding, with degradation driven by task complexity rather than prompt length.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
