Can Large Language Models Replace Human Coders? Introducing ContentBench
Michael Haman

TL;DR
ContentBench is a new benchmark suite that evaluates how well low-cost large language models can perform interpretive coding tasks, showing many models achieve high agreement with expert labels and enabling scalable content analysis.
Contribution
Introduces ContentBench, a public benchmark suite for assessing LLMs' interpretive coding accuracy and cost, with initial results demonstrating high agreement levels for several models.
Findings
Top models reach 97-99% agreement with expert labels
Low-cost LLMs can code 50,000 posts for a few dollars
Small open-weight models struggle with sarcasm-heavy content
Abstract
Can low-cost large language models (LLMs) take over the interpretive coding work that still anchors much of empirical content analysis? This paper introduces ContentBench, a public benchmark suite that helps answer this replacement question by tracking how much agreement low-cost LLMs achieve and what they cost on the same interpretive coding tasks. The suite uses versioned tracks that invite researchers to contribute new benchmark datasets. I report results from the first track, ContentBench-ResearchTalk v1.0: 1,000 synthetic, social-media-style posts about academic research labeled into five categories spanning praise, critique, sarcasm, questions, and procedural remarks. Reference labels are assigned only when three state-of-the-art reasoning models (GPT-5, Gemini 2.5 Pro, and Claude Opus 4.1) agree unanimously, and all final labels are checked by the author as a quality-control…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods · Artificial Intelligence in Healthcare and Education · Topic Modeling
