Can Large Language Models Replace Human Coders? Introducing ContentBench

Michael Haman

arXiv:2602.19467·cs.CY·February 24, 2026

Can Large Language Models Replace Human Coders? Introducing ContentBench

Michael Haman

PDF

Open Access

TL;DR

ContentBench is a new benchmark suite that evaluates how well low-cost large language models can perform interpretive coding tasks, showing many models achieve high agreement with expert labels and enabling scalable content analysis.

Contribution

Introduces ContentBench, a public benchmark suite for assessing LLMs' interpretive coding accuracy and cost, with initial results demonstrating high agreement levels for several models.

Findings

01

Top models reach 97-99% agreement with expert labels

02

Low-cost LLMs can code 50,000 posts for a few dollars

03

Small open-weight models struggle with sarcasm-heavy content

Abstract

Can low-cost large language models (LLMs) take over the interpretive coding work that still anchors much of empirical content analysis? This paper introduces ContentBench, a public benchmark suite that helps answer this replacement question by tracking how much agreement low-cost LLMs achieve and what they cost on the same interpretive coding tasks. The suite uses versioned tracks that invite researchers to contribute new benchmark datasets. I report results from the first track, ContentBench-ResearchTalk v1.0: 1,000 synthetic, social-media-style posts about academic research labeled into five categories spanning praise, critique, sarcasm, questions, and procedural remarks. Reference labels are assigned only when three state-of-the-art reasoning models (GPT-5, Gemini 2.5 Pro, and Claude Opus 4.1) agree unanimously, and all final labels are checked by the author as a quality-control…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational and Text Analysis Methods · Artificial Intelligence in Healthcare and Education · Topic Modeling