Open-Weight LLMs Are Often Competitive with Commercial APIs for Political Science Text Classification
Hanno Hilbig

TL;DR
Local open-weight LLMs often perform comparably to commercial APIs in political science text classification, especially on simpler tasks, offering a cost-effective and data-secure alternative.
Contribution
This study benchmarks five local models against four commercial APIs across 34 tasks, highlighting the competitive performance of open-weight models and practical considerations for their use.
Findings
Local models match or exceed API performance on 9 tasks.
On average, API models outperform local models by 0.015 F1.
Complex tasks favor API models with more labels or multiple outputs.
Abstract
Can researchers use local open-weight models instead of commercial APIs for LLM text classification? Local models avoid marginal API charges, keep data on the researcher's machine, and make exact model versions easier to preserve. I benchmark five local models against four commercial API models on 34 political science classification tasks. Local models are often competitive, especially on simpler tasks. In a task-specific oracle comparison, local models match or exceed API performance on 9 tasks; on average, the best API model exceeds the best local model by 0.015 F1. The four strongest observed model means fall within 0.021 F1. API models have their clearest edge on complex tasks with many labels or multiple outputs per item. Batching several items in one prompt usually reduces local runtime per item, but specific model-task pairs can return invalid response formats or labels. Taken…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
