Assessing the Reliability of Large Language Models for Deductive Qualitative Coding: A Comparative Study of ChatGPT Interventions
Angjelin Hila, Elliott Hauser

TL;DR
This study evaluates ChatGPT's ability to perform deductive qualitative coding using various intervention strategies, finding that tailored approaches can achieve reliability suitable for rigorous qualitative analysis workflows.
Contribution
It introduces a novel Step-by-Step Task Decomposition method and demonstrates its effectiveness in improving LLM reliability for deductive coding tasks.
Findings
Step-by-Step strategy achieved the highest reliability metrics.
Intervention strategies significantly affected classification patterns.
ChatGPT showed stable agreement despite semantic ambiguity.
Abstract
In this study, we investigate the use of large language models (LLMs), specifically ChatGPT, for structured deductive qualitative coding. While most current research emphasizes inductive coding applications, we address the underexplored potential of LLMs to perform deductive classification tasks aligned with established human-coded schemes. Using the Comparative Agendas Project (CAP) Master Codebook, we classified U.S. Supreme Court case summaries into 21 major policy domains. We tested four intervention methods: zero-shot, few-shot, definition-based, and a novel Step-by-Step Task Decomposition strategy, across repeated samples. Performance was evaluated using standard classification metrics (accuracy, F1-score, Cohen's kappa, Krippendorff's alpha), and construct validity was assessed using chi-squared tests and Cramer's V. Chi-squared and effect size analyses confirmed that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
