Towards Consistent Detection of Cognitive Distortions: LLM-Based Annotation and Dataset-Agnostic Evaluation

Neha Sharma; Navneet Agarwal; Kairit Sirts

arXiv:2511.01482·cs.CL·May 21, 2026

Towards Consistent Detection of Cognitive Distortions: LLM-Based Annotation and Dataset-Agnostic Evaluation

Neha Sharma, Navneet Agarwal, Kairit Sirts

PDF

TL;DR

This paper investigates using Large Language Models, especially GPT-4, as consistent annotators for cognitive distortion detection, proposing a dataset-agnostic evaluation framework to improve reliability and comparability across datasets.

Contribution

It introduces a novel LLM-based annotation approach and a dataset-agnostic evaluation method using Cohen's kappa for subjective NLP tasks.

Findings

01

GPT-4 achieves Fleiss's Kappa of 0.78 for consistent annotations.

02

Models trained on LLM-annotated data outperform those trained on human labels.

03

The proposed framework enables fair cross-dataset comparison of NLP models.

Abstract

Text-based automated Cognitive Distortion detection is a challenging task due to its subjective nature, with low agreement scores observed even among expert human annotators, leading to unreliable annotations. We explore the use of Large Language Models (LLMs) as consistent and reliable annotators, and propose that multiple independent LLM runs can reveal stable labeling patterns despite the inherent subjectivity of the task. Furthermore, to fairly compare models trained on datasets with different characteristics, we introduce a dataset-agnostic evaluation framework using Cohen's kappa as an effect size measure. This methodology allows for fair cross-dataset and cross-study comparisons where traditional metrics like F1 score fall short. Our results show that GPT-4 can produce consistent annotations (Fleiss's Kappa = 0.78), resulting in improved test set performance for models trained on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Text Readability and Simplification · Mobile Crowdsensing and Crowdsourcing