Can ChatGPT Reproduce Human-Generated Labels? A Study of Social Computing Tasks
Yiming Zhu, Peixian Zhang, Ehsan-Ul Haq, Pan Hui, Gareth Tyson

TL;DR
This study evaluates ChatGPT's ability to replicate human-generated labels across various social computing tasks, revealing moderate success and highlighting challenges in using LLMs for data annotation.
Contribution
It systematically assesses ChatGPT's performance on multiple social computing datasets, demonstrating its potential and limitations in automating annotation tasks.
Findings
ChatGPT achieves an average accuracy of 60.9% across tasks.
Performance varies significantly across different labels.
Sentiment analysis dataset shows the highest accuracy at 64.9%.
Abstract
The release of ChatGPT has uncovered a range of possibilities whereby large language models (LLMs) can substitute human intelligence. In this paper, we seek to understand whether ChatGPT has the potential to reproduce human-generated label annotations in social computing tasks. Such an achievement could significantly reduce the cost and complexity of social computing research. As such, we use ChatGPT to relabel five seminal datasets covering stance detection (2x), sentiment analysis, hate speech, and bot detection. Our results highlight that ChatGPT does have the potential to handle these data annotation tasks, although a number of challenges remain. ChatGPT obtains an average accuracy 0.609. Performance is highest for the sentiment analysis dataset, with ChatGPT correctly annotating 64.9% of tweets. Yet, we show that performance varies substantially across individual labels. We believe…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Hate Speech and Cyberbullying Detection · Misinformation and Its Impacts
