Are LLMs Ready to Replace Bangla Annotators?

Md. Najib Hasan; Touseef Hasan; Souvika Sarkar

arXiv:2602.16241·cs.CL·March 3, 2026

Are LLMs Ready to Replace Bangla Annotators?

Md. Najib Hasan, Touseef Hasan, Souvika Sarkar

PDF

Open Access

TL;DR

This paper evaluates the reliability of large language models as zero-shot annotators for Bangla hate speech detection, revealing biases and instability that challenge their use in sensitive, low-resource language tasks.

Contribution

It systematically benchmarks 17 LLMs for Bangla hate speech annotation, uncovering biases and instability, and challenges assumptions about model size correlating with annotation quality.

Findings

01

Larger models do not always produce better annotations.

02

Smaller, task-aligned models can be more consistent.

03

LLMs exhibit bias and instability in sensitive tasks.

Abstract

Large Language Models (LLMs) are increasingly used as automated annotators to scale dataset creation, yet their reliability as unbiased annotators--especially for low-resource and identity-sensitive settings--remains poorly understood. In this work, we study the behavior of LLMs as zero-shot annotators for Bangla hate speech, a task where even human agreement is challenging, and annotator bias can have serious downstream consequences. We conduct a systematic benchmark of 17 LLMs using a unified evaluation framework. Our analysis uncovers annotator bias and substantial instability in model judgments. Surprisingly, increased model scale does not guarantee improved annotation quality--smaller, more task-aligned models frequently exhibit more consistent behavior than their larger counterparts. These results highlight important limitations of current LLMs for sensitive annotation tasks in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMobile Crowdsensing and Crowdsourcing · Topic Modeling · Hate Speech and Cyberbullying Detection