MultiSoc-4D: A Benchmark for Diagnosing Instruction-Induced Label Collapse in Closed-Set LLM Annotation of Bengali Social Media
Souvik Pramanik, S.M. Riaz Rahman Antu, Shak Mohammad Abyad, Md. Ibrahim Khalil, Md. Shahriar Hussain

TL;DR
This paper introduces MultiSoc-4D, a Bengali social media dataset benchmark, revealing systematic annotation biases in LLMs, especially a tendency to favor fallback labels, which hampers minority category detection.
Contribution
The paper presents a new benchmark dataset and systematically diagnoses instruction-induced label collapse in LLM annotations for Bengali social media content.
Findings
LLMs show high agreement but under-detect minority categories like hate and sarcasm.
Instruction-induced label collapse leads to a label agreement illusion, with near-zero Fleiss' Kappa.
Benchmarking across 40+ LLMs reveals widespread bias propagation in annotation pipelines.
Abstract
Annotation automation via Large Language Models (LLMs) is the core approach for scaling NLP datasets; however, LLM behavior with respect to closed-set instructions in low-resource languages has not been well studied. We present MultiSoc-4D, a Bengali social media dataset benchmark, which contains 58K+ social media comments from six sources annotated along four dimensions: category, sentiment, hate speech, and sarcasm. By employing a structured pipeline where ChatGPT, Gemini, Claude, and Grok individually annotate separate partitions, while sharing a common validation set of 20%, we diagnose LLM behavior systematically. We discover a prevalent phenomenon called "instruction-induced label collapse", wherein LLMs show a systematic preference towards fallback labels (Other, Neutral, No), leading to high agreement rates but under-detection of minority categories. For example, we find that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
