MultiSoc-4D: A Benchmark for Diagnosing Instruction-Induced Label Collapse in Closed-Set LLM Annotation of Bengali Social Media

Souvik Pramanik; S.M. Riaz Rahman Antu; Shak Mohammad Abyad; Md. Ibrahim Khalil; Md. Shahriar Hussain

arXiv:2605.06940·cs.CL·May 13, 2026

MultiSoc-4D: A Benchmark for Diagnosing Instruction-Induced Label Collapse in Closed-Set LLM Annotation of Bengali Social Media

Souvik Pramanik, S.M. Riaz Rahman Antu, Shak Mohammad Abyad, Md. Ibrahim Khalil, Md. Shahriar Hussain

PDF

TL;DR

This paper introduces MultiSoc-4D, a Bengali social media dataset benchmark, revealing systematic annotation biases in LLMs, especially a tendency to favor fallback labels, which hampers minority category detection.

Contribution

The paper presents a new benchmark dataset and systematically diagnoses instruction-induced label collapse in LLM annotations for Bengali social media content.

Findings

01

LLMs show high agreement but under-detect minority categories like hate and sarcasm.

02

Instruction-induced label collapse leads to a label agreement illusion, with near-zero Fleiss' Kappa.

03

Benchmarking across 40+ LLMs reveals widespread bias propagation in annotation pipelines.

Abstract

Annotation automation via Large Language Models (LLMs) is the core approach for scaling NLP datasets; however, LLM behavior with respect to closed-set instructions in low-resource languages has not been well studied. We present MultiSoc-4D, a Bengali social media dataset benchmark, which contains 58K+ social media comments from six sources annotated along four dimensions: category, sentiment, hate speech, and sarcasm. By employing a structured pipeline where ChatGPT, Gemini, Claude, and Grok individually annotate separate partitions, while sharing a common validation set of 20%, we diagnose LLM behavior systematically. We discover a prevalent phenomenon called "instruction-induced label collapse", wherein LLMs show a systematic preference towards fallback labels (Other, Neutral, No), leading to high agreement rates but under-detection of minority categories. For example, we find that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.