SynBullying: A Multi LLM Synthetic Conversational Dataset for Cyberbullying Detection

Arefeh Kazemi; Hamza Qadeer; Joachim Wagner; Hossein Hosseini; Sri Balaaji Natarajan Kalaivendan; Brian Davis

arXiv:2511.11599·cs.AI·March 20, 2026

SynBullying: A Multi LLM Synthetic Conversational Dataset for Cyberbullying Detection

Arefeh Kazemi, Hamza Qadeer, Joachim Wagner, Hossein Hosseini, Sri Balaaji Natarajan Kalaivendan, Brian Davis

PDF

Open Access 1 Datasets

TL;DR

SynBullying is a synthetic multi-turn conversational dataset created using large language models to improve cyberbullying detection, offering realistic, context-aware, and fine-grained annotations for better analysis and model training.

Contribution

The paper introduces SynBullying, a novel synthetic dataset generated by LLMs that captures realistic, multi-turn cyberbullying interactions with detailed annotations for enhanced detection methods.

Findings

01

Effective in training cyberbullying detection models

02

Improves model performance when used as augmentation data

03

Captures diverse cyberbullying behaviors and linguistic patterns

Abstract

We introduce SynBullying, a synthetic multi-LLM conversational dataset for studying and detecting cyberbullying (CB). SynBullying provides a scalable and ethically safe alternative to human data collection by leveraging large language models (LLMs) to simulate realistic bullying interactions. The dataset offers (i) conversational structure, capturing multi-turn exchanges rather than isolated posts; (ii) context-aware annotations, where harmfulness is assessed within the conversational flow considering context, intent, and discourse dynamics; and (iii) fine-grained labeling, covering various CB categories for detailed linguistic and behavioral analysis. We evaluate SynBullying across five dimensions, including conversational structure, lexical patterns, sentiment/toxicity, role dynamics, harm intensity, and CB-type distribution. We further examine its utility by testing its performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

arrkaa-NLP/SynBullying
dataset· 20 dl
20 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Bullying, Victimization, and Aggression · Mental Health via Writing