PerSoMed: A Large-Scale Balanced Dataset for Persian Social Media Text Classification

Isun Chehreh; Ebrahim Ansari

arXiv:2602.19333·cs.CL·February 24, 2026

PerSoMed: A Large-Scale Balanced Dataset for Persian Social Media Text Classification

Isun Chehreh, Ebrahim Ansari

PDF

Open Access

TL;DR

This paper introduces PerSoMed, the first large-scale, balanced Persian social media text classification dataset, and benchmarks multiple models, with transformer-based models like TookaBERT-Large achieving top performance.

Contribution

It provides a comprehensive, well-annotated dataset for Persian social media text classification and evaluates state-of-the-art models, establishing a foundation for future Persian NLP research.

Findings

01

Transformer models outperform traditional neural networks.

02

TookaBERT-Large achieves the highest accuracy (F1-score: 0.9621).

03

Robust performance across all categories, with slight challenges in social and political texts.

Abstract

This research introduces the first large-scale, well-balanced Persian social media text classification dataset, specifically designed to address the lack of comprehensive resources in this domain. The dataset comprises 36,000 posts across nine categories (Economic, Artistic, Sports, Political, Social, Health, Psychological, Historical, and Science & Technology), each containing 4,000 samples to ensure balanced class distribution. Data collection involved 60,000 raw posts from various Persian social media platforms, followed by rigorous preprocessing and hybrid annotation combining ChatGPT-based few-shot prompting with human verification. To mitigate class imbalance, we employed undersampling with semantic redundancy removal and advanced data augmentation strategies integrating lexical replacement and generative prompting. We benchmarked several models, including BiLSTM, XLM-RoBERTa…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText and Document Classification Technologies · Sentiment Analysis and Opinion Mining · Hate Speech and Cyberbullying Detection