PerSoMed: A Large-Scale Balanced Dataset for Persian Social Media Text Classification
Isun Chehreh, Ebrahim Ansari

TL;DR
This paper introduces PerSoMed, the first large-scale, balanced Persian social media text classification dataset, and benchmarks multiple models, with transformer-based models like TookaBERT-Large achieving top performance.
Contribution
It provides a comprehensive, well-annotated dataset for Persian social media text classification and evaluates state-of-the-art models, establishing a foundation for future Persian NLP research.
Findings
Transformer models outperform traditional neural networks.
TookaBERT-Large achieves the highest accuracy (F1-score: 0.9621).
Robust performance across all categories, with slight challenges in social and political texts.
Abstract
This research introduces the first large-scale, well-balanced Persian social media text classification dataset, specifically designed to address the lack of comprehensive resources in this domain. The dataset comprises 36,000 posts across nine categories (Economic, Artistic, Sports, Political, Social, Health, Psychological, Historical, and Science & Technology), each containing 4,000 samples to ensure balanced class distribution. Data collection involved 60,000 raw posts from various Persian social media platforms, followed by rigorous preprocessing and hybrid annotation combining ChatGPT-based few-shot prompting with human verification. To mitigate class imbalance, we employed undersampling with semantic redundancy removal and advanced data augmentation strategies integrating lexical replacement and generative prompting. We benchmarked several models, including BiLSTM, XLM-RoBERTa…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText and Document Classification Technologies · Sentiment Analysis and Opinion Mining · Hate Speech and Cyberbullying Detection
