# ALERT: A benchmark Bengali dataset for identifying and categorizing religiously aggressive texts

**Authors:** Suhana Binta Rashid, Bibhas Roy Chowdhury Piyas, Sadia Rahman, Bijoy Roy Chowdhury Preenon

PMC · DOI: 10.1016/j.dib.2025.112094 · Data in Brief · 2025-09-19

## TL;DR

This paper introduces ALERT, a Bengali dataset for identifying religiously aggressive texts, aiming to improve detection tools in regional languages.

## Contribution

The paper presents ALERT, a new benchmark Bengali dataset for classifying religiously aggressive content.

## Key findings

- ALERT contains 4027 annotated Bengali instances across four aggression categories.
- The dataset achieved a Cohen’s kappa score of 72%, indicating strong inter-annotator agreement.
- Experiments with machine learning and transformer models showed promising classification results.

## Abstract

The widespread proliferation of religiously aggressive contents on social media platforms poses significant threats to societal harmony and communal solidarity. It often incites religious animosity, provokes violence and disseminates life-threatening messages that intensifies societal divisions and undermines social harmony. Despite significant advancements in identifying such contents in high-resource languages like English, there exists a notable scarcity of resources for regional languages like Bengali which constrains the development of effective detection and prevention tools. To address this gap, we introduce ALERT (Analysis of Linguistic Extremism in Religious Texts), a newly developed Bengali dataset along with English translation which includes 4027 annotated instances classified into four categories: hate speech (995), vandalism (909), atrocity (1117), and no aggression (1006). The dataset was sourced from many online platforms, including Facebook, YouTube, online news websites, blogs and group chats. Each of the instances in the dataset was annotated by any two annotators from the list of four having diverse religious, ethnic, geographical, and academic backgrounds. Any conflicts or disagreements between annotators during the annotation process were resolved through consultation with a domain expert. The preprocessing stages include the elimination of English words, duplication and alphanumeric characters to ensure data integrity. The dataset attains a Cohen’s kappa score of 72 % that signifies a strong inter-annotator agreement and a Jaccard similarity score between 16 % and 23 % which reflects the degree of overlap between classes. Moreover, Experiments with various machine learning, deep learning and transformer-based models yield promising results. ALERT serves as a benchmark dataset for religiously aggressive text classification that may contribute to the advancement of research in this field. The dataset is publicly accessible for research purposes to promote innovation and collaboration within the Bengali NLP community.

## Full-text entities

- **Diseases:** aggression (MESH:D010554), ALERT (MESH:C563475)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12516529/full.md

## Figures

8 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12516529/full.md

## References

9 references — full list in the complete paper: https://tomesphere.com/paper/PMC12516529/full.md

---
Source: https://tomesphere.com/paper/PMC12516529