BARD10: A New Benchmark Reveals Significance of Bangla Stop-Words in Authorship Attribution

Abdullah Muhammad Moosa (1); Nusrat Sultana (1); Mahdi Muhammad Moosa (2); Md. Miraiz Hossain (1) ((1) Department of Mechatronics & Industrial Engineering; Chittagong University of Engineering & Technology; Chittagong 4349; Bangladesh; (2) Department of Mathematics & Natural Sciences; Brac University; Dhaka 1212; Bangladesh)

arXiv:2511.08085·cs.CL·November 12, 2025

BARD10: A New Benchmark Reveals Significance of Bangla Stop-Words in Authorship Attribution

Abdullah Muhammad Moosa (1), Nusrat Sultana (1), Mahdi Muhammad Moosa (2), Md. Miraiz Hossain (1) ((1) Department of Mechatronics & Industrial Engineering, Chittagong University of Engineering & Technology, Chittagong 4349, Bangladesh

PDF

Open Access

TL;DR

This paper introduces BARD10, a new Bangla authorship attribution benchmark, and systematically analyzes the impact of stop-word removal on classical and deep learning models, revealing the stylistic importance of stop-words in Bangla texts.

Contribution

It presents BARD10, a balanced Bangla authorship dataset, and provides a comprehensive analysis of stop-word effects across multiple models, highlighting their stylistic significance.

Findings

01

Classical TF-IDF + SVM outperforms deep models on BARD10 and BAAD16.

02

Stop-word removal significantly affects authorship attribution accuracy.

03

Bangla stop-words are key stylistic indicators for authorship attribution.

Abstract

This research presents a comprehensive investigation into Bangla authorship attribution, introducing a new balanced benchmark corpus BARD10 (Bangla Authorship Recognition Dataset of 10 authors) and systematically analyzing the impact of stop-word removal across classical and deep learning models to uncover the stylistic significance of Bangla stop-words. BARD10 is a curated corpus of Bangla blog and opinion prose from ten contemporary authors, alongside the methodical assessment of four representative classifiers: SVM (Support Vector Machine), Bangla BERT (Bidirectional Encoder Representations from Transformers), XGBoost, and a MLP (Multilayer Perception), utilizing uniform preprocessing on both BARD10 and the benchmark corpora BAAD16 (Bangla Authorship Attribution Dataset of 16 authors). In all datasets, the classical TF-IDF + SVM baseline outperformed, attaining a macro-F1 score of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAuthorship Attribution and Profiling · Hate Speech and Cyberbullying Detection · Topic Modeling