BARD10: A New Benchmark Reveals Significance of Bangla Stop-Words in Authorship Attribution
Abdullah Muhammad Moosa (1), Nusrat Sultana (1), Mahdi Muhammad Moosa (2), Md. Miraiz Hossain (1) ((1) Department of Mechatronics & Industrial Engineering, Chittagong University of Engineering & Technology, Chittagong 4349, Bangladesh

TL;DR
This paper introduces BARD10, a new Bangla authorship attribution benchmark, and systematically analyzes the impact of stop-word removal on classical and deep learning models, revealing the stylistic importance of stop-words in Bangla texts.
Contribution
It presents BARD10, a balanced Bangla authorship dataset, and provides a comprehensive analysis of stop-word effects across multiple models, highlighting their stylistic significance.
Findings
Classical TF-IDF + SVM outperforms deep models on BARD10 and BAAD16.
Stop-word removal significantly affects authorship attribution accuracy.
Bangla stop-words are key stylistic indicators for authorship attribution.
Abstract
This research presents a comprehensive investigation into Bangla authorship attribution, introducing a new balanced benchmark corpus BARD10 (Bangla Authorship Recognition Dataset of 10 authors) and systematically analyzing the impact of stop-word removal across classical and deep learning models to uncover the stylistic significance of Bangla stop-words. BARD10 is a curated corpus of Bangla blog and opinion prose from ten contemporary authors, alongside the methodical assessment of four representative classifiers: SVM (Support Vector Machine), Bangla BERT (Bidirectional Encoder Representations from Transformers), XGBoost, and a MLP (Multilayer Perception), utilizing uniform preprocessing on both BARD10 and the benchmark corpora BAAD16 (Bangla Authorship Attribution Dataset of 16 authors). In all datasets, the classical TF-IDF + SVM baseline outperformed, attaining a macro-F1 score of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Hate Speech and Cyberbullying Detection · Topic Modeling
