# ANCHOLIK-NER: A benchmark dataset for Bangla regional named entity recognition

**Authors:** Bidyarthi Paul, Faika Fairuj Preotee, Shuvashis Sarker, Shamim Rahim Refat, Shifat Islam, Tashreef Muhammad, Mohammad Ashraful Hoque, Shahriar Manzoor, Matteo Bodini, Joanna Tindall, Joanna Tindall

PMC · DOI: 10.1371/journal.pone.0342786 · PLOS One · 2026-02-25

## TL;DR

This paper introduces ANCHOLIK-NER, the first benchmark dataset for named entity recognition in Bangla regional dialects, addressing a critical gap in NLP for low-resource languages.

## Contribution

The paper presents the first annotated benchmark dataset for NER in Bangla regional dialects and provides baseline model evaluations.

## Key findings

- Bangla BERT achieved the highest F1-scores across five Bangla regional dialects.
- Mymensingh and Barishal dialects showed stronger NER performance compared to Chittagong.
- The dataset includes 17,405 sentences annotated with 10 entity tags across five regions.

## Abstract

Named Entity Recognition (NER) in regional dialects is a critical yet underexplored area in Natural Language Processing (NLP), especially for low-resource languages like Bangla. While NER systems for Standard Bangla have made progress, no existing resources or models specifically address the challenge of regional dialects such as Barishal, Chittagong, Mymensingh, Noakhali, and Sylhet, which exhibit unique linguistic features that existing models fail to handle effectively. To fill this gap, we introduce ANCHOLIK-NER, the first benchmark dataset for NER in Bangla regional dialects, comprising 17,405 sentences and 101,817 words annotated with 10 entity tags across 5 regions. The dataset was sourced from publicly available resources and supplemented with manual translations, ensuring alignment of named entities across dialects. We evaluate three transformer-based models—Bangla BERT, Bangla Bert Base, and BERT Base Multilingual Cased—on this dataset. Bangla BERT achieved the highest performance overall, with F1-scores of 82.27% (Mymensingh), 81.48% (Barishal), 78.75% (Sylhet), 78.50% (Noakhali), and 75.31% (Chittagong). These results highlight strong recognition capability in Mymensingh and Barishal, while dialectal variation in Chittagong remains challenging. As no prior NER resources exist for Bangla regional dialects, this work provides a foundational dataset and baseline benchmarks to facilitate future research. Future work will focus on dialect-aware model adaptation and expanding coverage to additional regions.

## Full-text entities

- **Genes:** MUC6 (mucin 6, oligomeric mucus/gel-forming (gene/pseudogene)) [NCBI Gene 4588] {aka MUC-6}, MUC7 (mucin 7, secreted) [NCBI Gene 4589] {aka MG2}
- **Diseases:** XLM-R (MESH:C580424), PER (MESH:D010554), ORG (MESH:D000092124), use (MESH:D019966), OBJ (MESH:D014012), LLMs (MESH:D007806)
- **Chemicals:** BERT (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12935308/full.md

## Figures

20 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12935308/full.md

## References

59 references — full list in the complete paper: https://tomesphere.com/paper/PMC12935308/full.md

---
Source: https://tomesphere.com/paper/PMC12935308