BioUNER: A Benchmark Dataset for Clinical Urdu Named Entity Recognition
Wazir Ali, Adeeb Noor, Sanaullah Mahar, Alia, Muhammad Mazhar Younas

TL;DR
BioUNER is a high-quality benchmark dataset for biomedical Urdu named entity recognition, created from online health sources and validated for use in machine learning models.
Contribution
The paper introduces BioUNER, a novel gold-standard biomedical Urdu NER dataset, with extensive annotation and evaluation of multiple models demonstrating its utility.
Findings
Achieved an inter-annotator agreement score of 0.78.
Evaluated models include SVM, LSTM, mBERT, and XLM-RoBERTa.
BioUNER provides a reliable benchmark for Urdu biomedical NER.
Abstract
In this article, we present a gold-standard benchmark dataset for Biomedical Urdu Named Entity Recognition (BioUNER), developed by crawling health-related articles from online Urdu news portals, medical prescriptions, and hospital health blogs and websites. After preprocessing, three native annotators with familiarity in the medical domain participated in the annotation process using the Doccano text annotation tool and annotated 153K tokens. Following annotation, the proposed BioiUNER dataset was evaluated both intrinsically and extrinsically. An inter-annotator agreement score of 0.78 was achieved, thereby validating the dataset as gold-standard quality. To demonstrate the utility and benchmarking capability of the dataset, we evaluated several machine learning and deep learning models, including Support Vector Machines (SVM), Long Short-Term Memory networks (LSTM), Multilingual BERT…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
