Medical Dataset Classification for Kurdish Short Text over Social Media

Ari M. Saeed; Shnya R. Hussein; Chro M. Ali; Tarik A. Rashid

arXiv:2204.09660·cs.CL·April 21, 2022

Medical Dataset Classification for Kurdish Short Text over Social Media

Ari M. Saeed, Shnya R. Hussein, Chro M. Ali, Tarik A. Rashid

PDF

TL;DR

This paper presents a new Kurdish medical dataset from social media comments and applies text classification techniques to distinguish medical comments from non-medical ones, aiding health-related social media analysis.

Contribution

It introduces a novel Kurdish medical dataset from Facebook comments and details a preprocessing and labeling process for classifying medical versus non-medical comments.

Findings

01

Dataset contains 6756 comments with 45% medical and 55% non-medical.

02

Six preprocessing steps improve data quality for classification.

03

The dataset supports future research in medical text analysis for Kurdish social media.

Abstract

The Facebook application is used as a resource for collecting the comments of this dataset, The dataset consists of 6756 comments to create a Medical Kurdish Dataset (MKD). The samples are comments of users, which are gathered from different posts of pages (Medical, News, Economy, Education, and Sport). Six steps as a preprocessing technique are performed on the raw dataset to clean and remove noise in the comments by replacing characters. The comments (short text) are labeled for positive class (medical comment) and negative class (non-medical comment) as text classification. The percentage ratio of the negative class is 55% while the positive class is 45%.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.