The 2021 Urdu Fake News Detection Task using Supervised Machine Learning and Feature Combinations
Muhammad Humayoun

TL;DR
This paper describes a supervised machine learning approach for Urdu fake news detection, achieving a top F1 score by using feature selection, text preprocessing, and SVM classifiers in a shared task.
Contribution
It introduces an effective feature selection and preprocessing pipeline for Urdu fake news detection using SVMs, surpassing previous results in the shared task.
Findings
Achieved a best F1 Macro score of 0.6674 with SVMs.
Selected 20K features from over 1.5 million using feature importance.
Outperformed initial competition ranking with improved results.
Abstract
This paper presents the system description submitted at the FIRE Shared Task: "The 2021 Fake News Detection in the Urdu Language". This challenge aims at automatically identifying Fake news written in Urdu. Our submitted results ranked fifth in the competition. However, after the result declaration of the competition, we managed to attain even better results than the submitted results. The best F1 Macro score achieved by one of our models is 0.6674, higher than the second-best score in the competition. The result is achieved on Support Vector Machines (polynomial kernel degree 1) with stopwords removed, lemmatization applied, and selecting the 20K best features out of 1.557 million features in total (which were produced by Word n-grams n=1,2,3,4 and Char n-grams n=2,3,4,5,6). The code is made available for reproducibility.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMisinformation and Its Impacts · Spam and Phishing Detection · Advanced Malware Detection Techniques
