An Investigation of Supervised Learning Methods for Authorship Attribution in Short Hinglish Texts using Char & Word N-grams
Abhay Sharma, Ananya Nandan, Reetika Ralhan

TL;DR
This study evaluates supervised learning techniques for authorship attribution in short Hinglish texts from WhatsApp, finding SVM and Naive Bayes most effective with specific n-gram features, achieving over 94% accuracy.
Contribution
It compares multiple supervised models and feature types for authorship attribution in short, code-mixed texts, highlighting the effectiveness of word unigrams and character 3-grams.
Findings
SVM achieved up to 95.08% accuracy.
Naive Bayes achieved up to 94.45% accuracy.
Word unigrams and character 3-grams are most effective.
Abstract
The writing style of a person can be affirmed as a unique identity indicator; the words used, and the structuring of the sentences are clear measures which can identify the author of a specific work. Stylometry and its subset - Authorship Attribution, have a long history beginning from the 19th century, and we can still find their use in modern times. The emergence of the Internet has shifted the application of attribution studies towards non-standard texts that are comparatively shorter to and different from the long texts on which most research has been done. The aim of this paper focuses on the study of short online texts, retrieved from messaging application called WhatsApp and studying the distinctive features of a macaronic language (Hinglish), using supervised learning methods and then comparing the models. Various features such as word n-gram and character n-gram are compared…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Spam and Phishing Detection
