Sarcasm Detection on Reddit Using Classical Machine Learning and Feature Engineering
Subrata Karmaker

TL;DR
This paper explores sarcasm detection on Reddit comments using classical machine learning techniques and explicit feature engineering, achieving a reproducible baseline with interpretable models.
Contribution
It demonstrates that lightweight, classical ML models with engineered features can effectively detect sarcasm without neural networks or conversational context.
Findings
Naive Bayes and logistic regression achieved F1-scores around 0.57.
Explicit feature engineering provides a clear baseline for sarcasm detection.
Performance is limited by lack of conversational context.
Abstract
Sarcasm is common in online discussions, yet difficult for machines to identify because the intended meaning often contradicts the literal wording. In this work, I study sarcasm detection using only classical machine learning methods and explicit feature engineering, without relying on neural networks or context from parent comments. Using a 100,000-comment subsample of the Self-Annotated Reddit Corpus (SARC 2.0), I combine word-level and character-level TF-IDF features with simple stylistic indicators. Four models are evaluated: logistic regression, a linear SVM, multinomial Naive Bayes, and a random forest. Naive Bayes and logistic regression perform the strongest, achieving F1-scores around 0.57 for sarcastic comments. Although the lack of conversational context limits performance, the results offer a clear and reproducible baseline for sarcasm detection using lightweight and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSentiment Analysis and Opinion Mining · Hate Speech and Cyberbullying Detection · Topic Modeling
