A Large Self-Annotated Corpus for Sarcasm
Mikhail Khodak, Nikunj Saunshi, Kiran Vodrahalli

TL;DR
The paper introduces SARC, a large self-annotated Reddit corpus with 1.3 million sarcastic statements, enabling improved sarcasm detection research through extensive data and context information.
Contribution
It provides the largest self-annotated sarcasm dataset with contextual information, facilitating more accurate and scalable sarcasm detection models.
Findings
SARC contains 1.3 million sarcastic statements, ten times larger than previous datasets.
Baseline models show improved performance using the corpus and context features.
The dataset enables benchmarking and evaluation of sarcasm detection methods.
Abstract
We introduce the Self-Annotated Reddit Corpus (SARC), a large corpus for sarcasm research and for training and evaluating systems for sarcasm detection. The corpus has 1.3 million sarcastic statements -- 10 times more than any previous dataset -- and many times more instances of non-sarcastic statements, allowing for learning in both balanced and unbalanced label regimes. Each statement is furthermore self-annotated -- sarcasm is labeled by the author, not an independent annotator -- and provided with user, topic, and conversation context. We evaluate the corpus for accuracy, construct benchmarks for sarcasm detection, and evaluate baseline methods.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSentiment Analysis and Opinion Mining · Advanced Text Analysis Techniques · Topic Modeling
