A Large Self-Annotated Corpus for Sarcasm

Mikhail Khodak; Nikunj Saunshi; Kiran Vodrahalli

arXiv:1704.05579·cs.CL·March 26, 2018·133 cites

A Large Self-Annotated Corpus for Sarcasm

Mikhail Khodak, Nikunj Saunshi, Kiran Vodrahalli

PDF

Open Access 5 Repos 1 Datasets

TL;DR

The paper introduces SARC, a large self-annotated Reddit corpus with 1.3 million sarcastic statements, enabling improved sarcasm detection research through extensive data and context information.

Contribution

It provides the largest self-annotated sarcasm dataset with contextual information, facilitating more accurate and scalable sarcasm detection models.

Findings

01

SARC contains 1.3 million sarcastic statements, ten times larger than previous datasets.

02

Baseline models show improved performance using the corpus and context features.

03

The dataset enables benchmarking and evaluation of sarcasm detection methods.

Abstract

We introduce the Self-Annotated Reddit Corpus (SARC), a large corpus for sarcasm research and for training and evaluating systems for sarcasm detection. The corpus has 1.3 million sarcastic statements -- 10 times more than any previous dataset -- and many times more instances of non-sarcastic statements, allowing for learning in both balanced and unbalanced label regimes. Each statement is furthermore self-annotated -- sarcasm is labeled by the author, not an independent annotator -- and provided with user, topic, and conversation context. We evaluate the corpus for accuracy, construct benchmarks for sarcasm detection, and evaluate baseline methods.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

marcbishara/sarcasm-on-reddit
dataset· 30 dl
30 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSentiment Analysis and Opinion Mining · Advanced Text Analysis Techniques · Topic Modeling