PARADE: A New Dataset for Paraphrase Identification Requiring Computer   Science Domain Knowledge

Yun He; Zhuoer Wang; Yin Zhang; Ruihong Huang; James Caverlee

arXiv:2010.03725·cs.CL·October 9, 2020

PARADE: A New Dataset for Paraphrase Identification Requiring Computer Science Domain Knowledge

Yun He, Zhuoer Wang, Yin Zhang, Ruihong Huang, James Caverlee

PDF

Open Access 1 Repo

TL;DR

PARADE is a new dataset designed to evaluate paraphrase identification in the computer science domain, emphasizing the need for domain knowledge, and reveals current models' limitations in this specialized context.

Contribution

The paper introduces PARADE, a domain-specific paraphrase dataset that challenges existing models and highlights the importance of incorporating domain knowledge in paraphrase detection.

Findings

01

State-of-the-art models perform poorly on PARADE.

02

BERT achieves only 0.709 F1 score after fine-tuning.

03

Models struggle to leverage domain knowledge effectively.

Abstract

We present a new benchmark dataset called PARADE for paraphrase identification that requires specialized domain knowledge. PARADE contains paraphrases that overlap very little at the lexical and syntactic level but are semantically equivalent based on computer science domain knowledge, as well as non-paraphrases that overlap greatly at the lexical and syntactic level but are not semantically equivalent based on this domain knowledge. Experiments show that both state-of-the-art neural models and non-expert human annotators have poor performance on PARADE. For example, BERT after fine-tuning achieves an F1 score of 0.709, which is much lower than its performance on other paraphrase identification datasets. PARADE can serve as a resource for researchers interested in testing models that incorporate domain knowledge. We make our data and code freely available.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

heyunh2015/PARADE_dataset
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsLinear Layer · Layer Normalization · Dense Connections · WordPiece · Multi-Head Attention · Dropout · Linear Warmup With Linear Decay · Attention Dropout · Weight Decay · Attention Is All You Need