PARADE: A New Dataset for Paraphrase Identification Requiring Computer Science Domain Knowledge
Yun He, Zhuoer Wang, Yin Zhang, Ruihong Huang, James Caverlee

TL;DR
PARADE is a new dataset designed to evaluate paraphrase identification in the computer science domain, emphasizing the need for domain knowledge, and reveals current models' limitations in this specialized context.
Contribution
The paper introduces PARADE, a domain-specific paraphrase dataset that challenges existing models and highlights the importance of incorporating domain knowledge in paraphrase detection.
Findings
State-of-the-art models perform poorly on PARADE.
BERT achieves only 0.709 F1 score after fine-tuning.
Models struggle to leverage domain knowledge effectively.
Abstract
We present a new benchmark dataset called PARADE for paraphrase identification that requires specialized domain knowledge. PARADE contains paraphrases that overlap very little at the lexical and syntactic level but are semantically equivalent based on computer science domain knowledge, as well as non-paraphrases that overlap greatly at the lexical and syntactic level but are not semantically equivalent based on this domain knowledge. Experiments show that both state-of-the-art neural models and non-expert human annotators have poor performance on PARADE. For example, BERT after fine-tuning achieves an F1 score of 0.709, which is much lower than its performance on other paraphrase identification datasets. PARADE can serve as a resource for researchers interested in testing models that incorporate domain knowledge. We make our data and code freely available.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsLinear Layer · Layer Normalization · Dense Connections · WordPiece · Multi-Head Attention · Dropout · Linear Warmup With Linear Decay · Attention Dropout · Weight Decay · Attention Is All You Need
