CORD19STS: COVID-19 Semantic Textual Similarity Dataset
Xiao Guo, Hengameh Mirzaalian, Ekraam Sabir, Ayush Jaiswal, and Wael Abd-Almageed

TL;DR
The paper introduces CORD19STS, a large COVID-19 domain-specific semantic textual similarity dataset with annotated sentence pairs, aiming to improve NLP applications like information retrieval and diagnosis systems in the COVID-19 context.
Contribution
It creates a specialized COVID-19 STS dataset with 13,710 annotated pairs and employs a fine-tuned BERT model to generate similarity scores, addressing domain-specific challenges.
Findings
Dataset includes 13,710 annotated sentence pairs.
Uses a fine-tuned BERT-like model for similarity scoring.
Provides a balanced dataset across semantic similarity levels.
Abstract
In order to combat the COVID-19 pandemic, society can benefit from various natural language processing applications, such as dialog medical diagnosis systems and information retrieval engines calibrated specifically for COVID-19. These applications rely on the ability to measure semantic textual similarity (STS), making STS a fundamental task that can benefit several downstream applications. However, existing STS datasets and models fail to translate their performance to a domain-specific environment such as COVID-19. To overcome this gap, we introduce CORD19STS dataset which includes 13,710 annotated sentence pairs collected from COVID-19 open research dataset (CORD-19) challenge. To be specific, we generated one million sentence pairs using different sampling strategies. We then used a finetuned BERT-like language model, which we call Sen-SCI-CORD19-BERT, to calculate the similarity…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
