Constructing the CORD-19 Vaccine Dataset

Manisha Singh; Divy Sharma; Alonso Ma; Bridget Tyree; Margaret; Mitchell

arXiv:2407.18471·cs.CL·July 29, 2024

Constructing the CORD-19 Vaccine Dataset

Manisha Singh, Divy Sharma, Alonso Ma, Bridget Tyree, Margaret, Mitchell

PDF

Open Access

TL;DR

This paper introduces 'CORD-19-Vaccination', a comprehensive COVID-19 vaccine research dataset with enriched metadata, enabling advanced NLP tasks like question answering and text mining in vaccine research.

Contribution

The paper presents a new dataset with detailed metadata and topic annotations, specifically tailored for COVID-19 vaccine research, enhancing NLP applications in this domain.

Findings

01

Demonstrated question-answering capabilities using the dataset.

02

Performed sentence classification on abstracts with high accuracy.

03

The dataset contains 30,000 research papers for NLP research.

Abstract

We introduce new dataset 'CORD-19-Vaccination' to cater to scientists specifically looking into COVID-19 vaccine-related research. This dataset is extracted from CORD-19 dataset [Wang et al., 2020] and augmented with new columns for language detail, author demography, keywords, and topic per paper. Facebook's fastText model is used to identify languages [Joulin et al., 2016]. To establish author demography (author affiliation, lab/institution location, and lab/institution country columns) we processed the JSON file for each paper and then further enhanced using Google's search API to determine country values. 'Yake' was used to extract keywords from the title, abstract, and body of each paper and the LDA (Latent Dirichlet Allocation) algorithm was used to add topic information [Campos et al., 2020, 2018a,b]. To evaluate the dataset, we demonstrate a question-answering task like the one…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCOVID-19 diagnosis using AI

MethodsfastText · Linear Discriminant Analysis