Constructing the CORD-19 Vaccine Dataset
Manisha Singh, Divy Sharma, Alonso Ma, Bridget Tyree, Margaret, Mitchell

TL;DR
This paper introduces 'CORD-19-Vaccination', a comprehensive COVID-19 vaccine research dataset with enriched metadata, enabling advanced NLP tasks like question answering and text mining in vaccine research.
Contribution
The paper presents a new dataset with detailed metadata and topic annotations, specifically tailored for COVID-19 vaccine research, enhancing NLP applications in this domain.
Findings
Demonstrated question-answering capabilities using the dataset.
Performed sentence classification on abstracts with high accuracy.
The dataset contains 30,000 research papers for NLP research.
Abstract
We introduce new dataset 'CORD-19-Vaccination' to cater to scientists specifically looking into COVID-19 vaccine-related research. This dataset is extracted from CORD-19 dataset [Wang et al., 2020] and augmented with new columns for language detail, author demography, keywords, and topic per paper. Facebook's fastText model is used to identify languages [Joulin et al., 2016]. To establish author demography (author affiliation, lab/institution location, and lab/institution country columns) we processed the JSON file for each paper and then further enhanced using Google's search API to determine country values. 'Yake' was used to extract keywords from the title, abstract, and body of each paper and the LDA (Latent Dirichlet Allocation) algorithm was used to add topic information [Campos et al., 2020, 2018a,b]. To evaluate the dataset, we demonstrate a question-answering task like the one…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCOVID-19 diagnosis using AI
MethodsfastText · Linear Discriminant Analysis
