Document Classification for COVID-19 Literature

Bernal Jim\'enez Guti\'errez; Juncheng Zeng; Dongdong Zhang; Ping; Zhang; Yu Su

arXiv:2006.13816·cs.IR·September 11, 2020·5 cites

Document Classification for COVID-19 Literature

Bernal Jim\'enez Guti\'errez, Juncheng Zeng, Dongdong Zhang, Ping, Zhang, Yu Su

PDF

Open Access 1 Repo

TL;DR

This paper evaluates various multi-label document classification models on the LitCovid dataset, demonstrating that fine-tuned pre-trained language models, especially BioBERT, achieve high accuracy in classifying COVID-19 research papers, and analyzes their errors and limitations.

Contribution

It provides a comprehensive analysis of classification models on COVID-19 literature, highlighting the effectiveness of fine-tuned BioBERT and identifying key challenges for future improvements.

Findings

01

BioBERT achieves micro-F1 of around 86% and accuracy of 75%.

02

Pre-trained language models outperform baselines on LitCovid.

03

Errors often involve label correlation and focus issues.

Abstract

The global pandemic has made it more important than ever to quickly and accurately retrieve relevant scientific literature for effective consumption by researchers in a wide range of fields. We provide an analysis of several multi-label document classification models on the LitCovid dataset, a growing collection of 23,000 research papers regarding the novel 2019 coronavirus. We find that pre-trained language models fine-tuned on this dataset outperform all other baselines and that BioBERT surpasses the others by a small margin with micro-F1 and accuracy scores of around 86% and 75% respectively on the test set. We evaluate the data efficiency and generalizability of these models as essential features of any system prepared to deal with an urgent situation like the current health crisis. Finally, we explore 50 errors made by the best performing models on LitCovid documents and find that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dki-lab/covid19-classification
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies

MethodsLinear Layer · Softmax · Linear Warmup With Linear Decay · Dense Connections · AdamW · Layer Normalization · Attention Is All You Need · How do I get a human at Expedia immediately? (2025-2026) · WordPiece · Residual Connection