Document Classification for COVID-19 Literature
Bernal Jim\'enez Guti\'errez, Juncheng Zeng, Dongdong Zhang, Ping, Zhang, Yu Su

TL;DR
This paper evaluates various multi-label document classification models on the LitCovid dataset, demonstrating that fine-tuned pre-trained language models, especially BioBERT, achieve high accuracy in classifying COVID-19 research papers, and analyzes their errors and limitations.
Contribution
It provides a comprehensive analysis of classification models on COVID-19 literature, highlighting the effectiveness of fine-tuned BioBERT and identifying key challenges for future improvements.
Findings
BioBERT achieves micro-F1 of around 86% and accuracy of 75%.
Pre-trained language models outperform baselines on LitCovid.
Errors often involve label correlation and focus issues.
Abstract
The global pandemic has made it more important than ever to quickly and accurately retrieve relevant scientific literature for effective consumption by researchers in a wide range of fields. We provide an analysis of several multi-label document classification models on the LitCovid dataset, a growing collection of 23,000 research papers regarding the novel 2019 coronavirus. We find that pre-trained language models fine-tuned on this dataset outperform all other baselines and that BioBERT surpasses the others by a small margin with micro-F1 and accuracy scores of around 86% and 75% respectively on the test set. We evaluate the data efficiency and generalizability of these models as essential features of any system prepared to deal with an urgent situation like the current health crisis. Finally, we explore 50 errors made by the best performing models on LitCovid documents and find that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies
MethodsLinear Layer · Softmax · Linear Warmup With Linear Decay · Dense Connections · AdamW · Layer Normalization · Attention Is All You Need · How do I get a human at Expedia immediately? (2025-2026) · WordPiece · Residual Connection
