AI- and HPC-enabled Lead Generation for SARS-CoV-2: Models and Processes to Extract Druglike Molecules Contained in Natural Language Text
Zhi Hong, J. Gregory Pauloski, Logan Ward, Kyle Chard, Ben Blaiszik,, and Ian Foster

TL;DR
This paper presents an AI and HPC-based approach to identify drug-like molecules in scientific literature related to SARS-CoV-2, enabling rapid drug repurposing efforts through automated text analysis.
Contribution
It introduces a novel pipeline combining human-labeled data and machine learning to extract drug-like molecules from large-scale COVID-19 literature datasets.
Findings
Extracted 10,912 drug-like molecules from nearly 200,000 papers.
Achieved extraction performance comparable to non-expert human annotators.
Demonstrated the effectiveness of AI in accelerating drug discovery processes.
Abstract
Researchers worldwide are seeking to repurpose existing drugs or discover new drugs to counter the disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). A promising source of candidates for such studies is molecules that have been reported in the scientific literature to be drug-like in the context of coronavirus research. We report here on a project that leverages both human and artificial intelligence to detect references to drug-like molecules in free text. We engage non-expert humans to create a corpus of labeled text, use this labeled corpus to train a named entity recognition model, and employ the trained model to extract 10912 drug-like molecules from the COVID-19 Open Research Dataset Challenge (CORD-19) corpus of 198875 papers. Performance analyses show that our automated extraction model can achieve performance on par with that of non-expert humans.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMisinformation and Its Impacts · Topic Modeling · Computational Drug Discovery Methods
