VILLA: Versatile Information Retrieval From Scientific Literature Using Large LAnguage Models
Blessy Antony, Amartya Dutta, Sneha Aggarwal, Vasu Gatne, Ozan G\"okdemir, Samantha Grimes, Adam Lauring, Brian R. Wasik, Anuj Karpatne, T. M. Murali

TL;DR
This paper introduces VILLA, a new retrieval-augmented generation framework for extracting complex mutation information from scientific literature in virology, addressing a gap in open-ended, domain-specific scientific information extraction.
Contribution
The paper presents VILLA, a novel multi-step RAG framework tailored for virology, along with a curated dataset of influenza mutations, advancing open-ended scientific information extraction.
Findings
VILLA outperforms baseline RAG and other state-of-the-art tools.
Curated dataset of 629 influenza mutations from 239 publications.
Demonstrates effectiveness in complex, open-ended SIE tasks.
Abstract
The lack of high-quality ground truth datasets to train machine learning (ML) models impedes the potential of artificial intelligence (AI) for science research. Scientific information extraction (SIE) from the literature using LLMs is emerging as a powerful approach to automate the creation of these datasets. However, existing LLM-based approaches and benchmarking studies for SIE focus on broad topics such as biomedicine and chemistry, are limited to choice-based tasks, and focus on extracting information from short and well-formatted text. The potential of SIE methods in complex, open-ended tasks is considerably under-explored. In this study, we used a domain that has been virtually ignored in SIE, namely virology, to address these research gaps. We design a unique, open-ended SIE task of extracting mutations in a given virus that modify its interaction with the host. We develop a new,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Advanced Text Analysis Techniques · Topic Modeling
